DPDK patches and discussions
 help / color / mirror / Atom feed
* OVS DPDK DMA-Dev library/Design Discussion
@ 2022-03-15 11:15 Stokes, Ian
  0 siblings, 0 replies; 58+ messages in thread
From: Stokes, Ian @ 2022-03-15 11:15 UTC (permalink / raw)
  To: Ilya Maximets, Maxime Coquelin (maxime.coquelin@redhat.com),
	Pai G, Sunil, Hu, Jiayu, ovs-dev, dev
  Cc: Kevin Traynor, Flavio Leitner, Mcnamara, John, Van Haaren, Harry,
	Ferriter, Cian, mcoqueli, fleitner, Gooch, Stephen,
	murali.krishna, Nee, Yuan Kuok, Nobuhiro Miki, wanjunjie,
	Raghupatruni, Madhusudana R, Asaf Sinai

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

Hi All,

We'd like to put a public meeting in place for the stakeholders of DPDK and OVS to discuss the next steps and design of the DSA library along with its integration in OVS.

There are a few different time zones involved so trying to find a best fit.

Currently the suggestion is 2PM Tuesday the 15th.

https://meet.google.com/hme-pygf-bfb

The plan is for this to be a public meeting that can be shared with both DPDK and OVS communities but for the moment I've invited the direct stakeholders from both communities as a starting point as we'd like a time that suits these folks primarily, all are welcome to join the discussion.

Thanks
Ian



[-- Attachment #2: Type: text/html, Size: 1469 bytes --]

[-- Attachment #3: Type: text/calendar, Size: 4439 bytes --]

BEGIN:VCALENDAR
METHOD:REQUEST
PRODID:Microsoft Exchange Server 2010
VERSION:2.0
BEGIN:VTIMEZONE
TZID:GMT Standard Time
BEGIN:STANDARD
DTSTART:16010101T020000
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=10
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T010000
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=3
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
ORGANIZER;CN="Stokes, Ian":MAILTO:ian.stokes@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Ilya Maxim
 ets:MAILTO:i.maximets@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Maxime Coq
 uelin (maxime.coquelin@redhat.com):MAILTO:maxime.coquelin@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pai G, Sun
 il":MAILTO:sunil.pai.g@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Hu, Jiayu":
 MAILTO:jiayu.hu@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=ovs-dev@op
 envswitch.org:MAILTO:ovs-dev@openvswitch.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=dev@dpdk.o
 rg:MAILTO:dev@dpdk.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Kevin Tray
 nor:MAILTO:ktraynor@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Flavio Lei
 tner:MAILTO:fbl@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Mcnamara, 
 John":MAILTO:john.mcnamara@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Van Haaren
 , Harry":MAILTO:harry.van.haaren@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Ferriter, 
 Cian":MAILTO:cian.ferriter@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=mcoqueli@r
 edhat.com:MAILTO:mcoqueli@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=fleitner@r
 edhat.com:MAILTO:fleitner@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Gooch, Ste
 phen":MAILTO:stephen.gooch@windriver.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=murali.kri
 shna@broadcom.com:MAILTO:murali.krishna@broadcom.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Nee, Yuan 
 Kuok":MAILTO:yuan.kuok.nee@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Nobuhiro M
 iki:MAILTO:nmiki@yahoo-corp.jp
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=wanjunjie@
 bytedance.com:MAILTO:wanjunjie@bytedance.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Raghupatru
 ni, Madhusudana R":MAILTO:madhu.raghupatruni@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Asaf Sinai
 :MAILTO:AsafSi@Radware.com
DESCRIPTION;LANGUAGE=en-US:Hi All\,\n\nWe’d like to put a public meeting 
 in place for the stakeholders of DPDK and OVS to discuss the next steps an
 d design of the DSA library along with its integration in OVS.\n\nThere ar
 e a few different time zones involved so trying to find a best fit.\n\nCur
 rently the suggestion is 2PM Tuesday the 15th.\n\nhttps://meet.google.com/
 hme-pygf-bfb\n\nThe plan is for this to be a public meeting that can be sh
 ared with both DPDK and OVS communities but for the moment I’ve invited 
 the direct stakeholders from both communities as a starting point as we’
 d like a time that suits these folks primarily\, all are welcome to join t
 he discussion.\n\nThanks\nIan\n\n\n
UID:040000008200E00074C5B7101A82E00800000000906D32F9D233D801000000000000000
 01000000095301FFB80D10A40B54524ACAF1B0BC7
SUMMARY;LANGUAGE=en-US:OVS DPDK DMA-Dev library/Design Discussion
DTSTART;TZID=GMT Standard Time:20220315T140000
DTEND;TZID=GMT Standard Time:20220315T150000
CLASS:PUBLIC
PRIORITY:5
DTSTAMP:20220315T111508Z
TRANSP:OPAQUE
STATUS:CONFIRMED
SEQUENCE:2
LOCATION;LANGUAGE=en-US:Google  Meet
X-MICROSOFT-CDO-APPT-SEQUENCE:2
X-MICROSOFT-CDO-OWNERAPPTID:78198758
X-MICROSOFT-CDO-BUSYSTATUS:TENTATIVE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
X-MICROSOFT-CDO-ALLDAYEVENT:FALSE
X-MICROSOFT-CDO-IMPORTANCE:1
X-MICROSOFT-CDO-INSTTYPE:0
X-MICROSOFT-DONOTFORWARDMEETING:FALSE
X-MICROSOFT-DISALLOW-COUNTER:FALSE
BEGIN:VALARM
DESCRIPTION:REMINDER
TRIGGER;RELATED=START:-PT15M
ACTION:DISPLAY
END:VALARM
END:VEVENT
END:VCALENDAR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-05-03 19:38                                         ` Van Haaren, Harry
  2022-05-10 14:39                                           ` Van Haaren, Harry
@ 2022-05-24 12:12                                           ` Ilya Maximets
  1 sibling, 0 replies; 58+ messages in thread
From: Ilya Maximets @ 2022-05-24 12:12 UTC (permalink / raw)
  To: Van Haaren, Harry, Richardson, Bruce
  Cc: i.maximets, Mcnamara, John, Hu, Jiayu, Maxime Coquelin,
	Morten Brørup, Pai G, Sunil, Stokes, Ian, Ferriter, Cian,
	ovs-dev, dev, O'Driscoll, Tim, Finn, Emma

On 5/3/22 21:38, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Ilya Maximets <i.maximets@ovn.org>
>> Sent: Thursday, April 28, 2022 2:00 PM
>> To: Richardson, Bruce <bruce.richardson@intel.com>
>> Cc: i.maximets@ovn.org; Mcnamara, John <john.mcnamara@intel.com>; Hu, Jiayu
>> <jiayu.hu@intel.com>; Maxime Coquelin <maxime.coquelin@redhat.com>; Van
>> Haaren, Harry <harry.van.haaren@intel.com>; Morten Brørup
>> <mb@smartsharesystems.com>; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
>> dev@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>> Finn, Emma <emma.finn@intel.com>
>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>
>> On 4/27/22 22:34, Bruce Richardson wrote:
>>> On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote:
>>>> On 4/20/22 18:41, Mcnamara, John wrote:
>>>>>> -----Original Message-----
>>>>>> From: Ilya Maximets <i.maximets@ovn.org>
>>>>>> Sent: Friday, April 8, 2022 10:58 AM
>>>>>> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
>>>>>> <maxime.coquelin@redhat.com>; Van Haaren, Harry
>>>>>> <harry.van.haaren@intel.com>; Morten Brørup
>> <mb@smartsharesystems.com>;
>>>>>> Richardson, Bruce <bruce.richardson@intel.com>
>>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>>>>>> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
>>>>>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
>>>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>>>>>> Finn, Emma <emma.finn@intel.com>
>>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>
>>>>>> On 4/8/22 09:13, Hu, Jiayu wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Ilya Maximets <i.maximets@ovn.org>
>>>>>>>> Sent: Thursday, April 7, 2022 10:40 PM
>>>>>>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
>>>>>>>> <harry.van.haaren@intel.com>; Morten Brørup
>>>>>>>> <mb@smartsharesystems.com>; Richardson, Bruce
>>>>>>>> <bruce.richardson@intel.com>
>>>>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
>>>>>>>> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
>>>>>>>> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
>>>>>>>> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
>>>>>>>> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
>>>>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>>>
>>>>>>>> On 4/7/22 16:25, Maxime Coquelin wrote:
>>>>>>>>> Hi Harry,
>>>>>>>>>
>>>>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
>>>>>>>>>> Hi OVS & DPDK, Maintainers & Community,
>>>>>>>>>>
>>>>>>>>>> Top posting overview of discussion as replies to thread become
>>>>>> slower:
>>>>>>>>>> perhaps it is a good time to review and plan for next steps?
>>>>>>>>>>
>>>>>>>>>>  From my perspective, it those most vocal in the thread seem to be
>>>>>>>>>> in favour of the clean rx/tx split ("defer work"), with the
>>>>>>>>>> tradeoff that the application must be aware of handling the async
>>>>>>>>>> DMA completions. If there are any concerns opposing upstreaming of
>>>>>>>>>> this
>>>>>>>> method, please indicate this promptly, and we can continue technical
>>>>>>>> discussions here now.
>>>>>>>>>
>>>>>>>>> Wasn't there some discussions about handling the Virtio completions
>>>>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work.
>>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> With the virtio completions handled by DMA itself, the vhost port
>>>>>>>> turns almost into a real HW NIC.  With that we will not need any
>>>>>>>> extra manipulations from the OVS side, i.e. no need to defer any work
>>>>>>>> while maintaining clear split between rx and tx operations.
>>>>>>>
>>>>>>> First, making DMA do 2B copy would sacrifice performance, and I think
>>>>>>> we all agree on that.
>>>>>>
>>>>>> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
>>>>>> done by CPU, however CPU is going away for dozens or even hundreds of
>>>>>> thousands of cycles to process a new packet batch or service other ports,
>>>>>> hence DMA will likely complete the transmission faster than waiting for
>>>>>> the CPU thread to come back to that task.  In any case, this has to be
>>>>>> tested.
>>>>>>
>>>>>>> Second, this method comes with an issue of ordering.
>>>>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
>>>>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
>>>>>>> own dedicated DMA device dma0 and dma1, flag/index update for the
>>>>>>> first 10 packets is done by dma0, and flag/index update for the left
>>>>>>> 20 packets is done by dma1. But there is no ordering guarantee among
>>>>>>> different DMA devices, so flag/index update may error. If PMD threads
>>>>>>> don't have dedicated DMA devices, which means DMA devices are shared
>>>>>>> among threads, we need lock and pay for lock contention in data-path.
>>>>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA
>>>>>>> sharing among threads. But what's the overhead of allocation mechanism?
>>>>>> Who does it? Any thoughts?
>>>>>>
>>>>>> 1. DMA completion was discussed in context of per-queue allocation, so
>>>>>> there
>>>>>>    is no re-ordering in this case.
>>>>>>
>>>>>> 2. Overhead can be minimal if allocated device can stick to the queue for
>>>>>> a
>>>>>>    reasonable amount of time without re-allocation on every send.  You may
>>>>>>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
>>>>>>    such mechanism.  For sure it can not be the same, but ideas can be re-
>>>>>> used.
>>>>>>
>>>>>> 3. Locking doesn't mean contention if resources are allocated/distributed
>>>>>>    thoughtfully.
>>>>>>
>>>>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
>>>>>>    for doing that inside the vhost library, so any DPDK application and
>>>>>>    vhost ethdev can use it without re-inventing from scratch.  It also
>>>>>> should
>>>>>>    be simpler from the API point of view if allocation and usage are in
>>>>>>    the same place.  But I don't have a strong opinion here as for now,
>>>>>> since
>>>>>>    no real code examples exist, so it's hard to evaluate how they could
>>>>>> look
>>>>>>    like.
>>>>>>
>>>>>> But I feel like we're starting to run in circles here as I did already say
>>>>>> most of that before.
>>>>>
>>>>>
>>>>
>>>> Hi, John.
>>>>
>>>> Just reading this email as I was on PTO for a last 1.5 weeks
>>>> and didn't get through all the emails yet.
>>>>
>>>>> This does seem to be going in circles, especially since there seemed to be
>> technical alignment on the last public call on March 29th.
>>>>
>>>> I guess, there is a typo in the date here.
>>>> It seems to be 26th, not 29th.
>>>>
>>>>> It is not feasible to do a real world implementation/POC of every design
>> proposal.
>>>>
>>>> FWIW, I think it makes sense to PoC and test options that are
>>>> going to be simply unavailable going forward if not explored now.
>>>> Especially because we don't have any good solutions anyway
>>>> ("Deferral of Work" is architecturally wrong solution for OVS).
>>>>
>>>
>>> Hi Ilya,
>>>
>>> for those of us who haven't spent a long time working on OVS, can you
>>> perhaps explain a bit more as to why it is architecturally wrong? From my
>>> experience with DPDK, use of any lookaside accelerator, not just DMA but
>>> any crypto, compression or otherwise, requires asynchronous operation, and
>>> therefore some form of setting work aside temporarily to do other tasks.
>>
>> OVS doesn't use any lookaside accelerators and doesn't have any
>> infrastructure for them.
>>
>>
>> Let me create a DPDK analogy of what is proposed for OVS.
>>
>> DPDK has an ethdev API that abstracts different device drivers for
>> the application.  This API has a rte_eth_tx_burst() function that
>> is supposed to send packets through the particular network interface.
>>
>> Imagine now that there is a network card that is not capable of
>> sending packets right away and requires the application to come
>> back later to finish the operation.  That is an obvious problem,
>> because rte_eth_tx_burst() doesn't require any extra actions and
>> doesn't take ownership of packets that wasn't consumed.
>>
>> The proposed solution for this problem is to change the ethdev API:
>>
>> 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively
>>    means that packets was acknowledged, but not actually sent yet.
>>
>> 2. Require the application to call the new rte_eth_process_async()
>>    function sometime later until it doesn't return -EINPROGRESS
>>    anymore, in case the original rte_eth_tx_burst() call returned
>>    -EINPROGRESS.
>>
>> The main reason why this proposal is questionable:
>>
>> It's only one specific device that requires this special handling,
>> all other devices are capable of sending packets right away.
>> However, every DPDK application now has to implement some kind
>> of "Deferral of Work" mechanism in order to be compliant with
>> the updated DPDK ethdev API.
>>
>> Will DPDK make this API change?
>> I have no voice in DPDK API design decisions, but I'd argue against.
>>
>> Interestingly, that's not really an imaginary proposal.  That is
>> an exact change required for DPDK ethdev API in order to add
>> vhost async support to the vhost ethdev driver.
>>
>> Going back to OVS:
>>
>> An oversimplified architecture of OVS has 3 layers (top to bottom):
>>
>> 1. OFproto - the layer that handles OpenFlow.
>> 2. Datapath Interface - packet processing.
>> 3. Netdev - abstraction on top of all the different port types.
>>
>> Each layer has it's own API that allows different implementations
>> of the same layer to be used interchangeably without any modifications
>> to higher layers.  That's what APIs and encapsulation is for.
>>
>> So, Netdev layer has it's own API and this API is actually very
>> similar to the DPDK's ethdev API.  Simply because they are serving
>> the same purpose - abstraction on top of different network interfaces.
>> Beside different types of DPDK ports, there are also several types
>> of native linux, bsd and windows ports, variety of different tunnel
>> ports.
>>
>> Datapath interface layer is an "application" from the ethdev analogy
>> above.
>>
>> What is proposed by "Deferral of Work" solution is to make pretty
>> much the same API change that I described, but to netdev layer API
>> inside the OVS, and introduce a fairly complex (and questionable,
>> but I'm not going into that right now) machinery to handle that API
>> change into the datapath interface layer.
>>
>> So, exactly the same problem is here:
>>
>> If the API change is needed only for a single port type in a very
>> specific hardware environment, why we need to change the common
>> API and rework a lot of the code in upper layers in order to accommodate
>> that API change, while it makes no practical sense for any other
>> port types or more generic hardware setups?
>> And similar changes will have to be done in any other DPDK application
>> that is not bound to a specific hardware, but wants to support vhost
>> async.
>>
>> The right solution, IMO, is to make vhost async behave as any other
>> physical NIC, since it is essentially a physical NIC now (we're not
>> using DMA directly, it's a combined vhost+DMA solution), instead of
>> propagating quirks of the single device to a common API.
>>
>> And going back to DPDK, this implementation doesn't allow use of
>> vhost async in the DPDK's own vhost ethdev driver.
>>
>> My initial reply to the "Deferral of Work" RFC with pretty much
>> the same concerns:
>>   https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-
>> 2-cian.ferriter@intel.com/#2751799
>>
>> Best regards, Ilya Maximets.
> 
> 
> Hi Ilya,
> 
> Thanks for replying in more detail, understanding your perspective here helps to
> communicate the various solutions benefits and drawbacks. Agreed the OfProto/Dpif/Netdev
> abstraction layers are strong abstractions in OVS, and in general they serve their purpose.
> 
> A key difference between OVS's usage of DPDK Ethdev TX and VHost TX is that the performance
> of each is very different: as you know, sending a 1500 byte packet over a physical NIC, or via
> VHost into a guest has a very different CPU cycle cost. Typically DPDK Tx takes ~5% CPU cycles
> while vhost copies are often ~30%, but can be > 50% in certain packet-sizes/configurations.

I understand the performance difference, it's a general knowledge
that vhost rx/tx is costly.

At the same time ethdev TX and vhost TX are different in a way that
first one is working with asynchronous devices, while the second one
is not.  And the goal here is to make the vhost asynchronous too.
So, you're trying to increase the similarity between them by
differentiating the API... !?

> Let's view the performance of the above example from the perspective of an actual deployment: OVS is
> very often deployed to provide an accelerated packet interface to a guest/VM via Vhost/Virtio.
> Surely improving performance of this primary use-case is a valid reason to consider changes and 
> improvements to an internal abstraction layer in OVS?

It was considered.  But it also makes sense to consider solutions
that doesn't require API changes.

Regarding that being a primary use-case: yes, OVS <-> VM communication
is a primary use-case.  However, usage of accelerators requires a specific
hardware platform, which may or may not be a primary use-case in the
future.  And DMA is not the only option even in the space of hardware
accelerators (vdpa?  I know that vdpa concept is fairly different from
DMA accelerators, but that doesn't invalidate my point).

> 
> Today DPDK tx and vhost tx are called via the same netdev abstraction, but we must ask the questions:
> 	- Is the netdev abstraction really the best it can be?

It is sufficient for what it needs to do.

> 	- Does adding an optional "async" feature to the abstraction improve performance significantly? (positive from including?)

Yes, but so do other possible implementations that doesn't require API
changes.  Possibly, performance can be even higher.

> 	- Does adding the optional async feature cause actual degradation in DPIF implementations that don't support/use it? (negative due to including?)

It reduces the API clarity and increases the maintenance cost.

> 
> Of course strong abstractions are valuable, and of course changing them requires careful thought.
> But let's be clear - It is probably fair to say that OVS is not deployed because it has good abstractions internally.
> It is deployed because it is useful, and serves the need of an end-user. And part of the end-user needs is performance.

Good internal abstractions are the key for a code maintainability.
We can make a complete mess from a code while it will still work fast.
But that will cost us ability to fix issues in the future, so end users
will stop using OVS over time because it's full of bugs that we unable
to isolate and fix.  Little things adds up.

> 
> The suggestion of integrating "Defer Work" method of exposing async in the OVS Datapath is well thought out,
> and a clean way of handling async work in a per-thread manner at the application layer. It is the most common way of integrating
> lookaside acceleration in software pipelines, and handling the async work at application thread level is the only logical place where
> the programmer can reason about tradeoffs for a specific use-case. Adding "dma acceleration to Vhost" will inevitably lead to
> compromises in the DPDK implementation, and ones that might (or might not) work for OVS and other apps.

I'm not convinced about "inevitably lead to compromises", and most
certainly not convinced that they will actually lead to degraded
performance in real-world cases.  Moving "dma acceleration to Vhost" 
library and completing transfers by the hardware may even make it
faster due to lower delays between submitting the DMA job and packet
being available to a VM on the other side.
And I'm pretty sure that moving the implementation to vhost library
will allow many different applications to use that functionality
much more easily.

> 
> As you know, there have been OVS Conference presentations[1][2], RFCs and POCs[3][4][5][6], and community calls[7][8][9] on the topic.
> In the various presentations, the benefits of using application-level deferral of work are highlighted, and compared to other implementations
> which have non-desirable side-effects.

The presentations were one-sided (mostly a pure marketing, but that is
expected taking into account their purpose) and I objected to a lot of
statements made during them.  Didn't get any solid arguments against my
objections.  Most of the stated "non-desirable side-effects" can be
avoided or may not be a problem at all with a thoughtful implementation.
But I'm again repeating myself.

> We haven't heard any objections that people won't use OVS if the netdev abstraction is changed.

The same statement can be made about every internal code change in every
other project.  End users doesn't know how the code look like, so they do
not care.

> 
> It seems there is a trade-off decision to be made; 
> 	A) Change/improve the netdev abstraction to allow for async accelerations, and pay the cost of added app layer complexity

s/improve//

> 	B) Demand dma-acceleration is pushed down into vhost & below (as netdev abstraction is not going to be changed),
> 	    resulting in sub-par and potentially unusable code for any given app, as lower-layers cannot reason about app level specifics

s/demand/reasonably ask/

"sub-par and potentially unusable code"...  I can say the same about
the current async vhost API - it's unusable for any given app.  And
I can say that because I saw what is required in order to use it in OVS.
You found a way to hack it into OVS, it doesn't mean that it's a good
way of integrating it and it doesn't mean that it's possible to do the
same for any given app.  It will definitely be a lot of work for many
apps to integrate this functionality since a lot of a new code has to
be written just for it.  Try and implement support for async vhost in
vhost ethdev.

"potentially unusable code"... That is what experimental APIs are for.
You're creating a new API, waiting for several applications to try
to use it, gathering feedback, making API changes, then stabilizing
it when everyone is happy.

"lower-layers cannot reason about app level specifics"...  But that is
the whole point of DPDK - to be an abstraction layer, so the common stable
APIs can be used, and application developers not need to rewrite a half
of their application every time.

> 
> How can we the OVS/DPDK developers and users make a decision here?

There is an experimental API in DPDK, it doesn't integrate well with
OVS.  OVS community is asking to change that API (didn't see a single
positive review on RFCs), which is a normal process for experimental
APIs.  You're refusing to do so.

I don't know many DPDK applications, but at least VPP will not be able
to use vhost async as it doesn't seem to use vhost library and vhost
async is not available via vhost ethdev in its current form.
If applications can't use an API, such API is probably not good.

I don't see a clear path forward here.

Best regards, Ilya Maximets.

> 
> Regards, -Harry
> 
> [1] https://www.openvswitch.org/support/ovscon2020/#C3
> [2] https://www.openvswitch.org/support/ovscon2021/#T12
> [3] rawdev; https://patchwork.ozlabs.org/project/openvswitch/patch/20201023094845.35652-2-sunil.pai.g@intel.com/
> [4] defer work; http://patchwork.ozlabs.org/project/openvswitch/list/?series=261267&state=*
> [5] v3; http://patchwork.ozlabs.org/project/openvswitch/patch/20220104125242.1064162-2-sunil.pai.g@intel.com/
> [6] v4; http://patchwork.ozlabs.org/project/openvswitch/patch/20220321173640.326795-2-sunil.pai.g@intel.com/
> [7] Slides session 1; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/OVS%20vhost%20async%20datapath%20design%202022.pdf
> [8] Slides session 2; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/OVS%20vhost%20async%20datapath%20design%202022%20session%202.pdf
> [9] Slides session 3; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/ovs_datapath_design_2022%20session%203.pdf
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-05-13 10:34                         ` Bruce Richardson
@ 2022-05-16  9:04                           ` Morten Brørup
  0 siblings, 0 replies; 58+ messages in thread
From: Morten Brørup @ 2022-05-16  9:04 UTC (permalink / raw)
  To: Bruce Richardson, fengchengwen
  Cc: Pai G, Sunil, Ilya Maximets, Radha Mohan Chintakuntla,
	Veerasenareddy Burru, Gagandeep Singh, Nipun Gupta, Stokes, Ian,
	Hu, Jiayu, Ferriter, Cian, Van Haaren, Harry, maxime.coquelin,
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 13 May 2022 12.34
> 
> On Fri, May 13, 2022 at 05:48:35PM +0800, fengchengwen wrote:
> > On 2022/5/13 17:10, Bruce Richardson wrote:
> > > On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote:
> > >> On 2022/4/8 14:29, Pai G, Sunil wrote:
> > >>>> -----Original Message-----
> > >>>> From: Richardson, Bruce <bruce.richardson@intel.com>
> > >>>> Sent: Tuesday, April 5, 2022 5:38 PM
> > >>>> To: Ilya Maximets <i.maximets@ovn.org>; Chengwen Feng
> > >>>> <fengchengwen@huawei.com>; Radha Mohan Chintakuntla
> <radhac@marvell.com>;
> > >>>> Veerasenareddy Burru <vburru@marvell.com>; Gagandeep Singh
> > >>>> <g.singh@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
> > >>>> Cc: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> > >>>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>;
> Ferriter, Cian
> > >>>> <cian.ferriter@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>;
> > >>>> Maxime Coquelin (maxime.coquelin@redhat.com)
> <maxime.coquelin@redhat.com>;
> > >>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> > >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim
> <tim.odriscoll@intel.com>;
> > >>>> Finn, Emma <emma.finn@intel.com>
> > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > >>>>
> > >>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
> > >>>>> On 3/30/22 16:09, Bruce Richardson wrote:
> > >>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
> > >>>>>>> On 3/30/22 13:12, Bruce Richardson wrote:
> > >>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets
> wrote:
> > >>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote:
> > >>>>>>>>>> Forking the thread to discuss a memory
> consistency/ordering model.
> > >>>>>>>>>>
> > >>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a
> > >>>>>>>>>> completely separate PCI device.  However, I don't see any
> memory
> > >>>>>>>>>> ordering being enforced or even described in the dmadev
> API or
> > >>>> documentation.
> > >>>>>>>>>> Please, point me to the correct documentation, if I
> somehow missed
> > >>>> it.
> > >>>>>>>>>>
> > >>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing
> respectively
> > >>>>>>>>>> the data and the descriptor info.  CPU core (C) is reading
> the
> > >>>>>>>>>> descriptor and the data it points too.
> > >>>>>>>>>>
> > >>>>>>>>>> A few things about that process:
> > >>>>>>>>>>
> > >>>>>>>>>> 1. There is no memory barrier between writes A and B (Did
> I miss
> > >>>>>>>>>>    them?).  Meaning that those operations can be seen by C
> in a
> > >>>>>>>>>>    different order regardless of barriers issued by C and
> > >>>> regardless
> > >>>>>>>>>>    of the nature of devices A and B.
> > >>>>>>>>>>
> > >>>>>>>>>> 2. Even if there is a write barrier between A and B, there
> is
> > >>>>>>>>>>    no guarantee that C will see these writes in the same
> order
> > >>>>>>>>>>    as C doesn't use real memory barriers because vhost
> > >>>>>>>>>> advertises
> > >>>>>>>>>
> > >>>>>>>>> s/advertises/does not advertise/
> > >>>>>>>>>
> > >>>>>>>>>>    VIRTIO_F_ORDER_PLATFORM.
> > >>>>>>>>>>
> > >>>>>>>>>> So, I'm getting to conclusion that there is a missing
> write
> > >>>>>>>>>> barrier on the vhost side and vhost itself must not
> advertise
> > >>>>>>>>>> the
> > >>>>>>>>>
> > >>>>>>>>> s/must not/must/
> > >>>>>>>>>
> > >>>>>>>>> Sorry, I wrote things backwards. :)
> > >>>>>>>>>
> > >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use
> actual
> > >>>>>>>>>> memory barriers.
> > >>>>>>>>>>
> > >>>>>>>>>> Would like to hear some thoughts on that topic.  Is it a
> real
> > >>>> issue?
> > >>>>>>>>>> Is it an issue considering all possible CPU architectures
> and
> > >>>>>>>>>> DMA HW variants?
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> In terms of ordering of operations using dmadev:
> > >>>>>>>>
> > >>>>>>>> * Some DMA HW will perform all operations strictly in order
> e.g.
> > >>>> Intel
> > >>>>>>>>   IOAT, while other hardware may not guarantee order of
> > >>>> operations/do
> > >>>>>>>>   things in parallel e.g. Intel DSA. Therefore the dmadev
> API
> > >>>> provides the
> > >>>>>>>>   fence operation which allows the order to be enforced. The
> fence
> > >>>> can be
> > >>>>>>>>   thought of as a full memory barrier, meaning no jobs after
> the
> > >>>> barrier can
> > >>>>>>>>   be started until all those before it have completed.
> Obviously,
> > >>>> for HW
> > >>>>>>>>   where order is always enforced, this will be a no-op, but
> for
> > >>>> hardware that
> > >>>>>>>>   parallelizes, we want to reduce the fences to get best
> > >>>> performance.
> > >>>>>>>>
> > >>>>>>>> * For synchronization between DMA devices and CPUs, where a
> CPU can
> > >>>> only
> > >>>>>>>>   write after a DMA copy has been done, the CPU must wait
> for the
> > >>>> dma
> > >>>>>>>>   completion to guarantee ordering. Once the completion has
> been
> > >>>> returned
> > >>>>>>>>   the completed operation is globally visible to all cores.
> > >>>>>>>
> > >>>>>>> Thanks for explanation!  Some questions though:
> > >>>>>>>
> > >>>>>>> In our case one CPU waits for completion and another CPU is
> > >>>>>>> actually using the data.  IOW, "CPU must wait" is a bit
> ambiguous.
> > >>>> Which CPU must wait?
> > >>>>>>>
> > >>>>>>> Or should it be "Once the completion is visible on any core,
> the
> > >>>>>>> completed operation is globally visible to all cores." ?
> > >>>>>>>
> > >>>>>>
> > >>>>>> The latter.
> > >>>>>> Once the change to memory/cache is visible to any core, it is
> > >>>>>> visible to all ones. This applies to regular CPU memory writes
> too -
> > >>>>>> at least on IA, and I expect on many other architectures -
> once the
> > >>>>>> write is visible outside the current core it is visible to
> every
> > >>>>>> other core. Once the data hits the l1 or l2 cache of any core,
> any
> > >>>>>> subsequent requests for that data from any other core will
> "snoop"
> > >>>>>> the latest data from the cores cache, even if it has not made
> its
> > >>>>>> way down to a shared cache, e.g. l3 on most IA systems.
> > >>>>>
> > >>>>> It sounds like you're referring to the "multicopy atomicity" of
> the
> > >>>>> architecture.  However, that is not universally supported
> thing.
> > >>>>> AFAICT, POWER and older ARM systems doesn't support it, so
> writes
> > >>>>> performed by one core are not necessarily available to all
> other cores
> > >>>>> at the same time.  That means that if the CPU0 writes the data
> and the
> > >>>>> completion flag, CPU1 reads the completion flag and writes the
> ring,
> > >>>>> CPU2 may see the ring write, but may still not see the write of
> the
> > >>>>> data, even though there was a control dependency on CPU1.
> > >>>>> There should be a full memory barrier on CPU1 in order to
> fulfill the
> > >>>>> memory ordering requirements for CPU2, IIUC.
> > >>>>>
> > >>>>> In our scenario the CPU0 is a DMA device, which may or may not
> be part
> > >>>>> of a CPU and may have different memory consistency/ordering
> > >>>>> requirements.  So, the question is: does DPDK DMA API guarantee
> > >>>>> multicopy atomicity between DMA device and all CPU cores
> regardless of
> > >>>>> CPU architecture and a nature of the DMA device?
> > >>>>>
> > >>>>
> > >>>> Right now, it doesn't because this never came up in discussion.
> In order
> > >>>> to be useful, it sounds like it explicitly should do so. At
> least for the
> > >>>> Intel ioat and idxd driver cases, this will be supported, so we
> just need
> > >>>> to ensure all other drivers currently upstreamed can offer this
> too. If
> > >>>> they cannot, we cannot offer it as a global guarantee, and we
> should see
> > >>>> about adding a capability flag for this to indicate when the
> guarantee is
> > >>>> there or not.
> > >>>>
> > >>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok
> to
> > >>>> document for dmadev that once a DMA operation is completed, the
> op is
> > >>>> guaranteed visible to all cores/threads? If not, any thoughts on
> what
> > >>>> guarantees we can provide in this regard, or what capabilities
> should be
> > >>>> exposed?
> > >>>
> > >>>
> > >>>
> > >>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy
> Burru, @Gagandeep Singh, @Nipun Gupta,
> > >>> Requesting your valuable opinions for the queries on this thread.
> > >>
> > >> Sorry late for reply due I didn't follow this thread.
> > >>
> > >> I don't think the DMA API should provide such guarantee because:
> > >> 1. DMA is an acceleration device, which is the same as
> encryption/decryption device or network device.
> > >> 2. For Hisilicon Kunpeng platform:
> > >>    The DMA device support:
> > >>      a) IO coherency: which mean it could read read the latest
> data which may stay the cache, and will
> > >>         invalidate cache's data and write data to DDR when write.
> > >>      b) Order in one request: which mean it only write completion
> descriptor after the copy is done.
> > >>         Note: orders between multiple requests can be implemented
> through the fence mechanism.
> > >>    The DMA driver only should:
> > >>      a) Add one write memory barrier(use lightweight mb) when
> doorbell.
> > >>    So once the DMA is completed the operation is guaranteed
> visible to all cores,
> > >>    And the 3rd core will observed the right order: core-B prepare
> data and issue request to DMA, DMA
> > >>    start work, core-B get completion status.
> > >> 3. I did a TI multi-core SoC many years ago, the SoC don't support
> cache coherence and consistency between
> > >>    cores. The SoC also have DMA device which have many channel.
> Here we do a hypothetical design the DMA
> > >>    driver with the DPDK DMA framework:
> > >>    The DMA driver should:
> > >>      a) write back DMA's src buffer, so that there are none cache
> data when DMA running.
> > >>      b) invalidate DMA's dst buffer
> > >>      c) do a full mb
> > >>      d) update DMA's registers.
> > >>    Then DMA will execute the copy task, it copy from DDR and write
> to DDR, and after copy it will modify
> > >>    it's status register to completed.
> > >>    In this case, the 3rd core will also observed the right order.
> > >>    A particular point of this is: If one buffer will shared on
> multiple core, application should explicit
> > >>    maintain the cache.
> > >>
> > >> Based on above, I don't think the DMA API should explicit add the
> descriptor, it's driver's and even
> > >> application(e.g. above TI's SoC)'s duty to make sure it.
> > >>
> > > Hi,
> > >
> > > thanks for that. So if I understand correctly, your current HW does
> provide
> > > this guarantee, but you don't think it should be always the case
> for
> > > dmadev, correct?
> >
> > Yes, our HW will provide the guarantee.
> > If some HW could not provide, it's driver's and maybe application's
> duty to provide it.
> >
> > >
> > > Based on that, what do you think should be the guarantee on
> completion?
> > > Once a job is completed, the completion is visible to the
> submitting core,
> > > or the core reading the completion? Do you think it's acceptable to
> add a
> >
> > Both core will visible to it.
> >
> > > capability flag for drivers to indicate that they do support a
> "globally
> > > visible" guarantee?
> >
> > I think the driver (and with HW) should support "globally visible"
> guarantee.
> > And for some HW, even application (or middleware) should care about
> it.
> >
> 
> From a dmadev API viewpoint, whether the driver handles it or the HW
> itself, does not matter. However, if the application needs to take
> special
> actions to guarantee visibility, then that needs to be flagged as part
> of
> the dmadev API.
> 
> I see three possibilities:
> 1 Wait until we have a driver that does not have global visibility on
>   return from rte_dma_completed, and at that point add a flag
> indicating
>   the lack of that support. Until then, document that results of ops
> will
>   be globally visible.
> 2 Add a flag now to allow drivers to indicate *lack* of global
> visibility,
>   and document that results are visible unless flag is set.
> 3 Add a flag now to allow drivers call out that all results are g.v.,
> and
>   update drivers to use this flag.
> 
> I would be very much in favour of #1, because:
> * YAGNI principle - (subject to confirmation by other maintainers) if
> we
>   don't have a driver right now that needs non-g.v. behaviour we may
> never
>   need one.
> * In the absence of a concrete case where g.v. is not guaranteed, we
> may
>   struggle to document correctly what the actual guarantees are,
> especially if
>   submitter core and completer core are different.

A big +1 to that!

Perhaps the documentation can reflect that global visibility is provided by current DMA hardware, and if some future DMA hardware does not provide it, the API will be changed (in some unspecified manner) to reflect this. My point is: We should avoid that the API stability policy makes it impossible to add some future DMA hardware without global visibility. Requiring applications to handle such future DMA hardware differently is perfectly fine; the API (and its documentation) should just be open for it.

> 
> @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh,
> @Nipun Gupta,
> As driver maintainers, can you please confirm if on receipt of a
> completion
> from HW/driver, the operation results are visible on all application
> cores,
> i.e. the app does not need additional barriers to propagate visibility
> to
> other cores. Your opinions on this discussion would also be useful.
> 
> Regards,
> /Bruce


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-05-13  9:48                       ` fengchengwen
@ 2022-05-13 10:34                         ` Bruce Richardson
  2022-05-16  9:04                           ` Morten Brørup
  0 siblings, 1 reply; 58+ messages in thread
From: Bruce Richardson @ 2022-05-13 10:34 UTC (permalink / raw)
  To: fengchengwen
  Cc: Pai G, Sunil, Ilya Maximets, Radha Mohan Chintakuntla,
	Veerasenareddy Burru, Gagandeep Singh, Nipun Gupta, Stokes, Ian,
	Hu, Jiayu, Ferriter, Cian, Van Haaren, Harry,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On Fri, May 13, 2022 at 05:48:35PM +0800, fengchengwen wrote:
> On 2022/5/13 17:10, Bruce Richardson wrote:
> > On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote:
> >> On 2022/4/8 14:29, Pai G, Sunil wrote:
> >>>> -----Original Message-----
> >>>> From: Richardson, Bruce <bruce.richardson@intel.com>
> >>>> Sent: Tuesday, April 5, 2022 5:38 PM
> >>>> To: Ilya Maximets <i.maximets@ovn.org>; Chengwen Feng
> >>>> <fengchengwen@huawei.com>; Radha Mohan Chintakuntla <radhac@marvell.com>;
> >>>> Veerasenareddy Burru <vburru@marvell.com>; Gagandeep Singh
> >>>> <g.singh@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
> >>>> Cc: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> >>>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> >>>> <cian.ferriter@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>;
> >>>> Maxime Coquelin (maxime.coquelin@redhat.com) <maxime.coquelin@redhat.com>;
> >>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> >>>> Finn, Emma <emma.finn@intel.com>
> >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
> >>>>> On 3/30/22 16:09, Bruce Richardson wrote:
> >>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
> >>>>>>> On 3/30/22 13:12, Bruce Richardson wrote:
> >>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
> >>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote:
> >>>>>>>>>> Forking the thread to discuss a memory consistency/ordering model.
> >>>>>>>>>>
> >>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a
> >>>>>>>>>> completely separate PCI device.  However, I don't see any memory
> >>>>>>>>>> ordering being enforced or even described in the dmadev API or
> >>>> documentation.
> >>>>>>>>>> Please, point me to the correct documentation, if I somehow missed
> >>>> it.
> >>>>>>>>>>
> >>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing respectively
> >>>>>>>>>> the data and the descriptor info.  CPU core (C) is reading the
> >>>>>>>>>> descriptor and the data it points too.
> >>>>>>>>>>
> >>>>>>>>>> A few things about that process:
> >>>>>>>>>>
> >>>>>>>>>> 1. There is no memory barrier between writes A and B (Did I miss
> >>>>>>>>>>    them?).  Meaning that those operations can be seen by C in a
> >>>>>>>>>>    different order regardless of barriers issued by C and
> >>>> regardless
> >>>>>>>>>>    of the nature of devices A and B.
> >>>>>>>>>>
> >>>>>>>>>> 2. Even if there is a write barrier between A and B, there is
> >>>>>>>>>>    no guarantee that C will see these writes in the same order
> >>>>>>>>>>    as C doesn't use real memory barriers because vhost
> >>>>>>>>>> advertises
> >>>>>>>>>
> >>>>>>>>> s/advertises/does not advertise/
> >>>>>>>>>
> >>>>>>>>>>    VIRTIO_F_ORDER_PLATFORM.
> >>>>>>>>>>
> >>>>>>>>>> So, I'm getting to conclusion that there is a missing write
> >>>>>>>>>> barrier on the vhost side and vhost itself must not advertise
> >>>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>> s/must not/must/
> >>>>>>>>>
> >>>>>>>>> Sorry, I wrote things backwards. :)
> >>>>>>>>>
> >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual
> >>>>>>>>>> memory barriers.
> >>>>>>>>>>
> >>>>>>>>>> Would like to hear some thoughts on that topic.  Is it a real
> >>>> issue?
> >>>>>>>>>> Is it an issue considering all possible CPU architectures and
> >>>>>>>>>> DMA HW variants?
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> In terms of ordering of operations using dmadev:
> >>>>>>>>
> >>>>>>>> * Some DMA HW will perform all operations strictly in order e.g.
> >>>> Intel
> >>>>>>>>   IOAT, while other hardware may not guarantee order of
> >>>> operations/do
> >>>>>>>>   things in parallel e.g. Intel DSA. Therefore the dmadev API
> >>>> provides the
> >>>>>>>>   fence operation which allows the order to be enforced. The fence
> >>>> can be
> >>>>>>>>   thought of as a full memory barrier, meaning no jobs after the
> >>>> barrier can
> >>>>>>>>   be started until all those before it have completed. Obviously,
> >>>> for HW
> >>>>>>>>   where order is always enforced, this will be a no-op, but for
> >>>> hardware that
> >>>>>>>>   parallelizes, we want to reduce the fences to get best
> >>>> performance.
> >>>>>>>>
> >>>>>>>> * For synchronization between DMA devices and CPUs, where a CPU can
> >>>> only
> >>>>>>>>   write after a DMA copy has been done, the CPU must wait for the
> >>>> dma
> >>>>>>>>   completion to guarantee ordering. Once the completion has been
> >>>> returned
> >>>>>>>>   the completed operation is globally visible to all cores.
> >>>>>>>
> >>>>>>> Thanks for explanation!  Some questions though:
> >>>>>>>
> >>>>>>> In our case one CPU waits for completion and another CPU is
> >>>>>>> actually using the data.  IOW, "CPU must wait" is a bit ambiguous.
> >>>> Which CPU must wait?
> >>>>>>>
> >>>>>>> Or should it be "Once the completion is visible on any core, the
> >>>>>>> completed operation is globally visible to all cores." ?
> >>>>>>>
> >>>>>>
> >>>>>> The latter.
> >>>>>> Once the change to memory/cache is visible to any core, it is
> >>>>>> visible to all ones. This applies to regular CPU memory writes too -
> >>>>>> at least on IA, and I expect on many other architectures - once the
> >>>>>> write is visible outside the current core it is visible to every
> >>>>>> other core. Once the data hits the l1 or l2 cache of any core, any
> >>>>>> subsequent requests for that data from any other core will "snoop"
> >>>>>> the latest data from the cores cache, even if it has not made its
> >>>>>> way down to a shared cache, e.g. l3 on most IA systems.
> >>>>>
> >>>>> It sounds like you're referring to the "multicopy atomicity" of the
> >>>>> architecture.  However, that is not universally supported thing.
> >>>>> AFAICT, POWER and older ARM systems doesn't support it, so writes
> >>>>> performed by one core are not necessarily available to all other cores
> >>>>> at the same time.  That means that if the CPU0 writes the data and the
> >>>>> completion flag, CPU1 reads the completion flag and writes the ring,
> >>>>> CPU2 may see the ring write, but may still not see the write of the
> >>>>> data, even though there was a control dependency on CPU1.
> >>>>> There should be a full memory barrier on CPU1 in order to fulfill the
> >>>>> memory ordering requirements for CPU2, IIUC.
> >>>>>
> >>>>> In our scenario the CPU0 is a DMA device, which may or may not be part
> >>>>> of a CPU and may have different memory consistency/ordering
> >>>>> requirements.  So, the question is: does DPDK DMA API guarantee
> >>>>> multicopy atomicity between DMA device and all CPU cores regardless of
> >>>>> CPU architecture and a nature of the DMA device?
> >>>>>
> >>>>
> >>>> Right now, it doesn't because this never came up in discussion. In order
> >>>> to be useful, it sounds like it explicitly should do so. At least for the
> >>>> Intel ioat and idxd driver cases, this will be supported, so we just need
> >>>> to ensure all other drivers currently upstreamed can offer this too. If
> >>>> they cannot, we cannot offer it as a global guarantee, and we should see
> >>>> about adding a capability flag for this to indicate when the guarantee is
> >>>> there or not.
> >>>>
> >>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to
> >>>> document for dmadev that once a DMA operation is completed, the op is
> >>>> guaranteed visible to all cores/threads? If not, any thoughts on what
> >>>> guarantees we can provide in this regard, or what capabilities should be
> >>>> exposed?
> >>>
> >>>
> >>>
> >>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta,
> >>> Requesting your valuable opinions for the queries on this thread.
> >>
> >> Sorry late for reply due I didn't follow this thread.
> >>
> >> I don't think the DMA API should provide such guarantee because:
> >> 1. DMA is an acceleration device, which is the same as encryption/decryption device or network device.
> >> 2. For Hisilicon Kunpeng platform:
> >>    The DMA device support:
> >>      a) IO coherency: which mean it could read read the latest data which may stay the cache, and will
> >>         invalidate cache's data and write data to DDR when write.
> >>      b) Order in one request: which mean it only write completion descriptor after the copy is done.
> >>         Note: orders between multiple requests can be implemented through the fence mechanism.
> >>    The DMA driver only should:
> >>      a) Add one write memory barrier(use lightweight mb) when doorbell.
> >>    So once the DMA is completed the operation is guaranteed visible to all cores,
> >>    And the 3rd core will observed the right order: core-B prepare data and issue request to DMA, DMA
> >>    start work, core-B get completion status.
> >> 3. I did a TI multi-core SoC many years ago, the SoC don't support cache coherence and consistency between
> >>    cores. The SoC also have DMA device which have many channel. Here we do a hypothetical design the DMA
> >>    driver with the DPDK DMA framework:
> >>    The DMA driver should:
> >>      a) write back DMA's src buffer, so that there are none cache data when DMA running.
> >>      b) invalidate DMA's dst buffer
> >>      c) do a full mb
> >>      d) update DMA's registers.
> >>    Then DMA will execute the copy task, it copy from DDR and write to DDR, and after copy it will modify
> >>    it's status register to completed.
> >>    In this case, the 3rd core will also observed the right order.
> >>    A particular point of this is: If one buffer will shared on multiple core, application should explicit
> >>    maintain the cache.
> >>
> >> Based on above, I don't think the DMA API should explicit add the descriptor, it's driver's and even
> >> application(e.g. above TI's SoC)'s duty to make sure it.
> >>
> > Hi,
> > 
> > thanks for that. So if I understand correctly, your current HW does provide
> > this guarantee, but you don't think it should be always the case for
> > dmadev, correct?
> 
> Yes, our HW will provide the guarantee.
> If some HW could not provide, it's driver's and maybe application's duty to provide it.
> 
> > 
> > Based on that, what do you think should be the guarantee on completion?
> > Once a job is completed, the completion is visible to the submitting core,
> > or the core reading the completion? Do you think it's acceptable to add a
> 
> Both core will visible to it.
> 
> > capability flag for drivers to indicate that they do support a "globally
> > visible" guarantee?
> 
> I think the driver (and with HW) should support "globally visible" guarantee.
> And for some HW, even application (or middleware) should care about it.
> 

From a dmadev API viewpoint, whether the driver handles it or the HW
itself, does not matter. However, if the application needs to take special
actions to guarantee visibility, then that needs to be flagged as part of
the dmadev API.

I see three possibilities:
1 Wait until we have a driver that does not have global visibility on
  return from rte_dma_completed, and at that point add a flag indicating
  the lack of that support. Until then, document that results of ops will
  be globally visible.
2 Add a flag now to allow drivers to indicate *lack* of global visibility,
  and document that results are visible unless flag is set.
3 Add a flag now to allow drivers call out that all results are g.v., and
  update drivers to use this flag.

I would be very much in favour of #1, because:
* YAGNI principle - (subject to confirmation by other maintainers) if we
  don't have a driver right now that needs non-g.v. behaviour we may never
  need one.
* In the absence of a concrete case where g.v. is not guaranteed, we may
  struggle to document correctly what the actual guarantees are, especially if
  submitter core and completer core are different.

@Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta,
As driver maintainers, can you please confirm if on receipt of a completion
from HW/driver, the operation results are visible on all application cores,
i.e. the app does not need additional barriers to propagate visibility to
other cores. Your opinions on this discussion would also be useful.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-05-13  9:10                     ` Bruce Richardson
@ 2022-05-13  9:48                       ` fengchengwen
  2022-05-13 10:34                         ` Bruce Richardson
  0 siblings, 1 reply; 58+ messages in thread
From: fengchengwen @ 2022-05-13  9:48 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Pai G, Sunil, Ilya Maximets, Radha Mohan Chintakuntla,
	Veerasenareddy Burru, Gagandeep Singh, Nipun Gupta, Stokes, Ian,
	Hu,  Jiayu, Ferriter, Cian, Van Haaren, Harry,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On 2022/5/13 17:10, Bruce Richardson wrote:
> On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote:
>> On 2022/4/8 14:29, Pai G, Sunil wrote:
>>>> -----Original Message-----
>>>> From: Richardson, Bruce <bruce.richardson@intel.com>
>>>> Sent: Tuesday, April 5, 2022 5:38 PM
>>>> To: Ilya Maximets <i.maximets@ovn.org>; Chengwen Feng
>>>> <fengchengwen@huawei.com>; Radha Mohan Chintakuntla <radhac@marvell.com>;
>>>> Veerasenareddy Burru <vburru@marvell.com>; Gagandeep Singh
>>>> <g.singh@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
>>>> Cc: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>>>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
>>>> <cian.ferriter@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>;
>>>> Maxime Coquelin (maxime.coquelin@redhat.com) <maxime.coquelin@redhat.com>;
>>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>>>> Finn, Emma <emma.finn@intel.com>
>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>>>
>>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
>>>>> On 3/30/22 16:09, Bruce Richardson wrote:
>>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
>>>>>>> On 3/30/22 13:12, Bruce Richardson wrote:
>>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
>>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote:
>>>>>>>>>> Forking the thread to discuss a memory consistency/ordering model.
>>>>>>>>>>
>>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a
>>>>>>>>>> completely separate PCI device.  However, I don't see any memory
>>>>>>>>>> ordering being enforced or even described in the dmadev API or
>>>> documentation.
>>>>>>>>>> Please, point me to the correct documentation, if I somehow missed
>>>> it.
>>>>>>>>>>
>>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing respectively
>>>>>>>>>> the data and the descriptor info.  CPU core (C) is reading the
>>>>>>>>>> descriptor and the data it points too.
>>>>>>>>>>
>>>>>>>>>> A few things about that process:
>>>>>>>>>>
>>>>>>>>>> 1. There is no memory barrier between writes A and B (Did I miss
>>>>>>>>>>    them?).  Meaning that those operations can be seen by C in a
>>>>>>>>>>    different order regardless of barriers issued by C and
>>>> regardless
>>>>>>>>>>    of the nature of devices A and B.
>>>>>>>>>>
>>>>>>>>>> 2. Even if there is a write barrier between A and B, there is
>>>>>>>>>>    no guarantee that C will see these writes in the same order
>>>>>>>>>>    as C doesn't use real memory barriers because vhost
>>>>>>>>>> advertises
>>>>>>>>>
>>>>>>>>> s/advertises/does not advertise/
>>>>>>>>>
>>>>>>>>>>    VIRTIO_F_ORDER_PLATFORM.
>>>>>>>>>>
>>>>>>>>>> So, I'm getting to conclusion that there is a missing write
>>>>>>>>>> barrier on the vhost side and vhost itself must not advertise
>>>>>>>>>> the
>>>>>>>>>
>>>>>>>>> s/must not/must/
>>>>>>>>>
>>>>>>>>> Sorry, I wrote things backwards. :)
>>>>>>>>>
>>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual
>>>>>>>>>> memory barriers.
>>>>>>>>>>
>>>>>>>>>> Would like to hear some thoughts on that topic.  Is it a real
>>>> issue?
>>>>>>>>>> Is it an issue considering all possible CPU architectures and
>>>>>>>>>> DMA HW variants?
>>>>>>>>>>
>>>>>>>>
>>>>>>>> In terms of ordering of operations using dmadev:
>>>>>>>>
>>>>>>>> * Some DMA HW will perform all operations strictly in order e.g.
>>>> Intel
>>>>>>>>   IOAT, while other hardware may not guarantee order of
>>>> operations/do
>>>>>>>>   things in parallel e.g. Intel DSA. Therefore the dmadev API
>>>> provides the
>>>>>>>>   fence operation which allows the order to be enforced. The fence
>>>> can be
>>>>>>>>   thought of as a full memory barrier, meaning no jobs after the
>>>> barrier can
>>>>>>>>   be started until all those before it have completed. Obviously,
>>>> for HW
>>>>>>>>   where order is always enforced, this will be a no-op, but for
>>>> hardware that
>>>>>>>>   parallelizes, we want to reduce the fences to get best
>>>> performance.
>>>>>>>>
>>>>>>>> * For synchronization between DMA devices and CPUs, where a CPU can
>>>> only
>>>>>>>>   write after a DMA copy has been done, the CPU must wait for the
>>>> dma
>>>>>>>>   completion to guarantee ordering. Once the completion has been
>>>> returned
>>>>>>>>   the completed operation is globally visible to all cores.
>>>>>>>
>>>>>>> Thanks for explanation!  Some questions though:
>>>>>>>
>>>>>>> In our case one CPU waits for completion and another CPU is
>>>>>>> actually using the data.  IOW, "CPU must wait" is a bit ambiguous.
>>>> Which CPU must wait?
>>>>>>>
>>>>>>> Or should it be "Once the completion is visible on any core, the
>>>>>>> completed operation is globally visible to all cores." ?
>>>>>>>
>>>>>>
>>>>>> The latter.
>>>>>> Once the change to memory/cache is visible to any core, it is
>>>>>> visible to all ones. This applies to regular CPU memory writes too -
>>>>>> at least on IA, and I expect on many other architectures - once the
>>>>>> write is visible outside the current core it is visible to every
>>>>>> other core. Once the data hits the l1 or l2 cache of any core, any
>>>>>> subsequent requests for that data from any other core will "snoop"
>>>>>> the latest data from the cores cache, even if it has not made its
>>>>>> way down to a shared cache, e.g. l3 on most IA systems.
>>>>>
>>>>> It sounds like you're referring to the "multicopy atomicity" of the
>>>>> architecture.  However, that is not universally supported thing.
>>>>> AFAICT, POWER and older ARM systems doesn't support it, so writes
>>>>> performed by one core are not necessarily available to all other cores
>>>>> at the same time.  That means that if the CPU0 writes the data and the
>>>>> completion flag, CPU1 reads the completion flag and writes the ring,
>>>>> CPU2 may see the ring write, but may still not see the write of the
>>>>> data, even though there was a control dependency on CPU1.
>>>>> There should be a full memory barrier on CPU1 in order to fulfill the
>>>>> memory ordering requirements for CPU2, IIUC.
>>>>>
>>>>> In our scenario the CPU0 is a DMA device, which may or may not be part
>>>>> of a CPU and may have different memory consistency/ordering
>>>>> requirements.  So, the question is: does DPDK DMA API guarantee
>>>>> multicopy atomicity between DMA device and all CPU cores regardless of
>>>>> CPU architecture and a nature of the DMA device?
>>>>>
>>>>
>>>> Right now, it doesn't because this never came up in discussion. In order
>>>> to be useful, it sounds like it explicitly should do so. At least for the
>>>> Intel ioat and idxd driver cases, this will be supported, so we just need
>>>> to ensure all other drivers currently upstreamed can offer this too. If
>>>> they cannot, we cannot offer it as a global guarantee, and we should see
>>>> about adding a capability flag for this to indicate when the guarantee is
>>>> there or not.
>>>>
>>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to
>>>> document for dmadev that once a DMA operation is completed, the op is
>>>> guaranteed visible to all cores/threads? If not, any thoughts on what
>>>> guarantees we can provide in this regard, or what capabilities should be
>>>> exposed?
>>>
>>>
>>>
>>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta,
>>> Requesting your valuable opinions for the queries on this thread.
>>
>> Sorry late for reply due I didn't follow this thread.
>>
>> I don't think the DMA API should provide such guarantee because:
>> 1. DMA is an acceleration device, which is the same as encryption/decryption device or network device.
>> 2. For Hisilicon Kunpeng platform:
>>    The DMA device support:
>>      a) IO coherency: which mean it could read read the latest data which may stay the cache, and will
>>         invalidate cache's data and write data to DDR when write.
>>      b) Order in one request: which mean it only write completion descriptor after the copy is done.
>>         Note: orders between multiple requests can be implemented through the fence mechanism.
>>    The DMA driver only should:
>>      a) Add one write memory barrier(use lightweight mb) when doorbell.
>>    So once the DMA is completed the operation is guaranteed visible to all cores,
>>    And the 3rd core will observed the right order: core-B prepare data and issue request to DMA, DMA
>>    start work, core-B get completion status.
>> 3. I did a TI multi-core SoC many years ago, the SoC don't support cache coherence and consistency between
>>    cores. The SoC also have DMA device which have many channel. Here we do a hypothetical design the DMA
>>    driver with the DPDK DMA framework:
>>    The DMA driver should:
>>      a) write back DMA's src buffer, so that there are none cache data when DMA running.
>>      b) invalidate DMA's dst buffer
>>      c) do a full mb
>>      d) update DMA's registers.
>>    Then DMA will execute the copy task, it copy from DDR and write to DDR, and after copy it will modify
>>    it's status register to completed.
>>    In this case, the 3rd core will also observed the right order.
>>    A particular point of this is: If one buffer will shared on multiple core, application should explicit
>>    maintain the cache.
>>
>> Based on above, I don't think the DMA API should explicit add the descriptor, it's driver's and even
>> application(e.g. above TI's SoC)'s duty to make sure it.
>>
> Hi,
> 
> thanks for that. So if I understand correctly, your current HW does provide
> this guarantee, but you don't think it should be always the case for
> dmadev, correct?

Yes, our HW will provide the guarantee.
If some HW could not provide, it's driver's and maybe application's duty to provide it.

> 
> Based on that, what do you think should be the guarantee on completion?
> Once a job is completed, the completion is visible to the submitting core,
> or the core reading the completion? Do you think it's acceptable to add a

Both core will visible to it.

> capability flag for drivers to indicate that they do support a "globally
> visible" guarantee?

I think the driver (and with HW) should support "globally visible" guarantee.
And for some HW, even application (or middleware) should care about it.

> 
> Thanks,
> /Bruce
> 
> .
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-05-13  8:52                   ` fengchengwen
@ 2022-05-13  9:10                     ` Bruce Richardson
  2022-05-13  9:48                       ` fengchengwen
  0 siblings, 1 reply; 58+ messages in thread
From: Bruce Richardson @ 2022-05-13  9:10 UTC (permalink / raw)
  To: fengchengwen
  Cc: Pai G, Sunil, Ilya Maximets, Radha Mohan Chintakuntla,
	Veerasenareddy Burru, Gagandeep Singh, Nipun Gupta, Stokes, Ian,
	Hu, Jiayu, Ferriter, Cian, Van Haaren, Harry,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote:
> On 2022/4/8 14:29, Pai G, Sunil wrote:
> >> -----Original Message-----
> >> From: Richardson, Bruce <bruce.richardson@intel.com>
> >> Sent: Tuesday, April 5, 2022 5:38 PM
> >> To: Ilya Maximets <i.maximets@ovn.org>; Chengwen Feng
> >> <fengchengwen@huawei.com>; Radha Mohan Chintakuntla <radhac@marvell.com>;
> >> Veerasenareddy Burru <vburru@marvell.com>; Gagandeep Singh
> >> <g.singh@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
> >> Cc: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> >> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> >> <cian.ferriter@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>;
> >> Maxime Coquelin (maxime.coquelin@redhat.com) <maxime.coquelin@redhat.com>;
> >> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> >> Finn, Emma <emma.finn@intel.com>
> >> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>
> >> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
> >>> On 3/30/22 16:09, Bruce Richardson wrote:
> >>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
> >>>>> On 3/30/22 13:12, Bruce Richardson wrote:
> >>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
> >>>>>>> On 3/30/22 12:41, Ilya Maximets wrote:
> >>>>>>>> Forking the thread to discuss a memory consistency/ordering model.
> >>>>>>>>
> >>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a
> >>>>>>>> completely separate PCI device.  However, I don't see any memory
> >>>>>>>> ordering being enforced or even described in the dmadev API or
> >> documentation.
> >>>>>>>> Please, point me to the correct documentation, if I somehow missed
> >> it.
> >>>>>>>>
> >>>>>>>> We have a DMA device (A) and a CPU core (B) writing respectively
> >>>>>>>> the data and the descriptor info.  CPU core (C) is reading the
> >>>>>>>> descriptor and the data it points too.
> >>>>>>>>
> >>>>>>>> A few things about that process:
> >>>>>>>>
> >>>>>>>> 1. There is no memory barrier between writes A and B (Did I miss
> >>>>>>>>    them?).  Meaning that those operations can be seen by C in a
> >>>>>>>>    different order regardless of barriers issued by C and
> >> regardless
> >>>>>>>>    of the nature of devices A and B.
> >>>>>>>>
> >>>>>>>> 2. Even if there is a write barrier between A and B, there is
> >>>>>>>>    no guarantee that C will see these writes in the same order
> >>>>>>>>    as C doesn't use real memory barriers because vhost
> >>>>>>>> advertises
> >>>>>>>
> >>>>>>> s/advertises/does not advertise/
> >>>>>>>
> >>>>>>>>    VIRTIO_F_ORDER_PLATFORM.
> >>>>>>>>
> >>>>>>>> So, I'm getting to conclusion that there is a missing write
> >>>>>>>> barrier on the vhost side and vhost itself must not advertise
> >>>>>>>> the
> >>>>>>>
> >>>>>>> s/must not/must/
> >>>>>>>
> >>>>>>> Sorry, I wrote things backwards. :)
> >>>>>>>
> >>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual
> >>>>>>>> memory barriers.
> >>>>>>>>
> >>>>>>>> Would like to hear some thoughts on that topic.  Is it a real
> >> issue?
> >>>>>>>> Is it an issue considering all possible CPU architectures and
> >>>>>>>> DMA HW variants?
> >>>>>>>>
> >>>>>>
> >>>>>> In terms of ordering of operations using dmadev:
> >>>>>>
> >>>>>> * Some DMA HW will perform all operations strictly in order e.g.
> >> Intel
> >>>>>>   IOAT, while other hardware may not guarantee order of
> >> operations/do
> >>>>>>   things in parallel e.g. Intel DSA. Therefore the dmadev API
> >> provides the
> >>>>>>   fence operation which allows the order to be enforced. The fence
> >> can be
> >>>>>>   thought of as a full memory barrier, meaning no jobs after the
> >> barrier can
> >>>>>>   be started until all those before it have completed. Obviously,
> >> for HW
> >>>>>>   where order is always enforced, this will be a no-op, but for
> >> hardware that
> >>>>>>   parallelizes, we want to reduce the fences to get best
> >> performance.
> >>>>>>
> >>>>>> * For synchronization between DMA devices and CPUs, where a CPU can
> >> only
> >>>>>>   write after a DMA copy has been done, the CPU must wait for the
> >> dma
> >>>>>>   completion to guarantee ordering. Once the completion has been
> >> returned
> >>>>>>   the completed operation is globally visible to all cores.
> >>>>>
> >>>>> Thanks for explanation!  Some questions though:
> >>>>>
> >>>>> In our case one CPU waits for completion and another CPU is
> >>>>> actually using the data.  IOW, "CPU must wait" is a bit ambiguous.
> >> Which CPU must wait?
> >>>>>
> >>>>> Or should it be "Once the completion is visible on any core, the
> >>>>> completed operation is globally visible to all cores." ?
> >>>>>
> >>>>
> >>>> The latter.
> >>>> Once the change to memory/cache is visible to any core, it is
> >>>> visible to all ones. This applies to regular CPU memory writes too -
> >>>> at least on IA, and I expect on many other architectures - once the
> >>>> write is visible outside the current core it is visible to every
> >>>> other core. Once the data hits the l1 or l2 cache of any core, any
> >>>> subsequent requests for that data from any other core will "snoop"
> >>>> the latest data from the cores cache, even if it has not made its
> >>>> way down to a shared cache, e.g. l3 on most IA systems.
> >>>
> >>> It sounds like you're referring to the "multicopy atomicity" of the
> >>> architecture.  However, that is not universally supported thing.
> >>> AFAICT, POWER and older ARM systems doesn't support it, so writes
> >>> performed by one core are not necessarily available to all other cores
> >>> at the same time.  That means that if the CPU0 writes the data and the
> >>> completion flag, CPU1 reads the completion flag and writes the ring,
> >>> CPU2 may see the ring write, but may still not see the write of the
> >>> data, even though there was a control dependency on CPU1.
> >>> There should be a full memory barrier on CPU1 in order to fulfill the
> >>> memory ordering requirements for CPU2, IIUC.
> >>>
> >>> In our scenario the CPU0 is a DMA device, which may or may not be part
> >>> of a CPU and may have different memory consistency/ordering
> >>> requirements.  So, the question is: does DPDK DMA API guarantee
> >>> multicopy atomicity between DMA device and all CPU cores regardless of
> >>> CPU architecture and a nature of the DMA device?
> >>>
> >>
> >> Right now, it doesn't because this never came up in discussion. In order
> >> to be useful, it sounds like it explicitly should do so. At least for the
> >> Intel ioat and idxd driver cases, this will be supported, so we just need
> >> to ensure all other drivers currently upstreamed can offer this too. If
> >> they cannot, we cannot offer it as a global guarantee, and we should see
> >> about adding a capability flag for this to indicate when the guarantee is
> >> there or not.
> >>
> >> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to
> >> document for dmadev that once a DMA operation is completed, the op is
> >> guaranteed visible to all cores/threads? If not, any thoughts on what
> >> guarantees we can provide in this regard, or what capabilities should be
> >> exposed?
> > 
> > 
> > 
> > Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta,
> > Requesting your valuable opinions for the queries on this thread.
> 
> Sorry late for reply due I didn't follow this thread.
> 
> I don't think the DMA API should provide such guarantee because:
> 1. DMA is an acceleration device, which is the same as encryption/decryption device or network device.
> 2. For Hisilicon Kunpeng platform:
>    The DMA device support:
>      a) IO coherency: which mean it could read read the latest data which may stay the cache, and will
>         invalidate cache's data and write data to DDR when write.
>      b) Order in one request: which mean it only write completion descriptor after the copy is done.
>         Note: orders between multiple requests can be implemented through the fence mechanism.
>    The DMA driver only should:
>      a) Add one write memory barrier(use lightweight mb) when doorbell.
>    So once the DMA is completed the operation is guaranteed visible to all cores,
>    And the 3rd core will observed the right order: core-B prepare data and issue request to DMA, DMA
>    start work, core-B get completion status.
> 3. I did a TI multi-core SoC many years ago, the SoC don't support cache coherence and consistency between
>    cores. The SoC also have DMA device which have many channel. Here we do a hypothetical design the DMA
>    driver with the DPDK DMA framework:
>    The DMA driver should:
>      a) write back DMA's src buffer, so that there are none cache data when DMA running.
>      b) invalidate DMA's dst buffer
>      c) do a full mb
>      d) update DMA's registers.
>    Then DMA will execute the copy task, it copy from DDR and write to DDR, and after copy it will modify
>    it's status register to completed.
>    In this case, the 3rd core will also observed the right order.
>    A particular point of this is: If one buffer will shared on multiple core, application should explicit
>    maintain the cache.
> 
> Based on above, I don't think the DMA API should explicit add the descriptor, it's driver's and even
> application(e.g. above TI's SoC)'s duty to make sure it.
>
Hi,

thanks for that. So if I understand correctly, your current HW does provide
this guarantee, but you don't think it should be always the case for
dmadev, correct?

Based on that, what do you think should be the guarantee on completion?
Once a job is completed, the completion is visible to the submitting core,
or the core reading the completion? Do you think it's acceptable to add a
capability flag for drivers to indicate that they do support a "globally
visible" guarantee?

Thanks,
/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-08  6:29                 ` Pai G, Sunil
@ 2022-05-13  8:52                   ` fengchengwen
  2022-05-13  9:10                     ` Bruce Richardson
  0 siblings, 1 reply; 58+ messages in thread
From: fengchengwen @ 2022-05-13  8:52 UTC (permalink / raw)
  To: Pai G, Sunil, Richardson, Bruce, Ilya Maximets,
	Radha Mohan Chintakuntla, Veerasenareddy Burru, Gagandeep Singh,
	Nipun Gupta
  Cc: Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren, Harry,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On 2022/4/8 14:29, Pai G, Sunil wrote:
>> -----Original Message-----
>> From: Richardson, Bruce <bruce.richardson@intel.com>
>> Sent: Tuesday, April 5, 2022 5:38 PM
>> To: Ilya Maximets <i.maximets@ovn.org>; Chengwen Feng
>> <fengchengwen@huawei.com>; Radha Mohan Chintakuntla <radhac@marvell.com>;
>> Veerasenareddy Burru <vburru@marvell.com>; Gagandeep Singh
>> <g.singh@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
>> Cc: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
>> <cian.ferriter@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>;
>> Maxime Coquelin (maxime.coquelin@redhat.com) <maxime.coquelin@redhat.com>;
>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>> Finn, Emma <emma.finn@intel.com>
>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>
>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
>>> On 3/30/22 16:09, Bruce Richardson wrote:
>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
>>>>> On 3/30/22 13:12, Bruce Richardson wrote:
>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote:
>>>>>>>> Forking the thread to discuss a memory consistency/ordering model.
>>>>>>>>
>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a
>>>>>>>> completely separate PCI device.  However, I don't see any memory
>>>>>>>> ordering being enforced or even described in the dmadev API or
>> documentation.
>>>>>>>> Please, point me to the correct documentation, if I somehow missed
>> it.
>>>>>>>>
>>>>>>>> We have a DMA device (A) and a CPU core (B) writing respectively
>>>>>>>> the data and the descriptor info.  CPU core (C) is reading the
>>>>>>>> descriptor and the data it points too.
>>>>>>>>
>>>>>>>> A few things about that process:
>>>>>>>>
>>>>>>>> 1. There is no memory barrier between writes A and B (Did I miss
>>>>>>>>    them?).  Meaning that those operations can be seen by C in a
>>>>>>>>    different order regardless of barriers issued by C and
>> regardless
>>>>>>>>    of the nature of devices A and B.
>>>>>>>>
>>>>>>>> 2. Even if there is a write barrier between A and B, there is
>>>>>>>>    no guarantee that C will see these writes in the same order
>>>>>>>>    as C doesn't use real memory barriers because vhost
>>>>>>>> advertises
>>>>>>>
>>>>>>> s/advertises/does not advertise/
>>>>>>>
>>>>>>>>    VIRTIO_F_ORDER_PLATFORM.
>>>>>>>>
>>>>>>>> So, I'm getting to conclusion that there is a missing write
>>>>>>>> barrier on the vhost side and vhost itself must not advertise
>>>>>>>> the
>>>>>>>
>>>>>>> s/must not/must/
>>>>>>>
>>>>>>> Sorry, I wrote things backwards. :)
>>>>>>>
>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual
>>>>>>>> memory barriers.
>>>>>>>>
>>>>>>>> Would like to hear some thoughts on that topic.  Is it a real
>> issue?
>>>>>>>> Is it an issue considering all possible CPU architectures and
>>>>>>>> DMA HW variants?
>>>>>>>>
>>>>>>
>>>>>> In terms of ordering of operations using dmadev:
>>>>>>
>>>>>> * Some DMA HW will perform all operations strictly in order e.g.
>> Intel
>>>>>>   IOAT, while other hardware may not guarantee order of
>> operations/do
>>>>>>   things in parallel e.g. Intel DSA. Therefore the dmadev API
>> provides the
>>>>>>   fence operation which allows the order to be enforced. The fence
>> can be
>>>>>>   thought of as a full memory barrier, meaning no jobs after the
>> barrier can
>>>>>>   be started until all those before it have completed. Obviously,
>> for HW
>>>>>>   where order is always enforced, this will be a no-op, but for
>> hardware that
>>>>>>   parallelizes, we want to reduce the fences to get best
>> performance.
>>>>>>
>>>>>> * For synchronization between DMA devices and CPUs, where a CPU can
>> only
>>>>>>   write after a DMA copy has been done, the CPU must wait for the
>> dma
>>>>>>   completion to guarantee ordering. Once the completion has been
>> returned
>>>>>>   the completed operation is globally visible to all cores.
>>>>>
>>>>> Thanks for explanation!  Some questions though:
>>>>>
>>>>> In our case one CPU waits for completion and another CPU is
>>>>> actually using the data.  IOW, "CPU must wait" is a bit ambiguous.
>> Which CPU must wait?
>>>>>
>>>>> Or should it be "Once the completion is visible on any core, the
>>>>> completed operation is globally visible to all cores." ?
>>>>>
>>>>
>>>> The latter.
>>>> Once the change to memory/cache is visible to any core, it is
>>>> visible to all ones. This applies to regular CPU memory writes too -
>>>> at least on IA, and I expect on many other architectures - once the
>>>> write is visible outside the current core it is visible to every
>>>> other core. Once the data hits the l1 or l2 cache of any core, any
>>>> subsequent requests for that data from any other core will "snoop"
>>>> the latest data from the cores cache, even if it has not made its
>>>> way down to a shared cache, e.g. l3 on most IA systems.
>>>
>>> It sounds like you're referring to the "multicopy atomicity" of the
>>> architecture.  However, that is not universally supported thing.
>>> AFAICT, POWER and older ARM systems doesn't support it, so writes
>>> performed by one core are not necessarily available to all other cores
>>> at the same time.  That means that if the CPU0 writes the data and the
>>> completion flag, CPU1 reads the completion flag and writes the ring,
>>> CPU2 may see the ring write, but may still not see the write of the
>>> data, even though there was a control dependency on CPU1.
>>> There should be a full memory barrier on CPU1 in order to fulfill the
>>> memory ordering requirements for CPU2, IIUC.
>>>
>>> In our scenario the CPU0 is a DMA device, which may or may not be part
>>> of a CPU and may have different memory consistency/ordering
>>> requirements.  So, the question is: does DPDK DMA API guarantee
>>> multicopy atomicity between DMA device and all CPU cores regardless of
>>> CPU architecture and a nature of the DMA device?
>>>
>>
>> Right now, it doesn't because this never came up in discussion. In order
>> to be useful, it sounds like it explicitly should do so. At least for the
>> Intel ioat and idxd driver cases, this will be supported, so we just need
>> to ensure all other drivers currently upstreamed can offer this too. If
>> they cannot, we cannot offer it as a global guarantee, and we should see
>> about adding a capability flag for this to indicate when the guarantee is
>> there or not.
>>
>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to
>> document for dmadev that once a DMA operation is completed, the op is
>> guaranteed visible to all cores/threads? If not, any thoughts on what
>> guarantees we can provide in this regard, or what capabilities should be
>> exposed?
> 
> 
> 
> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta,
> Requesting your valuable opinions for the queries on this thread.

Sorry late for reply due I didn't follow this thread.

I don't think the DMA API should provide such guarantee because:
1. DMA is an acceleration device, which is the same as encryption/decryption device or network device.
2. For Hisilicon Kunpeng platform:
   The DMA device support:
     a) IO coherency: which mean it could read read the latest data which may stay the cache, and will
        invalidate cache's data and write data to DDR when write.
     b) Order in one request: which mean it only write completion descriptor after the copy is done.
        Note: orders between multiple requests can be implemented through the fence mechanism.
   The DMA driver only should:
     a) Add one write memory barrier(use lightweight mb) when doorbell.
   So once the DMA is completed the operation is guaranteed visible to all cores,
   And the 3rd core will observed the right order: core-B prepare data and issue request to DMA, DMA
   start work, core-B get completion status.
3. I did a TI multi-core SoC many years ago, the SoC don't support cache coherence and consistency between
   cores. The SoC also have DMA device which have many channel. Here we do a hypothetical design the DMA
   driver with the DPDK DMA framework:
   The DMA driver should:
     a) write back DMA's src buffer, so that there are none cache data when DMA running.
     b) invalidate DMA's dst buffer
     c) do a full mb
     d) update DMA's registers.
   Then DMA will execute the copy task, it copy from DDR and write to DDR, and after copy it will modify
   it's status register to completed.
   In this case, the 3rd core will also observed the right order.
   A particular point of this is: If one buffer will shared on multiple core, application should explicit
   maintain the cache.

Based on above, I don't think the DMA API should explicit add the descriptor, it's driver's and even
application(e.g. above TI's SoC)'s duty to make sure it.

> 
> 
>>
>> /Bruce
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-05-03 19:38                                         ` Van Haaren, Harry
@ 2022-05-10 14:39                                           ` Van Haaren, Harry
  2022-05-24 12:12                                           ` Ilya Maximets
  1 sibling, 0 replies; 58+ messages in thread
From: Van Haaren, Harry @ 2022-05-10 14:39 UTC (permalink / raw)
  To: Ilya Maximets, Richardson, Bruce
  Cc: Mcnamara, John, Hu, Jiayu, Maxime Coquelin, Morten Brørup,
	Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev, dev,
	O'Driscoll, Tim, Finn, Emma

> -----Original Message-----
> From: Van Haaren, Harry
> Sent: Tuesday, May 3, 2022 8:38 PM
> To: Ilya Maximets <i.maximets@ovn.org>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Mcnamara, John <john.mcnamara@intel.com>; Hu, Jiayu <Jiayu.Hu@intel.com>;
> Maxime Coquelin <maxime.coquelin@redhat.com>; Morten Brørup
> <mb@smartsharesystems.com>; Pai G, Sunil <Sunil.Pai.G@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; ovs-
> dev@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn, Emma <Emma.Finn@intel.com>
> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> 
> > -----Original Message-----
> > From: Ilya Maximets <i.maximets@ovn.org>
> > Sent: Thursday, April 28, 2022 2:00 PM
> > To: Richardson, Bruce <bruce.richardson@intel.com>
> > Cc: i.maximets@ovn.org; Mcnamara, John <john.mcnamara@intel.com>; Hu,
> Jiayu
> > <jiayu.hu@intel.com>; Maxime Coquelin <maxime.coquelin@redhat.com>; Van
> > Haaren, Harry <harry.van.haaren@intel.com>; Morten Brørup
> > <mb@smartsharesystems.com>; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> > <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> > dev@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> > Finn, Emma <emma.finn@intel.com>
> > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >
> > On 4/27/22 22:34, Bruce Richardson wrote:
> > > On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote:
> > >> On 4/20/22 18:41, Mcnamara, John wrote:
> > >>>> -----Original Message-----
> > >>>> From: Ilya Maximets <i.maximets@ovn.org>
> > >>>> Sent: Friday, April 8, 2022 10:58 AM
> > >>>> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
> > >>>> <maxime.coquelin@redhat.com>; Van Haaren, Harry
> > >>>> <harry.van.haaren@intel.com>; Morten Brørup
> > <mb@smartsharesystems.com>;
> > >>>> Richardson, Bruce <bruce.richardson@intel.com>
> > >>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> > >>>> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> > >>>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> > >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> > >>>> Finn, Emma <emma.finn@intel.com>
> > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > >>>>
> > >>>> On 4/8/22 09:13, Hu, Jiayu wrote:
> > >>>>>
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Ilya Maximets <i.maximets@ovn.org>
> > >>>>>> Sent: Thursday, April 7, 2022 10:40 PM
> > >>>>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> > >>>>>> <harry.van.haaren@intel.com>; Morten Brørup
> > >>>>>> <mb@smartsharesystems.com>; Richardson, Bruce
> > >>>>>> <bruce.richardson@intel.com>
> > >>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
> > >>>>>> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
> > >>>>>> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> > >>>>>> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
> > >>>>>> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> > >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > >>>>>>
> > >>>>>> On 4/7/22 16:25, Maxime Coquelin wrote:
> > >>>>>>> Hi Harry,
> > >>>>>>>
> > >>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
> > >>>>>>>> Hi OVS & DPDK, Maintainers & Community,
> > >>>>>>>>
> > >>>>>>>> Top posting overview of discussion as replies to thread become
> > >>>> slower:
> > >>>>>>>> perhaps it is a good time to review and plan for next steps?
> > >>>>>>>>
> > >>>>>>>>  From my perspective, it those most vocal in the thread seem to be
> > >>>>>>>> in favour of the clean rx/tx split ("defer work"), with the
> > >>>>>>>> tradeoff that the application must be aware of handling the async
> > >>>>>>>> DMA completions. If there are any concerns opposing upstreaming of
> > >>>>>>>> this
> > >>>>>> method, please indicate this promptly, and we can continue technical
> > >>>>>> discussions here now.
> > >>>>>>>
> > >>>>>>> Wasn't there some discussions about handling the Virtio completions
> > >>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work.
> > >>>>>>
> > >>>>>> +1
> > >>>>>>
> > >>>>>> With the virtio completions handled by DMA itself, the vhost port
> > >>>>>> turns almost into a real HW NIC.  With that we will not need any
> > >>>>>> extra manipulations from the OVS side, i.e. no need to defer any work
> > >>>>>> while maintaining clear split between rx and tx operations.
> > >>>>>
> > >>>>> First, making DMA do 2B copy would sacrifice performance, and I think
> > >>>>> we all agree on that.
> > >>>>
> > >>>> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
> > >>>> done by CPU, however CPU is going away for dozens or even hundreds of
> > >>>> thousands of cycles to process a new packet batch or service other ports,
> > >>>> hence DMA will likely complete the transmission faster than waiting for
> > >>>> the CPU thread to come back to that task.  In any case, this has to be
> > >>>> tested.
> > >>>>
> > >>>>> Second, this method comes with an issue of ordering.
> > >>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> > >>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
> > >>>>> own dedicated DMA device dma0 and dma1, flag/index update for the
> > >>>>> first 10 packets is done by dma0, and flag/index update for the left
> > >>>>> 20 packets is done by dma1. But there is no ordering guarantee among
> > >>>>> different DMA devices, so flag/index update may error. If PMD threads
> > >>>>> don't have dedicated DMA devices, which means DMA devices are shared
> > >>>>> among threads, we need lock and pay for lock contention in data-path.
> > >>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA
> > >>>>> sharing among threads. But what's the overhead of allocation mechanism?
> > >>>> Who does it? Any thoughts?
> > >>>>
> > >>>> 1. DMA completion was discussed in context of per-queue allocation, so
> > >>>> there
> > >>>>    is no re-ordering in this case.
> > >>>>
> > >>>> 2. Overhead can be minimal if allocated device can stick to the queue for
> > >>>> a
> > >>>>    reasonable amount of time without re-allocation on every send.  You may
> > >>>>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
> > >>>>    such mechanism.  For sure it can not be the same, but ideas can be re-
> > >>>> used.
> > >>>>
> > >>>> 3. Locking doesn't mean contention if resources are allocated/distributed
> > >>>>    thoughtfully.
> > >>>>
> > >>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
> > >>>>    for doing that inside the vhost library, so any DPDK application and
> > >>>>    vhost ethdev can use it without re-inventing from scratch.  It also
> > >>>> should
> > >>>>    be simpler from the API point of view if allocation and usage are in
> > >>>>    the same place.  But I don't have a strong opinion here as for now,
> > >>>> since
> > >>>>    no real code examples exist, so it's hard to evaluate how they could
> > >>>> look
> > >>>>    like.
> > >>>>
> > >>>> But I feel like we're starting to run in circles here as I did already say
> > >>>> most of that before.
> > >>>
> > >>>
> > >>
> > >> Hi, John.
> > >>
> > >> Just reading this email as I was on PTO for a last 1.5 weeks
> > >> and didn't get through all the emails yet.
> > >>
> > >>> This does seem to be going in circles, especially since there seemed to be
> > technical alignment on the last public call on March 29th.
> > >>
> > >> I guess, there is a typo in the date here.
> > >> It seems to be 26th, not 29th.
> > >>
> > >>> It is not feasible to do a real world implementation/POC of every design
> > proposal.
> > >>
> > >> FWIW, I think it makes sense to PoC and test options that are
> > >> going to be simply unavailable going forward if not explored now.
> > >> Especially because we don't have any good solutions anyway
> > >> ("Deferral of Work" is architecturally wrong solution for OVS).
> > >>
> > >
> > > Hi Ilya,
> > >
> > > for those of us who haven't spent a long time working on OVS, can you
> > > perhaps explain a bit more as to why it is architecturally wrong? From my
> > > experience with DPDK, use of any lookaside accelerator, not just DMA but
> > > any crypto, compression or otherwise, requires asynchronous operation, and
> > > therefore some form of setting work aside temporarily to do other tasks.
> >
> > OVS doesn't use any lookaside accelerators and doesn't have any
> > infrastructure for them.
> >
> >
> > Let me create a DPDK analogy of what is proposed for OVS.
> >
> > DPDK has an ethdev API that abstracts different device drivers for
> > the application.  This API has a rte_eth_tx_burst() function that
> > is supposed to send packets through the particular network interface.
> >
> > Imagine now that there is a network card that is not capable of
> > sending packets right away and requires the application to come
> > back later to finish the operation.  That is an obvious problem,
> > because rte_eth_tx_burst() doesn't require any extra actions and
> > doesn't take ownership of packets that wasn't consumed.
> >
> > The proposed solution for this problem is to change the ethdev API:
> >
> > 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively
> >    means that packets was acknowledged, but not actually sent yet.
> >
> > 2. Require the application to call the new rte_eth_process_async()
> >    function sometime later until it doesn't return -EINPROGRESS
> >    anymore, in case the original rte_eth_tx_burst() call returned
> >    -EINPROGRESS.
> >
> > The main reason why this proposal is questionable:
> >
> > It's only one specific device that requires this special handling,
> > all other devices are capable of sending packets right away.
> > However, every DPDK application now has to implement some kind
> > of "Deferral of Work" mechanism in order to be compliant with
> > the updated DPDK ethdev API.
> >
> > Will DPDK make this API change?
> > I have no voice in DPDK API design decisions, but I'd argue against.
> >
> > Interestingly, that's not really an imaginary proposal.  That is
> > an exact change required for DPDK ethdev API in order to add
> > vhost async support to the vhost ethdev driver.
> >
> > Going back to OVS:
> >
> > An oversimplified architecture of OVS has 3 layers (top to bottom):
> >
> > 1. OFproto - the layer that handles OpenFlow.
> > 2. Datapath Interface - packet processing.
> > 3. Netdev - abstraction on top of all the different port types.
> >
> > Each layer has it's own API that allows different implementations
> > of the same layer to be used interchangeably without any modifications
> > to higher layers.  That's what APIs and encapsulation is for.
> >
> > So, Netdev layer has it's own API and this API is actually very
> > similar to the DPDK's ethdev API.  Simply because they are serving
> > the same purpose - abstraction on top of different network interfaces.
> > Beside different types of DPDK ports, there are also several types
> > of native linux, bsd and windows ports, variety of different tunnel
> > ports.
> >
> > Datapath interface layer is an "application" from the ethdev analogy
> > above.
> >
> > What is proposed by "Deferral of Work" solution is to make pretty
> > much the same API change that I described, but to netdev layer API
> > inside the OVS, and introduce a fairly complex (and questionable,
> > but I'm not going into that right now) machinery to handle that API
> > change into the datapath interface layer.
> >
> > So, exactly the same problem is here:
> >
> > If the API change is needed only for a single port type in a very
> > specific hardware environment, why we need to change the common
> > API and rework a lot of the code in upper layers in order to accommodate
> > that API change, while it makes no practical sense for any other
> > port types or more generic hardware setups?
> > And similar changes will have to be done in any other DPDK application
> > that is not bound to a specific hardware, but wants to support vhost
> > async.
> >
> > The right solution, IMO, is to make vhost async behave as any other
> > physical NIC, since it is essentially a physical NIC now (we're not
> > using DMA directly, it's a combined vhost+DMA solution), instead of
> > propagating quirks of the single device to a common API.
> >
> > And going back to DPDK, this implementation doesn't allow use of
> > vhost async in the DPDK's own vhost ethdev driver.
> >
> > My initial reply to the "Deferral of Work" RFC with pretty much
> > the same concerns:
> >
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-
> > 2-cian.ferriter@intel.com/#2751799
> >
> > Best regards, Ilya Maximets.
> 
> 
> Hi Ilya,
> 
> Thanks for replying in more detail, understanding your perspective here helps to
> communicate the various solutions benefits and drawbacks. Agreed the
> OfProto/Dpif/Netdev
> abstraction layers are strong abstractions in OVS, and in general they serve their
> purpose.
> 
> A key difference between OVS's usage of DPDK Ethdev TX and VHost TX is that the
> performance
> of each is very different: as you know, sending a 1500 byte packet over a physical
> NIC, or via
> VHost into a guest has a very different CPU cycle cost. Typically DPDK Tx takes ~5%
> CPU cycles
> while vhost copies are often ~30%, but can be > 50% in certain packet-
> sizes/configurations.
> 
> Let's view the performance of the above example from the perspective of an actual
> deployment: OVS is
> very often deployed to provide an accelerated packet interface to a guest/VM via
> Vhost/Virtio.
> Surely improving performance of this primary use-case is a valid reason to consider
> changes and
> improvements to an internal abstraction layer in OVS?
> 
> Today DPDK tx and vhost tx are called via the same netdev abstraction, but we must
> ask the questions:
> 	- Is the netdev abstraction really the best it can be?
> 	- Does adding an optional "async" feature to the abstraction improve
> performance significantly? (positive from including?)
> 	- Does adding the optional async feature cause actual degradation in DPIF
> implementations that don't support/use it? (negative due to including?)
> 
> Of course strong abstractions are valuable, and of course changing them requires
> careful thought.
> But let's be clear - It is probably fair to say that OVS is not deployed because it has
> good abstractions internally.
> It is deployed because it is useful, and serves the need of an end-user. And part of
> the end-user needs is performance.
> 
> The suggestion of integrating "Defer Work" method of exposing async in the OVS
> Datapath is well thought out,
> and a clean way of handling async work in a per-thread manner at the application
> layer. It is the most common way of integrating
> lookaside acceleration in software pipelines, and handling the async work at
> application thread level is the only logical place where
> the programmer can reason about tradeoffs for a specific use-case. Adding "dma
> acceleration to Vhost" will inevitably lead to
> compromises in the DPDK implementation, and ones that might (or might not) work
> for OVS and other apps.
> 
> As you know, there have been OVS Conference presentations[1][2], RFCs and
> POCs[3][4][5][6], and community calls[7][8][9] on the topic.
> In the various presentations, the benefits of using application-level deferral of work
> are highlighted, and compared to other implementations
> which have non-desirable side-effects. We haven't heard any objections that people
> won't use OVS if the netdev abstraction is changed.
> 
> It seems there is a trade-off decision to be made;
> 	A) Change/improve the netdev abstraction to allow for async accelerations,
> and pay the cost of added app layer complexity
> 	B) Demand dma-acceleration is pushed down into vhost & below (as netdev
> abstraction is not going to be changed),
> 	    resulting in sub-par and potentially unusable code for any given app, as
> lower-layers cannot reason about app level specifics
> 
> How can we the OVS/DPDK developers and users make a decision here?

Ping on this topic - there's an ask here to find how to best move forward, so
welcoming input from everyone, and specifically OVS maintainers & tech leaders.


> Regards, -Harry
> 
> [1] https://www.openvswitch.org/support/ovscon2020/#C3
> [2] https://www.openvswitch.org/support/ovscon2021/#T12
> [3] rawdev;
> https://patchwork.ozlabs.org/project/openvswitch/patch/20201023094845.35652-
> 2-sunil.pai.g@intel.com/
> [4] defer work;
> http://patchwork.ozlabs.org/project/openvswitch/list/?series=261267&state=*
> [5] v3;
> http://patchwork.ozlabs.org/project/openvswitch/patch/20220104125242.1064162
> -2-sunil.pai.g@intel.com/
> [6] v4;
> http://patchwork.ozlabs.org/project/openvswitch/patch/20220321173640.326795-
> 2-sunil.pai.g@intel.com/
> [7] Slides session 1; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-
> share/raw/main/OVS%20vhost%20async%20datapath%20design%202022.pdf
> [8] Slides session 2; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-
> share/raw/main/OVS%20vhost%20async%20datapath%20design%202022%20sessio
> n%202.pdf
> [9] Slides session 3; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-
> share/raw/main/ovs_datapath_design_2022%20session%203.pdf


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-28 12:59                                       ` Ilya Maximets
  2022-04-28 13:55                                         ` Bruce Richardson
@ 2022-05-03 19:38                                         ` Van Haaren, Harry
  2022-05-10 14:39                                           ` Van Haaren, Harry
  2022-05-24 12:12                                           ` Ilya Maximets
  1 sibling, 2 replies; 58+ messages in thread
From: Van Haaren, Harry @ 2022-05-03 19:38 UTC (permalink / raw)
  To: Ilya Maximets, Richardson, Bruce
  Cc: Mcnamara, John, Hu, Jiayu, Maxime Coquelin, Morten Brørup,
	Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev, dev,
	O'Driscoll, Tim, Finn, Emma

> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Thursday, April 28, 2022 2:00 PM
> To: Richardson, Bruce <bruce.richardson@intel.com>
> Cc: i.maximets@ovn.org; Mcnamara, John <john.mcnamara@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>; Maxime Coquelin <maxime.coquelin@redhat.com>; Van
> Haaren, Harry <harry.van.haaren@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> dev@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On 4/27/22 22:34, Bruce Richardson wrote:
> > On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote:
> >> On 4/20/22 18:41, Mcnamara, John wrote:
> >>>> -----Original Message-----
> >>>> From: Ilya Maximets <i.maximets@ovn.org>
> >>>> Sent: Friday, April 8, 2022 10:58 AM
> >>>> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
> >>>> <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >>>> <harry.van.haaren@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>;
> >>>> Richardson, Bruce <bruce.richardson@intel.com>
> >>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> >>>> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> >>>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> >>>> Finn, Emma <emma.finn@intel.com>
> >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>> On 4/8/22 09:13, Hu, Jiayu wrote:
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Ilya Maximets <i.maximets@ovn.org>
> >>>>>> Sent: Thursday, April 7, 2022 10:40 PM
> >>>>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >>>>>> <harry.van.haaren@intel.com>; Morten Brørup
> >>>>>> <mb@smartsharesystems.com>; Richardson, Bruce
> >>>>>> <bruce.richardson@intel.com>
> >>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
> >>>>>> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
> >>>>>> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> >>>>>> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
> >>>>>> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>>>
> >>>>>> On 4/7/22 16:25, Maxime Coquelin wrote:
> >>>>>>> Hi Harry,
> >>>>>>>
> >>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
> >>>>>>>> Hi OVS & DPDK, Maintainers & Community,
> >>>>>>>>
> >>>>>>>> Top posting overview of discussion as replies to thread become
> >>>> slower:
> >>>>>>>> perhaps it is a good time to review and plan for next steps?
> >>>>>>>>
> >>>>>>>>  From my perspective, it those most vocal in the thread seem to be
> >>>>>>>> in favour of the clean rx/tx split ("defer work"), with the
> >>>>>>>> tradeoff that the application must be aware of handling the async
> >>>>>>>> DMA completions. If there are any concerns opposing upstreaming of
> >>>>>>>> this
> >>>>>> method, please indicate this promptly, and we can continue technical
> >>>>>> discussions here now.
> >>>>>>>
> >>>>>>> Wasn't there some discussions about handling the Virtio completions
> >>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work.
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> With the virtio completions handled by DMA itself, the vhost port
> >>>>>> turns almost into a real HW NIC.  With that we will not need any
> >>>>>> extra manipulations from the OVS side, i.e. no need to defer any work
> >>>>>> while maintaining clear split between rx and tx operations.
> >>>>>
> >>>>> First, making DMA do 2B copy would sacrifice performance, and I think
> >>>>> we all agree on that.
> >>>>
> >>>> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
> >>>> done by CPU, however CPU is going away for dozens or even hundreds of
> >>>> thousands of cycles to process a new packet batch or service other ports,
> >>>> hence DMA will likely complete the transmission faster than waiting for
> >>>> the CPU thread to come back to that task.  In any case, this has to be
> >>>> tested.
> >>>>
> >>>>> Second, this method comes with an issue of ordering.
> >>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> >>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
> >>>>> own dedicated DMA device dma0 and dma1, flag/index update for the
> >>>>> first 10 packets is done by dma0, and flag/index update for the left
> >>>>> 20 packets is done by dma1. But there is no ordering guarantee among
> >>>>> different DMA devices, so flag/index update may error. If PMD threads
> >>>>> don't have dedicated DMA devices, which means DMA devices are shared
> >>>>> among threads, we need lock and pay for lock contention in data-path.
> >>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA
> >>>>> sharing among threads. But what's the overhead of allocation mechanism?
> >>>> Who does it? Any thoughts?
> >>>>
> >>>> 1. DMA completion was discussed in context of per-queue allocation, so
> >>>> there
> >>>>    is no re-ordering in this case.
> >>>>
> >>>> 2. Overhead can be minimal if allocated device can stick to the queue for
> >>>> a
> >>>>    reasonable amount of time without re-allocation on every send.  You may
> >>>>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
> >>>>    such mechanism.  For sure it can not be the same, but ideas can be re-
> >>>> used.
> >>>>
> >>>> 3. Locking doesn't mean contention if resources are allocated/distributed
> >>>>    thoughtfully.
> >>>>
> >>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
> >>>>    for doing that inside the vhost library, so any DPDK application and
> >>>>    vhost ethdev can use it without re-inventing from scratch.  It also
> >>>> should
> >>>>    be simpler from the API point of view if allocation and usage are in
> >>>>    the same place.  But I don't have a strong opinion here as for now,
> >>>> since
> >>>>    no real code examples exist, so it's hard to evaluate how they could
> >>>> look
> >>>>    like.
> >>>>
> >>>> But I feel like we're starting to run in circles here as I did already say
> >>>> most of that before.
> >>>
> >>>
> >>
> >> Hi, John.
> >>
> >> Just reading this email as I was on PTO for a last 1.5 weeks
> >> and didn't get through all the emails yet.
> >>
> >>> This does seem to be going in circles, especially since there seemed to be
> technical alignment on the last public call on March 29th.
> >>
> >> I guess, there is a typo in the date here.
> >> It seems to be 26th, not 29th.
> >>
> >>> It is not feasible to do a real world implementation/POC of every design
> proposal.
> >>
> >> FWIW, I think it makes sense to PoC and test options that are
> >> going to be simply unavailable going forward if not explored now.
> >> Especially because we don't have any good solutions anyway
> >> ("Deferral of Work" is architecturally wrong solution for OVS).
> >>
> >
> > Hi Ilya,
> >
> > for those of us who haven't spent a long time working on OVS, can you
> > perhaps explain a bit more as to why it is architecturally wrong? From my
> > experience with DPDK, use of any lookaside accelerator, not just DMA but
> > any crypto, compression or otherwise, requires asynchronous operation, and
> > therefore some form of setting work aside temporarily to do other tasks.
> 
> OVS doesn't use any lookaside accelerators and doesn't have any
> infrastructure for them.
> 
> 
> Let me create a DPDK analogy of what is proposed for OVS.
> 
> DPDK has an ethdev API that abstracts different device drivers for
> the application.  This API has a rte_eth_tx_burst() function that
> is supposed to send packets through the particular network interface.
> 
> Imagine now that there is a network card that is not capable of
> sending packets right away and requires the application to come
> back later to finish the operation.  That is an obvious problem,
> because rte_eth_tx_burst() doesn't require any extra actions and
> doesn't take ownership of packets that wasn't consumed.
> 
> The proposed solution for this problem is to change the ethdev API:
> 
> 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively
>    means that packets was acknowledged, but not actually sent yet.
> 
> 2. Require the application to call the new rte_eth_process_async()
>    function sometime later until it doesn't return -EINPROGRESS
>    anymore, in case the original rte_eth_tx_burst() call returned
>    -EINPROGRESS.
> 
> The main reason why this proposal is questionable:
> 
> It's only one specific device that requires this special handling,
> all other devices are capable of sending packets right away.
> However, every DPDK application now has to implement some kind
> of "Deferral of Work" mechanism in order to be compliant with
> the updated DPDK ethdev API.
> 
> Will DPDK make this API change?
> I have no voice in DPDK API design decisions, but I'd argue against.
> 
> Interestingly, that's not really an imaginary proposal.  That is
> an exact change required for DPDK ethdev API in order to add
> vhost async support to the vhost ethdev driver.
> 
> Going back to OVS:
> 
> An oversimplified architecture of OVS has 3 layers (top to bottom):
> 
> 1. OFproto - the layer that handles OpenFlow.
> 2. Datapath Interface - packet processing.
> 3. Netdev - abstraction on top of all the different port types.
> 
> Each layer has it's own API that allows different implementations
> of the same layer to be used interchangeably without any modifications
> to higher layers.  That's what APIs and encapsulation is for.
> 
> So, Netdev layer has it's own API and this API is actually very
> similar to the DPDK's ethdev API.  Simply because they are serving
> the same purpose - abstraction on top of different network interfaces.
> Beside different types of DPDK ports, there are also several types
> of native linux, bsd and windows ports, variety of different tunnel
> ports.
> 
> Datapath interface layer is an "application" from the ethdev analogy
> above.
> 
> What is proposed by "Deferral of Work" solution is to make pretty
> much the same API change that I described, but to netdev layer API
> inside the OVS, and introduce a fairly complex (and questionable,
> but I'm not going into that right now) machinery to handle that API
> change into the datapath interface layer.
> 
> So, exactly the same problem is here:
> 
> If the API change is needed only for a single port type in a very
> specific hardware environment, why we need to change the common
> API and rework a lot of the code in upper layers in order to accommodate
> that API change, while it makes no practical sense for any other
> port types or more generic hardware setups?
> And similar changes will have to be done in any other DPDK application
> that is not bound to a specific hardware, but wants to support vhost
> async.
> 
> The right solution, IMO, is to make vhost async behave as any other
> physical NIC, since it is essentially a physical NIC now (we're not
> using DMA directly, it's a combined vhost+DMA solution), instead of
> propagating quirks of the single device to a common API.
> 
> And going back to DPDK, this implementation doesn't allow use of
> vhost async in the DPDK's own vhost ethdev driver.
> 
> My initial reply to the "Deferral of Work" RFC with pretty much
> the same concerns:
>   https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-
> 2-cian.ferriter@intel.com/#2751799
> 
> Best regards, Ilya Maximets.


Hi Ilya,

Thanks for replying in more detail, understanding your perspective here helps to
communicate the various solutions benefits and drawbacks. Agreed the OfProto/Dpif/Netdev
abstraction layers are strong abstractions in OVS, and in general they serve their purpose.

A key difference between OVS's usage of DPDK Ethdev TX and VHost TX is that the performance
of each is very different: as you know, sending a 1500 byte packet over a physical NIC, or via
VHost into a guest has a very different CPU cycle cost. Typically DPDK Tx takes ~5% CPU cycles
while vhost copies are often ~30%, but can be > 50% in certain packet-sizes/configurations.

Let's view the performance of the above example from the perspective of an actual deployment: OVS is
very often deployed to provide an accelerated packet interface to a guest/VM via Vhost/Virtio.
Surely improving performance of this primary use-case is a valid reason to consider changes and 
improvements to an internal abstraction layer in OVS?

Today DPDK tx and vhost tx are called via the same netdev abstraction, but we must ask the questions:
	- Is the netdev abstraction really the best it can be?
	- Does adding an optional "async" feature to the abstraction improve performance significantly? (positive from including?)
	- Does adding the optional async feature cause actual degradation in DPIF implementations that don't support/use it? (negative due to including?)

Of course strong abstractions are valuable, and of course changing them requires careful thought.
But let's be clear - It is probably fair to say that OVS is not deployed because it has good abstractions internally.
It is deployed because it is useful, and serves the need of an end-user. And part of the end-user needs is performance.

The suggestion of integrating "Defer Work" method of exposing async in the OVS Datapath is well thought out,
and a clean way of handling async work in a per-thread manner at the application layer. It is the most common way of integrating
lookaside acceleration in software pipelines, and handling the async work at application thread level is the only logical place where
the programmer can reason about tradeoffs for a specific use-case. Adding "dma acceleration to Vhost" will inevitably lead to
compromises in the DPDK implementation, and ones that might (or might not) work for OVS and other apps.

As you know, there have been OVS Conference presentations[1][2], RFCs and POCs[3][4][5][6], and community calls[7][8][9] on the topic.
In the various presentations, the benefits of using application-level deferral of work are highlighted, and compared to other implementations
which have non-desirable side-effects. We haven't heard any objections that people won't use OVS if the netdev abstraction is changed.

It seems there is a trade-off decision to be made; 
	A) Change/improve the netdev abstraction to allow for async accelerations, and pay the cost of added app layer complexity
	B) Demand dma-acceleration is pushed down into vhost & below (as netdev abstraction is not going to be changed),
	    resulting in sub-par and potentially unusable code for any given app, as lower-layers cannot reason about app level specifics

How can we the OVS/DPDK developers and users make a decision here?

Regards, -Harry

[1] https://www.openvswitch.org/support/ovscon2020/#C3
[2] https://www.openvswitch.org/support/ovscon2021/#T12
[3] rawdev; https://patchwork.ozlabs.org/project/openvswitch/patch/20201023094845.35652-2-sunil.pai.g@intel.com/
[4] defer work; http://patchwork.ozlabs.org/project/openvswitch/list/?series=261267&state=*
[5] v3; http://patchwork.ozlabs.org/project/openvswitch/patch/20220104125242.1064162-2-sunil.pai.g@intel.com/
[6] v4; http://patchwork.ozlabs.org/project/openvswitch/patch/20220321173640.326795-2-sunil.pai.g@intel.com/
[7] Slides session 1; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/OVS%20vhost%20async%20datapath%20design%202022.pdf
[8] Slides session 2; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/OVS%20vhost%20async%20datapath%20design%202022%20session%202.pdf
[9] Slides session 3; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/ovs_datapath_design_2022%20session%203.pdf


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-28 12:59                                       ` Ilya Maximets
@ 2022-04-28 13:55                                         ` Bruce Richardson
  2022-05-03 19:38                                         ` Van Haaren, Harry
  1 sibling, 0 replies; 58+ messages in thread
From: Bruce Richardson @ 2022-04-28 13:55 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Mcnamara, John, Hu, Jiayu, Maxime Coquelin, Van Haaren, Harry,
	Morten Brørup, Pai G, Sunil, Stokes, Ian, Ferriter, Cian,
	ovs-dev, dev, O'Driscoll, Tim, Finn, Emma

On Thu, Apr 28, 2022 at 02:59:37PM +0200, Ilya Maximets wrote:
> On 4/27/22 22:34, Bruce Richardson wrote:
> > On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote:
> >> On 4/20/22 18:41, Mcnamara, John wrote:
> >>>> -----Original Message-----
> >>>> From: Ilya Maximets <i.maximets@ovn.org>
> >>>> Sent: Friday, April 8, 2022 10:58 AM
> >>>> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
> >>>> <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >>>> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
> >>>> Richardson, Bruce <bruce.richardson@intel.com>
> >>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> >>>> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> >>>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> >>>> Finn, Emma <emma.finn@intel.com>
> >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>> On 4/8/22 09:13, Hu, Jiayu wrote:
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Ilya Maximets <i.maximets@ovn.org>
> >>>>>> Sent: Thursday, April 7, 2022 10:40 PM
> >>>>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >>>>>> <harry.van.haaren@intel.com>; Morten Brørup
> >>>>>> <mb@smartsharesystems.com>; Richardson, Bruce
> >>>>>> <bruce.richardson@intel.com>
> >>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
> >>>>>> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
> >>>>>> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> >>>>>> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
> >>>>>> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>>>
> >>>>>> On 4/7/22 16:25, Maxime Coquelin wrote:
> >>>>>>> Hi Harry,
> >>>>>>>
> >>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
> >>>>>>>> Hi OVS & DPDK, Maintainers & Community,
> >>>>>>>>
> >>>>>>>> Top posting overview of discussion as replies to thread become
> >>>> slower:
> >>>>>>>> perhaps it is a good time to review and plan for next steps?
> >>>>>>>>
> >>>>>>>>  From my perspective, it those most vocal in the thread seem to be
> >>>>>>>> in favour of the clean rx/tx split ("defer work"), with the
> >>>>>>>> tradeoff that the application must be aware of handling the async
> >>>>>>>> DMA completions. If there are any concerns opposing upstreaming of
> >>>>>>>> this
> >>>>>> method, please indicate this promptly, and we can continue technical
> >>>>>> discussions here now.
> >>>>>>>
> >>>>>>> Wasn't there some discussions about handling the Virtio completions
> >>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work.
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> With the virtio completions handled by DMA itself, the vhost port
> >>>>>> turns almost into a real HW NIC.  With that we will not need any
> >>>>>> extra manipulations from the OVS side, i.e. no need to defer any work
> >>>>>> while maintaining clear split between rx and tx operations.
> >>>>>
> >>>>> First, making DMA do 2B copy would sacrifice performance, and I think
> >>>>> we all agree on that.
> >>>>
> >>>> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
> >>>> done by CPU, however CPU is going away for dozens or even hundreds of
> >>>> thousands of cycles to process a new packet batch or service other ports,
> >>>> hence DMA will likely complete the transmission faster than waiting for
> >>>> the CPU thread to come back to that task.  In any case, this has to be
> >>>> tested.
> >>>>
> >>>>> Second, this method comes with an issue of ordering.
> >>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> >>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
> >>>>> own dedicated DMA device dma0 and dma1, flag/index update for the
> >>>>> first 10 packets is done by dma0, and flag/index update for the left
> >>>>> 20 packets is done by dma1. But there is no ordering guarantee among
> >>>>> different DMA devices, so flag/index update may error. If PMD threads
> >>>>> don't have dedicated DMA devices, which means DMA devices are shared
> >>>>> among threads, we need lock and pay for lock contention in data-path.
> >>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA
> >>>>> sharing among threads. But what's the overhead of allocation mechanism?
> >>>> Who does it? Any thoughts?
> >>>>
> >>>> 1. DMA completion was discussed in context of per-queue allocation, so
> >>>> there
> >>>>    is no re-ordering in this case.
> >>>>
> >>>> 2. Overhead can be minimal if allocated device can stick to the queue for
> >>>> a
> >>>>    reasonable amount of time without re-allocation on every send.  You may
> >>>>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
> >>>>    such mechanism.  For sure it can not be the same, but ideas can be re-
> >>>> used.
> >>>>
> >>>> 3. Locking doesn't mean contention if resources are allocated/distributed
> >>>>    thoughtfully.
> >>>>
> >>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
> >>>>    for doing that inside the vhost library, so any DPDK application and
> >>>>    vhost ethdev can use it without re-inventing from scratch.  It also
> >>>> should
> >>>>    be simpler from the API point of view if allocation and usage are in
> >>>>    the same place.  But I don't have a strong opinion here as for now,
> >>>> since
> >>>>    no real code examples exist, so it's hard to evaluate how they could
> >>>> look
> >>>>    like.
> >>>>
> >>>> But I feel like we're starting to run in circles here as I did already say
> >>>> most of that before.
> >>>
> >>>
> >>
> >> Hi, John.
> >>
> >> Just reading this email as I was on PTO for a last 1.5 weeks
> >> and didn't get through all the emails yet.
> >>
> >>> This does seem to be going in circles, especially since there seemed to be technical alignment on the last public call on March 29th.
> >>
> >> I guess, there is a typo in the date here.
> >> It seems to be 26th, not 29th.
> >>
> >>> It is not feasible to do a real world implementation/POC of every design proposal.
> >>
> >> FWIW, I think it makes sense to PoC and test options that are
> >> going to be simply unavailable going forward if not explored now.
> >> Especially because we don't have any good solutions anyway
> >> ("Deferral of Work" is architecturally wrong solution for OVS).
> >>
> > 
> > Hi Ilya,
> > 
> > for those of us who haven't spent a long time working on OVS, can you
> > perhaps explain a bit more as to why it is architecturally wrong? From my
> > experience with DPDK, use of any lookaside accelerator, not just DMA but
> > any crypto, compression or otherwise, requires asynchronous operation, and
> > therefore some form of setting work aside temporarily to do other tasks.
> 
> OVS doesn't use any lookaside accelerators and doesn't have any
> infrastructure for them.
> 
> 
> Let me create a DPDK analogy of what is proposed for OVS.
> 
> DPDK has an ethdev API that abstracts different device drivers for
> the application.  This API has a rte_eth_tx_burst() function that
> is supposed to send packets through the particular network interface.
> 
> Imagine now that there is a network card that is not capable of
> sending packets right away and requires the application to come
> back later to finish the operation.  That is an obvious problem,
> because rte_eth_tx_burst() doesn't require any extra actions and
> doesn't take ownership of packets that wasn't consumed.
> 
> The proposed solution for this problem is to change the ethdev API:
> 
> 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively
>    means that packets was acknowledged, but not actually sent yet.
> 
> 2. Require the application to call the new rte_eth_process_async()
>    function sometime later until it doesn't return -EINPROGRESS
>    anymore, in case the original rte_eth_tx_burst() call returned
>    -EINPROGRESS.
> 
> The main reason why this proposal is questionable:
> 
> It's only one specific device that requires this special handling,
> all other devices are capable of sending packets right away.
> However, every DPDK application now has to implement some kind
> of "Deferral of Work" mechanism in order to be compliant with
> the updated DPDK ethdev API.
> 
> Will DPDK make this API change?
> I have no voice in DPDK API design decisions, but I'd argue against.
> 
> Interestingly, that's not really an imaginary proposal.  That is
> an exact change required for DPDK ethdev API in order to add
> vhost async support to the vhost ethdev driver.
> 
> Going back to OVS:
> 
> An oversimplified architecture of OVS has 3 layers (top to bottom):
> 
> 1. OFproto - the layer that handles OpenFlow.
> 2. Datapath Interface - packet processing.
> 3. Netdev - abstraction on top of all the different port types.
> 
> Each layer has it's own API that allows different implementations
> of the same layer to be used interchangeably without any modifications
> to higher layers.  That's what APIs and encapsulation is for.
> 
> So, Netdev layer has it's own API and this API is actually very
> similar to the DPDK's ethdev API.  Simply because they are serving
> the same purpose - abstraction on top of different network interfaces.
> Beside different types of DPDK ports, there are also several types
> of native linux, bsd and windows ports, variety of different tunnel
> ports.
> 
> Datapath interface layer is an "application" from the ethdev analogy
> above.
> 
> What is proposed by "Deferral of Work" solution is to make pretty
> much the same API change that I described, but to netdev layer API
> inside the OVS, and introduce a fairly complex (and questionable,
> but I'm not going into that right now) machinery to handle that API
> change into the datapath interface layer.
> 
> So, exactly the same problem is here:
> 
> If the API change is needed only for a single port type in a very
> specific hardware environment, why we need to change the common
> API and rework a lot of the code in upper layers in order to accommodate
> that API change, while it makes no practical sense for any other
> port types or more generic hardware setups?
> And similar changes will have to be done in any other DPDK application
> that is not bound to a specific hardware, but wants to support vhost
> async.
> 
> The right solution, IMO, is to make vhost async behave as any other
> physical NIC, since it is essentially a physical NIC now (we're not
> using DMA directly, it's a combined vhost+DMA solution), instead of
> propagating quirks of the single device to a common API.
> 
> And going back to DPDK, this implementation doesn't allow use of
> vhost async in the DPDK's own vhost ethdev driver.
> 
> My initial reply to the "Deferral of Work" RFC with pretty much
> the same concerns:
>   https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-2-cian.ferriter@intel.com/#2751799
> 
> Best regards, Ilya Maximets.

Thanks for the clear explanation. Gives me a much better idea of the view
from your side of things.

/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-27 20:34                                     ` Bruce Richardson
@ 2022-04-28 12:59                                       ` Ilya Maximets
  2022-04-28 13:55                                         ` Bruce Richardson
  2022-05-03 19:38                                         ` Van Haaren, Harry
  0 siblings, 2 replies; 58+ messages in thread
From: Ilya Maximets @ 2022-04-28 12:59 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: i.maximets, Mcnamara, John, Hu, Jiayu, Maxime Coquelin,
	Van Haaren, Harry, Morten Brørup, Pai G, Sunil, Stokes, Ian,
	Ferriter, Cian, ovs-dev, dev, O'Driscoll, Tim, Finn, Emma

On 4/27/22 22:34, Bruce Richardson wrote:
> On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote:
>> On 4/20/22 18:41, Mcnamara, John wrote:
>>>> -----Original Message-----
>>>> From: Ilya Maximets <i.maximets@ovn.org>
>>>> Sent: Friday, April 8, 2022 10:58 AM
>>>> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
>>>> <maxime.coquelin@redhat.com>; Van Haaren, Harry
>>>> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
>>>> Richardson, Bruce <bruce.richardson@intel.com>
>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>>>> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
>>>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>>>> Finn, Emma <emma.finn@intel.com>
>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>>>
>>>> On 4/8/22 09:13, Hu, Jiayu wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ilya Maximets <i.maximets@ovn.org>
>>>>>> Sent: Thursday, April 7, 2022 10:40 PM
>>>>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
>>>>>> <harry.van.haaren@intel.com>; Morten Brørup
>>>>>> <mb@smartsharesystems.com>; Richardson, Bruce
>>>>>> <bruce.richardson@intel.com>
>>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
>>>>>> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
>>>>>> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
>>>>>> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
>>>>>> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
>>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>
>>>>>> On 4/7/22 16:25, Maxime Coquelin wrote:
>>>>>>> Hi Harry,
>>>>>>>
>>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
>>>>>>>> Hi OVS & DPDK, Maintainers & Community,
>>>>>>>>
>>>>>>>> Top posting overview of discussion as replies to thread become
>>>> slower:
>>>>>>>> perhaps it is a good time to review and plan for next steps?
>>>>>>>>
>>>>>>>>  From my perspective, it those most vocal in the thread seem to be
>>>>>>>> in favour of the clean rx/tx split ("defer work"), with the
>>>>>>>> tradeoff that the application must be aware of handling the async
>>>>>>>> DMA completions. If there are any concerns opposing upstreaming of
>>>>>>>> this
>>>>>> method, please indicate this promptly, and we can continue technical
>>>>>> discussions here now.
>>>>>>>
>>>>>>> Wasn't there some discussions about handling the Virtio completions
>>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work.
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> With the virtio completions handled by DMA itself, the vhost port
>>>>>> turns almost into a real HW NIC.  With that we will not need any
>>>>>> extra manipulations from the OVS side, i.e. no need to defer any work
>>>>>> while maintaining clear split between rx and tx operations.
>>>>>
>>>>> First, making DMA do 2B copy would sacrifice performance, and I think
>>>>> we all agree on that.
>>>>
>>>> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
>>>> done by CPU, however CPU is going away for dozens or even hundreds of
>>>> thousands of cycles to process a new packet batch or service other ports,
>>>> hence DMA will likely complete the transmission faster than waiting for
>>>> the CPU thread to come back to that task.  In any case, this has to be
>>>> tested.
>>>>
>>>>> Second, this method comes with an issue of ordering.
>>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
>>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
>>>>> own dedicated DMA device dma0 and dma1, flag/index update for the
>>>>> first 10 packets is done by dma0, and flag/index update for the left
>>>>> 20 packets is done by dma1. But there is no ordering guarantee among
>>>>> different DMA devices, so flag/index update may error. If PMD threads
>>>>> don't have dedicated DMA devices, which means DMA devices are shared
>>>>> among threads, we need lock and pay for lock contention in data-path.
>>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA
>>>>> sharing among threads. But what's the overhead of allocation mechanism?
>>>> Who does it? Any thoughts?
>>>>
>>>> 1. DMA completion was discussed in context of per-queue allocation, so
>>>> there
>>>>    is no re-ordering in this case.
>>>>
>>>> 2. Overhead can be minimal if allocated device can stick to the queue for
>>>> a
>>>>    reasonable amount of time without re-allocation on every send.  You may
>>>>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
>>>>    such mechanism.  For sure it can not be the same, but ideas can be re-
>>>> used.
>>>>
>>>> 3. Locking doesn't mean contention if resources are allocated/distributed
>>>>    thoughtfully.
>>>>
>>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
>>>>    for doing that inside the vhost library, so any DPDK application and
>>>>    vhost ethdev can use it without re-inventing from scratch.  It also
>>>> should
>>>>    be simpler from the API point of view if allocation and usage are in
>>>>    the same place.  But I don't have a strong opinion here as for now,
>>>> since
>>>>    no real code examples exist, so it's hard to evaluate how they could
>>>> look
>>>>    like.
>>>>
>>>> But I feel like we're starting to run in circles here as I did already say
>>>> most of that before.
>>>
>>>
>>
>> Hi, John.
>>
>> Just reading this email as I was on PTO for a last 1.5 weeks
>> and didn't get through all the emails yet.
>>
>>> This does seem to be going in circles, especially since there seemed to be technical alignment on the last public call on March 29th.
>>
>> I guess, there is a typo in the date here.
>> It seems to be 26th, not 29th.
>>
>>> It is not feasible to do a real world implementation/POC of every design proposal.
>>
>> FWIW, I think it makes sense to PoC and test options that are
>> going to be simply unavailable going forward if not explored now.
>> Especially because we don't have any good solutions anyway
>> ("Deferral of Work" is architecturally wrong solution for OVS).
>>
> 
> Hi Ilya,
> 
> for those of us who haven't spent a long time working on OVS, can you
> perhaps explain a bit more as to why it is architecturally wrong? From my
> experience with DPDK, use of any lookaside accelerator, not just DMA but
> any crypto, compression or otherwise, requires asynchronous operation, and
> therefore some form of setting work aside temporarily to do other tasks.

OVS doesn't use any lookaside accelerators and doesn't have any
infrastructure for them.


Let me create a DPDK analogy of what is proposed for OVS.

DPDK has an ethdev API that abstracts different device drivers for
the application.  This API has a rte_eth_tx_burst() function that
is supposed to send packets through the particular network interface.

Imagine now that there is a network card that is not capable of
sending packets right away and requires the application to come
back later to finish the operation.  That is an obvious problem,
because rte_eth_tx_burst() doesn't require any extra actions and
doesn't take ownership of packets that wasn't consumed.

The proposed solution for this problem is to change the ethdev API:

1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively
   means that packets was acknowledged, but not actually sent yet.

2. Require the application to call the new rte_eth_process_async()
   function sometime later until it doesn't return -EINPROGRESS
   anymore, in case the original rte_eth_tx_burst() call returned
   -EINPROGRESS.

The main reason why this proposal is questionable:

It's only one specific device that requires this special handling,
all other devices are capable of sending packets right away.
However, every DPDK application now has to implement some kind
of "Deferral of Work" mechanism in order to be compliant with
the updated DPDK ethdev API.

Will DPDK make this API change?
I have no voice in DPDK API design decisions, but I'd argue against.

Interestingly, that's not really an imaginary proposal.  That is
an exact change required for DPDK ethdev API in order to add
vhost async support to the vhost ethdev driver.

Going back to OVS:

An oversimplified architecture of OVS has 3 layers (top to bottom):

1. OFproto - the layer that handles OpenFlow.
2. Datapath Interface - packet processing.
3. Netdev - abstraction on top of all the different port types.

Each layer has it's own API that allows different implementations
of the same layer to be used interchangeably without any modifications
to higher layers.  That's what APIs and encapsulation is for.

So, Netdev layer has it's own API and this API is actually very
similar to the DPDK's ethdev API.  Simply because they are serving
the same purpose - abstraction on top of different network interfaces.
Beside different types of DPDK ports, there are also several types
of native linux, bsd and windows ports, variety of different tunnel
ports.

Datapath interface layer is an "application" from the ethdev analogy
above.

What is proposed by "Deferral of Work" solution is to make pretty
much the same API change that I described, but to netdev layer API
inside the OVS, and introduce a fairly complex (and questionable,
but I'm not going into that right now) machinery to handle that API
change into the datapath interface layer.

So, exactly the same problem is here:

If the API change is needed only for a single port type in a very
specific hardware environment, why we need to change the common
API and rework a lot of the code in upper layers in order to accommodate
that API change, while it makes no practical sense for any other
port types or more generic hardware setups?
And similar changes will have to be done in any other DPDK application
that is not bound to a specific hardware, but wants to support vhost
async.

The right solution, IMO, is to make vhost async behave as any other
physical NIC, since it is essentially a physical NIC now (we're not
using DMA directly, it's a combined vhost+DMA solution), instead of
propagating quirks of the single device to a common API.

And going back to DPDK, this implementation doesn't allow use of
vhost async in the DPDK's own vhost ethdev driver.

My initial reply to the "Deferral of Work" RFC with pretty much
the same concerns:
  https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-2-cian.ferriter@intel.com/#2751799

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-25 21:46                                   ` Ilya Maximets
  2022-04-27 14:55                                     ` Mcnamara, John
@ 2022-04-27 20:34                                     ` Bruce Richardson
  2022-04-28 12:59                                       ` Ilya Maximets
  1 sibling, 1 reply; 58+ messages in thread
From: Bruce Richardson @ 2022-04-27 20:34 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Mcnamara, John, Hu, Jiayu, Maxime Coquelin, Van Haaren, Harry,
	Morten Brørup, Pai G, Sunil, Stokes, Ian, Ferriter, Cian,
	ovs-dev, dev, O'Driscoll, Tim, Finn, Emma

On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote:
> On 4/20/22 18:41, Mcnamara, John wrote:
> >> -----Original Message-----
> >> From: Ilya Maximets <i.maximets@ovn.org>
> >> Sent: Friday, April 8, 2022 10:58 AM
> >> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
> >> <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
> >> Richardson, Bruce <bruce.richardson@intel.com>
> >> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> >> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> >> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> >> Finn, Emma <emma.finn@intel.com>
> >> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>
> >> On 4/8/22 09:13, Hu, Jiayu wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Ilya Maximets <i.maximets@ovn.org>
> >>>> Sent: Thursday, April 7, 2022 10:40 PM
> >>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >>>> <harry.van.haaren@intel.com>; Morten Brørup
> >>>> <mb@smartsharesystems.com>; Richardson, Bruce
> >>>> <bruce.richardson@intel.com>
> >>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
> >>>> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
> >>>> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> >>>> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
> >>>> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>> On 4/7/22 16:25, Maxime Coquelin wrote:
> >>>>> Hi Harry,
> >>>>>
> >>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
> >>>>>> Hi OVS & DPDK, Maintainers & Community,
> >>>>>>
> >>>>>> Top posting overview of discussion as replies to thread become
> >> slower:
> >>>>>> perhaps it is a good time to review and plan for next steps?
> >>>>>>
> >>>>>>  From my perspective, it those most vocal in the thread seem to be
> >>>>>> in favour of the clean rx/tx split ("defer work"), with the
> >>>>>> tradeoff that the application must be aware of handling the async
> >>>>>> DMA completions. If there are any concerns opposing upstreaming of
> >>>>>> this
> >>>> method, please indicate this promptly, and we can continue technical
> >>>> discussions here now.
> >>>>>
> >>>>> Wasn't there some discussions about handling the Virtio completions
> >>>>> with the DMA engine? With that, we wouldn't need the deferral of work.
> >>>>
> >>>> +1
> >>>>
> >>>> With the virtio completions handled by DMA itself, the vhost port
> >>>> turns almost into a real HW NIC.  With that we will not need any
> >>>> extra manipulations from the OVS side, i.e. no need to defer any work
> >>>> while maintaining clear split between rx and tx operations.
> >>>
> >>> First, making DMA do 2B copy would sacrifice performance, and I think
> >>> we all agree on that.
> >>
> >> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
> >> done by CPU, however CPU is going away for dozens or even hundreds of
> >> thousands of cycles to process a new packet batch or service other ports,
> >> hence DMA will likely complete the transmission faster than waiting for
> >> the CPU thread to come back to that task.  In any case, this has to be
> >> tested.
> >>
> >>> Second, this method comes with an issue of ordering.
> >>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> >>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
> >>> own dedicated DMA device dma0 and dma1, flag/index update for the
> >>> first 10 packets is done by dma0, and flag/index update for the left
> >>> 20 packets is done by dma1. But there is no ordering guarantee among
> >>> different DMA devices, so flag/index update may error. If PMD threads
> >>> don't have dedicated DMA devices, which means DMA devices are shared
> >>> among threads, we need lock and pay for lock contention in data-path.
> >>> Or we can allocate DMA devices for vring dynamically to avoid DMA
> >>> sharing among threads. But what's the overhead of allocation mechanism?
> >> Who does it? Any thoughts?
> >>
> >> 1. DMA completion was discussed in context of per-queue allocation, so
> >> there
> >>    is no re-ordering in this case.
> >>
> >> 2. Overhead can be minimal if allocated device can stick to the queue for
> >> a
> >>    reasonable amount of time without re-allocation on every send.  You may
> >>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
> >>    such mechanism.  For sure it can not be the same, but ideas can be re-
> >> used.
> >>
> >> 3. Locking doesn't mean contention if resources are allocated/distributed
> >>    thoughtfully.
> >>
> >> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
> >>    for doing that inside the vhost library, so any DPDK application and
> >>    vhost ethdev can use it without re-inventing from scratch.  It also
> >> should
> >>    be simpler from the API point of view if allocation and usage are in
> >>    the same place.  But I don't have a strong opinion here as for now,
> >> since
> >>    no real code examples exist, so it's hard to evaluate how they could
> >> look
> >>    like.
> >>
> >> But I feel like we're starting to run in circles here as I did already say
> >> most of that before.
> > 
> > 
> 
> Hi, John.
> 
> Just reading this email as I was on PTO for a last 1.5 weeks
> and didn't get through all the emails yet.
> 
> > This does seem to be going in circles, especially since there seemed to be technical alignment on the last public call on March 29th.
> 
> I guess, there is a typo in the date here.
> It seems to be 26th, not 29th.
> 
> > It is not feasible to do a real world implementation/POC of every design proposal.
> 
> FWIW, I think it makes sense to PoC and test options that are
> going to be simply unavailable going forward if not explored now.
> Especially because we don't have any good solutions anyway
> ("Deferral of Work" is architecturally wrong solution for OVS).
>

Hi Ilya,

for those of us who haven't spent a long time working on OVS, can you
perhaps explain a bit more as to why it is architecturally wrong? From my
experience with DPDK, use of any lookaside accelerator, not just DMA but
any crypto, compression or otherwise, requires asynchronous operation, and
therefore some form of setting work aside temporarily to do other tasks.

Thanks,
/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-25 21:46                                   ` Ilya Maximets
@ 2022-04-27 14:55                                     ` Mcnamara, John
  2022-04-27 20:34                                     ` Bruce Richardson
  1 sibling, 0 replies; 58+ messages in thread
From: Mcnamara, John @ 2022-04-27 14:55 UTC (permalink / raw)
  To: Ilya Maximets, Hu, Jiayu, Maxime Coquelin, Van Haaren, Harry,
	Morten Brørup, Richardson, Bruce
  Cc: Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev, dev,
	O'Driscoll, Tim, Finn, Emma



> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Monday, April 25, 2022 10:46 PM
> To: Mcnamara, John <john.mcnamara@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>; Maxime Coquelin <maxime.coquelin@redhat.com>; Van
> Haaren, Harry <harry.van.haaren@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
> Ian <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>;
> ovs-dev@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim
> <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> ...
> 
> FWIW, I think it makes sense to PoC and test options that are going to
> be simply unavailable going forward if not explored now.
> Especially because we don't have any good solutions anyway ("Deferral
> of Work" is architecturally wrong solution for OVS).

I agree that there is value in doing PoCs and we have been doing that for over a year based on different proposals and none of them show the potential of the Deferral of Work approach. It isn't productive to keep building PoCs indefinitely; at some point we need to make progress with merging a specific solution upstream.


> > Let's have another call so that we can move towards a single solution
> that the DPDK and OVS communities agree on. I'll set up a call for next
> week in a similar time slot to the previous one.
> 
> Is there any particular reason we can't use a mailing list to discuss
> that topic further?

The discussion can continue on the mailing list. It just seemed more efficient and interactive to discuss this in a meeting.

John
-- 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-20 16:41                                 ` Mcnamara, John
@ 2022-04-25 21:46                                   ` Ilya Maximets
  2022-04-27 14:55                                     ` Mcnamara, John
  2022-04-27 20:34                                     ` Bruce Richardson
  0 siblings, 2 replies; 58+ messages in thread
From: Ilya Maximets @ 2022-04-25 21:46 UTC (permalink / raw)
  To: Mcnamara, John, Hu, Jiayu, Maxime Coquelin, Van Haaren, Harry,
	Morten Brørup, Richardson, Bruce
  Cc: i.maximets, Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev,
	dev, O'Driscoll, Tim, Finn, Emma

On 4/20/22 18:41, Mcnamara, John wrote:
>> -----Original Message-----
>> From: Ilya Maximets <i.maximets@ovn.org>
>> Sent: Friday, April 8, 2022 10:58 AM
>> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
>> <maxime.coquelin@redhat.com>; Van Haaren, Harry
>> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
>> Richardson, Bruce <bruce.richardson@intel.com>
>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>> Finn, Emma <emma.finn@intel.com>
>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>
>> On 4/8/22 09:13, Hu, Jiayu wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Ilya Maximets <i.maximets@ovn.org>
>>>> Sent: Thursday, April 7, 2022 10:40 PM
>>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
>>>> <harry.van.haaren@intel.com>; Morten Brørup
>>>> <mb@smartsharesystems.com>; Richardson, Bruce
>>>> <bruce.richardson@intel.com>
>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
>>>> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
>>>> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
>>>> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
>>>> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>>>
>>>> On 4/7/22 16:25, Maxime Coquelin wrote:
>>>>> Hi Harry,
>>>>>
>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
>>>>>> Hi OVS & DPDK, Maintainers & Community,
>>>>>>
>>>>>> Top posting overview of discussion as replies to thread become
>> slower:
>>>>>> perhaps it is a good time to review and plan for next steps?
>>>>>>
>>>>>>  From my perspective, it those most vocal in the thread seem to be
>>>>>> in favour of the clean rx/tx split ("defer work"), with the
>>>>>> tradeoff that the application must be aware of handling the async
>>>>>> DMA completions. If there are any concerns opposing upstreaming of
>>>>>> this
>>>> method, please indicate this promptly, and we can continue technical
>>>> discussions here now.
>>>>>
>>>>> Wasn't there some discussions about handling the Virtio completions
>>>>> with the DMA engine? With that, we wouldn't need the deferral of work.
>>>>
>>>> +1
>>>>
>>>> With the virtio completions handled by DMA itself, the vhost port
>>>> turns almost into a real HW NIC.  With that we will not need any
>>>> extra manipulations from the OVS side, i.e. no need to defer any work
>>>> while maintaining clear split between rx and tx operations.
>>>
>>> First, making DMA do 2B copy would sacrifice performance, and I think
>>> we all agree on that.
>>
>> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
>> done by CPU, however CPU is going away for dozens or even hundreds of
>> thousands of cycles to process a new packet batch or service other ports,
>> hence DMA will likely complete the transmission faster than waiting for
>> the CPU thread to come back to that task.  In any case, this has to be
>> tested.
>>
>>> Second, this method comes with an issue of ordering.
>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
>>> own dedicated DMA device dma0 and dma1, flag/index update for the
>>> first 10 packets is done by dma0, and flag/index update for the left
>>> 20 packets is done by dma1. But there is no ordering guarantee among
>>> different DMA devices, so flag/index update may error. If PMD threads
>>> don't have dedicated DMA devices, which means DMA devices are shared
>>> among threads, we need lock and pay for lock contention in data-path.
>>> Or we can allocate DMA devices for vring dynamically to avoid DMA
>>> sharing among threads. But what's the overhead of allocation mechanism?
>> Who does it? Any thoughts?
>>
>> 1. DMA completion was discussed in context of per-queue allocation, so
>> there
>>    is no re-ordering in this case.
>>
>> 2. Overhead can be minimal if allocated device can stick to the queue for
>> a
>>    reasonable amount of time without re-allocation on every send.  You may
>>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
>>    such mechanism.  For sure it can not be the same, but ideas can be re-
>> used.
>>
>> 3. Locking doesn't mean contention if resources are allocated/distributed
>>    thoughtfully.
>>
>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
>>    for doing that inside the vhost library, so any DPDK application and
>>    vhost ethdev can use it without re-inventing from scratch.  It also
>> should
>>    be simpler from the API point of view if allocation and usage are in
>>    the same place.  But I don't have a strong opinion here as for now,
>> since
>>    no real code examples exist, so it's hard to evaluate how they could
>> look
>>    like.
>>
>> But I feel like we're starting to run in circles here as I did already say
>> most of that before.
> 
> 

Hi, John.

Just reading this email as I was on PTO for a last 1.5 weeks
and didn't get through all the emails yet.

> This does seem to be going in circles, especially since there seemed to be technical alignment on the last public call on March 29th.

I guess, there is a typo in the date here.
It seems to be 26th, not 29th.

> It is not feasible to do a real world implementation/POC of every design proposal.

FWIW, I think it makes sense to PoC and test options that are
going to be simply unavailable going forward if not explored now.
Especially because we don't have any good solutions anyway
("Deferral of Work" is architecturally wrong solution for OVS).

> Let's have another call so that we can move towards a single solution that the DPDK and OVS communities agree on. I'll set up a call for next week in a similar time slot to the previous one.

Is there any particular reason we can't use a mailing list to
discuss that topic further?  Live discussions tend to cause
information loss as people start to forget what was already
discussed fairly quickly, and there is no reliable source to
refresh the memory (recordings are not really suitable for this
purpose, and we didn't have them before).

Anyway, I have a conflict tomorrow with another meetings, so
will not be able to attend.

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* OVS DPDK DMA-Dev library/Design Discussion
@ 2022-04-25 15:19 Mcnamara, John
  0 siblings, 0 replies; 58+ messages in thread
From: Mcnamara, John @ 2022-04-25 15:19 UTC (permalink / raw)
  To: Stokes, Ian, Pai G, Sunil, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Ilya Maximets,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev
  Cc: O'Driscoll, Tim, Finn, Emma, Ralf Hoffmann, Harris, James R,
	Luse, Paul E, hemal.shah, Gooch, Stephen, Bing Zhao, Wiles,
	Keith, Prasanna Murugesan (prasanna),
	Ananyev, Konstantin, Yu, De, Tom Barbette, Zeng, ZhichaoX,
	Knight, Joshua, Scheurich, Jan, Sinai, Asaf

[-- Attachment #1: Type: text/plain, Size: 1466 bytes --]

Updated with meeting and presentation.

This meeting is a follow-up to the previous calls in March and the discussion which has happened since on the DPDK and OVS mailing lists.

Three approaches were presented in the previous calls:

*       "Defer work": Handle DMA completions at OVS PMD thread level
*       "v3": Handle DMA Tx completions from Rx context.
*       "v3 + lockless ring": Handle DMA Tx completions from Rx context + lockless ring to avoid contention.

After these calls, the discussion continued on the DPDK and OVS mailing lists, where an alternate approach has been proposed.

The newly-suggested approach:

*       "DMA VirtQ Completions": Add an additional transaction(s) to each burst of DMA copies; a special transaction containing the memory write operation that makes the descriptors available to the Virtio driver. Also separate the actual kick of the guest with the data transfer.

Agenda for call 26th April:

*       Intel team will present slides to help understand the differences in architecture/designs.
*       Discuss the strengths/weaknesses/feasibility of the "DMA VirtQ Completions" approach, comparing to current best-candidate "Defer Work".
*       Work toward single-solution to be accepted upstream in DPDK and OVS

Slides: https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/ovs_datapath_design_2022%20session%203.pdf

Google Meet: https://meet.google.com/hme-pygf-bfb






[-- Attachment #2: Type: text/html, Size: 2998 bytes --]

[-- Attachment #3: Type: text/calendar, Size: 6081 bytes --]

BEGIN:VCALENDAR
METHOD:REQUEST
PRODID:Microsoft Exchange Server 2010
VERSION:2.0
BEGIN:VTIMEZONE
TZID:GMT Standard Time
BEGIN:STANDARD
DTSTART:16010101T020000
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=10
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T010000
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=3
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
ORGANIZER;CN="Mcnamara, John":mailto:john.mcnamara@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Stokes, Ia
 n":mailto:ian.stokes@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pai G, Sun
 il":mailto:sunil.pai.g@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Hu, Jiayu":
 mailto:jiayu.hu@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Ferriter, 
 Cian":mailto:cian.ferriter@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Van Haaren
 , Harry":mailto:harry.van.haaren@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Ilya Maxim
 ets:mailto:i.maximets@ovn.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Maxime Coq
 uelin (maxime.coquelin@redhat.com):mailto:maxime.coquelin@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=ovs-dev@op
 envswitch.org:mailto:ovs-dev@openvswitch.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=dev@dpdk.o
 rg:mailto:dev@dpdk.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="O'Driscoll
 , Tim":mailto:tim.odriscoll@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Finn, Emma"
 :mailto:emma.finn@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Ralf Hoffm
 ann:mailto:ralf.hoffmann@allegro-packets.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Harris, Ja
 mes R":mailto:james.r.harris@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Luse, Paul
  E":mailto:paul.e.luse@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=hemal.shah
 @broadcom.com:mailto:hemal.shah@broadcom.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Gooch, Ste
 phen":mailto:stephen.gooch@windriver.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Bing Zhao:
 mailto:bingz@nvidia.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Wiles, Kei
 th":mailto:keith.wiles@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Prasanna M
 urugesan (prasanna):mailto:prasanna@cisco.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Ananyev, K
 onstantin":mailto:konstantin.ananyev@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Yu, De":mai
 lto:de.yu@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Tom Barbet
 te:mailto:tom.barbette@uclouvain.be
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Zeng, Zhic
 haoX":mailto:zhichaox.zeng@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Knight, Jo
 shua":mailto:Joshua.Knight@netscout.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Scheurich,
  Jan":mailto:jan.scheurich@ericsson.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Sinai, Asa
 f":mailto:asafsi@radware.com
DESCRIPTION;LANGUAGE=en-US:Updated with meeting and presentation.\n\nThis m
 eeting is a follow-up to the previous calls in March and the discussion wh
 ich has happened since on the DPDK and OVS mailing lists.\n\nThree approac
 hes were presented in the previous calls:\n\n•       "Defer work": Handl
 e DMA completions at OVS PMD thread level\n•       "v3": Handle DMA Tx c
 ompletions from Rx context.\n•       "v3 + lockless ring": Handle DMA Tx
  completions from Rx context + lockless ring to avoid contention.\n\nAfter
  these calls\, the discussion continued on the DPDK and OVS mailing lists\
 , where an alternate approach has been proposed.\n\nThe newly-suggested ap
 proach:\n\n•       "DMA VirtQ Completions": Add an additional transactio
 n(s) to each burst of DMA copies\; a special transaction containing the me
 mory write operation that makes the descriptors available to the Virtio dr
 iver. Also separate the actual kick of the guest with the data transfer.\n
 \nAgenda for call 26th April:\n\n•       Intel team will present slides 
 to help understand the differences in architecture/designs.\n•       Dis
 cuss the strengths/weaknesses/feasibility of the "DMA VirtQ Completions" a
 pproach\, comparing to current best-candidate "Defer Work".\n•       Wor
 k toward single-solution to be accepted upstream in DPDK and OVS\n\nSlides
 : https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/ovs
 _datapath_design_2022%20session%203.pdf\n\nGoogle Meet: https://meet.googl
 e.com/hme-pygf-bfb\n\n\n\n\n\n
UID:040000008200E00074C5B7101A82E0080000000000DED7A59755D801000000000000000
 0100000001E85284E2C159341818ACDC3087374E1
SUMMARY;LANGUAGE=en-US:OVS DPDK DMA-Dev library/Design Discussion
DTSTART;TZID=GMT Standard Time:20220426T140000
DTEND;TZID=GMT Standard Time:20220426T150000
CLASS:PUBLIC
PRIORITY:5
DTSTAMP:20220425T151916Z
TRANSP:OPAQUE
STATUS:CONFIRMED
SEQUENCE:1
LOCATION;LANGUAGE=en-US:https://meet.google.com/hme-pygf-bfb
X-MICROSOFT-CDO-APPT-SEQUENCE:1
X-MICROSOFT-CDO-OWNERAPPTID:1001736166
X-MICROSOFT-CDO-BUSYSTATUS:TENTATIVE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
X-MICROSOFT-CDO-ALLDAYEVENT:FALSE
X-MICROSOFT-CDO-IMPORTANCE:1
X-MICROSOFT-CDO-INSTTYPE:0
X-MICROSOFT-DONOTFORWARDMEETING:FALSE
X-MICROSOFT-DISALLOW-COUNTER:FALSE
X-MICROSOFT-LOCATIONS:[ { "DisplayName" : "https://meet.google.com/hme-pygf
 -bfb"\, "LocationAnnotation" : ""\, "LocationSource" : 0\, "Unresolved" : 
 true\, "LocationUri" : "" } ]
END:VEVENT
END:VCALENDAR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* OVS DPDK DMA-Dev library/Design Discussion
@ 2022-04-21 14:57 Mcnamara, John
  0 siblings, 0 replies; 58+ messages in thread
From: Mcnamara, John @ 2022-04-21 14:57 UTC (permalink / raw)
  To: Stokes, Ian, Pai G, Sunil, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Ilya Maximets,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev
  Cc: O'Driscoll, Tim, Finn, Emma

[-- Attachment #1: Type: text/plain, Size: 1286 bytes --]

This meeting is a follow-up to the previous calls in March and the discussion which has happened since on the DPDK and OVS mailing lists.

Three approaches were presented in the previous calls:

*       "Defer work": Handle DMA completions at OVS PMD thread level
*       "v3": Handle DMA Tx completions from Rx context.
*       "v3 + lockless ring": Handle DMA Tx completions from Rx context + lockless ring to avoid contention.

After these calls, the discussion continued on the DPDK and OVS mailing lists, where an alternate approach has been proposed.

The newly-suggested approach:

*       "DMA VirtQ Completions": Add an additional transaction(s) to each burst of DMA copies; a special transaction containing the memory write operation that makes the descriptors available to the Virtio driver. Also separate the actual kick of the guest with the data transfer.

Agenda for call 26th April:

*       Intel team will present slides to help understand the differences in architecture/designs.
*       Discuss the strengths/weaknesses/feasibility of the "DMA VirtQ Completions" approach, comparing to current best-candidate "Defer Work".
*       Work toward single-solution to be accepted upstream in DPDK and OVS

Google Meet link and slides to follow.




[-- Attachment #2: Type: text/html, Size: 2367 bytes --]

[-- Attachment #3: Type: text/calendar, Size: 4091 bytes --]

BEGIN:VCALENDAR
METHOD:REQUEST
PRODID:Microsoft Exchange Server 2010
VERSION:2.0
BEGIN:VTIMEZONE
TZID:GMT Standard Time
BEGIN:STANDARD
DTSTART:16010101T020000
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=10
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T010000
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=3
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
ORGANIZER;CN="Mcnamara, John":mailto:john.mcnamara@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Stokes, Ia
 n":mailto:ian.stokes@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pai G, Sun
 il":mailto:sunil.pai.g@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Hu, Jiayu":
 mailto:jiayu.hu@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Ferriter, 
 Cian":mailto:cian.ferriter@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Van Haaren
 , Harry":mailto:harry.van.haaren@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Ilya Maxim
 ets:mailto:i.maximets@ovn.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Maxime Coq
 uelin (maxime.coquelin@redhat.com):mailto:maxime.coquelin@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=ovs-dev@op
 envswitch.org:mailto:ovs-dev@openvswitch.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=dev@dpdk.o
 rg:mailto:dev@dpdk.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="O'Driscoll
 , Tim":mailto:tim.odriscoll@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Finn, Emma"
 :mailto:emma.finn@intel.com
DESCRIPTION;LANGUAGE=en-US:This meeting is a follow-up to the previous call
 s in March and the discussion which has happened since on the DPDK and OVS
  mailing lists.\n\nThree approaches were presented in the previous calls:\
 n\n•       "Defer work": Handle DMA completions at OVS PMD thread level\
 n•       "v3": Handle DMA Tx completions from Rx context.\n•       "v3
  + lockless ring": Handle DMA Tx completions from Rx context + lockless ri
 ng to avoid contention.\n\nAfter these calls\, the discussion continued on
  the DPDK and OVS mailing lists\, where an alternate approach has been pro
 posed.\n\nThe newly-suggested approach:\n\n•       "DMA VirtQ Completion
 s": Add an additional transaction(s) to each burst of DMA copies\; a speci
 al transaction containing the memory write operation that makes the descri
 ptors available to the Virtio driver. Also separate the actual kick of the
  guest with the data transfer.\n\nAgenda for call 26th April:\n\n•      
  Intel team will present slides to help understand the differences in arch
 itecture/designs.\n•       Discuss the strengths/weaknesses/feasibility 
 of the "DMA VirtQ Completions" approach\, comparing to current best-candid
 ate "Defer Work".\n•       Work toward single-solution to be accepted up
 stream in DPDK and OVS\n\nGoogle Meet link and slides to follow.\n\n\n\n
UID:040000008200E00074C5B7101A82E0080000000000DED7A59755D801000000000000000
 0100000001E85284E2C159341818ACDC3087374E1
SUMMARY;LANGUAGE=en-US:OVS DPDK DMA-Dev library/Design Discussion
DTSTART;TZID=GMT Standard Time:20220426T140000
DTEND;TZID=GMT Standard Time:20220426T150000
CLASS:PUBLIC
PRIORITY:5
DTSTAMP:20220421T145740Z
TRANSP:OPAQUE
STATUS:CONFIRMED
SEQUENCE:0
LOCATION;LANGUAGE=en-US:To be added
X-MICROSOFT-CDO-APPT-SEQUENCE:0
X-MICROSOFT-CDO-OWNERAPPTID:1001736166
X-MICROSOFT-CDO-BUSYSTATUS:TENTATIVE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
X-MICROSOFT-CDO-ALLDAYEVENT:FALSE
X-MICROSOFT-CDO-IMPORTANCE:1
X-MICROSOFT-CDO-INSTTYPE:0
X-MICROSOFT-DONOTFORWARDMEETING:FALSE
X-MICROSOFT-DISALLOW-COUNTER:FALSE
X-MICROSOFT-LOCATIONS:[ { "DisplayName" : "To be added"\, "LocationAnnotati
 on" : ""\, "LocationSource" : 0\, "Unresolved" : true\, "LocationUri" : ""
  } ]
END:VEVENT
END:VCALENDAR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
       [not found] <DM6PR11MB3227AC0014F321EB901BE385FC199@DM6PR11MB3227.namprd11.prod.outlook.com>
@ 2022-04-21 11:51 ` Mcnamara, John
  0 siblings, 0 replies; 58+ messages in thread
From: Mcnamara, John @ 2022-04-21 11:51 UTC (permalink / raw)
  To: Stokes, Ian, Pai G, Sunil, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Ilya Maximets,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev
  Cc: O'Driscoll, Tim, Finn, Emma

[-- Attachment #1: Type: text/plain, Size: 1882 bytes --]



      -----Original Appointment-----
      From: Stokes, Ian <ian.stokes@intel.com>
      Sent: Thursday, March 24, 2022 3:37 PM
      To: Stokes, Ian; Pai G, Sunil; Hu, Jiayu; Ferriter, Cian; Van Haaren, Harry; Ilya Maximets; Maxime Coquelin (maxime.coquelin@redhat.com); ovs-dev@openvswitch.org; dev@dpdk.org
      Cc: Mcnamara, John; O'Driscoll, Tim; Finn, Emma
      Subject: OVS DPDK DMA-Dev library/Design Discussion
      When: 29 March 2022 14:00-15:00 (UTC+00:00) Dublin, Edinburgh, Lisbon, London.
      Where: Google Meet


      Hi All,

      This meeting is a follow up to the call earlier this week.

      This week Sunil presented 3 different approaches to integrating DMA-Dev with OVS along with the performance impacts.

      https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022.pdf

      The approaches were as follows:

*       Defer work.
*       Tx completions from Rx context.
*       Tx completions from Rx context + lockless ring.

      The pros and cons of each approach were discussed but there was no clear solution reached.

      As such a follow up call was suggested to continue discussion and to reach a clear decision on the approach to take.

      Please see agenda as it stands below:

      Agenda
*       Opens
*       Continue discussion of 3x approaches from last week (Defer work, "V3", V4, links to patches in Sunil's slides above)
*       Design Feedback (please review solutions of above & slide-deck from last week before call to be informed)
*       Dynamic Allocation of DMA engine per queue
*       Code Availability (DPDK GitHub, OVS GitHub branches)

      Please feel free to respond with any other items to be added to the agenda.

      Google Meet: https://meet.google.com/hme-pygf-bfb

      Regards
      Ian


[-- Attachment #2: Type: text/html, Size: 4018 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-08  9:57                               ` Ilya Maximets
  2022-04-20 15:39                                 ` Mcnamara, John
@ 2022-04-20 16:41                                 ` Mcnamara, John
  2022-04-25 21:46                                   ` Ilya Maximets
  1 sibling, 1 reply; 58+ messages in thread
From: Mcnamara, John @ 2022-04-20 16:41 UTC (permalink / raw)
  To: Ilya Maximets, Hu, Jiayu, Maxime Coquelin, Van Haaren, Harry,
	Morten Brørup, Richardson, Bruce
  Cc: Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev, dev,
	O'Driscoll, Tim, Finn, Emma

> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Friday, April 8, 2022 10:58 AM
> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
> Richardson, Bruce <bruce.richardson@intel.com>
> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On 4/8/22 09:13, Hu, Jiayu wrote:
> >
> >
> >> -----Original Message-----
> >> From: Ilya Maximets <i.maximets@ovn.org>
> >> Sent: Thursday, April 7, 2022 10:40 PM
> >> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >> <harry.van.haaren@intel.com>; Morten Brørup
> >> <mb@smartsharesystems.com>; Richardson, Bruce
> >> <bruce.richardson@intel.com>
> >> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
> >> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
> >> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> >> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
> >> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> >> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>
> >> On 4/7/22 16:25, Maxime Coquelin wrote:
> >>> Hi Harry,
> >>>
> >>> On 4/7/22 16:04, Van Haaren, Harry wrote:
> >>>> Hi OVS & DPDK, Maintainers & Community,
> >>>>
> >>>> Top posting overview of discussion as replies to thread become
> slower:
> >>>> perhaps it is a good time to review and plan for next steps?
> >>>>
> >>>>  From my perspective, it those most vocal in the thread seem to be
> >>>> in favour of the clean rx/tx split ("defer work"), with the
> >>>> tradeoff that the application must be aware of handling the async
> >>>> DMA completions. If there are any concerns opposing upstreaming of
> >>>> this
> >> method, please indicate this promptly, and we can continue technical
> >> discussions here now.
> >>>
> >>> Wasn't there some discussions about handling the Virtio completions
> >>> with the DMA engine? With that, we wouldn't need the deferral of work.
> >>
> >> +1
> >>
> >> With the virtio completions handled by DMA itself, the vhost port
> >> turns almost into a real HW NIC.  With that we will not need any
> >> extra manipulations from the OVS side, i.e. no need to defer any work
> >> while maintaining clear split between rx and tx operations.
> >
> > First, making DMA do 2B copy would sacrifice performance, and I think
> > we all agree on that.
> 
> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
> done by CPU, however CPU is going away for dozens or even hundreds of
> thousands of cycles to process a new packet batch or service other ports,
> hence DMA will likely complete the transmission faster than waiting for
> the CPU thread to come back to that task.  In any case, this has to be
> tested.
> 
> > Second, this method comes with an issue of ordering.
> > For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> > thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
> > own dedicated DMA device dma0 and dma1, flag/index update for the
> > first 10 packets is done by dma0, and flag/index update for the left
> > 20 packets is done by dma1. But there is no ordering guarantee among
> > different DMA devices, so flag/index update may error. If PMD threads
> > don't have dedicated DMA devices, which means DMA devices are shared
> > among threads, we need lock and pay for lock contention in data-path.
> > Or we can allocate DMA devices for vring dynamically to avoid DMA
> > sharing among threads. But what's the overhead of allocation mechanism?
> Who does it? Any thoughts?
> 
> 1. DMA completion was discussed in context of per-queue allocation, so
> there
>    is no re-ordering in this case.
> 
> 2. Overhead can be minimal if allocated device can stick to the queue for
> a
>    reasonable amount of time without re-allocation on every send.  You may
>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
>    such mechanism.  For sure it can not be the same, but ideas can be re-
> used.
> 
> 3. Locking doesn't mean contention if resources are allocated/distributed
>    thoughtfully.
> 
> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
>    for doing that inside the vhost library, so any DPDK application and
>    vhost ethdev can use it without re-inventing from scratch.  It also
> should
>    be simpler from the API point of view if allocation and usage are in
>    the same place.  But I don't have a strong opinion here as for now,
> since
>    no real code examples exist, so it's hard to evaluate how they could
> look
>    like.
> 
> But I feel like we're starting to run in circles here as I did already say
> most of that before.


This does seem to be going in circles, especially since there seemed to be technical alignment on the last public call on March 29th. It is not feasible to do a real world implementation/POC of every design proposal. Let's have another call so that we can move towards a single solution that the DPDK and OVS communities agree on. I'll set up a call for next week in a similar time slot to the previous one.

John
--

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-08  9:57                               ` Ilya Maximets
@ 2022-04-20 15:39                                 ` Mcnamara, John
  2022-04-20 16:41                                 ` Mcnamara, John
  1 sibling, 0 replies; 58+ messages in thread
From: Mcnamara, John @ 2022-04-20 15:39 UTC (permalink / raw)
  To: Ilya Maximets, Hu, Jiayu, Maxime Coquelin, Van Haaren, Harry,
	Morten Brørup, Richardson, Bruce
  Cc: Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev, dev,
	O'Driscoll, Tim, Finn, Emma

> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Friday, April 8, 2022 10:58 AM
> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
> Richardson, Bruce <bruce.richardson@intel.com>
> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
> Ian <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>;
> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On 4/8/22 09:13, Hu, Jiayu wrote:
> >
> >
> >> -----Original Message-----
> >> From: Ilya Maximets <i.maximets@ovn.org>
> >> Sent: Thursday, April 7, 2022 10:40 PM
> >> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >> <harry.van.haaren@intel.com>; Morten Brørup
> >> <mb@smartsharesystems.com>; Richardson, Bruce
> >> <bruce.richardson@intel.com>
> >> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>;
> Stokes,
> >> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>;
> Ferriter,
> >> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> >> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
> >> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> >> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>
> >> On 4/7/22 16:25, Maxime Coquelin wrote:
> >>> Hi Harry,
> >>>
> >>> On 4/7/22 16:04, Van Haaren, Harry wrote:
> >>>> Hi OVS & DPDK, Maintainers & Community,
> >>>>
> >>>> Top posting overview of discussion as replies to thread become
> slower:
> >>>> perhaps it is a good time to review and plan for next steps?
> >>>>
> >>>>  From my perspective, it those most vocal in the thread seem to be
> >>>> in favour of the clean rx/tx split ("defer work"), with the
> >>>> tradeoff that the application must be aware of handling the async
> >>>> DMA completions. If there are any concerns opposing upstreaming of
> >>>> this
> >> method, please indicate this promptly, and we can continue technical
> >> discussions here now.
> >>>
> >>> Wasn't there some discussions about handling the Virtio completions
> >>> with the DMA engine? With that, we wouldn't need the deferral of
> work.
> >>
> >> +1
> >>
> >> With the virtio completions handled by DMA itself, the vhost port
> >> turns almost into a real HW NIC.  With that we will not need any
> >> extra manipulations from the OVS side, i.e. no need to defer any
> work
> >> while maintaining clear split between rx and tx operations.
> >
> > First, making DMA do 2B copy would sacrifice performance, and I think
> > we all agree on that.
> 
> I do not agree with that.  Yes, 2B copy by DMA will likely be slower
> than done by CPU, however CPU is going away for dozens or even hundreds
> of thousands of cycles to process a new packet batch or service other
> ports, hence DMA will likely complete the transmission faster than
> waiting for the CPU thread to come back to that task.  In any case,
> this has to be tested.
> 
> > Second, this method comes with an issue of ordering.
> > For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> > thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
> > own dedicated DMA device dma0 and dma1, flag/index update for the
> > first 10 packets is done by dma0, and flag/index update for the left
> > 20 packets is done by dma1. But there is no ordering guarantee among
> > different DMA devices, so flag/index update may error. If PMD threads
> > don't have dedicated DMA devices, which means DMA devices are shared
> > among threads, we need lock and pay for lock contention in data-path.
> > Or we can allocate DMA devices for vring dynamically to avoid DMA
> > sharing among threads. But what's the overhead of allocation
> mechanism? Who does it? Any thoughts?
> 
> 1. DMA completion was discussed in context of per-queue allocation, so
> there
>    is no re-ordering in this case.
> 
> 2. Overhead can be minimal if allocated device can stick to the queue
> for a
>    reasonable amount of time without re-allocation on every send.  You
> may
>    look at XPS implementation in lib/dpif-netdev.c in OVS for example
> of
>    such mechanism.  For sure it can not be the same, but ideas can be
> re-used.
> 
> 3. Locking doesn't mean contention if resources are
> allocated/distributed
>    thoughtfully.
> 
> 4. Allocation can be done be either OVS or vhost library itself, I'd
> vote
>    for doing that inside the vhost library, so any DPDK application and
>    vhost ethdev can use it without re-inventing from scratch.  It also
> should
>    be simpler from the API point of view if allocation and usage are in
>    the same place.  But I don't have a strong opinion here as for now,
> since
>    no real code examples exist, so it's hard to evaluate how they could
> look
>    like.
> 
> But I feel like we're starting to run in circles here as I did already
> say most of that before.

This does seem to be going in circles, especially since there seemed to be technical alignment on the last public call on March 29th. It is not feasible to do a real world implementation/POC of every design proposal. Let's have another call so that we can move towards a single solution that the DPDK and OVS communities agree on. I'll set up a call for next week in a similar time slot to the previous one.

John
-- 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-08  7:13                             ` Hu, Jiayu
  2022-04-08  8:21                               ` Morten Brørup
@ 2022-04-08  9:57                               ` Ilya Maximets
  2022-04-20 15:39                                 ` Mcnamara, John
  2022-04-20 16:41                                 ` Mcnamara, John
  1 sibling, 2 replies; 58+ messages in thread
From: Ilya Maximets @ 2022-04-08  9:57 UTC (permalink / raw)
  To: Hu, Jiayu, Maxime Coquelin, Van Haaren, Harry,
	Morten Brørup, Richardson, Bruce
  Cc: i.maximets, Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev,
	dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On 4/8/22 09:13, Hu, Jiayu wrote:
> 
> 
>> -----Original Message-----
>> From: Ilya Maximets <i.maximets@ovn.org>
>> Sent: Thursday, April 7, 2022 10:40 PM
>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
>> <harry.van.haaren@intel.com>; Morten Brørup
>> <mb@smartsharesystems.com>; Richardson, Bruce
>> <bruce.richardson@intel.com>
>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
>> <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; dev@dpdk.org;
>> Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll, Tim
>> <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>
>> On 4/7/22 16:25, Maxime Coquelin wrote:
>>> Hi Harry,
>>>
>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
>>>> Hi OVS & DPDK, Maintainers & Community,
>>>>
>>>> Top posting overview of discussion as replies to thread become slower:
>>>> perhaps it is a good time to review and plan for next steps?
>>>>
>>>>  From my perspective, it those most vocal in the thread seem to be in
>>>> favour of the clean rx/tx split ("defer work"), with the tradeoff
>>>> that the application must be aware of handling the async DMA
>>>> completions. If there are any concerns opposing upstreaming of this
>> method, please indicate this promptly, and we can continue technical
>> discussions here now.
>>>
>>> Wasn't there some discussions about handling the Virtio completions
>>> with the DMA engine? With that, we wouldn't need the deferral of work.
>>
>> +1
>>
>> With the virtio completions handled by DMA itself, the vhost port turns
>> almost into a real HW NIC.  With that we will not need any extra
>> manipulations from the OVS side, i.e. no need to defer any work while
>> maintaining clear split between rx and tx operations.
> 
> First, making DMA do 2B copy would sacrifice performance, and I think
> we all agree on that.

I do not agree with that.  Yes, 2B copy by DMA will likely be slower
than done by CPU, however CPU is going away for dozens or even hundreds
of thousands of cycles to process a new packet batch or service other
ports, hence DMA will likely complete the transmission faster than
waiting for the CPU thread to come back to that task.  In any case,
this has to be tested.

> Second, this method comes with an issue of ordering.
> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD thread1
> enqueue 20 packets to vring0. If PMD thread0 and threa1 have own dedicated
> DMA device dma0 and dma1, flag/index update for the first 10 packets is done by
> dma0, and flag/index update for the left 20 packets is done by dma1. But there
> is no ordering guarantee among different DMA devices, so flag/index update may
> error. If PMD threads don't have dedicated DMA devices, which means DMA
> devices are shared among threads, we need lock and pay for lock contention in
> data-path. Or we can allocate DMA devices for vring dynamically to avoid DMA
> sharing among threads. But what's the overhead of allocation mechanism? Who
> does it? Any thoughts?

1. DMA completion was discussed in context of per-queue allocation, so there
   is no re-ordering in this case.

2. Overhead can be minimal if allocated device can stick to the queue for a
   reasonable amount of time without re-allocation on every send.  You may
   look at XPS implementation in lib/dpif-netdev.c in OVS for example of
   such mechanism.  For sure it can not be the same, but ideas can be re-used.

3. Locking doesn't mean contention if resources are allocated/distributed
   thoughtfully.

4. Allocation can be done be either OVS or vhost library itself, I'd vote
   for doing that inside the vhost library, so any DPDK application and
   vhost ethdev can use it without re-inventing from scratch.  It also should
   be simpler from the API point of view if allocation and usage are in
   the same place.  But I don't have a strong opinion here as for now, since
   no real code examples exist, so it's hard to evaluate how they could look
   like.

But I feel like we're starting to run in circles here as I did already say
most of that before.

> 
> Thanks,
> Jiayu
> 
>>
>> I'd vote for that.
>>
>>>
>>> Thanks,
>>> Maxime
>>>
>>>> In absence of continued technical discussion here, I suggest Sunil
>>>> and Ian collaborate on getting the OVS Defer-work approach, and DPDK
>>>> VHost Async patchsets available on GitHub for easier consumption and
>> future development (as suggested in slides presented on last call).
>>>>
>>>> Regards, -Harry
>>>>
>>>> No inline-replies below; message just for context.
>>>>
>>>>> -----Original Message-----
>>>>> From: Van Haaren, Harry
>>>>> Sent: Wednesday, March 30, 2022 10:02 AM
>>>>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
>>>>> <bruce.richardson@intel.com>
>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu,
>>>>> Jiayu <Jiayu.Hu@intel.com>; Ferriter, Cian
>>>>> <Cian.Ferriter@intel.com>; Ilya Maximets <i.maximets@ovn.org>;
>>>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
>>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim
>>>>> <tim.odriscoll@intel.com>; Finn, Emma <Emma.Finn@intel.com>
>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>> Sent: Tuesday, March 29, 2022 8:59 PM
>>>>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson,
>>>>>> Bruce <bruce.richardson@intel.com>
>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu,
>>>>>> Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
>>>>>> <cian.ferriter@intel.com>; Ilya Maximets <i.maximets@ovn.org>;
>>>>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>>>>> John
>>>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim
>>>>>> <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
>>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>
>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>> Sent: Tuesday, 29 March 2022 19.46
>>>>>>>
>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
>>>>>>>>
>>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>>>>>>
>>>>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>>>>>>
>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>
>>>>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Having thought more about it, I think that a completely
>>>>>>>>> different
>>>>>>>>>>> architectural approach is required:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
>>>>>>> and TX
>>>>>>>>>>> packet burst functions, each optimized for different CPU
>>>>>>>>>>> vector instruction sets. The availability of a DMA engine
>>>>>>>>>>> should be
>>>>>>>>> treated
>>>>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
>>>>>>> e.g.
>>>>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and
>> TX
>>>>>>>>> packet
>>>>>>>>>>> burst functions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In such an architecture, it would be the application's job
>>>>>>> to
>>>>>>>>>>> allocate DMA channels and assign them to the specific PMDs
>>>>>>>>>>> that
>>>>>>>>> should
>>>>>>>>>>> use them. But the actual use of the DMA channels would move
>>>>>>> down
>>>>>>>>> below
>>>>>>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Med venlig hilsen / Kind regards, -Morten Brørup
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>>>
>>>>>>>>>>>>> That's *exactly* how this architecture is designed &
>>>>>>>>> implemented.
>>>>>>>>>>>>> 1.    The DMA configuration and initialization is up to the
>>>>>>>>> application
>>>>>>>>>>> (OVS).
>>>>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
>>>>>>> new
>>>>>>>>> async
>>>>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looking forward to talking on the call that just started.
>>>>>>>>> Regards, -
>>>>>>>>>>> Harry
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>>>>>>> patches.
>>>>>>>>>>>>
>>>>>>>>>>>> Then, I suppose that the TX completions can be handled in the
>>>>>>> TX
>>>>>>>>>>> function, and the RX completions can be handled in the RX
>>>>>>> function,
>>>>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>>>>>>
>>>>>>>>>>>> TX_Burst(tx_packet_array):
>>>>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
>>>>>>> Process
>>>>>>>>> TX
>>>>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline
>>>>>>>>>>> stage.)
>>>>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
>>>>>>> descriptors. --
>>>>>>>>>> Pass
>>>>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively,
>>>>>>>>>>> the
>>>>>>> 1st
>>>>>>>>>>> pipeline stage.)
>>>>>>>>>>>
>>>>>>>>>>> The problem is Tx function might not be called again, so
>>>>>>> enqueued
>>>>>>>>>>> packets in 2. may never be completed from a Virtio point of
>>>>>>> view.
>>>>>>>>> IOW,
>>>>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
>>>>>>> but
>>>>>>>>> the
>>>>>>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>>>>>>
>>>>>>>>>> In that case, the application needs to call TX_Burst()
>>>>>>> periodically
>>>>>>>>> with an empty array, for completion purposes.
>>>>>>>
>>>>>>> This is what the "defer work" does at the OVS thread-level, but
>>>>>>> instead of "brute-forcing" and *always* making the call, the defer
>>>>>>> work concept tracks
>>>>>>> *when* there is outstanding work (DMA copies) to be completed
>>>>>>> ("deferred work") and calls the generic completion function at
>>>>>>> that point.
>>>>>>>
>>>>>>> So "defer work" is generic infrastructure at the OVS thread level
>>>>>>> to handle work that needs to be done "later", e.g. DMA completion
>>>>>>> handling.
>>>>>>>
>>>>>>>
>>>>>>>>>> Or some sort of TX_Keepalive() function can be added to the
>>>>>>>>>> DPDK
>>>>>>>>> library, to handle DMA completion. It might even handle multiple
>>>>>>> DMA
>>>>>>>>> channels, if convenient - and if possible without locking or
>>>>>>>>> other weird complexity.
>>>>>>>
>>>>>>> That's exactly how it is done, the VHost library has a new API
>>>>>>> added, which allows for handling completions. And in the "Netdev
>>>>>>> layer" (~OVS ethdev
>>>>>>> abstraction)
>>>>>>> we add a function to allow the OVS thread to do those completions
>>>>>>> in a new Netdev-abstraction API called "async_process" where the
>>>>>>> completions can be checked.
>>>>>>>
>>>>>>> The only method to abstract them is to "hide" them somewhere that
>>>>>>> will always be polled, e.g. an ethdev port's RX function.  Both V3
>>>>>>> and V4 approaches use this method.
>>>>>>> This allows "completions" to be transparent to the app, at the
>>>>>>> tradeoff to having bad separation  of concerns as Rx and Tx are
>>>>>>> now tied-together.
>>>>>>>
>>>>>>> The point is, the Application layer must *somehow * handle of
>>>>>>> completions.
>>>>>>> So fundamentally there are 2 options for the Application level:
>>>>>>>
>>>>>>> A) Make the application periodically call a "handle completions"
>>>>>>> function
>>>>>>>     A1) Defer work, call when needed, and track "needed" at app
>>>>>>> layer, and calling into vhost txq complete as required.
>>>>>>>             Elegant in that "no work" means "no cycles spent" on
>>>>>>> checking DMA completions.
>>>>>>>     A2) Brute-force-always-call, and pay some overhead when not
>>>>>>> required.
>>>>>>>             Cycle-cost in "no work" scenarios. Depending on # of
>>>>>>> vhost queues, this adds up as polling required *per vhost txq*.
>>>>>>>             Also note that "checking DMA completions" means taking
>>>>>>> a virtq-lock, so this "brute-force" can needlessly increase
>>>>>>> x-thread contention!
>>>>>>
>>>>>> A side note: I don't see why locking is required to test for DMA
>> completions.
>>>>>> rte_dma_vchan_status() is lockless, e.g.:
>>>>>>
>>>>> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_
>>>>> dmadev.c#L
>>>>> 56
>>>>>> 0
>>>>>
>>>>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a
>>>>> lockfree manner from a single thread.
>>>>>
>>>>> The locks I refer to are at the OVS-netdev level, as virtq's are
>>>>> shared across OVS's dataplane threads.
>>>>> So the "M to N" comes from M dataplane threads to N virtqs, hence
>>>>> requiring some locking.
>>>>>
>>>>>
>>>>>>> B) Hide completions and live with the complexity/architectural
>>>>>>> sacrifice of mixed-RxTx.
>>>>>>>     Various downsides here in my opinion, see the slide deck
>>>>>>> presented earlier today for a summary.
>>>>>>>
>>>>>>> In my opinion, A1 is the most elegant solution, as it has a clean
>>>>>>> separation of concerns, does not  cause avoidable contention on
>>>>>>> virtq locks, and spends no cycles when there is no completion work
>>>>>>> to do.
>>>>>>>
>>>>>>
>>>>>> Thank you for elaborating, Harry.
>>>>>
>>>>> Thanks for part-taking in the discussion & providing your insight!
>>>>>
>>>>>> I strongly oppose against hiding any part of TX processing in an RX
>>>>>> function. It
>>>>> is just
>>>>>> wrong in so many ways!
>>>>>>
>>>>>> I agree that A1 is the most elegant solution. And being the most
>>>>>> elegant
>>>>> solution, it
>>>>>> is probably also the most future proof solution. :-)
>>>>>
>>>>> I think so too, yes.
>>>>>
>>>>>> I would also like to stress that DMA completion handling belongs in
>>>>>> the DPDK library, not in the application. And yes, the application
>>>>>> will be required to call
>>>>> some
>>>>>> "handle DMA completions" function in the DPDK library. But since
>>>>>> the
>>>>> application
>>>>>> already knows that it uses DMA, the application should also know
>>>>>> that it needs
>>>>> to
>>>>>> call this extra function - so I consider this requirement perfectly
>> acceptable.
>>>>>
>>>>> Agree here.
>>>>>
>>>>>> I prefer if the DPDK vhost library can hide its inner workings from
>>>>>> the
>>>>> application,
>>>>>> and just expose the additional "handle completions" function. This
>>>>>> also means
>>>>> that
>>>>>> the inner workings can be implemented as "defer work", or by some
>>>>>> other algorithm. And it can be tweaked and optimized later.
>>>>>
>>>>> Yes, the choice in how to call the handle_completions function is
>>>>> Application layer.
>>>>> For OVS we designed Defer Work, V3 and V4. But it is an App level
>>>>> choice, and every application is free to choose its own method.
>>>>>
>>>>>> Thinking about the long term perspective, this design pattern is
>>>>>> common for
>>>>> both
>>>>>> the vhost library and other DPDK libraries that could benefit from DMA
>> (e.g.
>>>>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA
>>>>>> library or a separate library. But for now, we should focus on the
>>>>>> vhost use case, and just
>>>>> keep
>>>>>> the long term roadmap for using DMA in mind.
>>>>>
>>>>> Totally agree to keep long term roadmap in mind; but I'm not sure we
>>>>> can refactor logic out of vhost. When DMA-completions arrive, the
>>>>> virtQ needs to be updated; this causes a tight coupling between the
>>>>> DMA completion count, and the vhost library.
>>>>>
>>>>> As Ilya raised on the call yesterday, there is an "in_order"
>>>>> requirement in the vhost library, that per virtq the packets are
>>>>> presented to the guest "in order" of enqueue.
>>>>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the
>>>>> Vhost library handles this today by re-ordering the DMA
>>>>> completions.)
>>>>>
>>>>>
>>>>>> Rephrasing what I said on the conference call: This vhost design
>>>>>> will become
>>>>> the
>>>>>> common design pattern for using DMA in DPDK libraries. If we get it
>>>>>> wrong, we
>>>>> are
>>>>>> stuck with it.
>>>>>
>>>>> Agree, and if we get it right, then we're stuck with it too! :)
>>>>>
>>>>>
>>>>>>>>>> Here is another idea, inspired by a presentation at one of the
>>>>>>> DPDK
>>>>>>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>>>>>>
>>>>>>>>>> Add an additional transaction to each DMA burst; a special
>>>>>>>>> transaction containing the memory write operation that makes the
>>>>>>>>> descriptors available to the Virtio driver.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That is something that can work, so long as the receiver is
>>>>>>> operating
>>>>>>>>> in
>>>>>>>>> polling mode. For cases where virtio interrupts are enabled, you
>>>>>>> still
>>>>>>>>> need
>>>>>>>>> to do a write to the eventfd in the kernel in vhost to signal
>>>>>>>>> the virtio side. That's not something that can be offloaded to a
>>>>>>>>> DMA engine, sadly, so we still need some form of completion
>>>>>>>>> call.
>>>>>>>>
>>>>>>>> I guess that virtio interrupts is the most widely deployed
>>>>>>>> scenario,
>>>>>>> so let's ignore
>>>>>>>> the DMA TX completion transaction for now - and call it a
>>>>>>>> possible
>>>>>>> future
>>>>>>>> optimization for specific use cases. So it seems that some form
>>>>>>>> of
>>>>>>> completion call
>>>>>>>> is unavoidable.
>>>>>>>
>>>>>>> Agree to leave this aside, there is in theory a potential
>>>>>>> optimization, but unlikely to be of large value.
>>>>>>>
>>>>>>
>>>>>> One more thing: When using DMA to pass on packets into a guest,
>> there could
>>>>> be a
>>>>>> delay from the DMA completes until the guest is signaled. Is there any
>> CPU
>>>>> cache
>>>>>> hotness regarding the guest's access to the packet data to consider here?
>> I.e. if
>>>>> we
>>>>>> wait signaling the guest, the packet data may get cold.
>>>>>
>>>>> Interesting question; we can likely spawn a new thread around this topic!
>>>>> In short, it depends on how/where the DMA hardware writes the copy.
>>>>>
>>>>> With technologies like DDIO, the "dest" part of the copy will be in LLC.
>> The core
>>>>> reading the
>>>>> dest data will benefit from the LLC locality (instead of snooping it from a
>> remote
>>>>> core's L1/L2).
>>>>>
>>>>> Delays in notifying the guest could result in LLC capacity eviction, yes.
>>>>> The application layer decides how often/promptly to check for
>> completions,
>>>>> and notify the guest of them. Calling the function more often will result
>> in less
>>>>> delay in that portion of the pipeline.
>>>>>
>>>>> Overall, there are caching benefits with DMA acceleration, and the
>> application
>>>>> can control
>>>>> the latency introduced between dma-completion done in HW, and Guest
>> vring
>>>>> update.
>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-08  7:13                             ` Hu, Jiayu
@ 2022-04-08  8:21                               ` Morten Brørup
  2022-04-08  9:57                               ` Ilya Maximets
  1 sibling, 0 replies; 58+ messages in thread
From: Morten Brørup @ 2022-04-08  8:21 UTC (permalink / raw)
  To: Hu, Jiayu, Ilya Maximets, Maxime Coquelin, Van Haaren, Harry,
	Richardson, Bruce
  Cc: Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev, dev,
	Mcnamara, John, O'Driscoll, Tim, Finn, Emma

> From: Hu, Jiayu [mailto:jiayu.hu@intel.com]
> Sent: Friday, 8 April 2022 09.14
> 
> > From: Ilya Maximets <i.maximets@ovn.org>
> >
> > On 4/7/22 16:25, Maxime Coquelin wrote:
> > > Hi Harry,
> > >
> > > On 4/7/22 16:04, Van Haaren, Harry wrote:
> > >> Hi OVS & DPDK, Maintainers & Community,
> > >>
> > >> Top posting overview of discussion as replies to thread become
> slower:
> > >> perhaps it is a good time to review and plan for next steps?
> > >>
> > >>  From my perspective, it those most vocal in the thread seem to be
> in
> > >> favour of the clean rx/tx split ("defer work"), with the tradeoff
> > >> that the application must be aware of handling the async DMA
> > >> completions. If there are any concerns opposing upstreaming of
> this
> > method, please indicate this promptly, and we can continue technical
> > discussions here now.
> > >
> > > Wasn't there some discussions about handling the Virtio completions
> > > with the DMA engine? With that, we wouldn't need the deferral of
> work.
> >
> > +1
> >
> > With the virtio completions handled by DMA itself, the vhost port
> turns
> > almost into a real HW NIC.  With that we will not need any extra
> > manipulations from the OVS side, i.e. no need to defer any work while
> > maintaining clear split between rx and tx operations.
> 
> First, making DMA do 2B copy would sacrifice performance, and I think
> we all agree on that. Second, this method comes with an issue of
> ordering.
> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> thread1
> enqueue 20 packets to vring0. If PMD thread0 and threa1 have own
> dedicated
> DMA device dma0 and dma1, flag/index update for the first 10 packets is
> done by
> dma0, and flag/index update for the left 20 packets is done by dma1.
> But there
> is no ordering guarantee among different DMA devices, so flag/index
> update may
> error. If PMD threads don't have dedicated DMA devices, which means DMA
> devices are shared among threads, we need lock and pay for lock
> contention in
> data-path. Or we can allocate DMA devices for vring dynamically to
> avoid DMA
> sharing among threads. But what's the overhead of allocation mechanism?
> Who
> does it? Any thoughts?
> 

Think of it like a hardware NIC... what are the constraints for a hardware NIC:

Two threads writing simultaneously into the same NIC TX queue is not possible, and would be an application design error. With a hardware NIC, you use separate TX queues for each thread.

Having two threads writing into the same TX queue (to maintain ordering), is not possible without additional application logic. This could be a pipeline stage with a lockless multi producer-single consumer ring in front of the NIC TX queue, or it could be a critical section preventing one thread from writing while another thread is writing.

Either way, multiple threads writing simultaneously into the same NIC TX queue is not possible with hardware NIC drivers, but must be implemented in the application. So why would anyone expect it to be possible for virtual NIC drivers (such as the vhost)?


> Thanks,
> Jiayu
> 
> >
> > I'd vote for that.
> >
> > >
> > > Thanks,
> > > Maxime
> > >
> > >> In absence of continued technical discussion here, I suggest Sunil
> > >> and Ian collaborate on getting the OVS Defer-work approach, and
> DPDK
> > >> VHost Async patchsets available on GitHub for easier consumption
> and
> > future development (as suggested in slides presented on last call).
> > >>
> > >> Regards, -Harry
> > >>
> > >> No inline-replies below; message just for context.
> > >>
> > >>> -----Original Message-----
> > >>> From: Van Haaren, Harry
> > >>> Sent: Wednesday, March 30, 2022 10:02 AM
> > >>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
> > >>> <bruce.richardson@intel.com>
> > >>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> > >>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu,
> > >>> Jiayu <Jiayu.Hu@intel.com>; Ferriter, Cian
> > >>> <Cian.Ferriter@intel.com>; Ilya Maximets <i.maximets@ovn.org>;
> > >>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> > >>> <john.mcnamara@intel.com>; O'Driscoll, Tim
> > >>> <tim.odriscoll@intel.com>; Finn, Emma <Emma.Finn@intel.com>
> > >>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Morten Brørup <mb@smartsharesystems.com>
> > >>>> Sent: Tuesday, March 29, 2022 8:59 PM
> > >>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson,
> > >>>> Bruce <bruce.richardson@intel.com>
> > >>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> > >>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu,
> > >>>> Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> > >>>> <cian.ferriter@intel.com>; Ilya Maximets <i.maximets@ovn.org>;
> > >>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
> > >>> John
> > >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim
> > >>>> <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> > >>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> > >>>>
> > >>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > >>>>> Sent: Tuesday, 29 March 2022 19.46
> > >>>>>
> > >>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> > >>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
> > >>>>>>
> > >>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > >>>>>>> Sent: Tuesday, 29 March 2022 19.03
> > >>>>>>>
> > >>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup
> wrote:
> > >>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > >>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
> > >>>>>>>>>
> > >>>>>>>>> Hi Morten,
> > >>>>>>>>>
> > >>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
> > >>>>>>>>>>> From: Van Haaren, Harry
> [mailto:harry.van.haaren@intel.com]
> > >>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
> > >>>>>>>>>>>
> > >>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> > >>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Having thought more about it, I think that a completely
> > >>>>>>> different
> > >>>>>>>>> architectural approach is required:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
> > >>>>> and TX
> > >>>>>>>>> packet burst functions, each optimized for different CPU
> > >>>>>>>>> vector instruction sets. The availability of a DMA engine
> > >>>>>>>>> should be
> > >>>>>>> treated
> > >>>>>>>>> the same way. So I suggest that PMDs copying packet
> contents,
> > >>>>> e.g.
> > >>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and
> > TX
> > >>>>>>> packet
> > >>>>>>>>> burst functions.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Similarly for the DPDK vhost library.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In such an architecture, it would be the application's
> job
> > >>>>> to
> > >>>>>>>>> allocate DMA channels and assign them to the specific PMDs
> > >>>>>>>>> that
> > >>>>>>> should
> > >>>>>>>>> use them. But the actual use of the DMA channels would move
> > >>>>> down
> > >>>>>>> below
> > >>>>>>>>> the application and into the DPDK PMDs and libraries.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Med venlig hilsen / Kind regards, -Morten Brørup
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Morten,
> > >>>>>>>>>>>
> > >>>>>>>>>>> That's *exactly* how this architecture is designed &
> > >>>>>>> implemented.
> > >>>>>>>>>>> 1.    The DMA configuration and initialization is up to
> the
> > >>>>>>> application
> > >>>>>>>>> (OVS).
> > >>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
> > >>>>> new
> > >>>>>>> async
> > >>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Looking forward to talking on the call that just started.
> > >>>>>>> Regards, -
> > >>>>>>>>> Harry
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at
> the
> > >>>>>>> patches.
> > >>>>>>>>>>
> > >>>>>>>>>> Then, I suppose that the TX completions can be handled in
> the
> > >>>>> TX
> > >>>>>>>>> function, and the RX completions can be handled in the RX
> > >>>>> function,
> > >>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
> > >>>>>>>>>>
> > >>>>>>>>>> TX_Burst(tx_packet_array):
> > >>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
> > >>>>> Process
> > >>>>>>> TX
> > >>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline
> > >>>>>>>>> stage.)
> > >>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
> > >>>>> descriptors. --
> > >>>>>>>> Pass
> > >>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively,
> > >>>>>>>>> the
> > >>>>> 1st
> > >>>>>>>>> pipeline stage.)
> > >>>>>>>>>
> > >>>>>>>>> The problem is Tx function might not be called again, so
> > >>>>> enqueued
> > >>>>>>>>> packets in 2. may never be completed from a Virtio point of
> > >>>>> view.
> > >>>>>>> IOW,
> > >>>>>>>>> the packets will be copied to the Virtio descriptors
> buffers,
> > >>>>> but
> > >>>>>>> the
> > >>>>>>>>> descriptors will not be made available to the Virtio
> driver.
> > >>>>>>>>
> > >>>>>>>> In that case, the application needs to call TX_Burst()
> > >>>>> periodically
> > >>>>>>> with an empty array, for completion purposes.
> > >>>>>
> > >>>>> This is what the "defer work" does at the OVS thread-level, but
> > >>>>> instead of "brute-forcing" and *always* making the call, the
> defer
> > >>>>> work concept tracks
> > >>>>> *when* there is outstanding work (DMA copies) to be completed
> > >>>>> ("deferred work") and calls the generic completion function at
> > >>>>> that point.
> > >>>>>
> > >>>>> So "defer work" is generic infrastructure at the OVS thread
> level
> > >>>>> to handle work that needs to be done "later", e.g. DMA
> completion
> > >>>>> handling.
> > >>>>>
> > >>>>>
> > >>>>>>>> Or some sort of TX_Keepalive() function can be added to the
> > >>>>>>>> DPDK
> > >>>>>>> library, to handle DMA completion. It might even handle
> multiple
> > >>>>> DMA
> > >>>>>>> channels, if convenient - and if possible without locking or
> > >>>>>>> other weird complexity.
> > >>>>>
> > >>>>> That's exactly how it is done, the VHost library has a new API
> > >>>>> added, which allows for handling completions. And in the
> "Netdev
> > >>>>> layer" (~OVS ethdev
> > >>>>> abstraction)
> > >>>>> we add a function to allow the OVS thread to do those
> completions
> > >>>>> in a new Netdev-abstraction API called "async_process" where
> the
> > >>>>> completions can be checked.
> > >>>>>
> > >>>>> The only method to abstract them is to "hide" them somewhere
> that
> > >>>>> will always be polled, e.g. an ethdev port's RX function.  Both
> V3
> > >>>>> and V4 approaches use this method.
> > >>>>> This allows "completions" to be transparent to the app, at the
> > >>>>> tradeoff to having bad separation  of concerns as Rx and Tx are
> > >>>>> now tied-together.
> > >>>>>
> > >>>>> The point is, the Application layer must *somehow * handle of
> > >>>>> completions.
> > >>>>> So fundamentally there are 2 options for the Application level:
> > >>>>>
> > >>>>> A) Make the application periodically call a "handle
> completions"
> > >>>>> function
> > >>>>>     A1) Defer work, call when needed, and track "needed" at app
> > >>>>> layer, and calling into vhost txq complete as required.
> > >>>>>             Elegant in that "no work" means "no cycles spent"
> on
> > >>>>> checking DMA completions.
> > >>>>>     A2) Brute-force-always-call, and pay some overhead when not
> > >>>>> required.
> > >>>>>             Cycle-cost in "no work" scenarios. Depending on #
> of
> > >>>>> vhost queues, this adds up as polling required *per vhost txq*.
> > >>>>>             Also note that "checking DMA completions" means
> taking
> > >>>>> a virtq-lock, so this "brute-force" can needlessly increase
> > >>>>> x-thread contention!
> > >>>>
> > >>>> A side note: I don't see why locking is required to test for DMA
> > completions.
> > >>>> rte_dma_vchan_status() is lockless, e.g.:
> > >>>>
> > >>>
> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_
> > >>> dmadev.c#L
> > >>> 56
> > >>>> 0
> > >>>
> > >>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a
> > >>> lockfree manner from a single thread.
> > >>>
> > >>> The locks I refer to are at the OVS-netdev level, as virtq's are
> > >>> shared across OVS's dataplane threads.
> > >>> So the "M to N" comes from M dataplane threads to N virtqs, hence
> > >>> requiring some locking.
> > >>>
> > >>>
> > >>>>> B) Hide completions and live with the complexity/architectural
> > >>>>> sacrifice of mixed-RxTx.
> > >>>>>     Various downsides here in my opinion, see the slide deck
> > >>>>> presented earlier today for a summary.
> > >>>>>
> > >>>>> In my opinion, A1 is the most elegant solution, as it has a
> clean
> > >>>>> separation of concerns, does not  cause avoidable contention on
> > >>>>> virtq locks, and spends no cycles when there is no completion
> work
> > >>>>> to do.
> > >>>>>
> > >>>>
> > >>>> Thank you for elaborating, Harry.
> > >>>
> > >>> Thanks for part-taking in the discussion & providing your
> insight!
> > >>>
> > >>>> I strongly oppose against hiding any part of TX processing in an
> RX
> > >>>> function. It
> > >>> is just
> > >>>> wrong in so many ways!
> > >>>>
> > >>>> I agree that A1 is the most elegant solution. And being the most
> > >>>> elegant
> > >>> solution, it
> > >>>> is probably also the most future proof solution. :-)
> > >>>
> > >>> I think so too, yes.
> > >>>
> > >>>> I would also like to stress that DMA completion handling belongs
> in
> > >>>> the DPDK library, not in the application. And yes, the
> application
> > >>>> will be required to call
> > >>> some
> > >>>> "handle DMA completions" function in the DPDK library. But since
> > >>>> the
> > >>> application
> > >>>> already knows that it uses DMA, the application should also know
> > >>>> that it needs
> > >>> to
> > >>>> call this extra function - so I consider this requirement
> perfectly
> > acceptable.
> > >>>
> > >>> Agree here.
> > >>>
> > >>>> I prefer if the DPDK vhost library can hide its inner workings
> from
> > >>>> the
> > >>> application,
> > >>>> and just expose the additional "handle completions" function.
> This
> > >>>> also means
> > >>> that
> > >>>> the inner workings can be implemented as "defer work", or by
> some
> > >>>> other algorithm. And it can be tweaked and optimized later.
> > >>>
> > >>> Yes, the choice in how to call the handle_completions function is
> > >>> Application layer.
> > >>> For OVS we designed Defer Work, V3 and V4. But it is an App level
> > >>> choice, and every application is free to choose its own method.
> > >>>
> > >>>> Thinking about the long term perspective, this design pattern is
> > >>>> common for
> > >>> both
> > >>>> the vhost library and other DPDK libraries that could benefit
> from DMA
> > (e.g.
> > >>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA
> > >>>> library or a separate library. But for now, we should focus on
> the
> > >>>> vhost use case, and just
> > >>> keep
> > >>>> the long term roadmap for using DMA in mind.
> > >>>
> > >>> Totally agree to keep long term roadmap in mind; but I'm not sure
> we
> > >>> can refactor logic out of vhost. When DMA-completions arrive, the
> > >>> virtQ needs to be updated; this causes a tight coupling between
> the
> > >>> DMA completion count, and the vhost library.
> > >>>
> > >>> As Ilya raised on the call yesterday, there is an "in_order"
> > >>> requirement in the vhost library, that per virtq the packets are
> > >>> presented to the guest "in order" of enqueue.
> > >>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned,
> the
> > >>> Vhost library handles this today by re-ordering the DMA
> > >>> completions.)
> > >>>
> > >>>
> > >>>> Rephrasing what I said on the conference call: This vhost design
> > >>>> will become
> > >>> the
> > >>>> common design pattern for using DMA in DPDK libraries. If we get
> it
> > >>>> wrong, we
> > >>> are
> > >>>> stuck with it.
> > >>>
> > >>> Agree, and if we get it right, then we're stuck with it too! :)
> > >>>
> > >>>
> > >>>>>>>> Here is another idea, inspired by a presentation at one of
> the
> > >>>>> DPDK
> > >>>>>>> Userspace conferences. It may be wishful thinking, though:
> > >>>>>>>>
> > >>>>>>>> Add an additional transaction to each DMA burst; a special
> > >>>>>>> transaction containing the memory write operation that makes
> the
> > >>>>>>> descriptors available to the Virtio driver.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> That is something that can work, so long as the receiver is
> > >>>>> operating
> > >>>>>>> in
> > >>>>>>> polling mode. For cases where virtio interrupts are enabled,
> you
> > >>>>> still
> > >>>>>>> need
> > >>>>>>> to do a write to the eventfd in the kernel in vhost to signal
> > >>>>>>> the virtio side. That's not something that can be offloaded
> to a
> > >>>>>>> DMA engine, sadly, so we still need some form of completion
> > >>>>>>> call.
> > >>>>>>
> > >>>>>> I guess that virtio interrupts is the most widely deployed
> > >>>>>> scenario,
> > >>>>> so let's ignore
> > >>>>>> the DMA TX completion transaction for now - and call it a
> > >>>>>> possible
> > >>>>> future
> > >>>>>> optimization for specific use cases. So it seems that some
> form
> > >>>>>> of
> > >>>>> completion call
> > >>>>>> is unavoidable.
> > >>>>>
> > >>>>> Agree to leave this aside, there is in theory a potential
> > >>>>> optimization, but unlikely to be of large value.
> > >>>>>
> > >>>>
> > >>>> One more thing: When using DMA to pass on packets into a guest,
> > there could
> > >>> be a
> > >>>> delay from the DMA completes until the guest is signaled. Is
> there any
> > CPU
> > >>> cache
> > >>>> hotness regarding the guest's access to the packet data to
> consider here?
> > I.e. if
> > >>> we
> > >>>> wait signaling the guest, the packet data may get cold.
> > >>>
> > >>> Interesting question; we can likely spawn a new thread around
> this topic!
> > >>> In short, it depends on how/where the DMA hardware writes the
> copy.
> > >>>
> > >>> With technologies like DDIO, the "dest" part of the copy will be
> in LLC.
> > The core
> > >>> reading the
> > >>> dest data will benefit from the LLC locality (instead of snooping
> it from a
> > remote
> > >>> core's L1/L2).
> > >>>
> > >>> Delays in notifying the guest could result in LLC capacity
> eviction, yes.
> > >>> The application layer decides how often/promptly to check for
> > completions,
> > >>> and notify the guest of them. Calling the function more often
> will result
> > in less
> > >>> delay in that portion of the pipeline.
> > >>>
> > >>> Overall, there are caching benefits with DMA acceleration, and
> the
> > application
> > >>> can control
> > >>> the latency introduced between dma-completion done in HW, and
> Guest
> > vring
> > >>> update.
> > >>
> > >


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-07 14:39                           ` Ilya Maximets
  2022-04-07 14:42                             ` Van Haaren, Harry
@ 2022-04-08  7:13                             ` Hu, Jiayu
  2022-04-08  8:21                               ` Morten Brørup
  2022-04-08  9:57                               ` Ilya Maximets
  1 sibling, 2 replies; 58+ messages in thread
From: Hu, Jiayu @ 2022-04-08  7:13 UTC (permalink / raw)
  To: Ilya Maximets, Maxime Coquelin, Van Haaren, Harry,
	Morten Brørup, Richardson, Bruce
  Cc: Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev, dev,
	Mcnamara, John, O'Driscoll, Tim, Finn, Emma



> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Thursday, April 7, 2022 10:40 PM
> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; dev@dpdk.org;
> Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll, Tim
> <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On 4/7/22 16:25, Maxime Coquelin wrote:
> > Hi Harry,
> >
> > On 4/7/22 16:04, Van Haaren, Harry wrote:
> >> Hi OVS & DPDK, Maintainers & Community,
> >>
> >> Top posting overview of discussion as replies to thread become slower:
> >> perhaps it is a good time to review and plan for next steps?
> >>
> >>  From my perspective, it those most vocal in the thread seem to be in
> >> favour of the clean rx/tx split ("defer work"), with the tradeoff
> >> that the application must be aware of handling the async DMA
> >> completions. If there are any concerns opposing upstreaming of this
> method, please indicate this promptly, and we can continue technical
> discussions here now.
> >
> > Wasn't there some discussions about handling the Virtio completions
> > with the DMA engine? With that, we wouldn't need the deferral of work.
> 
> +1
> 
> With the virtio completions handled by DMA itself, the vhost port turns
> almost into a real HW NIC.  With that we will not need any extra
> manipulations from the OVS side, i.e. no need to defer any work while
> maintaining clear split between rx and tx operations.

First, making DMA do 2B copy would sacrifice performance, and I think
we all agree on that. Second, this method comes with an issue of ordering.
For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD thread1
enqueue 20 packets to vring0. If PMD thread0 and threa1 have own dedicated
DMA device dma0 and dma1, flag/index update for the first 10 packets is done by
dma0, and flag/index update for the left 20 packets is done by dma1. But there
is no ordering guarantee among different DMA devices, so flag/index update may
error. If PMD threads don't have dedicated DMA devices, which means DMA
devices are shared among threads, we need lock and pay for lock contention in
data-path. Or we can allocate DMA devices for vring dynamically to avoid DMA
sharing among threads. But what's the overhead of allocation mechanism? Who
does it? Any thoughts?

Thanks,
Jiayu

> 
> I'd vote for that.
> 
> >
> > Thanks,
> > Maxime
> >
> >> In absence of continued technical discussion here, I suggest Sunil
> >> and Ian collaborate on getting the OVS Defer-work approach, and DPDK
> >> VHost Async patchsets available on GitHub for easier consumption and
> future development (as suggested in slides presented on last call).
> >>
> >> Regards, -Harry
> >>
> >> No inline-replies below; message just for context.
> >>
> >>> -----Original Message-----
> >>> From: Van Haaren, Harry
> >>> Sent: Wednesday, March 30, 2022 10:02 AM
> >>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
> >>> <bruce.richardson@intel.com>
> >>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> >>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu,
> >>> Jiayu <Jiayu.Hu@intel.com>; Ferriter, Cian
> >>> <Cian.Ferriter@intel.com>; Ilya Maximets <i.maximets@ovn.org>;
> >>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >>> <john.mcnamara@intel.com>; O'Driscoll, Tim
> >>> <tim.odriscoll@intel.com>; Finn, Emma <Emma.Finn@intel.com>
> >>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >>>
> >>>> -----Original Message-----
> >>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>> Sent: Tuesday, March 29, 2022 8:59 PM
> >>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson,
> >>>> Bruce <bruce.richardson@intel.com>
> >>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> >>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu,
> >>>> Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> >>>> <cian.ferriter@intel.com>; Ilya Maximets <i.maximets@ovn.org>;
> >>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
> >>> John
> >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim
> >>>> <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> >>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> >>>>> Sent: Tuesday, 29 March 2022 19.46
> >>>>>
> >>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
> >>>>>>
> >>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>>>> Sent: Tuesday, 29 March 2022 19.03
> >>>>>>>
> >>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> >>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
> >>>>>>>>>
> >>>>>>>>> Hi Morten,
> >>>>>>>>>
> >>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
> >>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> >>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
> >>>>>>>>>>>
> >>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> Having thought more about it, I think that a completely
> >>>>>>> different
> >>>>>>>>> architectural approach is required:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
> >>>>> and TX
> >>>>>>>>> packet burst functions, each optimized for different CPU
> >>>>>>>>> vector instruction sets. The availability of a DMA engine
> >>>>>>>>> should be
> >>>>>>> treated
> >>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
> >>>>> e.g.
> >>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and
> TX
> >>>>>>> packet
> >>>>>>>>> burst functions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Similarly for the DPDK vhost library.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In such an architecture, it would be the application's job
> >>>>> to
> >>>>>>>>> allocate DMA channels and assign them to the specific PMDs
> >>>>>>>>> that
> >>>>>>> should
> >>>>>>>>> use them. But the actual use of the DMA channels would move
> >>>>> down
> >>>>>>> below
> >>>>>>>>> the application and into the DPDK PMDs and libraries.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Med venlig hilsen / Kind regards, -Morten Brørup
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Morten,
> >>>>>>>>>>>
> >>>>>>>>>>> That's *exactly* how this architecture is designed &
> >>>>>>> implemented.
> >>>>>>>>>>> 1.    The DMA configuration and initialization is up to the
> >>>>>>> application
> >>>>>>>>> (OVS).
> >>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
> >>>>> new
> >>>>>>> async
> >>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
> >>>>>>>>>>>
> >>>>>>>>>>> Looking forward to talking on the call that just started.
> >>>>>>> Regards, -
> >>>>>>>>> Harry
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
> >>>>>>> patches.
> >>>>>>>>>>
> >>>>>>>>>> Then, I suppose that the TX completions can be handled in the
> >>>>> TX
> >>>>>>>>> function, and the RX completions can be handled in the RX
> >>>>> function,
> >>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
> >>>>>>>>>>
> >>>>>>>>>> TX_Burst(tx_packet_array):
> >>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
> >>>>> Process
> >>>>>>> TX
> >>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline
> >>>>>>>>> stage.)
> >>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
> >>>>> descriptors. --
> >>>>>>>> Pass
> >>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively,
> >>>>>>>>> the
> >>>>> 1st
> >>>>>>>>> pipeline stage.)
> >>>>>>>>>
> >>>>>>>>> The problem is Tx function might not be called again, so
> >>>>> enqueued
> >>>>>>>>> packets in 2. may never be completed from a Virtio point of
> >>>>> view.
> >>>>>>> IOW,
> >>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
> >>>>> but
> >>>>>>> the
> >>>>>>>>> descriptors will not be made available to the Virtio driver.
> >>>>>>>>
> >>>>>>>> In that case, the application needs to call TX_Burst()
> >>>>> periodically
> >>>>>>> with an empty array, for completion purposes.
> >>>>>
> >>>>> This is what the "defer work" does at the OVS thread-level, but
> >>>>> instead of "brute-forcing" and *always* making the call, the defer
> >>>>> work concept tracks
> >>>>> *when* there is outstanding work (DMA copies) to be completed
> >>>>> ("deferred work") and calls the generic completion function at
> >>>>> that point.
> >>>>>
> >>>>> So "defer work" is generic infrastructure at the OVS thread level
> >>>>> to handle work that needs to be done "later", e.g. DMA completion
> >>>>> handling.
> >>>>>
> >>>>>
> >>>>>>>> Or some sort of TX_Keepalive() function can be added to the
> >>>>>>>> DPDK
> >>>>>>> library, to handle DMA completion. It might even handle multiple
> >>>>> DMA
> >>>>>>> channels, if convenient - and if possible without locking or
> >>>>>>> other weird complexity.
> >>>>>
> >>>>> That's exactly how it is done, the VHost library has a new API
> >>>>> added, which allows for handling completions. And in the "Netdev
> >>>>> layer" (~OVS ethdev
> >>>>> abstraction)
> >>>>> we add a function to allow the OVS thread to do those completions
> >>>>> in a new Netdev-abstraction API called "async_process" where the
> >>>>> completions can be checked.
> >>>>>
> >>>>> The only method to abstract them is to "hide" them somewhere that
> >>>>> will always be polled, e.g. an ethdev port's RX function.  Both V3
> >>>>> and V4 approaches use this method.
> >>>>> This allows "completions" to be transparent to the app, at the
> >>>>> tradeoff to having bad separation  of concerns as Rx and Tx are
> >>>>> now tied-together.
> >>>>>
> >>>>> The point is, the Application layer must *somehow * handle of
> >>>>> completions.
> >>>>> So fundamentally there are 2 options for the Application level:
> >>>>>
> >>>>> A) Make the application periodically call a "handle completions"
> >>>>> function
> >>>>>     A1) Defer work, call when needed, and track "needed" at app
> >>>>> layer, and calling into vhost txq complete as required.
> >>>>>             Elegant in that "no work" means "no cycles spent" on
> >>>>> checking DMA completions.
> >>>>>     A2) Brute-force-always-call, and pay some overhead when not
> >>>>> required.
> >>>>>             Cycle-cost in "no work" scenarios. Depending on # of
> >>>>> vhost queues, this adds up as polling required *per vhost txq*.
> >>>>>             Also note that "checking DMA completions" means taking
> >>>>> a virtq-lock, so this "brute-force" can needlessly increase
> >>>>> x-thread contention!
> >>>>
> >>>> A side note: I don't see why locking is required to test for DMA
> completions.
> >>>> rte_dma_vchan_status() is lockless, e.g.:
> >>>>
> >>> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_
> >>> dmadev.c#L
> >>> 56
> >>>> 0
> >>>
> >>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a
> >>> lockfree manner from a single thread.
> >>>
> >>> The locks I refer to are at the OVS-netdev level, as virtq's are
> >>> shared across OVS's dataplane threads.
> >>> So the "M to N" comes from M dataplane threads to N virtqs, hence
> >>> requiring some locking.
> >>>
> >>>
> >>>>> B) Hide completions and live with the complexity/architectural
> >>>>> sacrifice of mixed-RxTx.
> >>>>>     Various downsides here in my opinion, see the slide deck
> >>>>> presented earlier today for a summary.
> >>>>>
> >>>>> In my opinion, A1 is the most elegant solution, as it has a clean
> >>>>> separation of concerns, does not  cause avoidable contention on
> >>>>> virtq locks, and spends no cycles when there is no completion work
> >>>>> to do.
> >>>>>
> >>>>
> >>>> Thank you for elaborating, Harry.
> >>>
> >>> Thanks for part-taking in the discussion & providing your insight!
> >>>
> >>>> I strongly oppose against hiding any part of TX processing in an RX
> >>>> function. It
> >>> is just
> >>>> wrong in so many ways!
> >>>>
> >>>> I agree that A1 is the most elegant solution. And being the most
> >>>> elegant
> >>> solution, it
> >>>> is probably also the most future proof solution. :-)
> >>>
> >>> I think so too, yes.
> >>>
> >>>> I would also like to stress that DMA completion handling belongs in
> >>>> the DPDK library, not in the application. And yes, the application
> >>>> will be required to call
> >>> some
> >>>> "handle DMA completions" function in the DPDK library. But since
> >>>> the
> >>> application
> >>>> already knows that it uses DMA, the application should also know
> >>>> that it needs
> >>> to
> >>>> call this extra function - so I consider this requirement perfectly
> acceptable.
> >>>
> >>> Agree here.
> >>>
> >>>> I prefer if the DPDK vhost library can hide its inner workings from
> >>>> the
> >>> application,
> >>>> and just expose the additional "handle completions" function. This
> >>>> also means
> >>> that
> >>>> the inner workings can be implemented as "defer work", or by some
> >>>> other algorithm. And it can be tweaked and optimized later.
> >>>
> >>> Yes, the choice in how to call the handle_completions function is
> >>> Application layer.
> >>> For OVS we designed Defer Work, V3 and V4. But it is an App level
> >>> choice, and every application is free to choose its own method.
> >>>
> >>>> Thinking about the long term perspective, this design pattern is
> >>>> common for
> >>> both
> >>>> the vhost library and other DPDK libraries that could benefit from DMA
> (e.g.
> >>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA
> >>>> library or a separate library. But for now, we should focus on the
> >>>> vhost use case, and just
> >>> keep
> >>>> the long term roadmap for using DMA in mind.
> >>>
> >>> Totally agree to keep long term roadmap in mind; but I'm not sure we
> >>> can refactor logic out of vhost. When DMA-completions arrive, the
> >>> virtQ needs to be updated; this causes a tight coupling between the
> >>> DMA completion count, and the vhost library.
> >>>
> >>> As Ilya raised on the call yesterday, there is an "in_order"
> >>> requirement in the vhost library, that per virtq the packets are
> >>> presented to the guest "in order" of enqueue.
> >>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the
> >>> Vhost library handles this today by re-ordering the DMA
> >>> completions.)
> >>>
> >>>
> >>>> Rephrasing what I said on the conference call: This vhost design
> >>>> will become
> >>> the
> >>>> common design pattern for using DMA in DPDK libraries. If we get it
> >>>> wrong, we
> >>> are
> >>>> stuck with it.
> >>>
> >>> Agree, and if we get it right, then we're stuck with it too! :)
> >>>
> >>>
> >>>>>>>> Here is another idea, inspired by a presentation at one of the
> >>>>> DPDK
> >>>>>>> Userspace conferences. It may be wishful thinking, though:
> >>>>>>>>
> >>>>>>>> Add an additional transaction to each DMA burst; a special
> >>>>>>> transaction containing the memory write operation that makes the
> >>>>>>> descriptors available to the Virtio driver.
> >>>>>>>>
> >>>>>>>
> >>>>>>> That is something that can work, so long as the receiver is
> >>>>> operating
> >>>>>>> in
> >>>>>>> polling mode. For cases where virtio interrupts are enabled, you
> >>>>> still
> >>>>>>> need
> >>>>>>> to do a write to the eventfd in the kernel in vhost to signal
> >>>>>>> the virtio side. That's not something that can be offloaded to a
> >>>>>>> DMA engine, sadly, so we still need some form of completion
> >>>>>>> call.
> >>>>>>
> >>>>>> I guess that virtio interrupts is the most widely deployed
> >>>>>> scenario,
> >>>>> so let's ignore
> >>>>>> the DMA TX completion transaction for now - and call it a
> >>>>>> possible
> >>>>> future
> >>>>>> optimization for specific use cases. So it seems that some form
> >>>>>> of
> >>>>> completion call
> >>>>>> is unavoidable.
> >>>>>
> >>>>> Agree to leave this aside, there is in theory a potential
> >>>>> optimization, but unlikely to be of large value.
> >>>>>
> >>>>
> >>>> One more thing: When using DMA to pass on packets into a guest,
> there could
> >>> be a
> >>>> delay from the DMA completes until the guest is signaled. Is there any
> CPU
> >>> cache
> >>>> hotness regarding the guest's access to the packet data to consider here?
> I.e. if
> >>> we
> >>>> wait signaling the guest, the packet data may get cold.
> >>>
> >>> Interesting question; we can likely spawn a new thread around this topic!
> >>> In short, it depends on how/where the DMA hardware writes the copy.
> >>>
> >>> With technologies like DDIO, the "dest" part of the copy will be in LLC.
> The core
> >>> reading the
> >>> dest data will benefit from the LLC locality (instead of snooping it from a
> remote
> >>> core's L1/L2).
> >>>
> >>> Delays in notifying the guest could result in LLC capacity eviction, yes.
> >>> The application layer decides how often/promptly to check for
> completions,
> >>> and notify the guest of them. Calling the function more often will result
> in less
> >>> delay in that portion of the pipeline.
> >>>
> >>> Overall, there are caching benefits with DMA acceleration, and the
> application
> >>> can control
> >>> the latency introduced between dma-completion done in HW, and Guest
> vring
> >>> update.
> >>
> >


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-05 12:07               ` Bruce Richardson
@ 2022-04-08  6:29                 ` Pai G, Sunil
  2022-05-13  8:52                   ` fengchengwen
  0 siblings, 1 reply; 58+ messages in thread
From: Pai G, Sunil @ 2022-04-08  6:29 UTC (permalink / raw)
  To: Richardson, Bruce, Ilya Maximets, Chengwen Feng,
	Radha Mohan Chintakuntla, Veerasenareddy Burru, Gagandeep Singh,
	Nipun Gupta
  Cc: Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren, Harry,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

> -----Original Message-----
> From: Richardson, Bruce <bruce.richardson@intel.com>
> Sent: Tuesday, April 5, 2022 5:38 PM
> To: Ilya Maximets <i.maximets@ovn.org>; Chengwen Feng
> <fengchengwen@huawei.com>; Radha Mohan Chintakuntla <radhac@marvell.com>;
> Veerasenareddy Burru <vburru@marvell.com>; Gagandeep Singh
> <g.singh@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
> Cc: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> <cian.ferriter@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Maxime Coquelin (maxime.coquelin@redhat.com) <maxime.coquelin@redhat.com>;
> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
> > On 3/30/22 16:09, Bruce Richardson wrote:
> > > On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
> > >> On 3/30/22 13:12, Bruce Richardson wrote:
> > >>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
> > >>>> On 3/30/22 12:41, Ilya Maximets wrote:
> > >>>>> Forking the thread to discuss a memory consistency/ordering model.
> > >>>>>
> > >>>>> AFAICT, dmadev can be anything from part of a CPU to a
> > >>>>> completely separate PCI device.  However, I don't see any memory
> > >>>>> ordering being enforced or even described in the dmadev API or
> documentation.
> > >>>>> Please, point me to the correct documentation, if I somehow missed
> it.
> > >>>>>
> > >>>>> We have a DMA device (A) and a CPU core (B) writing respectively
> > >>>>> the data and the descriptor info.  CPU core (C) is reading the
> > >>>>> descriptor and the data it points too.
> > >>>>>
> > >>>>> A few things about that process:
> > >>>>>
> > >>>>> 1. There is no memory barrier between writes A and B (Did I miss
> > >>>>>    them?).  Meaning that those operations can be seen by C in a
> > >>>>>    different order regardless of barriers issued by C and
> regardless
> > >>>>>    of the nature of devices A and B.
> > >>>>>
> > >>>>> 2. Even if there is a write barrier between A and B, there is
> > >>>>>    no guarantee that C will see these writes in the same order
> > >>>>>    as C doesn't use real memory barriers because vhost
> > >>>>> advertises
> > >>>>
> > >>>> s/advertises/does not advertise/
> > >>>>
> > >>>>>    VIRTIO_F_ORDER_PLATFORM.
> > >>>>>
> > >>>>> So, I'm getting to conclusion that there is a missing write
> > >>>>> barrier on the vhost side and vhost itself must not advertise
> > >>>>> the
> > >>>>
> > >>>> s/must not/must/
> > >>>>
> > >>>> Sorry, I wrote things backwards. :)
> > >>>>
> > >>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual
> > >>>>> memory barriers.
> > >>>>>
> > >>>>> Would like to hear some thoughts on that topic.  Is it a real
> issue?
> > >>>>> Is it an issue considering all possible CPU architectures and
> > >>>>> DMA HW variants?
> > >>>>>
> > >>>
> > >>> In terms of ordering of operations using dmadev:
> > >>>
> > >>> * Some DMA HW will perform all operations strictly in order e.g.
> Intel
> > >>>   IOAT, while other hardware may not guarantee order of
> operations/do
> > >>>   things in parallel e.g. Intel DSA. Therefore the dmadev API
> provides the
> > >>>   fence operation which allows the order to be enforced. The fence
> can be
> > >>>   thought of as a full memory barrier, meaning no jobs after the
> barrier can
> > >>>   be started until all those before it have completed. Obviously,
> for HW
> > >>>   where order is always enforced, this will be a no-op, but for
> hardware that
> > >>>   parallelizes, we want to reduce the fences to get best
> performance.
> > >>>
> > >>> * For synchronization between DMA devices and CPUs, where a CPU can
> only
> > >>>   write after a DMA copy has been done, the CPU must wait for the
> dma
> > >>>   completion to guarantee ordering. Once the completion has been
> returned
> > >>>   the completed operation is globally visible to all cores.
> > >>
> > >> Thanks for explanation!  Some questions though:
> > >>
> > >> In our case one CPU waits for completion and another CPU is
> > >> actually using the data.  IOW, "CPU must wait" is a bit ambiguous.
> Which CPU must wait?
> > >>
> > >> Or should it be "Once the completion is visible on any core, the
> > >> completed operation is globally visible to all cores." ?
> > >>
> > >
> > > The latter.
> > > Once the change to memory/cache is visible to any core, it is
> > > visible to all ones. This applies to regular CPU memory writes too -
> > > at least on IA, and I expect on many other architectures - once the
> > > write is visible outside the current core it is visible to every
> > > other core. Once the data hits the l1 or l2 cache of any core, any
> > > subsequent requests for that data from any other core will "snoop"
> > > the latest data from the cores cache, even if it has not made its
> > > way down to a shared cache, e.g. l3 on most IA systems.
> >
> > It sounds like you're referring to the "multicopy atomicity" of the
> > architecture.  However, that is not universally supported thing.
> > AFAICT, POWER and older ARM systems doesn't support it, so writes
> > performed by one core are not necessarily available to all other cores
> > at the same time.  That means that if the CPU0 writes the data and the
> > completion flag, CPU1 reads the completion flag and writes the ring,
> > CPU2 may see the ring write, but may still not see the write of the
> > data, even though there was a control dependency on CPU1.
> > There should be a full memory barrier on CPU1 in order to fulfill the
> > memory ordering requirements for CPU2, IIUC.
> >
> > In our scenario the CPU0 is a DMA device, which may or may not be part
> > of a CPU and may have different memory consistency/ordering
> > requirements.  So, the question is: does DPDK DMA API guarantee
> > multicopy atomicity between DMA device and all CPU cores regardless of
> > CPU architecture and a nature of the DMA device?
> >
> 
> Right now, it doesn't because this never came up in discussion. In order
> to be useful, it sounds like it explicitly should do so. At least for the
> Intel ioat and idxd driver cases, this will be supported, so we just need
> to ensure all other drivers currently upstreamed can offer this too. If
> they cannot, we cannot offer it as a global guarantee, and we should see
> about adding a capability flag for this to indicate when the guarantee is
> there or not.
> 
> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to
> document for dmadev that once a DMA operation is completed, the op is
> guaranteed visible to all cores/threads? If not, any thoughts on what
> guarantees we can provide in this regard, or what capabilities should be
> exposed?



Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta,
Requesting your valuable opinions for the queries on this thread.


> 
> /Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-07 15:46                                 ` Maxime Coquelin
@ 2022-04-07 16:04                                   ` Bruce Richardson
  0 siblings, 0 replies; 58+ messages in thread
From: Bruce Richardson @ 2022-04-07 16:04 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Ilya Maximets, Van Haaren, Harry, Morten Brørup, Pai G,
	Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, ovs-dev, dev,
	Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On Thu, Apr 07, 2022 at 05:46:32PM +0200, Maxime Coquelin wrote:
> 
> 
> On 4/7/22 17:01, Ilya Maximets wrote:
> > On 4/7/22 16:42, Van Haaren, Harry wrote:
> > > > -----Original Message-----
> > > > From: Ilya Maximets <i.maximets@ovn.org>
> > > > Sent: Thursday, April 7, 2022 3:40 PM
> > > > To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> > > > <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
> > > > Richardson, Bruce <bruce.richardson@intel.com>
> > > > Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> > > > <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> > > > <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
> > > > John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> > > > Finn, Emma <emma.finn@intel.com>
> > > > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > > > 
> > > > On 4/7/22 16:25, Maxime Coquelin wrote:
> > > > > Hi Harry,
> > > > > 
> > > > > On 4/7/22 16:04, Van Haaren, Harry wrote:
> > > > > > Hi OVS & DPDK, Maintainers & Community,
> > > > > > 
> > > > > > Top posting overview of discussion as replies to thread become slower:
> > > > > > perhaps it is a good time to review and plan for next steps?
> > > > > > 
> > > > > >   From my perspective, it those most vocal in the thread seem to be in favour
> > > > of the clean
> > > > > > rx/tx split ("defer work"), with the tradeoff that the application must be
> > > > aware of handling
> > > > > > the async DMA completions. If there are any concerns opposing upstreaming
> > > > of this method,
> > > > > > please indicate this promptly, and we can continue technical discussions here
> > > > now.
> > > > > 
> > > > > Wasn't there some discussions about handling the Virtio completions with
> > > > > the DMA engine? With that, we wouldn't need the deferral of work.
> > > > 
> > > > +1
> > > 
> > > Yes there was, the DMA/virtq completions thread here for reference;
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html
> > > 
> > > I do not believe that there is a viable path to actually implementing it, and particularly
> > > not in the more complex cases; e.g. virtio with guest-interrupt enabled.
> > > 
> > > The thread above mentions additional threads and various other options; none of which
> > > I believe to be a clean or workable solution. I'd like input from other folks more familiar
> > > with the exact implementations of VHost/vrings, as well as those with DMA engine expertise.
> > 
> > I tend to trust Maxime as a vhost maintainer in such questions. :)
> > 
> > In my own opinion though, the implementation is possible and concerns doesn't
> > sound deal-breaking as solutions for them might work well enough.  So I think
> > the viability should be tested out before solution is disregarded.  Especially
> > because the decision will form the API of the vhost library.
> 
> I agree, we need a PoC adding interrupt support to dmadev API using
> eventfd, and adding a thread in Vhost library that polls for DMA
> interrupts and calls vhost_vring_call if needed.
>
Hi Maxime,

couple of questions, perhaps you can clarify. Firstly, why would an eventfd
be needed for the interrupts, can they not just use the regular interrupt
handling like other devices in DPDK, e.g. read on /dev/node.

In terms of the new thread - what is this thread going to handle? Is it
going to take interrupts from all dma operations and handle the
completion/cleanup of all jobs for all queues once the DMA engine is
finished? Or is it just going to periodically be woken up to check for the
edge case if there are any virtio queues sleeping with interrupts enabled
where there are completed - but unsignalled to the VM - packets?

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-07 15:01                               ` Ilya Maximets
@ 2022-04-07 15:46                                 ` Maxime Coquelin
  2022-04-07 16:04                                   ` Bruce Richardson
  0 siblings, 1 reply; 58+ messages in thread
From: Maxime Coquelin @ 2022-04-07 15:46 UTC (permalink / raw)
  To: Ilya Maximets, Van Haaren, Harry, Morten Brørup, Richardson, Bruce
  Cc: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, ovs-dev,
	dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma



On 4/7/22 17:01, Ilya Maximets wrote:
> On 4/7/22 16:42, Van Haaren, Harry wrote:
>>> -----Original Message-----
>>> From: Ilya Maximets <i.maximets@ovn.org>
>>> Sent: Thursday, April 7, 2022 3:40 PM
>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
>>> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
>>> Richardson, Bruce <bruce.richardson@intel.com>
>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
>>> <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>>> John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>>> Finn, Emma <emma.finn@intel.com>
>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>>
>>> On 4/7/22 16:25, Maxime Coquelin wrote:
>>>> Hi Harry,
>>>>
>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
>>>>> Hi OVS & DPDK, Maintainers & Community,
>>>>>
>>>>> Top posting overview of discussion as replies to thread become slower:
>>>>> perhaps it is a good time to review and plan for next steps?
>>>>>
>>>>>   From my perspective, it those most vocal in the thread seem to be in favour
>>> of the clean
>>>>> rx/tx split ("defer work"), with the tradeoff that the application must be
>>> aware of handling
>>>>> the async DMA completions. If there are any concerns opposing upstreaming
>>> of this method,
>>>>> please indicate this promptly, and we can continue technical discussions here
>>> now.
>>>>
>>>> Wasn't there some discussions about handling the Virtio completions with
>>>> the DMA engine? With that, we wouldn't need the deferral of work.
>>>
>>> +1
>>
>> Yes there was, the DMA/virtq completions thread here for reference;
>> https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html
>>
>> I do not believe that there is a viable path to actually implementing it, and particularly
>> not in the more complex cases; e.g. virtio with guest-interrupt enabled.
>>
>> The thread above mentions additional threads and various other options; none of which
>> I believe to be a clean or workable solution. I'd like input from other folks more familiar
>> with the exact implementations of VHost/vrings, as well as those with DMA engine expertise.
> 
> I tend to trust Maxime as a vhost maintainer in such questions. :)
> 
> In my own opinion though, the implementation is possible and concerns doesn't
> sound deal-breaking as solutions for them might work well enough.  So I think
> the viability should be tested out before solution is disregarded.  Especially
> because the decision will form the API of the vhost library.

I agree, we need a PoC adding interrupt support to dmadev API using
eventfd, and adding a thread in Vhost library that polls for DMA
interrupts and calls vhost_vring_call if needed.

>>
>>
>>> With the virtio completions handled by DMA itself, the vhost port
>>> turns almost into a real HW NIC.  With that we will not need any
>>> extra manipulations from the OVS side, i.e. no need to defer any
>>> work while maintaining clear split between rx and tx operations.
>>>
>>> I'd vote for that.
>>>
>>>>
>>>> Thanks,
>>>> Maxime
>>
>> Thanks for the prompt responses, and lets understand if there is a viable workable way
>> to totally hide DMA-completions from the application.
>>
>> Regards,  -Harry
>>
>>
>>>>> In absence of continued technical discussion here, I suggest Sunil and Ian
>>> collaborate on getting
>>>>> the OVS Defer-work approach, and DPDK VHost Async patchsets available on
>>> GitHub for easier
>>>>> consumption and future development (as suggested in slides presented on
>>> last call).
>>>>>
>>>>> Regards, -Harry
>>>>>
>>>>> No inline-replies below; message just for context.
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Van Haaren, Harry
>>>>>> Sent: Wednesday, March 30, 2022 10:02 AM
>>>>>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
>>>>>> <bruce.richardson@intel.com>
>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>>>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>>>>> <Jiayu.Hu@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; Ilya
>>> Maximets
>>>>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org;
>>> Mcnamara,
>>>>>> John <john.mcnamara@intel.com>; O'Driscoll, Tim
>>> <tim.odriscoll@intel.com>;
>>>>>> Finn, Emma <Emma.Finn@intel.com>
>>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>> Sent: Tuesday, March 29, 2022 8:59 PM
>>>>>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
>>>>>>> <bruce.richardson@intel.com>
>>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>>>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>>>>>> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya
>>> Maximets
>>>>>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org;
>>> Mcnamara,
>>>>>> John
>>>>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>>> Finn,
>>>>>>> Emma <emma.finn@intel.com>
>>>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>>
>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>> Sent: Tuesday, 29 March 2022 19.46
>>>>>>>>
>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
>>>>>>>>>
>>>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Having thought more about it, I think that a completely
>>>>>>>>>> different
>>>>>>>>>>>> architectural approach is required:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
>>>>>>>> and TX
>>>>>>>>>>>> packet burst functions, each optimized for different CPU vector
>>>>>>>>>>>> instruction sets. The availability of a DMA engine should be
>>>>>>>>>> treated
>>>>>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
>>>>>>>> e.g.
>>>>>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>>>>>>>>> packet
>>>>>>>>>>>> burst functions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In such an architecture, it would be the application's job
>>>>>>>> to
>>>>>>>>>>>> allocate DMA channels and assign them to the specific PMDs that
>>>>>>>>>> should
>>>>>>>>>>>> use them. But the actual use of the DMA channels would move
>>>>>>>> down
>>>>>>>>>> below
>>>>>>>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Med venlig hilsen / Kind regards,
>>>>>>>>>>>>>>> -Morten Brørup
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's *exactly* how this architecture is designed &
>>>>>>>>>> implemented.
>>>>>>>>>>>>>> 1.    The DMA configuration and initialization is up to the
>>>>>>>>>> application
>>>>>>>>>>>> (OVS).
>>>>>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
>>>>>>>> new
>>>>>>>>>> async
>>>>>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looking forward to talking on the call that just started.
>>>>>>>>>> Regards, -
>>>>>>>>>>>> Harry
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>>>>>>>> patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then, I suppose that the TX completions can be handled in the
>>>>>>>> TX
>>>>>>>>>>>> function, and the RX completions can be handled in the RX
>>>>>>>> function,
>>>>>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>>>>>>>
>>>>>>>>>>>>> TX_Burst(tx_packet_array):
>>>>>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
>>>>>>>> Process
>>>>>>>>>> TX
>>>>>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
>>>>>>>> descriptors. --
>>>>>>>>>>> Pass
>>>>>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
>>>>>>>> 1st
>>>>>>>>>>>> pipeline stage.)
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is Tx function might not be called again, so
>>>>>>>> enqueued
>>>>>>>>>>>> packets in 2. may never be completed from a Virtio point of
>>>>>>>> view.
>>>>>>>>>> IOW,
>>>>>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
>>>>>>>> but
>>>>>>>>>> the
>>>>>>>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>>>>>>>
>>>>>>>>>>> In that case, the application needs to call TX_Burst()
>>>>>>>> periodically
>>>>>>>>>> with an empty array, for completion purposes.
>>>>>>>>
>>>>>>>> This is what the "defer work" does at the OVS thread-level, but instead
>>>>>>>> of
>>>>>>>> "brute-forcing" and *always* making the call, the defer work concept
>>>>>>>> tracks
>>>>>>>> *when* there is outstanding work (DMA copies) to be completed
>>>>>>>> ("deferred work")
>>>>>>>> and calls the generic completion function at that point.
>>>>>>>>
>>>>>>>> So "defer work" is generic infrastructure at the OVS thread level to
>>>>>>>> handle
>>>>>>>> work that needs to be done "later", e.g. DMA completion handling.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>>>>>>>>>> library, to handle DMA completion. It might even handle multiple
>>>>>>>> DMA
>>>>>>>>>> channels, if convenient - and if possible without locking or other
>>>>>>>>>> weird complexity.
>>>>>>>>
>>>>>>>> That's exactly how it is done, the VHost library has a new API added,
>>>>>>>> which allows
>>>>>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev
>>>>>>>> abstraction)
>>>>>>>> we add a function to allow the OVS thread to do those completions in a
>>>>>>>> new
>>>>>>>> Netdev-abstraction API called "async_process" where the completions can
>>>>>>>> be checked.
>>>>>>>>
>>>>>>>> The only method to abstract them is to "hide" them somewhere that will
>>>>>>>> always be
>>>>>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
>>>>>>>> use this method.
>>>>>>>> This allows "completions" to be transparent to the app, at the tradeoff
>>>>>>>> to having bad
>>>>>>>> separation  of concerns as Rx and Tx are now tied-together.
>>>>>>>>
>>>>>>>> The point is, the Application layer must *somehow * handle of
>>>>>>>> completions.
>>>>>>>> So fundamentally there are 2 options for the Application level:
>>>>>>>>
>>>>>>>> A) Make the application periodically call a "handle completions"
>>>>>>>> function
>>>>>>>>      A1) Defer work, call when needed, and track "needed" at app
>>>>>>>> layer, and calling into vhost txq complete as required.
>>>>>>>>              Elegant in that "no work" means "no cycles spent" on
>>>>>>>> checking DMA completions.
>>>>>>>>      A2) Brute-force-always-call, and pay some overhead when not
>>>>>>>> required.
>>>>>>>>              Cycle-cost in "no work" scenarios. Depending on # of
>>>>>>>> vhost queues, this adds up as polling required *per vhost txq*.
>>>>>>>>              Also note that "checking DMA completions" means taking a
>>>>>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread
>>>>>>>> contention!
>>>>>>>
>>>>>>> A side note: I don't see why locking is required to test for DMA
>>> completions.
>>>>>>> rte_dma_vchan_status() is lockless, e.g.:
>>>>>>>
>>>>>>
>>> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
>>>>>> 56
>>>>>>> 0
>>>>>>
>>>>>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree
>>> manner
>>>>>> from a single thread.
>>>>>>
>>>>>> The locks I refer to are at the OVS-netdev level, as virtq's are shared across
>>> OVS's
>>>>>> dataplane threads.
>>>>>> So the "M to N" comes from M dataplane threads to N virtqs, hence
>>> requiring
>>>>>> some locking.
>>>>>>
>>>>>>
>>>>>>>> B) Hide completions and live with the complexity/architectural
>>>>>>>> sacrifice of mixed-RxTx.
>>>>>>>>      Various downsides here in my opinion, see the slide deck
>>>>>>>> presented earlier today for a summary.
>>>>>>>>
>>>>>>>> In my opinion, A1 is the most elegant solution, as it has a clean
>>>>>>>> separation of concerns, does not  cause
>>>>>>>> avoidable contention on virtq locks, and spends no cycles when there is
>>>>>>>> no completion work to do.
>>>>>>>>
>>>>>>>
>>>>>>> Thank you for elaborating, Harry.
>>>>>>
>>>>>> Thanks for part-taking in the discussion & providing your insight!
>>>>>>
>>>>>>> I strongly oppose against hiding any part of TX processing in an RX function.
>>> It
>>>>>> is just
>>>>>>> wrong in so many ways!
>>>>>>>
>>>>>>> I agree that A1 is the most elegant solution. And being the most elegant
>>>>>> solution, it
>>>>>>> is probably also the most future proof solution. :-)
>>>>>>
>>>>>> I think so too, yes.
>>>>>>
>>>>>>> I would also like to stress that DMA completion handling belongs in the
>>> DPDK
>>>>>>> library, not in the application. And yes, the application will be required to
>>> call
>>>>>> some
>>>>>>> "handle DMA completions" function in the DPDK library. But since the
>>>>>> application
>>>>>>> already knows that it uses DMA, the application should also know that it
>>> needs
>>>>>> to
>>>>>>> call this extra function - so I consider this requirement perfectly acceptable.
>>>>>>
>>>>>> Agree here.
>>>>>>
>>>>>>> I prefer if the DPDK vhost library can hide its inner workings from the
>>>>>> application,
>>>>>>> and just expose the additional "handle completions" function. This also
>>> means
>>>>>> that
>>>>>>> the inner workings can be implemented as "defer work", or by some other
>>>>>>> algorithm. And it can be tweaked and optimized later.
>>>>>>
>>>>>> Yes, the choice in how to call the handle_completions function is Application
>>>>>> layer.
>>>>>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice,
>>> and
>>>>>> every
>>>>>> application is free to choose its own method.
>>>>>>
>>>>>>> Thinking about the long term perspective, this design pattern is common
>>> for
>>>>>> both
>>>>>>> the vhost library and other DPDK libraries that could benefit from DMA (e.g.
>>>>>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or
>>> a
>>>>>>> separate library. But for now, we should focus on the vhost use case, and
>>> just
>>>>>> keep
>>>>>>> the long term roadmap for using DMA in mind.
>>>>>>
>>>>>> Totally agree to keep long term roadmap in mind; but I'm not sure we can
>>>>>> refactor
>>>>>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
>>>>>> updated;
>>>>>> this causes a tight coupling between the DMA completion count, and the
>>> vhost
>>>>>> library.
>>>>>>
>>>>>> As Ilya raised on the call yesterday, there is an "in_order" requirement in the
>>>>>> vhost
>>>>>> library, that per virtq the packets are presented to the guest "in order" of
>>>>>> enqueue.
>>>>>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
>>>>>> library
>>>>>> handles this today by re-ordering the DMA completions.)
>>>>>>
>>>>>>
>>>>>>> Rephrasing what I said on the conference call: This vhost design will
>>> become
>>>>>> the
>>>>>>> common design pattern for using DMA in DPDK libraries. If we get it wrong,
>>> we
>>>>>> are
>>>>>>> stuck with it.
>>>>>>
>>>>>> Agree, and if we get it right, then we're stuck with it too! :)
>>>>>>
>>>>>>
>>>>>>>>>>> Here is another idea, inspired by a presentation at one of the
>>>>>>>> DPDK
>>>>>>>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>>>>>>>
>>>>>>>>>>> Add an additional transaction to each DMA burst; a special
>>>>>>>>>> transaction containing the memory write operation that makes the
>>>>>>>>>> descriptors available to the Virtio driver.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That is something that can work, so long as the receiver is
>>>>>>>> operating
>>>>>>>>>> in
>>>>>>>>>> polling mode. For cases where virtio interrupts are enabled, you
>>>>>>>> still
>>>>>>>>>> need
>>>>>>>>>> to do a write to the eventfd in the kernel in vhost to signal the
>>>>>>>>>> virtio
>>>>>>>>>> side. That's not something that can be offloaded to a DMA engine,
>>>>>>>>>> sadly, so
>>>>>>>>>> we still need some form of completion call.
>>>>>>>>>
>>>>>>>>> I guess that virtio interrupts is the most widely deployed scenario,
>>>>>>>> so let's ignore
>>>>>>>>> the DMA TX completion transaction for now - and call it a possible
>>>>>>>> future
>>>>>>>>> optimization for specific use cases. So it seems that some form of
>>>>>>>> completion call
>>>>>>>>> is unavoidable.
>>>>>>>>
>>>>>>>> Agree to leave this aside, there is in theory a potential optimization,
>>>>>>>> but
>>>>>>>> unlikely to be of large value.
>>>>>>>>
>>>>>>>
>>>>>>> One more thing: When using DMA to pass on packets into a guest, there
>>> could
>>>>>> be a
>>>>>>> delay from the DMA completes until the guest is signaled. Is there any CPU
>>>>>> cache
>>>>>>> hotness regarding the guest's access to the packet data to consider here?
>>> I.e. if
>>>>>> we
>>>>>>> wait signaling the guest, the packet data may get cold.
>>>>>>
>>>>>> Interesting question; we can likely spawn a new thread around this topic!
>>>>>> In short, it depends on how/where the DMA hardware writes the copy.
>>>>>>
>>>>>> With technologies like DDIO, the "dest" part of the copy will be in LLC. The
>>> core
>>>>>> reading the
>>>>>> dest data will benefit from the LLC locality (instead of snooping it from a
>>> remote
>>>>>> core's L1/L2).
>>>>>>
>>>>>> Delays in notifying the guest could result in LLC capacity eviction, yes.
>>>>>> The application layer decides how often/promptly to check for completions,
>>>>>> and notify the guest of them. Calling the function more often will result in
>>> less
>>>>>> delay in that portion of the pipeline.
>>>>>>
>>>>>> Overall, there are caching benefits with DMA acceleration, and the
>>> application
>>>>>> can control
>>>>>> the latency introduced between dma-completion done in HW, and Guest
>>> vring
>>>>>> update.
>>>>>
>>>>
>>
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-07 14:42                             ` Van Haaren, Harry
@ 2022-04-07 15:01                               ` Ilya Maximets
  2022-04-07 15:46                                 ` Maxime Coquelin
  0 siblings, 1 reply; 58+ messages in thread
From: Ilya Maximets @ 2022-04-07 15:01 UTC (permalink / raw)
  To: Van Haaren, Harry, Maxime Coquelin, Morten Brørup,
	Richardson, Bruce
  Cc: i.maximets, Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian,
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On 4/7/22 16:42, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Ilya Maximets <i.maximets@ovn.org>
>> Sent: Thursday, April 7, 2022 3:40 PM
>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
>> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
>> Richardson, Bruce <bruce.richardson@intel.com>
>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
>> <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>> John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>> Finn, Emma <emma.finn@intel.com>
>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>
>> On 4/7/22 16:25, Maxime Coquelin wrote:
>>> Hi Harry,
>>>
>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
>>>> Hi OVS & DPDK, Maintainers & Community,
>>>>
>>>> Top posting overview of discussion as replies to thread become slower:
>>>> perhaps it is a good time to review and plan for next steps?
>>>>
>>>>  From my perspective, it those most vocal in the thread seem to be in favour
>> of the clean
>>>> rx/tx split ("defer work"), with the tradeoff that the application must be
>> aware of handling
>>>> the async DMA completions. If there are any concerns opposing upstreaming
>> of this method,
>>>> please indicate this promptly, and we can continue technical discussions here
>> now.
>>>
>>> Wasn't there some discussions about handling the Virtio completions with
>>> the DMA engine? With that, we wouldn't need the deferral of work.
>>
>> +1
> 
> Yes there was, the DMA/virtq completions thread here for reference;
> https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html
> 
> I do not believe that there is a viable path to actually implementing it, and particularly
> not in the more complex cases; e.g. virtio with guest-interrupt enabled.
> 
> The thread above mentions additional threads and various other options; none of which
> I believe to be a clean or workable solution. I'd like input from other folks more familiar
> with the exact implementations of VHost/vrings, as well as those with DMA engine expertise.

I tend to trust Maxime as a vhost maintainer in such questions. :)

In my own opinion though, the implementation is possible and concerns doesn't
sound deal-breaking as solutions for them might work well enough.  So I think
the viability should be tested out before solution is disregarded.  Especially
because the decision will form the API of the vhost library.

> 
> 
>> With the virtio completions handled by DMA itself, the vhost port
>> turns almost into a real HW NIC.  With that we will not need any
>> extra manipulations from the OVS side, i.e. no need to defer any
>> work while maintaining clear split between rx and tx operations.
>>
>> I'd vote for that.
>>
>>>
>>> Thanks,
>>> Maxime
> 
> Thanks for the prompt responses, and lets understand if there is a viable workable way
> to totally hide DMA-completions from the application.
> 
> Regards,  -Harry
> 
> 
>>>> In absence of continued technical discussion here, I suggest Sunil and Ian
>> collaborate on getting
>>>> the OVS Defer-work approach, and DPDK VHost Async patchsets available on
>> GitHub for easier
>>>> consumption and future development (as suggested in slides presented on
>> last call).
>>>>
>>>> Regards, -Harry
>>>>
>>>> No inline-replies below; message just for context.
>>>>
>>>>> -----Original Message-----
>>>>> From: Van Haaren, Harry
>>>>> Sent: Wednesday, March 30, 2022 10:02 AM
>>>>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
>>>>> <bruce.richardson@intel.com>
>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>>>> <Jiayu.Hu@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; Ilya
>> Maximets
>>>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org;
>> Mcnamara,
>>>>> John <john.mcnamara@intel.com>; O'Driscoll, Tim
>> <tim.odriscoll@intel.com>;
>>>>> Finn, Emma <Emma.Finn@intel.com>
>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>> Sent: Tuesday, March 29, 2022 8:59 PM
>>>>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
>>>>>> <bruce.richardson@intel.com>
>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>>>>> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya
>> Maximets
>>>>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org;
>> Mcnamara,
>>>>> John
>>>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>> Finn,
>>>>>> Emma <emma.finn@intel.com>
>>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>
>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>> Sent: Tuesday, 29 March 2022 19.46
>>>>>>>
>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
>>>>>>>>
>>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>>>>>>
>>>>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>>>>>>
>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>
>>>>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Having thought more about it, I think that a completely
>>>>>>>>> different
>>>>>>>>>>> architectural approach is required:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
>>>>>>> and TX
>>>>>>>>>>> packet burst functions, each optimized for different CPU vector
>>>>>>>>>>> instruction sets. The availability of a DMA engine should be
>>>>>>>>> treated
>>>>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
>>>>>>> e.g.
>>>>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>>>>>>>> packet
>>>>>>>>>>> burst functions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In such an architecture, it would be the application's job
>>>>>>> to
>>>>>>>>>>> allocate DMA channels and assign them to the specific PMDs that
>>>>>>>>> should
>>>>>>>>>>> use them. But the actual use of the DMA channels would move
>>>>>>> down
>>>>>>>>> below
>>>>>>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Med venlig hilsen / Kind regards,
>>>>>>>>>>>>>> -Morten Brørup
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>>>
>>>>>>>>>>>>> That's *exactly* how this architecture is designed &
>>>>>>>>> implemented.
>>>>>>>>>>>>> 1.    The DMA configuration and initialization is up to the
>>>>>>>>> application
>>>>>>>>>>> (OVS).
>>>>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
>>>>>>> new
>>>>>>>>> async
>>>>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looking forward to talking on the call that just started.
>>>>>>>>> Regards, -
>>>>>>>>>>> Harry
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>>>>>>> patches.
>>>>>>>>>>>>
>>>>>>>>>>>> Then, I suppose that the TX completions can be handled in the
>>>>>>> TX
>>>>>>>>>>> function, and the RX completions can be handled in the RX
>>>>>>> function,
>>>>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>>>>>>
>>>>>>>>>>>> TX_Burst(tx_packet_array):
>>>>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
>>>>>>> Process
>>>>>>>>> TX
>>>>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
>>>>>>> descriptors. --
>>>>>>>>>> Pass
>>>>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
>>>>>>> 1st
>>>>>>>>>>> pipeline stage.)
>>>>>>>>>>>
>>>>>>>>>>> The problem is Tx function might not be called again, so
>>>>>>> enqueued
>>>>>>>>>>> packets in 2. may never be completed from a Virtio point of
>>>>>>> view.
>>>>>>>>> IOW,
>>>>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
>>>>>>> but
>>>>>>>>> the
>>>>>>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>>>>>>
>>>>>>>>>> In that case, the application needs to call TX_Burst()
>>>>>>> periodically
>>>>>>>>> with an empty array, for completion purposes.
>>>>>>>
>>>>>>> This is what the "defer work" does at the OVS thread-level, but instead
>>>>>>> of
>>>>>>> "brute-forcing" and *always* making the call, the defer work concept
>>>>>>> tracks
>>>>>>> *when* there is outstanding work (DMA copies) to be completed
>>>>>>> ("deferred work")
>>>>>>> and calls the generic completion function at that point.
>>>>>>>
>>>>>>> So "defer work" is generic infrastructure at the OVS thread level to
>>>>>>> handle
>>>>>>> work that needs to be done "later", e.g. DMA completion handling.
>>>>>>>
>>>>>>>
>>>>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>>>>>>>>> library, to handle DMA completion. It might even handle multiple
>>>>>>> DMA
>>>>>>>>> channels, if convenient - and if possible without locking or other
>>>>>>>>> weird complexity.
>>>>>>>
>>>>>>> That's exactly how it is done, the VHost library has a new API added,
>>>>>>> which allows
>>>>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev
>>>>>>> abstraction)
>>>>>>> we add a function to allow the OVS thread to do those completions in a
>>>>>>> new
>>>>>>> Netdev-abstraction API called "async_process" where the completions can
>>>>>>> be checked.
>>>>>>>
>>>>>>> The only method to abstract them is to "hide" them somewhere that will
>>>>>>> always be
>>>>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
>>>>>>> use this method.
>>>>>>> This allows "completions" to be transparent to the app, at the tradeoff
>>>>>>> to having bad
>>>>>>> separation  of concerns as Rx and Tx are now tied-together.
>>>>>>>
>>>>>>> The point is, the Application layer must *somehow * handle of
>>>>>>> completions.
>>>>>>> So fundamentally there are 2 options for the Application level:
>>>>>>>
>>>>>>> A) Make the application periodically call a "handle completions"
>>>>>>> function
>>>>>>>     A1) Defer work, call when needed, and track "needed" at app
>>>>>>> layer, and calling into vhost txq complete as required.
>>>>>>>             Elegant in that "no work" means "no cycles spent" on
>>>>>>> checking DMA completions.
>>>>>>>     A2) Brute-force-always-call, and pay some overhead when not
>>>>>>> required.
>>>>>>>             Cycle-cost in "no work" scenarios. Depending on # of
>>>>>>> vhost queues, this adds up as polling required *per vhost txq*.
>>>>>>>             Also note that "checking DMA completions" means taking a
>>>>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread
>>>>>>> contention!
>>>>>>
>>>>>> A side note: I don't see why locking is required to test for DMA
>> completions.
>>>>>> rte_dma_vchan_status() is lockless, e.g.:
>>>>>>
>>>>>
>> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
>>>>> 56
>>>>>> 0
>>>>>
>>>>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree
>> manner
>>>>> from a single thread.
>>>>>
>>>>> The locks I refer to are at the OVS-netdev level, as virtq's are shared across
>> OVS's
>>>>> dataplane threads.
>>>>> So the "M to N" comes from M dataplane threads to N virtqs, hence
>> requiring
>>>>> some locking.
>>>>>
>>>>>
>>>>>>> B) Hide completions and live with the complexity/architectural
>>>>>>> sacrifice of mixed-RxTx.
>>>>>>>     Various downsides here in my opinion, see the slide deck
>>>>>>> presented earlier today for a summary.
>>>>>>>
>>>>>>> In my opinion, A1 is the most elegant solution, as it has a clean
>>>>>>> separation of concerns, does not  cause
>>>>>>> avoidable contention on virtq locks, and spends no cycles when there is
>>>>>>> no completion work to do.
>>>>>>>
>>>>>>
>>>>>> Thank you for elaborating, Harry.
>>>>>
>>>>> Thanks for part-taking in the discussion & providing your insight!
>>>>>
>>>>>> I strongly oppose against hiding any part of TX processing in an RX function.
>> It
>>>>> is just
>>>>>> wrong in so many ways!
>>>>>>
>>>>>> I agree that A1 is the most elegant solution. And being the most elegant
>>>>> solution, it
>>>>>> is probably also the most future proof solution. :-)
>>>>>
>>>>> I think so too, yes.
>>>>>
>>>>>> I would also like to stress that DMA completion handling belongs in the
>> DPDK
>>>>>> library, not in the application. And yes, the application will be required to
>> call
>>>>> some
>>>>>> "handle DMA completions" function in the DPDK library. But since the
>>>>> application
>>>>>> already knows that it uses DMA, the application should also know that it
>> needs
>>>>> to
>>>>>> call this extra function - so I consider this requirement perfectly acceptable.
>>>>>
>>>>> Agree here.
>>>>>
>>>>>> I prefer if the DPDK vhost library can hide its inner workings from the
>>>>> application,
>>>>>> and just expose the additional "handle completions" function. This also
>> means
>>>>> that
>>>>>> the inner workings can be implemented as "defer work", or by some other
>>>>>> algorithm. And it can be tweaked and optimized later.
>>>>>
>>>>> Yes, the choice in how to call the handle_completions function is Application
>>>>> layer.
>>>>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice,
>> and
>>>>> every
>>>>> application is free to choose its own method.
>>>>>
>>>>>> Thinking about the long term perspective, this design pattern is common
>> for
>>>>> both
>>>>>> the vhost library and other DPDK libraries that could benefit from DMA (e.g.
>>>>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or
>> a
>>>>>> separate library. But for now, we should focus on the vhost use case, and
>> just
>>>>> keep
>>>>>> the long term roadmap for using DMA in mind.
>>>>>
>>>>> Totally agree to keep long term roadmap in mind; but I'm not sure we can
>>>>> refactor
>>>>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
>>>>> updated;
>>>>> this causes a tight coupling between the DMA completion count, and the
>> vhost
>>>>> library.
>>>>>
>>>>> As Ilya raised on the call yesterday, there is an "in_order" requirement in the
>>>>> vhost
>>>>> library, that per virtq the packets are presented to the guest "in order" of
>>>>> enqueue.
>>>>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
>>>>> library
>>>>> handles this today by re-ordering the DMA completions.)
>>>>>
>>>>>
>>>>>> Rephrasing what I said on the conference call: This vhost design will
>> become
>>>>> the
>>>>>> common design pattern for using DMA in DPDK libraries. If we get it wrong,
>> we
>>>>> are
>>>>>> stuck with it.
>>>>>
>>>>> Agree, and if we get it right, then we're stuck with it too! :)
>>>>>
>>>>>
>>>>>>>>>> Here is another idea, inspired by a presentation at one of the
>>>>>>> DPDK
>>>>>>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>>>>>>
>>>>>>>>>> Add an additional transaction to each DMA burst; a special
>>>>>>>>> transaction containing the memory write operation that makes the
>>>>>>>>> descriptors available to the Virtio driver.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That is something that can work, so long as the receiver is
>>>>>>> operating
>>>>>>>>> in
>>>>>>>>> polling mode. For cases where virtio interrupts are enabled, you
>>>>>>> still
>>>>>>>>> need
>>>>>>>>> to do a write to the eventfd in the kernel in vhost to signal the
>>>>>>>>> virtio
>>>>>>>>> side. That's not something that can be offloaded to a DMA engine,
>>>>>>>>> sadly, so
>>>>>>>>> we still need some form of completion call.
>>>>>>>>
>>>>>>>> I guess that virtio interrupts is the most widely deployed scenario,
>>>>>>> so let's ignore
>>>>>>>> the DMA TX completion transaction for now - and call it a possible
>>>>>>> future
>>>>>>>> optimization for specific use cases. So it seems that some form of
>>>>>>> completion call
>>>>>>>> is unavoidable.
>>>>>>>
>>>>>>> Agree to leave this aside, there is in theory a potential optimization,
>>>>>>> but
>>>>>>> unlikely to be of large value.
>>>>>>>
>>>>>>
>>>>>> One more thing: When using DMA to pass on packets into a guest, there
>> could
>>>>> be a
>>>>>> delay from the DMA completes until the guest is signaled. Is there any CPU
>>>>> cache
>>>>>> hotness regarding the guest's access to the packet data to consider here?
>> I.e. if
>>>>> we
>>>>>> wait signaling the guest, the packet data may get cold.
>>>>>
>>>>> Interesting question; we can likely spawn a new thread around this topic!
>>>>> In short, it depends on how/where the DMA hardware writes the copy.
>>>>>
>>>>> With technologies like DDIO, the "dest" part of the copy will be in LLC. The
>> core
>>>>> reading the
>>>>> dest data will benefit from the LLC locality (instead of snooping it from a
>> remote
>>>>> core's L1/L2).
>>>>>
>>>>> Delays in notifying the guest could result in LLC capacity eviction, yes.
>>>>> The application layer decides how often/promptly to check for completions,
>>>>> and notify the guest of them. Calling the function more often will result in
>> less
>>>>> delay in that portion of the pipeline.
>>>>>
>>>>> Overall, there are caching benefits with DMA acceleration, and the
>> application
>>>>> can control
>>>>> the latency introduced between dma-completion done in HW, and Guest
>> vring
>>>>> update.
>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-07 14:39                           ` Ilya Maximets
@ 2022-04-07 14:42                             ` Van Haaren, Harry
  2022-04-07 15:01                               ` Ilya Maximets
  2022-04-08  7:13                             ` Hu, Jiayu
  1 sibling, 1 reply; 58+ messages in thread
From: Van Haaren, Harry @ 2022-04-07 14:42 UTC (permalink / raw)
  To: Ilya Maximets, Maxime Coquelin, Morten Brørup, Richardson, Bruce
  Cc: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, ovs-dev,
	dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Thursday, April 7, 2022 3:40 PM
> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
> Richardson, Bruce <bruce.richardson@intel.com>
> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
> John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On 4/7/22 16:25, Maxime Coquelin wrote:
> > Hi Harry,
> >
> > On 4/7/22 16:04, Van Haaren, Harry wrote:
> >> Hi OVS & DPDK, Maintainers & Community,
> >>
> >> Top posting overview of discussion as replies to thread become slower:
> >> perhaps it is a good time to review and plan for next steps?
> >>
> >>  From my perspective, it those most vocal in the thread seem to be in favour
> of the clean
> >> rx/tx split ("defer work"), with the tradeoff that the application must be
> aware of handling
> >> the async DMA completions. If there are any concerns opposing upstreaming
> of this method,
> >> please indicate this promptly, and we can continue technical discussions here
> now.
> >
> > Wasn't there some discussions about handling the Virtio completions with
> > the DMA engine? With that, we wouldn't need the deferral of work.
> 
> +1

Yes there was, the DMA/virtq completions thread here for reference;
https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html

I do not believe that there is a viable path to actually implementing it, and particularly
not in the more complex cases; e.g. virtio with guest-interrupt enabled.

The thread above mentions additional threads and various other options; none of which
I believe to be a clean or workable solution. I'd like input from other folks more familiar
with the exact implementations of VHost/vrings, as well as those with DMA engine expertise.


> With the virtio completions handled by DMA itself, the vhost port
> turns almost into a real HW NIC.  With that we will not need any
> extra manipulations from the OVS side, i.e. no need to defer any
> work while maintaining clear split between rx and tx operations.
> 
> I'd vote for that.
> 
> >
> > Thanks,
> > Maxime

Thanks for the prompt responses, and lets understand if there is a viable workable way
to totally hide DMA-completions from the application.

Regards,  -Harry


> >> In absence of continued technical discussion here, I suggest Sunil and Ian
> collaborate on getting
> >> the OVS Defer-work approach, and DPDK VHost Async patchsets available on
> GitHub for easier
> >> consumption and future development (as suggested in slides presented on
> last call).
> >>
> >> Regards, -Harry
> >>
> >> No inline-replies below; message just for context.
> >>
> >>> -----Original Message-----
> >>> From: Van Haaren, Harry
> >>> Sent: Wednesday, March 30, 2022 10:02 AM
> >>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
> >>> <bruce.richardson@intel.com>
> >>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> >>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
> >>> <Jiayu.Hu@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; Ilya
> Maximets
> >>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org;
> Mcnamara,
> >>> John <john.mcnamara@intel.com>; O'Driscoll, Tim
> <tim.odriscoll@intel.com>;
> >>> Finn, Emma <Emma.Finn@intel.com>
> >>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >>>
> >>>> -----Original Message-----
> >>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>> Sent: Tuesday, March 29, 2022 8:59 PM
> >>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
> >>>> <bruce.richardson@intel.com>
> >>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> >>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
> >>>> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya
> Maximets
> >>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org;
> Mcnamara,
> >>> John
> >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn,
> >>>> Emma <emma.finn@intel.com>
> >>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> >>>>> Sent: Tuesday, 29 March 2022 19.46
> >>>>>
> >>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
> >>>>>>
> >>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>>>> Sent: Tuesday, 29 March 2022 19.03
> >>>>>>>
> >>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> >>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
> >>>>>>>>>
> >>>>>>>>> Hi Morten,
> >>>>>>>>>
> >>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
> >>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> >>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
> >>>>>>>>>>>
> >>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> Having thought more about it, I think that a completely
> >>>>>>> different
> >>>>>>>>> architectural approach is required:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
> >>>>> and TX
> >>>>>>>>> packet burst functions, each optimized for different CPU vector
> >>>>>>>>> instruction sets. The availability of a DMA engine should be
> >>>>>>> treated
> >>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
> >>>>> e.g.
> >>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> >>>>>>> packet
> >>>>>>>>> burst functions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Similarly for the DPDK vhost library.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In such an architecture, it would be the application's job
> >>>>> to
> >>>>>>>>> allocate DMA channels and assign them to the specific PMDs that
> >>>>>>> should
> >>>>>>>>> use them. But the actual use of the DMA channels would move
> >>>>> down
> >>>>>>> below
> >>>>>>>>> the application and into the DPDK PMDs and libraries.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Med venlig hilsen / Kind regards,
> >>>>>>>>>>>> -Morten Brørup
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Morten,
> >>>>>>>>>>>
> >>>>>>>>>>> That's *exactly* how this architecture is designed &
> >>>>>>> implemented.
> >>>>>>>>>>> 1.    The DMA configuration and initialization is up to the
> >>>>>>> application
> >>>>>>>>> (OVS).
> >>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
> >>>>> new
> >>>>>>> async
> >>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
> >>>>>>>>>>>
> >>>>>>>>>>> Looking forward to talking on the call that just started.
> >>>>>>> Regards, -
> >>>>>>>>> Harry
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
> >>>>>>> patches.
> >>>>>>>>>>
> >>>>>>>>>> Then, I suppose that the TX completions can be handled in the
> >>>>> TX
> >>>>>>>>> function, and the RX completions can be handled in the RX
> >>>>> function,
> >>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
> >>>>>>>>>>
> >>>>>>>>>> TX_Burst(tx_packet_array):
> >>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
> >>>>> Process
> >>>>>>> TX
> >>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
> >>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
> >>>>> descriptors. --
> >>>>>>>> Pass
> >>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
> >>>>> 1st
> >>>>>>>>> pipeline stage.)
> >>>>>>>>>
> >>>>>>>>> The problem is Tx function might not be called again, so
> >>>>> enqueued
> >>>>>>>>> packets in 2. may never be completed from a Virtio point of
> >>>>> view.
> >>>>>>> IOW,
> >>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
> >>>>> but
> >>>>>>> the
> >>>>>>>>> descriptors will not be made available to the Virtio driver.
> >>>>>>>>
> >>>>>>>> In that case, the application needs to call TX_Burst()
> >>>>> periodically
> >>>>>>> with an empty array, for completion purposes.
> >>>>>
> >>>>> This is what the "defer work" does at the OVS thread-level, but instead
> >>>>> of
> >>>>> "brute-forcing" and *always* making the call, the defer work concept
> >>>>> tracks
> >>>>> *when* there is outstanding work (DMA copies) to be completed
> >>>>> ("deferred work")
> >>>>> and calls the generic completion function at that point.
> >>>>>
> >>>>> So "defer work" is generic infrastructure at the OVS thread level to
> >>>>> handle
> >>>>> work that needs to be done "later", e.g. DMA completion handling.
> >>>>>
> >>>>>
> >>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
> >>>>>>> library, to handle DMA completion. It might even handle multiple
> >>>>> DMA
> >>>>>>> channels, if convenient - and if possible without locking or other
> >>>>>>> weird complexity.
> >>>>>
> >>>>> That's exactly how it is done, the VHost library has a new API added,
> >>>>> which allows
> >>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev
> >>>>> abstraction)
> >>>>> we add a function to allow the OVS thread to do those completions in a
> >>>>> new
> >>>>> Netdev-abstraction API called "async_process" where the completions can
> >>>>> be checked.
> >>>>>
> >>>>> The only method to abstract them is to "hide" them somewhere that will
> >>>>> always be
> >>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
> >>>>> use this method.
> >>>>> This allows "completions" to be transparent to the app, at the tradeoff
> >>>>> to having bad
> >>>>> separation  of concerns as Rx and Tx are now tied-together.
> >>>>>
> >>>>> The point is, the Application layer must *somehow * handle of
> >>>>> completions.
> >>>>> So fundamentally there are 2 options for the Application level:
> >>>>>
> >>>>> A) Make the application periodically call a "handle completions"
> >>>>> function
> >>>>>     A1) Defer work, call when needed, and track "needed" at app
> >>>>> layer, and calling into vhost txq complete as required.
> >>>>>             Elegant in that "no work" means "no cycles spent" on
> >>>>> checking DMA completions.
> >>>>>     A2) Brute-force-always-call, and pay some overhead when not
> >>>>> required.
> >>>>>             Cycle-cost in "no work" scenarios. Depending on # of
> >>>>> vhost queues, this adds up as polling required *per vhost txq*.
> >>>>>             Also note that "checking DMA completions" means taking a
> >>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread
> >>>>> contention!
> >>>>
> >>>> A side note: I don't see why locking is required to test for DMA
> completions.
> >>>> rte_dma_vchan_status() is lockless, e.g.:
> >>>>
> >>>
> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
> >>> 56
> >>>> 0
> >>>
> >>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree
> manner
> >>> from a single thread.
> >>>
> >>> The locks I refer to are at the OVS-netdev level, as virtq's are shared across
> OVS's
> >>> dataplane threads.
> >>> So the "M to N" comes from M dataplane threads to N virtqs, hence
> requiring
> >>> some locking.
> >>>
> >>>
> >>>>> B) Hide completions and live with the complexity/architectural
> >>>>> sacrifice of mixed-RxTx.
> >>>>>     Various downsides here in my opinion, see the slide deck
> >>>>> presented earlier today for a summary.
> >>>>>
> >>>>> In my opinion, A1 is the most elegant solution, as it has a clean
> >>>>> separation of concerns, does not  cause
> >>>>> avoidable contention on virtq locks, and spends no cycles when there is
> >>>>> no completion work to do.
> >>>>>
> >>>>
> >>>> Thank you for elaborating, Harry.
> >>>
> >>> Thanks for part-taking in the discussion & providing your insight!
> >>>
> >>>> I strongly oppose against hiding any part of TX processing in an RX function.
> It
> >>> is just
> >>>> wrong in so many ways!
> >>>>
> >>>> I agree that A1 is the most elegant solution. And being the most elegant
> >>> solution, it
> >>>> is probably also the most future proof solution. :-)
> >>>
> >>> I think so too, yes.
> >>>
> >>>> I would also like to stress that DMA completion handling belongs in the
> DPDK
> >>>> library, not in the application. And yes, the application will be required to
> call
> >>> some
> >>>> "handle DMA completions" function in the DPDK library. But since the
> >>> application
> >>>> already knows that it uses DMA, the application should also know that it
> needs
> >>> to
> >>>> call this extra function - so I consider this requirement perfectly acceptable.
> >>>
> >>> Agree here.
> >>>
> >>>> I prefer if the DPDK vhost library can hide its inner workings from the
> >>> application,
> >>>> and just expose the additional "handle completions" function. This also
> means
> >>> that
> >>>> the inner workings can be implemented as "defer work", or by some other
> >>>> algorithm. And it can be tweaked and optimized later.
> >>>
> >>> Yes, the choice in how to call the handle_completions function is Application
> >>> layer.
> >>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice,
> and
> >>> every
> >>> application is free to choose its own method.
> >>>
> >>>> Thinking about the long term perspective, this design pattern is common
> for
> >>> both
> >>>> the vhost library and other DPDK libraries that could benefit from DMA (e.g.
> >>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or
> a
> >>>> separate library. But for now, we should focus on the vhost use case, and
> just
> >>> keep
> >>>> the long term roadmap for using DMA in mind.
> >>>
> >>> Totally agree to keep long term roadmap in mind; but I'm not sure we can
> >>> refactor
> >>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
> >>> updated;
> >>> this causes a tight coupling between the DMA completion count, and the
> vhost
> >>> library.
> >>>
> >>> As Ilya raised on the call yesterday, there is an "in_order" requirement in the
> >>> vhost
> >>> library, that per virtq the packets are presented to the guest "in order" of
> >>> enqueue.
> >>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
> >>> library
> >>> handles this today by re-ordering the DMA completions.)
> >>>
> >>>
> >>>> Rephrasing what I said on the conference call: This vhost design will
> become
> >>> the
> >>>> common design pattern for using DMA in DPDK libraries. If we get it wrong,
> we
> >>> are
> >>>> stuck with it.
> >>>
> >>> Agree, and if we get it right, then we're stuck with it too! :)
> >>>
> >>>
> >>>>>>>> Here is another idea, inspired by a presentation at one of the
> >>>>> DPDK
> >>>>>>> Userspace conferences. It may be wishful thinking, though:
> >>>>>>>>
> >>>>>>>> Add an additional transaction to each DMA burst; a special
> >>>>>>> transaction containing the memory write operation that makes the
> >>>>>>> descriptors available to the Virtio driver.
> >>>>>>>>
> >>>>>>>
> >>>>>>> That is something that can work, so long as the receiver is
> >>>>> operating
> >>>>>>> in
> >>>>>>> polling mode. For cases where virtio interrupts are enabled, you
> >>>>> still
> >>>>>>> need
> >>>>>>> to do a write to the eventfd in the kernel in vhost to signal the
> >>>>>>> virtio
> >>>>>>> side. That's not something that can be offloaded to a DMA engine,
> >>>>>>> sadly, so
> >>>>>>> we still need some form of completion call.
> >>>>>>
> >>>>>> I guess that virtio interrupts is the most widely deployed scenario,
> >>>>> so let's ignore
> >>>>>> the DMA TX completion transaction for now - and call it a possible
> >>>>> future
> >>>>>> optimization for specific use cases. So it seems that some form of
> >>>>> completion call
> >>>>>> is unavoidable.
> >>>>>
> >>>>> Agree to leave this aside, there is in theory a potential optimization,
> >>>>> but
> >>>>> unlikely to be of large value.
> >>>>>
> >>>>
> >>>> One more thing: When using DMA to pass on packets into a guest, there
> could
> >>> be a
> >>>> delay from the DMA completes until the guest is signaled. Is there any CPU
> >>> cache
> >>>> hotness regarding the guest's access to the packet data to consider here?
> I.e. if
> >>> we
> >>>> wait signaling the guest, the packet data may get cold.
> >>>
> >>> Interesting question; we can likely spawn a new thread around this topic!
> >>> In short, it depends on how/where the DMA hardware writes the copy.
> >>>
> >>> With technologies like DDIO, the "dest" part of the copy will be in LLC. The
> core
> >>> reading the
> >>> dest data will benefit from the LLC locality (instead of snooping it from a
> remote
> >>> core's L1/L2).
> >>>
> >>> Delays in notifying the guest could result in LLC capacity eviction, yes.
> >>> The application layer decides how often/promptly to check for completions,
> >>> and notify the guest of them. Calling the function more often will result in
> less
> >>> delay in that portion of the pipeline.
> >>>
> >>> Overall, there are caching benefits with DMA acceleration, and the
> application
> >>> can control
> >>> the latency introduced between dma-completion done in HW, and Guest
> vring
> >>> update.
> >>
> >


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-07 14:25                         ` Maxime Coquelin
@ 2022-04-07 14:39                           ` Ilya Maximets
  2022-04-07 14:42                             ` Van Haaren, Harry
  2022-04-08  7:13                             ` Hu, Jiayu
  0 siblings, 2 replies; 58+ messages in thread
From: Ilya Maximets @ 2022-04-07 14:39 UTC (permalink / raw)
  To: Maxime Coquelin, Van Haaren, Harry, Morten Brørup,
	Richardson, Bruce
  Cc: i.maximets, Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian,
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On 4/7/22 16:25, Maxime Coquelin wrote:
> Hi Harry,
> 
> On 4/7/22 16:04, Van Haaren, Harry wrote:
>> Hi OVS & DPDK, Maintainers & Community,
>>
>> Top posting overview of discussion as replies to thread become slower:
>> perhaps it is a good time to review and plan for next steps?
>>
>>  From my perspective, it those most vocal in the thread seem to be in favour of the clean
>> rx/tx split ("defer work"), with the tradeoff that the application must be aware of handling
>> the async DMA completions. If there are any concerns opposing upstreaming of this method,
>> please indicate this promptly, and we can continue technical discussions here now.
> 
> Wasn't there some discussions about handling the Virtio completions with
> the DMA engine? With that, we wouldn't need the deferral of work.

+1

With the virtio completions handled by DMA itself, the vhost port
turns almost into a real HW NIC.  With that we will not need any
extra manipulations from the OVS side, i.e. no need to defer any
work while maintaining clear split between rx and tx operations.

I'd vote for that.

> 
> Thanks,
> Maxime
> 
>> In absence of continued technical discussion here, I suggest Sunil and Ian collaborate on getting
>> the OVS Defer-work approach, and DPDK VHost Async patchsets available on GitHub for easier
>> consumption and future development (as suggested in slides presented on last call).
>>
>> Regards, -Harry
>>
>> No inline-replies below; message just for context.
>>
>>> -----Original Message-----
>>> From: Van Haaren, Harry
>>> Sent: Wednesday, March 30, 2022 10:02 AM
>>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
>>> <bruce.richardson@intel.com>
>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>> <Jiayu.Hu@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; Ilya Maximets
>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>>> John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>>> Finn, Emma <Emma.Finn@intel.com>
>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>
>>>> -----Original Message-----
>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>> Sent: Tuesday, March 29, 2022 8:59 PM
>>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
>>>> <bruce.richardson@intel.com>
>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>>> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya Maximets
>>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>>> John
>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>; Finn,
>>>> Emma <emma.finn@intel.com>
>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>
>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>> Sent: Tuesday, 29 March 2022 19.46
>>>>>
>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
>>>>>>
>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>>>>
>>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>>>>
>>>>>>>>> Hi Morten,
>>>>>>>>>
>>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>>>>
>>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>>>>
>>>>>>>>>>>> Having thought more about it, I think that a completely
>>>>>>> different
>>>>>>>>> architectural approach is required:
>>>>>>>>>>>>
>>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
>>>>> and TX
>>>>>>>>> packet burst functions, each optimized for different CPU vector
>>>>>>>>> instruction sets. The availability of a DMA engine should be
>>>>>>> treated
>>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
>>>>> e.g.
>>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>>>>>> packet
>>>>>>>>> burst functions.
>>>>>>>>>>>>
>>>>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>>>>
>>>>>>>>>>>> In such an architecture, it would be the application's job
>>>>> to
>>>>>>>>> allocate DMA channels and assign them to the specific PMDs that
>>>>>>> should
>>>>>>>>> use them. But the actual use of the DMA channels would move
>>>>> down
>>>>>>> below
>>>>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Med venlig hilsen / Kind regards,
>>>>>>>>>>>> -Morten Brørup
>>>>>>>>>>>
>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>
>>>>>>>>>>> That's *exactly* how this architecture is designed &
>>>>>>> implemented.
>>>>>>>>>>> 1.    The DMA configuration and initialization is up to the
>>>>>>> application
>>>>>>>>> (OVS).
>>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
>>>>> new
>>>>>>> async
>>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>>>>
>>>>>>>>>>> Looking forward to talking on the call that just started.
>>>>>>> Regards, -
>>>>>>>>> Harry
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>>>>> patches.
>>>>>>>>>>
>>>>>>>>>> Then, I suppose that the TX completions can be handled in the
>>>>> TX
>>>>>>>>> function, and the RX completions can be handled in the RX
>>>>> function,
>>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>>>>
>>>>>>>>>> TX_Burst(tx_packet_array):
>>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
>>>>> Process
>>>>>>> TX
>>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
>>>>> descriptors. --
>>>>>>>> Pass
>>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
>>>>> 1st
>>>>>>>>> pipeline stage.)
>>>>>>>>>
>>>>>>>>> The problem is Tx function might not be called again, so
>>>>> enqueued
>>>>>>>>> packets in 2. may never be completed from a Virtio point of
>>>>> view.
>>>>>>> IOW,
>>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
>>>>> but
>>>>>>> the
>>>>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>>>>
>>>>>>>> In that case, the application needs to call TX_Burst()
>>>>> periodically
>>>>>>> with an empty array, for completion purposes.
>>>>>
>>>>> This is what the "defer work" does at the OVS thread-level, but instead
>>>>> of
>>>>> "brute-forcing" and *always* making the call, the defer work concept
>>>>> tracks
>>>>> *when* there is outstanding work (DMA copies) to be completed
>>>>> ("deferred work")
>>>>> and calls the generic completion function at that point.
>>>>>
>>>>> So "defer work" is generic infrastructure at the OVS thread level to
>>>>> handle
>>>>> work that needs to be done "later", e.g. DMA completion handling.
>>>>>
>>>>>
>>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>>>>>>> library, to handle DMA completion. It might even handle multiple
>>>>> DMA
>>>>>>> channels, if convenient - and if possible without locking or other
>>>>>>> weird complexity.
>>>>>
>>>>> That's exactly how it is done, the VHost library has a new API added,
>>>>> which allows
>>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev
>>>>> abstraction)
>>>>> we add a function to allow the OVS thread to do those completions in a
>>>>> new
>>>>> Netdev-abstraction API called "async_process" where the completions can
>>>>> be checked.
>>>>>
>>>>> The only method to abstract them is to "hide" them somewhere that will
>>>>> always be
>>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
>>>>> use this method.
>>>>> This allows "completions" to be transparent to the app, at the tradeoff
>>>>> to having bad
>>>>> separation  of concerns as Rx and Tx are now tied-together.
>>>>>
>>>>> The point is, the Application layer must *somehow * handle of
>>>>> completions.
>>>>> So fundamentally there are 2 options for the Application level:
>>>>>
>>>>> A) Make the application periodically call a "handle completions"
>>>>> function
>>>>>     A1) Defer work, call when needed, and track "needed" at app
>>>>> layer, and calling into vhost txq complete as required.
>>>>>             Elegant in that "no work" means "no cycles spent" on
>>>>> checking DMA completions.
>>>>>     A2) Brute-force-always-call, and pay some overhead when not
>>>>> required.
>>>>>             Cycle-cost in "no work" scenarios. Depending on # of
>>>>> vhost queues, this adds up as polling required *per vhost txq*.
>>>>>             Also note that "checking DMA completions" means taking a
>>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread
>>>>> contention!
>>>>
>>>> A side note: I don't see why locking is required to test for DMA completions.
>>>> rte_dma_vchan_status() is lockless, e.g.:
>>>>
>>> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
>>> 56
>>>> 0
>>>
>>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree manner
>>> from a single thread.
>>>
>>> The locks I refer to are at the OVS-netdev level, as virtq's are shared across OVS's
>>> dataplane threads.
>>> So the "M to N" comes from M dataplane threads to N virtqs, hence requiring
>>> some locking.
>>>
>>>
>>>>> B) Hide completions and live with the complexity/architectural
>>>>> sacrifice of mixed-RxTx.
>>>>>     Various downsides here in my opinion, see the slide deck
>>>>> presented earlier today for a summary.
>>>>>
>>>>> In my opinion, A1 is the most elegant solution, as it has a clean
>>>>> separation of concerns, does not  cause
>>>>> avoidable contention on virtq locks, and spends no cycles when there is
>>>>> no completion work to do.
>>>>>
>>>>
>>>> Thank you for elaborating, Harry.
>>>
>>> Thanks for part-taking in the discussion & providing your insight!
>>>
>>>> I strongly oppose against hiding any part of TX processing in an RX function. It
>>> is just
>>>> wrong in so many ways!
>>>>
>>>> I agree that A1 is the most elegant solution. And being the most elegant
>>> solution, it
>>>> is probably also the most future proof solution. :-)
>>>
>>> I think so too, yes.
>>>
>>>> I would also like to stress that DMA completion handling belongs in the DPDK
>>>> library, not in the application. And yes, the application will be required to call
>>> some
>>>> "handle DMA completions" function in the DPDK library. But since the
>>> application
>>>> already knows that it uses DMA, the application should also know that it needs
>>> to
>>>> call this extra function - so I consider this requirement perfectly acceptable.
>>>
>>> Agree here.
>>>
>>>> I prefer if the DPDK vhost library can hide its inner workings from the
>>> application,
>>>> and just expose the additional "handle completions" function. This also means
>>> that
>>>> the inner workings can be implemented as "defer work", or by some other
>>>> algorithm. And it can be tweaked and optimized later.
>>>
>>> Yes, the choice in how to call the handle_completions function is Application
>>> layer.
>>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice, and
>>> every
>>> application is free to choose its own method.
>>>
>>>> Thinking about the long term perspective, this design pattern is common for
>>> both
>>>> the vhost library and other DPDK libraries that could benefit from DMA (e.g.
>>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or a
>>>> separate library. But for now, we should focus on the vhost use case, and just
>>> keep
>>>> the long term roadmap for using DMA in mind.
>>>
>>> Totally agree to keep long term roadmap in mind; but I'm not sure we can
>>> refactor
>>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
>>> updated;
>>> this causes a tight coupling between the DMA completion count, and the vhost
>>> library.
>>>
>>> As Ilya raised on the call yesterday, there is an "in_order" requirement in the
>>> vhost
>>> library, that per virtq the packets are presented to the guest "in order" of
>>> enqueue.
>>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
>>> library
>>> handles this today by re-ordering the DMA completions.)
>>>
>>>
>>>> Rephrasing what I said on the conference call: This vhost design will become
>>> the
>>>> common design pattern for using DMA in DPDK libraries. If we get it wrong, we
>>> are
>>>> stuck with it.
>>>
>>> Agree, and if we get it right, then we're stuck with it too! :)
>>>
>>>
>>>>>>>> Here is another idea, inspired by a presentation at one of the
>>>>> DPDK
>>>>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>>>>
>>>>>>>> Add an additional transaction to each DMA burst; a special
>>>>>>> transaction containing the memory write operation that makes the
>>>>>>> descriptors available to the Virtio driver.
>>>>>>>>
>>>>>>>
>>>>>>> That is something that can work, so long as the receiver is
>>>>> operating
>>>>>>> in
>>>>>>> polling mode. For cases where virtio interrupts are enabled, you
>>>>> still
>>>>>>> need
>>>>>>> to do a write to the eventfd in the kernel in vhost to signal the
>>>>>>> virtio
>>>>>>> side. That's not something that can be offloaded to a DMA engine,
>>>>>>> sadly, so
>>>>>>> we still need some form of completion call.
>>>>>>
>>>>>> I guess that virtio interrupts is the most widely deployed scenario,
>>>>> so let's ignore
>>>>>> the DMA TX completion transaction for now - and call it a possible
>>>>> future
>>>>>> optimization for specific use cases. So it seems that some form of
>>>>> completion call
>>>>>> is unavoidable.
>>>>>
>>>>> Agree to leave this aside, there is in theory a potential optimization,
>>>>> but
>>>>> unlikely to be of large value.
>>>>>
>>>>
>>>> One more thing: When using DMA to pass on packets into a guest, there could
>>> be a
>>>> delay from the DMA completes until the guest is signaled. Is there any CPU
>>> cache
>>>> hotness regarding the guest's access to the packet data to consider here? I.e. if
>>> we
>>>> wait signaling the guest, the packet data may get cold.
>>>
>>> Interesting question; we can likely spawn a new thread around this topic!
>>> In short, it depends on how/where the DMA hardware writes the copy.
>>>
>>> With technologies like DDIO, the "dest" part of the copy will be in LLC. The core
>>> reading the
>>> dest data will benefit from the LLC locality (instead of snooping it from a remote
>>> core's L1/L2).
>>>
>>> Delays in notifying the guest could result in LLC capacity eviction, yes.
>>> The application layer decides how often/promptly to check for completions,
>>> and notify the guest of them. Calling the function more often will result in less
>>> delay in that portion of the pipeline.
>>>
>>> Overall, there are caching benefits with DMA acceleration, and the application
>>> can control
>>> the latency introduced between dma-completion done in HW, and Guest vring
>>> update.
>>
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-07 14:04                       ` Van Haaren, Harry
@ 2022-04-07 14:25                         ` Maxime Coquelin
  2022-04-07 14:39                           ` Ilya Maximets
  0 siblings, 1 reply; 58+ messages in thread
From: Maxime Coquelin @ 2022-04-07 14:25 UTC (permalink / raw)
  To: Van Haaren, Harry, Morten Brørup, Richardson, Bruce
  Cc: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian,
	Ilya Maximets, ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim,
	Finn, Emma

Hi Harry,

On 4/7/22 16:04, Van Haaren, Harry wrote:
> Hi OVS & DPDK, Maintainers & Community,
> 
> Top posting overview of discussion as replies to thread become slower:
> perhaps it is a good time to review and plan for next steps?
> 
>  From my perspective, it those most vocal in the thread seem to be in favour of the clean
> rx/tx split ("defer work"), with the tradeoff that the application must be aware of handling
> the async DMA completions. If there are any concerns opposing upstreaming of this method,
> please indicate this promptly, and we can continue technical discussions here now.

Wasn't there some discussions about handling the Virtio completions with
the DMA engine? With that, we wouldn't need the deferral of work.

Thanks,
Maxime

> In absence of continued technical discussion here, I suggest Sunil and Ian collaborate on getting
> the OVS Defer-work approach, and DPDK VHost Async patchsets available on GitHub for easier
> consumption and future development (as suggested in slides presented on last call).
> 
> Regards, -Harry
> 
> No inline-replies below; message just for context.
> 
>> -----Original Message-----
>> From: Van Haaren, Harry
>> Sent: Wednesday, March 30, 2022 10:02 AM
>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
>> <bruce.richardson@intel.com>
>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>> <Jiayu.Hu@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; Ilya Maximets
>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>> John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>> Finn, Emma <Emma.Finn@intel.com>
>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>
>>> -----Original Message-----
>>> From: Morten Brørup <mb@smartsharesystems.com>
>>> Sent: Tuesday, March 29, 2022 8:59 PM
>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
>>> <bruce.richardson@intel.com>
>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya Maximets
>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>> John
>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>; Finn,
>>> Emma <emma.finn@intel.com>
>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>
>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>> Sent: Tuesday, 29 March 2022 19.46
>>>>
>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
>>>>>
>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>>>
>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>>>
>>>>>>>> Hi Morten,
>>>>>>>>
>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>>>
>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>>>
>>>>>>>>>>> Having thought more about it, I think that a completely
>>>>>> different
>>>>>>>> architectural approach is required:
>>>>>>>>>>>
>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
>>>> and TX
>>>>>>>> packet burst functions, each optimized for different CPU vector
>>>>>>>> instruction sets. The availability of a DMA engine should be
>>>>>> treated
>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
>>>> e.g.
>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>>>>> packet
>>>>>>>> burst functions.
>>>>>>>>>>>
>>>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>>>
>>>>>>>>>>> In such an architecture, it would be the application's job
>>>> to
>>>>>>>> allocate DMA channels and assign them to the specific PMDs that
>>>>>> should
>>>>>>>> use them. But the actual use of the DMA channels would move
>>>> down
>>>>>> below
>>>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Med venlig hilsen / Kind regards,
>>>>>>>>>>> -Morten Brørup
>>>>>>>>>>
>>>>>>>>>> Hi Morten,
>>>>>>>>>>
>>>>>>>>>> That's *exactly* how this architecture is designed &
>>>>>> implemented.
>>>>>>>>>> 1.	The DMA configuration and initialization is up to the
>>>>>> application
>>>>>>>> (OVS).
>>>>>>>>>> 2.	The VHost library is passed the DMA-dev ID, and its
>>>> new
>>>>>> async
>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>>>
>>>>>>>>>> Looking forward to talking on the call that just started.
>>>>>> Regards, -
>>>>>>>> Harry
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>>>> patches.
>>>>>>>>>
>>>>>>>>> Then, I suppose that the TX completions can be handled in the
>>>> TX
>>>>>>>> function, and the RX completions can be handled in the RX
>>>> function,
>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>>>
>>>>>>>>> TX_Burst(tx_packet_array):
>>>>>>>>> 1.	Clean up descriptors processed by the NIC chip. -->
>>>> Process
>>>>>> TX
>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>>>>>> 2.	Pass on the tx_packet_array to the NIC chip
>>>> descriptors. --
>>>>>>> Pass
>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
>>>> 1st
>>>>>>>> pipeline stage.)
>>>>>>>>
>>>>>>>> The problem is Tx function might not be called again, so
>>>> enqueued
>>>>>>>> packets in 2. may never be completed from a Virtio point of
>>>> view.
>>>>>> IOW,
>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
>>>> but
>>>>>> the
>>>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>>>
>>>>>>> In that case, the application needs to call TX_Burst()
>>>> periodically
>>>>>> with an empty array, for completion purposes.
>>>>
>>>> This is what the "defer work" does at the OVS thread-level, but instead
>>>> of
>>>> "brute-forcing" and *always* making the call, the defer work concept
>>>> tracks
>>>> *when* there is outstanding work (DMA copies) to be completed
>>>> ("deferred work")
>>>> and calls the generic completion function at that point.
>>>>
>>>> So "defer work" is generic infrastructure at the OVS thread level to
>>>> handle
>>>> work that needs to be done "later", e.g. DMA completion handling.
>>>>
>>>>
>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>>>>>> library, to handle DMA completion. It might even handle multiple
>>>> DMA
>>>>>> channels, if convenient - and if possible without locking or other
>>>>>> weird complexity.
>>>>
>>>> That's exactly how it is done, the VHost library has a new API added,
>>>> which allows
>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev
>>>> abstraction)
>>>> we add a function to allow the OVS thread to do those completions in a
>>>> new
>>>> Netdev-abstraction API called "async_process" where the completions can
>>>> be checked.
>>>>
>>>> The only method to abstract them is to "hide" them somewhere that will
>>>> always be
>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
>>>> use this method.
>>>> This allows "completions" to be transparent to the app, at the tradeoff
>>>> to having bad
>>>> separation  of concerns as Rx and Tx are now tied-together.
>>>>
>>>> The point is, the Application layer must *somehow * handle of
>>>> completions.
>>>> So fundamentally there are 2 options for the Application level:
>>>>
>>>> A) Make the application periodically call a "handle completions"
>>>> function
>>>> 	A1) Defer work, call when needed, and track "needed" at app
>>>> layer, and calling into vhost txq complete as required.
>>>> 	        Elegant in that "no work" means "no cycles spent" on
>>>> checking DMA completions.
>>>> 	A2) Brute-force-always-call, and pay some overhead when not
>>>> required.
>>>> 	        Cycle-cost in "no work" scenarios. Depending on # of
>>>> vhost queues, this adds up as polling required *per vhost txq*.
>>>> 	        Also note that "checking DMA completions" means taking a
>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread
>>>> contention!
>>>
>>> A side note: I don't see why locking is required to test for DMA completions.
>>> rte_dma_vchan_status() is lockless, e.g.:
>>>
>> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
>> 56
>>> 0
>>
>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree manner
>> from a single thread.
>>
>> The locks I refer to are at the OVS-netdev level, as virtq's are shared across OVS's
>> dataplane threads.
>> So the "M to N" comes from M dataplane threads to N virtqs, hence requiring
>> some locking.
>>
>>
>>>> B) Hide completions and live with the complexity/architectural
>>>> sacrifice of mixed-RxTx.
>>>> 	Various downsides here in my opinion, see the slide deck
>>>> presented earlier today for a summary.
>>>>
>>>> In my opinion, A1 is the most elegant solution, as it has a clean
>>>> separation of concerns, does not  cause
>>>> avoidable contention on virtq locks, and spends no cycles when there is
>>>> no completion work to do.
>>>>
>>>
>>> Thank you for elaborating, Harry.
>>
>> Thanks for part-taking in the discussion & providing your insight!
>>
>>> I strongly oppose against hiding any part of TX processing in an RX function. It
>> is just
>>> wrong in so many ways!
>>>
>>> I agree that A1 is the most elegant solution. And being the most elegant
>> solution, it
>>> is probably also the most future proof solution. :-)
>>
>> I think so too, yes.
>>
>>> I would also like to stress that DMA completion handling belongs in the DPDK
>>> library, not in the application. And yes, the application will be required to call
>> some
>>> "handle DMA completions" function in the DPDK library. But since the
>> application
>>> already knows that it uses DMA, the application should also know that it needs
>> to
>>> call this extra function - so I consider this requirement perfectly acceptable.
>>
>> Agree here.
>>
>>> I prefer if the DPDK vhost library can hide its inner workings from the
>> application,
>>> and just expose the additional "handle completions" function. This also means
>> that
>>> the inner workings can be implemented as "defer work", or by some other
>>> algorithm. And it can be tweaked and optimized later.
>>
>> Yes, the choice in how to call the handle_completions function is Application
>> layer.
>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice, and
>> every
>> application is free to choose its own method.
>>
>>> Thinking about the long term perspective, this design pattern is common for
>> both
>>> the vhost library and other DPDK libraries that could benefit from DMA (e.g.
>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or a
>>> separate library. But for now, we should focus on the vhost use case, and just
>> keep
>>> the long term roadmap for using DMA in mind.
>>
>> Totally agree to keep long term roadmap in mind; but I'm not sure we can
>> refactor
>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
>> updated;
>> this causes a tight coupling between the DMA completion count, and the vhost
>> library.
>>
>> As Ilya raised on the call yesterday, there is an "in_order" requirement in the
>> vhost
>> library, that per virtq the packets are presented to the guest "in order" of
>> enqueue.
>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
>> library
>> handles this today by re-ordering the DMA completions.)
>>
>>
>>> Rephrasing what I said on the conference call: This vhost design will become
>> the
>>> common design pattern for using DMA in DPDK libraries. If we get it wrong, we
>> are
>>> stuck with it.
>>
>> Agree, and if we get it right, then we're stuck with it too! :)
>>
>>
>>>>>>> Here is another idea, inspired by a presentation at one of the
>>>> DPDK
>>>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>>>
>>>>>>> Add an additional transaction to each DMA burst; a special
>>>>>> transaction containing the memory write operation that makes the
>>>>>> descriptors available to the Virtio driver.
>>>>>>>
>>>>>>
>>>>>> That is something that can work, so long as the receiver is
>>>> operating
>>>>>> in
>>>>>> polling mode. For cases where virtio interrupts are enabled, you
>>>> still
>>>>>> need
>>>>>> to do a write to the eventfd in the kernel in vhost to signal the
>>>>>> virtio
>>>>>> side. That's not something that can be offloaded to a DMA engine,
>>>>>> sadly, so
>>>>>> we still need some form of completion call.
>>>>>
>>>>> I guess that virtio interrupts is the most widely deployed scenario,
>>>> so let's ignore
>>>>> the DMA TX completion transaction for now - and call it a possible
>>>> future
>>>>> optimization for specific use cases. So it seems that some form of
>>>> completion call
>>>>> is unavoidable.
>>>>
>>>> Agree to leave this aside, there is in theory a potential optimization,
>>>> but
>>>> unlikely to be of large value.
>>>>
>>>
>>> One more thing: When using DMA to pass on packets into a guest, there could
>> be a
>>> delay from the DMA completes until the guest is signaled. Is there any CPU
>> cache
>>> hotness regarding the guest's access to the packet data to consider here? I.e. if
>> we
>>> wait signaling the guest, the packet data may get cold.
>>
>> Interesting question; we can likely spawn a new thread around this topic!
>> In short, it depends on how/where the DMA hardware writes the copy.
>>
>> With technologies like DDIO, the "dest" part of the copy will be in LLC. The core
>> reading the
>> dest data will benefit from the LLC locality (instead of snooping it from a remote
>> core's L1/L2).
>>
>> Delays in notifying the guest could result in LLC capacity eviction, yes.
>> The application layer decides how often/promptly to check for completions,
>> and notify the guest of them. Calling the function more often will result in less
>> delay in that portion of the pipeline.
>>
>> Overall, there are caching benefits with DMA acceleration, and the application
>> can control
>> the latency introduced between dma-completion done in HW, and Guest vring
>> update.
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30  9:01                     ` Van Haaren, Harry
@ 2022-04-07 14:04                       ` Van Haaren, Harry
  2022-04-07 14:25                         ` Maxime Coquelin
  0 siblings, 1 reply; 58+ messages in thread
From: Van Haaren, Harry @ 2022-04-07 14:04 UTC (permalink / raw)
  To: Morten Brørup, Richardson,  Bruce
  Cc: Maxime Coquelin, Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter,
	Cian, Ilya Maximets, ovs-dev, dev, Mcnamara, John,
	O'Driscoll, Tim, Finn, Emma

Hi OVS & DPDK, Maintainers & Community,

Top posting overview of discussion as replies to thread become slower:
perhaps it is a good time to review and plan for next steps?

From my perspective, it those most vocal in the thread seem to be in favour of the clean
rx/tx split ("defer work"), with the tradeoff that the application must be aware of handling
the async DMA completions. If there are any concerns opposing upstreaming of this method,
please indicate this promptly, and we can continue technical discussions here now.

In absence of continued technical discussion here, I suggest Sunil and Ian collaborate on getting
the OVS Defer-work approach, and DPDK VHost Async patchsets available on GitHub for easier
consumption and future development (as suggested in slides presented on last call).

Regards, -Harry

No inline-replies below; message just for context.

> -----Original Message-----
> From: Van Haaren, Harry
> Sent: Wednesday, March 30, 2022 10:02 AM
> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
> <Jiayu.Hu@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; Ilya Maximets
> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
> John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn, Emma <Emma.Finn@intel.com>
> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> 
> > -----Original Message-----
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Tuesday, March 29, 2022 8:59 PM
> > To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>
> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> > <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
> > <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya Maximets
> > <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
> John
> > <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>; Finn,
> > Emma <emma.finn@intel.com>
> > Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >
> > > From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > > Sent: Tuesday, 29 March 2022 19.46
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: Tuesday, March 29, 2022 6:14 PM
> > > >
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > Sent: Tuesday, 29 March 2022 19.03
> > > > >
> > > > > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > > > > > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > > > > > > Sent: Tuesday, 29 March 2022 18.24
> > > > > > >
> > > > > > > Hi Morten,
> > > > > > >
> > > > > > > On 3/29/22 16:44, Morten Brørup wrote:
> > > > > > > >> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > > > > > > >> Sent: Tuesday, 29 March 2022 15.02
> > > > > > > >>
> > > > > > > >>> From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > >>> Sent: Tuesday, March 29, 2022 1:51 PM
> > > > > > > >>>
> > > > > > > >>> Having thought more about it, I think that a completely
> > > > > different
> > > > > > > architectural approach is required:
> > > > > > > >>>
> > > > > > > >>> Many of the DPDK Ethernet PMDs implement a variety of RX
> > > and TX
> > > > > > > packet burst functions, each optimized for different CPU vector
> > > > > > > instruction sets. The availability of a DMA engine should be
> > > > > treated
> > > > > > > the same way. So I suggest that PMDs copying packet contents,
> > > e.g.
> > > > > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> > > > > packet
> > > > > > > burst functions.
> > > > > > > >>>
> > > > > > > >>> Similarly for the DPDK vhost library.
> > > > > > > >>>
> > > > > > > >>> In such an architecture, it would be the application's job
> > > to
> > > > > > > allocate DMA channels and assign them to the specific PMDs that
> > > > > should
> > > > > > > use them. But the actual use of the DMA channels would move
> > > down
> > > > > below
> > > > > > > the application and into the DPDK PMDs and libraries.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Med venlig hilsen / Kind regards,
> > > > > > > >>> -Morten Brørup
> > > > > > > >>
> > > > > > > >> Hi Morten,
> > > > > > > >>
> > > > > > > >> That's *exactly* how this architecture is designed &
> > > > > implemented.
> > > > > > > >> 1.	The DMA configuration and initialization is up to the
> > > > > application
> > > > > > > (OVS).
> > > > > > > >> 2.	The VHost library is passed the DMA-dev ID, and its
> > > new
> > > > > async
> > > > > > > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > > > > > > >>
> > > > > > > >> Looking forward to talking on the call that just started.
> > > > > Regards, -
> > > > > > > Harry
> > > > > > > >>
> > > > > > > >
> > > > > > > > OK, thanks - as I said on the call, I haven't looked at the
> > > > > patches.
> > > > > > > >
> > > > > > > > Then, I suppose that the TX completions can be handled in the
> > > TX
> > > > > > > function, and the RX completions can be handled in the RX
> > > function,
> > > > > > > just like the Ethdev PMDs handle packet descriptors:
> > > > > > > >
> > > > > > > > TX_Burst(tx_packet_array):
> > > > > > > > 1.	Clean up descriptors processed by the NIC chip. -->
> > > Process
> > > > > TX
> > > > > > > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > > > > > > 2.	Pass on the tx_packet_array to the NIC chip
> > > descriptors. --
> > > > > > Pass
> > > > > > > on the tx_packet_array to the TX DMA channel. (Effectively, the
> > > 1st
> > > > > > > pipeline stage.)
> > > > > > >
> > > > > > > The problem is Tx function might not be called again, so
> > > enqueued
> > > > > > > packets in 2. may never be completed from a Virtio point of
> > > view.
> > > > > IOW,
> > > > > > > the packets will be copied to the Virtio descriptors buffers,
> > > but
> > > > > the
> > > > > > > descriptors will not be made available to the Virtio driver.
> > > > > >
> > > > > > In that case, the application needs to call TX_Burst()
> > > periodically
> > > > > with an empty array, for completion purposes.
> > >
> > > This is what the "defer work" does at the OVS thread-level, but instead
> > > of
> > > "brute-forcing" and *always* making the call, the defer work concept
> > > tracks
> > > *when* there is outstanding work (DMA copies) to be completed
> > > ("deferred work")
> > > and calls the generic completion function at that point.
> > >
> > > So "defer work" is generic infrastructure at the OVS thread level to
> > > handle
> > > work that needs to be done "later", e.g. DMA completion handling.
> > >
> > >
> > > > > > Or some sort of TX_Keepalive() function can be added to the DPDK
> > > > > library, to handle DMA completion. It might even handle multiple
> > > DMA
> > > > > channels, if convenient - and if possible without locking or other
> > > > > weird complexity.
> > >
> > > That's exactly how it is done, the VHost library has a new API added,
> > > which allows
> > > for handling completions. And in the "Netdev layer" (~OVS ethdev
> > > abstraction)
> > > we add a function to allow the OVS thread to do those completions in a
> > > new
> > > Netdev-abstraction API called "async_process" where the completions can
> > > be checked.
> > >
> > > The only method to abstract them is to "hide" them somewhere that will
> > > always be
> > > polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
> > > use this method.
> > > This allows "completions" to be transparent to the app, at the tradeoff
> > > to having bad
> > > separation  of concerns as Rx and Tx are now tied-together.
> > >
> > > The point is, the Application layer must *somehow * handle of
> > > completions.
> > > So fundamentally there are 2 options for the Application level:
> > >
> > > A) Make the application periodically call a "handle completions"
> > > function
> > > 	A1) Defer work, call when needed, and track "needed" at app
> > > layer, and calling into vhost txq complete as required.
> > > 	        Elegant in that "no work" means "no cycles spent" on
> > > checking DMA completions.
> > > 	A2) Brute-force-always-call, and pay some overhead when not
> > > required.
> > > 	        Cycle-cost in "no work" scenarios. Depending on # of
> > > vhost queues, this adds up as polling required *per vhost txq*.
> > > 	        Also note that "checking DMA completions" means taking a
> > > virtq-lock, so this "brute-force" can needlessly increase x-thread
> > > contention!
> >
> > A side note: I don't see why locking is required to test for DMA completions.
> > rte_dma_vchan_status() is lockless, e.g.:
> >
> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
> 56
> > 0
> 
> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree manner
> from a single thread.
> 
> The locks I refer to are at the OVS-netdev level, as virtq's are shared across OVS's
> dataplane threads.
> So the "M to N" comes from M dataplane threads to N virtqs, hence requiring
> some locking.
> 
> 
> > > B) Hide completions and live with the complexity/architectural
> > > sacrifice of mixed-RxTx.
> > > 	Various downsides here in my opinion, see the slide deck
> > > presented earlier today for a summary.
> > >
> > > In my opinion, A1 is the most elegant solution, as it has a clean
> > > separation of concerns, does not  cause
> > > avoidable contention on virtq locks, and spends no cycles when there is
> > > no completion work to do.
> > >
> >
> > Thank you for elaborating, Harry.
> 
> Thanks for part-taking in the discussion & providing your insight!
> 
> > I strongly oppose against hiding any part of TX processing in an RX function. It
> is just
> > wrong in so many ways!
> >
> > I agree that A1 is the most elegant solution. And being the most elegant
> solution, it
> > is probably also the most future proof solution. :-)
> 
> I think so too, yes.
> 
> > I would also like to stress that DMA completion handling belongs in the DPDK
> > library, not in the application. And yes, the application will be required to call
> some
> > "handle DMA completions" function in the DPDK library. But since the
> application
> > already knows that it uses DMA, the application should also know that it needs
> to
> > call this extra function - so I consider this requirement perfectly acceptable.
> 
> Agree here.
> 
> > I prefer if the DPDK vhost library can hide its inner workings from the
> application,
> > and just expose the additional "handle completions" function. This also means
> that
> > the inner workings can be implemented as "defer work", or by some other
> > algorithm. And it can be tweaked and optimized later.
> 
> Yes, the choice in how to call the handle_completions function is Application
> layer.
> For OVS we designed Defer Work, V3 and V4. But it is an App level choice, and
> every
> application is free to choose its own method.
> 
> > Thinking about the long term perspective, this design pattern is common for
> both
> > the vhost library and other DPDK libraries that could benefit from DMA (e.g.
> > vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or a
> > separate library. But for now, we should focus on the vhost use case, and just
> keep
> > the long term roadmap for using DMA in mind.
> 
> Totally agree to keep long term roadmap in mind; but I'm not sure we can
> refactor
> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
> updated;
> this causes a tight coupling between the DMA completion count, and the vhost
> library.
> 
> As Ilya raised on the call yesterday, there is an "in_order" requirement in the
> vhost
> library, that per virtq the packets are presented to the guest "in order" of
> enqueue.
> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
> library
> handles this today by re-ordering the DMA completions.)
> 
> 
> > Rephrasing what I said on the conference call: This vhost design will become
> the
> > common design pattern for using DMA in DPDK libraries. If we get it wrong, we
> are
> > stuck with it.
> 
> Agree, and if we get it right, then we're stuck with it too! :)
> 
> 
> > > > > > Here is another idea, inspired by a presentation at one of the
> > > DPDK
> > > > > Userspace conferences. It may be wishful thinking, though:
> > > > > >
> > > > > > Add an additional transaction to each DMA burst; a special
> > > > > transaction containing the memory write operation that makes the
> > > > > descriptors available to the Virtio driver.
> > > > > >
> > > > >
> > > > > That is something that can work, so long as the receiver is
> > > operating
> > > > > in
> > > > > polling mode. For cases where virtio interrupts are enabled, you
> > > still
> > > > > need
> > > > > to do a write to the eventfd in the kernel in vhost to signal the
> > > > > virtio
> > > > > side. That's not something that can be offloaded to a DMA engine,
> > > > > sadly, so
> > > > > we still need some form of completion call.
> > > >
> > > > I guess that virtio interrupts is the most widely deployed scenario,
> > > so let's ignore
> > > > the DMA TX completion transaction for now - and call it a possible
> > > future
> > > > optimization for specific use cases. So it seems that some form of
> > > completion call
> > > > is unavoidable.
> > >
> > > Agree to leave this aside, there is in theory a potential optimization,
> > > but
> > > unlikely to be of large value.
> > >
> >
> > One more thing: When using DMA to pass on packets into a guest, there could
> be a
> > delay from the DMA completes until the guest is signaled. Is there any CPU
> cache
> > hotness regarding the guest's access to the packet data to consider here? I.e. if
> we
> > wait signaling the guest, the packet data may get cold.
> 
> Interesting question; we can likely spawn a new thread around this topic!
> In short, it depends on how/where the DMA hardware writes the copy.
> 
> With technologies like DDIO, the "dest" part of the copy will be in LLC. The core
> reading the
> dest data will benefit from the LLC locality (instead of snooping it from a remote
> core's L1/L2).
> 
> Delays in notifying the guest could result in LLC capacity eviction, yes.
> The application layer decides how often/promptly to check for completions,
> and notify the guest of them. Calling the function more often will result in less
> delay in that portion of the pipeline.
> 
> Overall, there are caching benefits with DMA acceleration, and the application
> can control
> the latency introduced between dma-completion done in HW, and Guest vring
> update.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-04-05 11:29             ` Ilya Maximets
@ 2022-04-05 12:07               ` Bruce Richardson
  2022-04-08  6:29                 ` Pai G, Sunil
  0 siblings, 1 reply; 58+ messages in thread
From: Bruce Richardson @ 2022-04-05 12:07 UTC (permalink / raw)
  To: Ilya Maximets, Chengwen Feng, Radha Mohan Chintakuntla,
	Veerasenareddy Burru, Gagandeep Singh, Nipun Gupta
  Cc: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
> On 3/30/22 16:09, Bruce Richardson wrote:
> > On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
> >> On 3/30/22 13:12, Bruce Richardson wrote:
> >>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
> >>>> On 3/30/22 12:41, Ilya Maximets wrote:
> >>>>> Forking the thread to discuss a memory consistency/ordering model.
> >>>>>
> >>>>> AFAICT, dmadev can be anything from part of a CPU to a completely
> >>>>> separate PCI device.  However, I don't see any memory ordering being
> >>>>> enforced or even described in the dmadev API or documentation.
> >>>>> Please, point me to the correct documentation, if I somehow missed it.
> >>>>>
> >>>>> We have a DMA device (A) and a CPU core (B) writing respectively
> >>>>> the data and the descriptor info.  CPU core (C) is reading the
> >>>>> descriptor and the data it points too.
> >>>>>
> >>>>> A few things about that process:
> >>>>>
> >>>>> 1. There is no memory barrier between writes A and B (Did I miss
> >>>>>    them?).  Meaning that those operations can be seen by C in a
> >>>>>    different order regardless of barriers issued by C and regardless
> >>>>>    of the nature of devices A and B.
> >>>>>
> >>>>> 2. Even if there is a write barrier between A and B, there is
> >>>>>    no guarantee that C will see these writes in the same order
> >>>>>    as C doesn't use real memory barriers because vhost advertises
> >>>>
> >>>> s/advertises/does not advertise/
> >>>>
> >>>>>    VIRTIO_F_ORDER_PLATFORM.
> >>>>>
> >>>>> So, I'm getting to conclusion that there is a missing write barrier
> >>>>> on the vhost side and vhost itself must not advertise the
> >>>>
> >>>> s/must not/must/
> >>>>
> >>>> Sorry, I wrote things backwards. :)
> >>>>
> >>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual memory
> >>>>> barriers.
> >>>>>
> >>>>> Would like to hear some thoughts on that topic.  Is it a real issue?
> >>>>> Is it an issue considering all possible CPU architectures and DMA
> >>>>> HW variants?
> >>>>>
> >>>
> >>> In terms of ordering of operations using dmadev:
> >>>
> >>> * Some DMA HW will perform all operations strictly in order e.g. Intel
> >>>   IOAT, while other hardware may not guarantee order of operations/do
> >>>   things in parallel e.g. Intel DSA. Therefore the dmadev API provides the
> >>>   fence operation which allows the order to be enforced. The fence can be
> >>>   thought of as a full memory barrier, meaning no jobs after the barrier can
> >>>   be started until all those before it have completed. Obviously, for HW
> >>>   where order is always enforced, this will be a no-op, but for hardware that
> >>>   parallelizes, we want to reduce the fences to get best performance.
> >>>
> >>> * For synchronization between DMA devices and CPUs, where a CPU can only
> >>>   write after a DMA copy has been done, the CPU must wait for the dma
> >>>   completion to guarantee ordering. Once the completion has been returned
> >>>   the completed operation is globally visible to all cores.
> >>
> >> Thanks for explanation!  Some questions though:
> >>
> >> In our case one CPU waits for completion and another CPU is actually using
> >> the data.  IOW, "CPU must wait" is a bit ambiguous.  Which CPU must wait?
> >>
> >> Or should it be "Once the completion is visible on any core, the completed
> >> operation is globally visible to all cores." ?
> >>
> > 
> > The latter.
> > Once the change to memory/cache is visible to any core, it is visible to
> > all ones. This applies to regular CPU memory writes too - at least on IA,
> > and I expect on many other architectures - once the write is visible
> > outside the current core it is visible to every other core. Once the data
> > hits the l1 or l2 cache of any core, any subsequent requests for that data
> > from any other core will "snoop" the latest data from the cores cache, even
> > if it has not made its way down to a shared cache, e.g. l3 on most IA
> > systems.
> 
> It sounds like you're referring to the "multicopy atomicity" of the
> architecture.  However, that is not universally supported thing.
> AFAICT, POWER and older ARM systems doesn't support it, so writes
> performed by one core are not necessarily available to all other
> cores at the same time.  That means that if the CPU0 writes the data
> and the completion flag, CPU1 reads the completion flag and writes
> the ring, CPU2 may see the ring write, but may still not see the
> write of the data, even though there was a control dependency on CPU1.
> There should be a full memory barrier on CPU1 in order to fulfill
> the memory ordering requirements for CPU2, IIUC.
> 
> In our scenario the CPU0 is a DMA device, which may or may not be
> part of a CPU and may have different memory consistency/ordering
> requirements.  So, the question is: does DPDK DMA API guarantee
> multicopy atomicity between DMA device and all CPU cores regardless
> of CPU architecture and a nature of the DMA device?
> 

Right now, it doesn't because this never came up in discussion. In order to
be useful, it sounds like it explicitly should do so. At least for the
Intel ioat and idxd driver cases, this will be supported, so we just need
to ensure all other drivers currently upstreamed can offer this too. If
they cannot, we cannot offer it as a global guarantee, and we should see
about adding a capability flag for this to indicate when the guarantee is
there or not.

Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to document
for dmadev that once a DMA operation is completed, the op is guaranteed
visible to all cores/threads? If not, any thoughts on what guarantees we
can provide in this regard, or what capabilities should be exposed?

/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30 14:09           ` Bruce Richardson
@ 2022-04-05 11:29             ` Ilya Maximets
  2022-04-05 12:07               ` Bruce Richardson
  0 siblings, 1 reply; 58+ messages in thread
From: Ilya Maximets @ 2022-04-05 11:29 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: i.maximets, Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian,
	Van Haaren, Harry, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On 3/30/22 16:09, Bruce Richardson wrote:
> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
>> On 3/30/22 13:12, Bruce Richardson wrote:
>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
>>>> On 3/30/22 12:41, Ilya Maximets wrote:
>>>>> Forking the thread to discuss a memory consistency/ordering model.
>>>>>
>>>>> AFAICT, dmadev can be anything from part of a CPU to a completely
>>>>> separate PCI device.  However, I don't see any memory ordering being
>>>>> enforced or even described in the dmadev API or documentation.
>>>>> Please, point me to the correct documentation, if I somehow missed it.
>>>>>
>>>>> We have a DMA device (A) and a CPU core (B) writing respectively
>>>>> the data and the descriptor info.  CPU core (C) is reading the
>>>>> descriptor and the data it points too.
>>>>>
>>>>> A few things about that process:
>>>>>
>>>>> 1. There is no memory barrier between writes A and B (Did I miss
>>>>>    them?).  Meaning that those operations can be seen by C in a
>>>>>    different order regardless of barriers issued by C and regardless
>>>>>    of the nature of devices A and B.
>>>>>
>>>>> 2. Even if there is a write barrier between A and B, there is
>>>>>    no guarantee that C will see these writes in the same order
>>>>>    as C doesn't use real memory barriers because vhost advertises
>>>>
>>>> s/advertises/does not advertise/
>>>>
>>>>>    VIRTIO_F_ORDER_PLATFORM.
>>>>>
>>>>> So, I'm getting to conclusion that there is a missing write barrier
>>>>> on the vhost side and vhost itself must not advertise the
>>>>
>>>> s/must not/must/
>>>>
>>>> Sorry, I wrote things backwards. :)
>>>>
>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual memory
>>>>> barriers.
>>>>>
>>>>> Would like to hear some thoughts on that topic.  Is it a real issue?
>>>>> Is it an issue considering all possible CPU architectures and DMA
>>>>> HW variants?
>>>>>
>>>
>>> In terms of ordering of operations using dmadev:
>>>
>>> * Some DMA HW will perform all operations strictly in order e.g. Intel
>>>   IOAT, while other hardware may not guarantee order of operations/do
>>>   things in parallel e.g. Intel DSA. Therefore the dmadev API provides the
>>>   fence operation which allows the order to be enforced. The fence can be
>>>   thought of as a full memory barrier, meaning no jobs after the barrier can
>>>   be started until all those before it have completed. Obviously, for HW
>>>   where order is always enforced, this will be a no-op, but for hardware that
>>>   parallelizes, we want to reduce the fences to get best performance.
>>>
>>> * For synchronization between DMA devices and CPUs, where a CPU can only
>>>   write after a DMA copy has been done, the CPU must wait for the dma
>>>   completion to guarantee ordering. Once the completion has been returned
>>>   the completed operation is globally visible to all cores.
>>
>> Thanks for explanation!  Some questions though:
>>
>> In our case one CPU waits for completion and another CPU is actually using
>> the data.  IOW, "CPU must wait" is a bit ambiguous.  Which CPU must wait?
>>
>> Or should it be "Once the completion is visible on any core, the completed
>> operation is globally visible to all cores." ?
>>
> 
> The latter.
> Once the change to memory/cache is visible to any core, it is visible to
> all ones. This applies to regular CPU memory writes too - at least on IA,
> and I expect on many other architectures - once the write is visible
> outside the current core it is visible to every other core. Once the data
> hits the l1 or l2 cache of any core, any subsequent requests for that data
> from any other core will "snoop" the latest data from the cores cache, even
> if it has not made its way down to a shared cache, e.g. l3 on most IA
> systems.

It sounds like you're referring to the "multicopy atomicity" of the
architecture.  However, that is not universally supported thing.
AFAICT, POWER and older ARM systems doesn't support it, so writes
performed by one core are not necessarily available to all other
cores at the same time.  That means that if the CPU0 writes the data
and the completion flag, CPU1 reads the completion flag and writes
the ring, CPU2 may see the ring write, but may still not see the
write of the data, even though there was a control dependency on CPU1.
There should be a full memory barrier on CPU1 in order to fulfill
the memory ordering requirements for CPU2, IIUC.

In our scenario the CPU0 is a DMA device, which may or may not be
part of a CPU and may have different memory consistency/ordering
requirements.  So, the question is: does DPDK DMA API guarantee
multicopy atomicity between DMA device and all CPU cores regardless
of CPU architecture and a nature of the DMA device?

> 
>> And the main question:
>>   Are these synchronization claims documented somewhere?
>>
> 
> Not explicitly, no. However, the way DMA devices works in the regard if
> global observability is absolutely no different from how crypto,
> compression, or any other hardware devices work. Doing a memory copy using
> a DMA device is exactly the same as doing a no-op crypto, or compression
> job with the output going to a separate output buffer. In all cases, a job
> cannot be considered completed until you get a hardware completion
> notification for it, and one you get that notification, it is globally
> observable by all entities.
> 
> The only different for dmadev APIs is that we do have the capability to
> specify that jobs must be done in a specific order, using a fence flag,
> which is documented in the API documentation.
> 
> /Bruce


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30  9:25                     ` Maxime Coquelin
  2022-03-30 10:20                       ` Bruce Richardson
@ 2022-03-30 14:27                       ` Hu, Jiayu
  1 sibling, 0 replies; 58+ messages in thread
From: Hu, Jiayu @ 2022-03-30 14:27 UTC (permalink / raw)
  To: Maxime Coquelin, Ilya Maximets, Morten Brørup, Richardson, Bruce
  Cc: Van Haaren, Harry, Pai G, Sunil, Stokes, Ian, Ferriter, Cian,
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Wednesday, March 30, 2022 5:25 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; Ilya Maximets <i.maximets@ovn.org>;
> Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Ferriter, Cian
> <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; dev@dpdk.org;
> Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll, Tim
> <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> 
> 
> On 3/30/22 04:02, Hu, Jiayu wrote:
> >
> >
> >> -----Original Message-----
> >> From: Ilya Maximets <i.maximets@ovn.org>
> >> Sent: Wednesday, March 30, 2022 1:45 AM
> >> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
> >> <bruce.richardson@intel.com>
> >> Cc: i.maximets@ovn.org; Maxime Coquelin
> <maxime.coquelin@redhat.com>;
> >> Van Haaren, Harry <harry.van.haaren@intel.com>; Pai G, Sunil
> >> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu,
> >> Jiayu <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>;
> >> ovs- dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> >> Finn, Emma <emma.finn@intel.com>
> >> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>
> >> On 3/29/22 19:13, Morten Brørup wrote:
> >>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>> Sent: Tuesday, 29 March 2022 19.03
> >>>>
> >>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> >>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>>>> Sent: Tuesday, 29 March 2022 18.24
> >>>>>>
> >>>>>> Hi Morten,
> >>>>>>
> >>>>>> On 3/29/22 16:44, Morten Brørup wrote:
> >>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> >>>>>>>> Sent: Tuesday, 29 March 2022 15.02
> >>>>>>>>
> >>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
> >>>>>>>>>
> >>>>>>>>> Having thought more about it, I think that a completely
> >>>> different
> >>>>>> architectural approach is required:
> >>>>>>>>>
> >>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX and
> >>>>>>>>> TX
> >>>>>> packet burst functions, each optimized for different CPU vector
> >>>>>> instruction sets. The availability of a DMA engine should be
> >>>> treated
> >>>>>> the same way. So I suggest that PMDs copying packet contents, e.g.
> >>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> >>>> packet
> >>>>>> burst functions.
> >>>>>>>>>
> >>>>>>>>> Similarly for the DPDK vhost library.
> >>>>>>>>>
> >>>>>>>>> In such an architecture, it would be the application's job to
> >>>>>> allocate DMA channels and assign them to the specific PMDs that
> >>>> should
> >>>>>> use them. But the actual use of the DMA channels would move down
> >>>> below
> >>>>>> the application and into the DPDK PMDs and libraries.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Med venlig hilsen / Kind regards, -Morten Brørup
> >>>>>>>>
> >>>>>>>> Hi Morten,
> >>>>>>>>
> >>>>>>>> That's *exactly* how this architecture is designed &
> >>>> implemented.
> >>>>>>>> 1.	The DMA configuration and initialization is up to the
> >>>> application
> >>>>>> (OVS).
> >>>>>>>> 2.	The VHost library is passed the DMA-dev ID, and its new
> >>>> async
> >>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
> >>>>>>>>
> >>>>>>>> Looking forward to talking on the call that just started.
> >>>> Regards, -
> >>>>>> Harry
> >>>>>>>>
> >>>>>>>
> >>>>>>> OK, thanks - as I said on the call, I haven't looked at the
> >>>> patches.
> >>>>>>>
> >>>>>>> Then, I suppose that the TX completions can be handled in the TX
> >>>>>> function, and the RX completions can be handled in the RX
> >>>>>> function, just like the Ethdev PMDs handle packet descriptors:
> >>>>>>>
> >>>>>>> TX_Burst(tx_packet_array):
> >>>>>>> 1.	Clean up descriptors processed by the NIC chip. --> Process
> >>>> TX
> >>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
> >>>>>>> 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
> >>>>> Pass
> >>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
> >>>>>> 1st pipeline stage.)
> >>>>>>
> >>>>>> The problem is Tx function might not be called again, so enqueued
> >>>>>> packets in 2. may never be completed from a Virtio point of view.
> >>>> IOW,
> >>>>>> the packets will be copied to the Virtio descriptors buffers, but
> >>>> the
> >>>>>> descriptors will not be made available to the Virtio driver.
> >>>>>
> >>>>> In that case, the application needs to call TX_Burst()
> >>>>> periodically
> >>>> with an empty array, for completion purposes.
> >>>>>
> >>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
> >>>> library, to handle DMA completion. It might even handle multiple
> >>>> DMA channels, if convenient - and if possible without locking or
> >>>> other weird complexity.
> >>>>>
> >>>>> Here is another idea, inspired by a presentation at one of the
> >>>>> DPDK
> >>>> Userspace conferences. It may be wishful thinking, though:
> >>>>>
> >>>>> Add an additional transaction to each DMA burst; a special
> >>>> transaction containing the memory write operation that makes the
> >>>> descriptors available to the Virtio driver.
> >>
> >> I was talking with Maxime after the call today about the same idea.
> >> And it looks fairly doable, I would say.
> >
> > If the idea is making DMA update used ring's index (2B) and packed
> > ring descriptor's flag (2B), yes, it will work functionally. But
> > considering the offloading cost of DMA, it would hurt performance. In
> > addition, the latency of small copy of DMA is much higher than that of CPU.
> So it will also increase latency.
> 
> I agree writing back descriptors using DMA can be sub-optimal, especially for
> packed ring where the head desc flags have to be written last.
> 
> Are you sure about latency? With current solution, the descriptors write-
> backs can happen quite some time after the DMA transfers are done, isn't it?

Yes, guest can get notification immediately after the copy is completed by DMA
engine. But if we consider the latency of CPU doing a 2B copy and DMA doing
a 2B copy, the latter is much higher.

> 
> >>
> >>>>>
> >>>>
> >>>> That is something that can work, so long as the receiver is
> >>>> operating in polling mode. For cases where virtio interrupts are
> >>>> enabled, you still need to do a write to the eventfd in the kernel
> >>>> in vhost to signal the virtio side. That's not something that can
> >>>> be offloaded to a DMA engine, sadly, so we still need some form of
> completion call.
> >>>
> >>> I guess that virtio interrupts is the most widely deployed scenario,
> >>> so let's ignore the DMA TX completion transaction for now - and call
> >>> it a possible future optimization for specific use cases. So it
> >>> seems that some form of completion call is unavoidable.
> >>>
> >>
> >> We could separate the actual kick of the guest with the data transfer.
> >> If interrupts are enabled, this means that the guest is not actively polling,
> i.e.
> >> we can allow some extra latency by performing the actual kick from
> >> the rx context, or, as Maxime said, if DMA engine can generate
> >> interrupts when the DMA queue is empty, vhost thread may listen to
> >> them and kick the guest if needed.  This will additionally remove the
> >> extra system call from the fast path.
> >
> > Separating kick with data transfer is a very good idea. But it
> > requires a dedicated control plane thread to kick guest after DMA
> > interrupt. Anyway, we can try this optimization in the future.
> 
> Yes it requires a dedicated thread, but I don't think this is really an issue.
> Interrupt mode can be considered as slow-path.

Agree. It's worth a try.

Thanks,
Jiayu
> 
> >
> > Thanks,
> > Jiayu


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30 11:41         ` Ilya Maximets
@ 2022-03-30 14:09           ` Bruce Richardson
  2022-04-05 11:29             ` Ilya Maximets
  0 siblings, 1 reply; 58+ messages in thread
From: Bruce Richardson @ 2022-03-30 14:09 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
> On 3/30/22 13:12, Bruce Richardson wrote:
> > On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
> >> On 3/30/22 12:41, Ilya Maximets wrote:
> >>> Forking the thread to discuss a memory consistency/ordering model.
> >>>
> >>> AFAICT, dmadev can be anything from part of a CPU to a completely
> >>> separate PCI device.  However, I don't see any memory ordering being
> >>> enforced or even described in the dmadev API or documentation.
> >>> Please, point me to the correct documentation, if I somehow missed it.
> >>>
> >>> We have a DMA device (A) and a CPU core (B) writing respectively
> >>> the data and the descriptor info.  CPU core (C) is reading the
> >>> descriptor and the data it points too.
> >>>
> >>> A few things about that process:
> >>>
> >>> 1. There is no memory barrier between writes A and B (Did I miss
> >>>    them?).  Meaning that those operations can be seen by C in a
> >>>    different order regardless of barriers issued by C and regardless
> >>>    of the nature of devices A and B.
> >>>
> >>> 2. Even if there is a write barrier between A and B, there is
> >>>    no guarantee that C will see these writes in the same order
> >>>    as C doesn't use real memory barriers because vhost advertises
> >>
> >> s/advertises/does not advertise/
> >>
> >>>    VIRTIO_F_ORDER_PLATFORM.
> >>>
> >>> So, I'm getting to conclusion that there is a missing write barrier
> >>> on the vhost side and vhost itself must not advertise the
> >>
> >> s/must not/must/
> >>
> >> Sorry, I wrote things backwards. :)
> >>
> >>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual memory
> >>> barriers.
> >>>
> >>> Would like to hear some thoughts on that topic.  Is it a real issue?
> >>> Is it an issue considering all possible CPU architectures and DMA
> >>> HW variants?
> >>>
> > 
> > In terms of ordering of operations using dmadev:
> > 
> > * Some DMA HW will perform all operations strictly in order e.g. Intel
> >   IOAT, while other hardware may not guarantee order of operations/do
> >   things in parallel e.g. Intel DSA. Therefore the dmadev API provides the
> >   fence operation which allows the order to be enforced. The fence can be
> >   thought of as a full memory barrier, meaning no jobs after the barrier can
> >   be started until all those before it have completed. Obviously, for HW
> >   where order is always enforced, this will be a no-op, but for hardware that
> >   parallelizes, we want to reduce the fences to get best performance.
> > 
> > * For synchronization between DMA devices and CPUs, where a CPU can only
> >   write after a DMA copy has been done, the CPU must wait for the dma
> >   completion to guarantee ordering. Once the completion has been returned
> >   the completed operation is globally visible to all cores.
> 
> Thanks for explanation!  Some questions though:
> 
> In our case one CPU waits for completion and another CPU is actually using
> the data.  IOW, "CPU must wait" is a bit ambiguous.  Which CPU must wait?
> 
> Or should it be "Once the completion is visible on any core, the completed
> operation is globally visible to all cores." ?
> 

The latter.
Once the change to memory/cache is visible to any core, it is visible to
all ones. This applies to regular CPU memory writes too - at least on IA,
and I expect on many other architectures - once the write is visible
outside the current core it is visible to every other core. Once the data
hits the l1 or l2 cache of any core, any subsequent requests for that data
from any other core will "snoop" the latest data from the cores cache, even
if it has not made its way down to a shared cache, e.g. l3 on most IA
systems.

> And the main question:
>   Are these synchronization claims documented somewhere?
> 

Not explicitly, no. However, the way DMA devices works in the regard if
global observability is absolutely no different from how crypto,
compression, or any other hardware devices work. Doing a memory copy using
a DMA device is exactly the same as doing a no-op crypto, or compression
job with the output going to a separate output buffer. In all cases, a job
cannot be considered completed until you get a hardware completion
notification for it, and one you get that notification, it is globally
observable by all entities.

The only different for dmadev APIs is that we do have the capability to
specify that jobs must be done in a specific order, using a fence flag,
which is documented in the API documentation.

/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30 11:12       ` Bruce Richardson
@ 2022-03-30 11:41         ` Ilya Maximets
  2022-03-30 14:09           ` Bruce Richardson
  0 siblings, 1 reply; 58+ messages in thread
From: Ilya Maximets @ 2022-03-30 11:41 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: i.maximets, Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian,
	Van Haaren, Harry, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On 3/30/22 13:12, Bruce Richardson wrote:
> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
>> On 3/30/22 12:41, Ilya Maximets wrote:
>>> Forking the thread to discuss a memory consistency/ordering model.
>>>
>>> AFAICT, dmadev can be anything from part of a CPU to a completely
>>> separate PCI device.  However, I don't see any memory ordering being
>>> enforced or even described in the dmadev API or documentation.
>>> Please, point me to the correct documentation, if I somehow missed it.
>>>
>>> We have a DMA device (A) and a CPU core (B) writing respectively
>>> the data and the descriptor info.  CPU core (C) is reading the
>>> descriptor and the data it points too.
>>>
>>> A few things about that process:
>>>
>>> 1. There is no memory barrier between writes A and B (Did I miss
>>>    them?).  Meaning that those operations can be seen by C in a
>>>    different order regardless of barriers issued by C and regardless
>>>    of the nature of devices A and B.
>>>
>>> 2. Even if there is a write barrier between A and B, there is
>>>    no guarantee that C will see these writes in the same order
>>>    as C doesn't use real memory barriers because vhost advertises
>>
>> s/advertises/does not advertise/
>>
>>>    VIRTIO_F_ORDER_PLATFORM.
>>>
>>> So, I'm getting to conclusion that there is a missing write barrier
>>> on the vhost side and vhost itself must not advertise the
>>
>> s/must not/must/
>>
>> Sorry, I wrote things backwards. :)
>>
>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual memory
>>> barriers.
>>>
>>> Would like to hear some thoughts on that topic.  Is it a real issue?
>>> Is it an issue considering all possible CPU architectures and DMA
>>> HW variants?
>>>
> 
> In terms of ordering of operations using dmadev:
> 
> * Some DMA HW will perform all operations strictly in order e.g. Intel
>   IOAT, while other hardware may not guarantee order of operations/do
>   things in parallel e.g. Intel DSA. Therefore the dmadev API provides the
>   fence operation which allows the order to be enforced. The fence can be
>   thought of as a full memory barrier, meaning no jobs after the barrier can
>   be started until all those before it have completed. Obviously, for HW
>   where order is always enforced, this will be a no-op, but for hardware that
>   parallelizes, we want to reduce the fences to get best performance.
> 
> * For synchronization between DMA devices and CPUs, where a CPU can only
>   write after a DMA copy has been done, the CPU must wait for the dma
>   completion to guarantee ordering. Once the completion has been returned
>   the completed operation is globally visible to all cores.

Thanks for explanation!  Some questions though:

In our case one CPU waits for completion and another CPU is actually using
the data.  IOW, "CPU must wait" is a bit ambiguous.  Which CPU must wait?

Or should it be "Once the completion is visible on any core, the completed
operation is globally visible to all cores." ?

And the main question:
  Are these synchronization claims documented somewhere?

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30 10:52     ` Ilya Maximets
@ 2022-03-30 11:12       ` Bruce Richardson
  2022-03-30 11:41         ` Ilya Maximets
  0 siblings, 1 reply; 58+ messages in thread
From: Bruce Richardson @ 2022-03-30 11:12 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
> On 3/30/22 12:41, Ilya Maximets wrote:
> > Forking the thread to discuss a memory consistency/ordering model.
> > 
> > AFAICT, dmadev can be anything from part of a CPU to a completely
> > separate PCI device.  However, I don't see any memory ordering being
> > enforced or even described in the dmadev API or documentation.
> > Please, point me to the correct documentation, if I somehow missed it.
> > 
> > We have a DMA device (A) and a CPU core (B) writing respectively
> > the data and the descriptor info.  CPU core (C) is reading the
> > descriptor and the data it points too.
> > 
> > A few things about that process:
> > 
> > 1. There is no memory barrier between writes A and B (Did I miss
> >    them?).  Meaning that those operations can be seen by C in a
> >    different order regardless of barriers issued by C and regardless
> >    of the nature of devices A and B.
> > 
> > 2. Even if there is a write barrier between A and B, there is
> >    no guarantee that C will see these writes in the same order
> >    as C doesn't use real memory barriers because vhost advertises
> 
> s/advertises/does not advertise/
> 
> >    VIRTIO_F_ORDER_PLATFORM.
> > 
> > So, I'm getting to conclusion that there is a missing write barrier
> > on the vhost side and vhost itself must not advertise the
> 
> s/must not/must/
> 
> Sorry, I wrote things backwards. :)
> 
> > VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual memory
> > barriers.
> > 
> > Would like to hear some thoughts on that topic.  Is it a real issue?
> > Is it an issue considering all possible CPU architectures and DMA
> > HW variants?
> > 

In terms of ordering of operations using dmadev:

* Some DMA HW will perform all operations strictly in order e.g. Intel
  IOAT, while other hardware may not guarantee order of operations/do
  things in parallel e.g. Intel DSA. Therefore the dmadev API provides the
  fence operation which allows the order to be enforced. The fence can be
  thought of as a full memory barrier, meaning no jobs after the barrier can
  be started until all those before it have completed. Obviously, for HW
  where order is always enforced, this will be a no-op, but for hardware that
  parallelizes, we want to reduce the fences to get best performance.

* For synchronization between DMA devices and CPUs, where a CPU can only
  write after a DMA copy has been done, the CPU must wait for the dma
  completion to guarantee ordering. Once the completion has been returned
  the completed operation is globally visible to all cores.

Hope this is clear.

/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30 10:41   ` Ilya Maximets
@ 2022-03-30 10:52     ` Ilya Maximets
  2022-03-30 11:12       ` Bruce Richardson
  0 siblings, 1 reply; 58+ messages in thread
From: Ilya Maximets @ 2022-03-30 10:52 UTC (permalink / raw)
  To: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev
  Cc: i.maximets, Mcnamara, John, O'Driscoll, Tim, Finn, Emma,
	Richardson, Bruce

On 3/30/22 12:41, Ilya Maximets wrote:
> Forking the thread to discuss a memory consistency/ordering model.
> 
> AFAICT, dmadev can be anything from part of a CPU to a completely
> separate PCI device.  However, I don't see any memory ordering being
> enforced or even described in the dmadev API or documentation.
> Please, point me to the correct documentation, if I somehow missed it.
> 
> We have a DMA device (A) and a CPU core (B) writing respectively
> the data and the descriptor info.  CPU core (C) is reading the
> descriptor and the data it points too.
> 
> A few things about that process:
> 
> 1. There is no memory barrier between writes A and B (Did I miss
>    them?).  Meaning that those operations can be seen by C in a
>    different order regardless of barriers issued by C and regardless
>    of the nature of devices A and B.
> 
> 2. Even if there is a write barrier between A and B, there is
>    no guarantee that C will see these writes in the same order
>    as C doesn't use real memory barriers because vhost advertises

s/advertises/does not advertise/

>    VIRTIO_F_ORDER_PLATFORM.
> 
> So, I'm getting to conclusion that there is a missing write barrier
> on the vhost side and vhost itself must not advertise the

s/must not/must/

Sorry, I wrote things backwards. :)

> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual memory
> barriers.
> 
> Would like to hear some thoughts on that topic.  Is it a real issue?
> Is it an issue considering all possible CPU architectures and DMA
> HW variants?
> 
> Best regards, Ilya Maximets.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-28 18:19 ` Pai G, Sunil
  2022-03-29 12:51   ` Morten Brørup
@ 2022-03-30 10:41   ` Ilya Maximets
  2022-03-30 10:52     ` Ilya Maximets
  1 sibling, 1 reply; 58+ messages in thread
From: Ilya Maximets @ 2022-03-30 10:41 UTC (permalink / raw)
  To: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev
  Cc: i.maximets, Mcnamara, John, O'Driscoll, Tim, Finn, Emma,
	Richardson, Bruce

Forking the thread to discuss a memory consistency/ordering model.

AFAICT, dmadev can be anything from part of a CPU to a completely
separate PCI device.  However, I don't see any memory ordering being
enforced or even described in the dmadev API or documentation.
Please, point me to the correct documentation, if I somehow missed it.

We have a DMA device (A) and a CPU core (B) writing respectively
the data and the descriptor info.  CPU core (C) is reading the
descriptor and the data it points too.

A few things about that process:

1. There is no memory barrier between writes A and B (Did I miss
   them?).  Meaning that those operations can be seen by C in a
   different order regardless of barriers issued by C and regardless
   of the nature of devices A and B.

2. Even if there is a write barrier between A and B, there is
   no guarantee that C will see these writes in the same order
   as C doesn't use real memory barriers because vhost advertises
   VIRTIO_F_ORDER_PLATFORM.

So, I'm getting to conclusion that there is a missing write barrier
on the vhost side and vhost itself must not advertise the
VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual memory
barriers.

Would like to hear some thoughts on that topic.  Is it a real issue?
Is it an issue considering all possible CPU architectures and DMA
HW variants?

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30  9:25                     ` Maxime Coquelin
@ 2022-03-30 10:20                       ` Bruce Richardson
  2022-03-30 14:27                       ` Hu, Jiayu
  1 sibling, 0 replies; 58+ messages in thread
From: Bruce Richardson @ 2022-03-30 10:20 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Hu, Jiayu, Ilya Maximets, Morten Brørup, Van Haaren, Harry,
	Pai G, Sunil, Stokes, Ian, Ferriter, Cian, ovs-dev, dev,
	Mcnamara, John, O'Driscoll, Tim, Finn, Emma

On Wed, Mar 30, 2022 at 11:25:05AM +0200, Maxime Coquelin wrote:
> 
> 
> On 3/30/22 04:02, Hu, Jiayu wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Ilya Maximets <i.maximets@ovn.org>
> > > Sent: Wednesday, March 30, 2022 1:45 AM
> > > To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
> > > <bruce.richardson@intel.com>
> > > Cc: i.maximets@ovn.org; Maxime Coquelin <maxime.coquelin@redhat.com>;
> > > Van Haaren, Harry <harry.van.haaren@intel.com>; Pai G, Sunil
> > > <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
> > > <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> > > dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> > > <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> > > Finn, Emma <emma.finn@intel.com>
> > > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > > 
> > > On 3/29/22 19:13, Morten Brørup wrote:
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > Sent: Tuesday, 29 March 2022 19.03
> > > > > 
> > > > > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > > > > > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > > > > > > Sent: Tuesday, 29 March 2022 18.24
> > > > > > > 
> > > > > > > Hi Morten,
> > > > > > > 
> > > > > > > On 3/29/22 16:44, Morten Brørup wrote:
> > > > > > > > > From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > > > > > > > > Sent: Tuesday, 29 March 2022 15.02
> > > > > > > > > 
> > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > Sent: Tuesday, March 29, 2022 1:51 PM
> > > > > > > > > > 
> > > > > > > > > > Having thought more about it, I think that a completely
> > > > > different
> > > > > > > architectural approach is required:
> > > > > > > > > > 
> > > > > > > > > > Many of the DPDK Ethernet PMDs implement a variety of RX and TX
> > > > > > > packet burst functions, each optimized for different CPU vector
> > > > > > > instruction sets. The availability of a DMA engine should be
> > > > > treated
> > > > > > > the same way. So I suggest that PMDs copying packet contents, e.g.
> > > > > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> > > > > packet
> > > > > > > burst functions.
> > > > > > > > > > 
> > > > > > > > > > Similarly for the DPDK vhost library.
> > > > > > > > > > 
> > > > > > > > > > In such an architecture, it would be the application's job to
> > > > > > > allocate DMA channels and assign them to the specific PMDs that
> > > > > should
> > > > > > > use them. But the actual use of the DMA channels would move down
> > > > > below
> > > > > > > the application and into the DPDK PMDs and libraries.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Med venlig hilsen / Kind regards, -Morten Brørup
> > > > > > > > > 
> > > > > > > > > Hi Morten,
> > > > > > > > > 
> > > > > > > > > That's *exactly* how this architecture is designed &
> > > > > implemented.
> > > > > > > > > 1.	The DMA configuration and initialization is up to the
> > > > > application
> > > > > > > (OVS).
> > > > > > > > > 2.	The VHost library is passed the DMA-dev ID, and its new
> > > > > async
> > > > > > > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > > > > > > > > 
> > > > > > > > > Looking forward to talking on the call that just started.
> > > > > Regards, -
> > > > > > > Harry
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > OK, thanks - as I said on the call, I haven't looked at the
> > > > > patches.
> > > > > > > > 
> > > > > > > > Then, I suppose that the TX completions can be handled in the TX
> > > > > > > function, and the RX completions can be handled in the RX function,
> > > > > > > just like the Ethdev PMDs handle packet descriptors:
> > > > > > > > 
> > > > > > > > TX_Burst(tx_packet_array):
> > > > > > > > 1.	Clean up descriptors processed by the NIC chip. --> Process
> > > > > TX
> > > > > > > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > > > > > > 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
> > > > > > Pass
> > > > > > > on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
> > > > > > > pipeline stage.)
> > > > > > > 
> > > > > > > The problem is Tx function might not be called again, so enqueued
> > > > > > > packets in 2. may never be completed from a Virtio point of view.
> > > > > IOW,
> > > > > > > the packets will be copied to the Virtio descriptors buffers, but
> > > > > the
> > > > > > > descriptors will not be made available to the Virtio driver.
> > > > > > 
> > > > > > In that case, the application needs to call TX_Burst() periodically
> > > > > with an empty array, for completion purposes.
> > > > > > 
> > > > > > Or some sort of TX_Keepalive() function can be added to the DPDK
> > > > > library, to handle DMA completion. It might even handle multiple DMA
> > > > > channels, if convenient - and if possible without locking or other
> > > > > weird complexity.
> > > > > > 
> > > > > > Here is another idea, inspired by a presentation at one of the DPDK
> > > > > Userspace conferences. It may be wishful thinking, though:
> > > > > > 
> > > > > > Add an additional transaction to each DMA burst; a special
> > > > > transaction containing the memory write operation that makes the
> > > > > descriptors available to the Virtio driver.
> > > 
> > > I was talking with Maxime after the call today about the same idea.
> > > And it looks fairly doable, I would say.
> > 
> > If the idea is making DMA update used ring's index (2B) and packed ring descriptor's flag (2B),
> > yes, it will work functionally. But considering the offloading cost of DMA, it would hurt
> > performance. In addition, the latency of small copy of DMA is much higher than that of
> > CPU. So it will also increase latency.
> 
> I agree writing back descriptors using DMA can be sub-optimal,
> especially for packed ring where the head desc flags have to be written
> last.
> 

I think we'll have to try it out to check how it works. If we are already
doing hardware offload, adding one addition job to the DMA list may be a
very minor addition. 

[Incidentally, for something like a head-pointer update, using a fill
operation rather than a copy may be a good choice, as it avoid the need for
a memory read transaction from the DMA engine, since the data to be written
is already in the descriptor submitted.]

> Are you sure about latency? With current solution, the descriptors
> write-backs can happen quite some time after the DMA transfers are done,
> isn't it?
> 

For a polling receiver, having the DMA engine automatically do any
head-pointer updates after copies done should indeed lead to lowest
latency. For DMA engines that perform operations in parallel (such as Intel
DSA), we just need to ensure proper fencing of operations.

> > > 
> > > > > > 
> > > > > 
> > > > > That is something that can work, so long as the receiver is operating
> > > > > in polling mode. For cases where virtio interrupts are enabled, you
> > > > > still need to do a write to the eventfd in the kernel in vhost to
> > > > > signal the virtio side. That's not something that can be offloaded to
> > > > > a DMA engine, sadly, so we still need some form of completion call.
> > > > 
> > > > I guess that virtio interrupts is the most widely deployed scenario,
> > > > so let's ignore the DMA TX completion transaction for now - and call
> > > > it a possible future optimization for specific use cases. So it seems
> > > > that some form of completion call is unavoidable.
> > > > 
> > > 
> > > We could separate the actual kick of the guest with the data transfer.
> > > If interrupts are enabled, this means that the guest is not actively polling, i.e.
> > > we can allow some extra latency by performing the actual kick from the rx
> > > context, or, as Maxime said, if DMA engine can generate interrupts when the
> > > DMA queue is empty, vhost thread may listen to them and kick the guest if
> > > needed.  This will additionally remove the extra system call from the fast
> > > path.
> > 
> > Separating kick with data transfer is a very good idea. But it requires a dedicated
> > control plane thread to kick guest after DMA interrupt. Anyway, we can try this
> > optimization in the future.
> 
> Yes it requires a dedicated thread, but I don't think this is really an
> issue. Interrupt mode can be considered as slow-path.
> 

While not overly familiar with virtio/vhost, my main concern about
interrupt mode is not the handling of interrupts themselves when interrupt
mode is enabled, but rather ensuring that we have correct behaviour when
interrupt mode is enabled while copy operations are in-flight. Is it
possible for a guest to enable interrupt mode, receive packets but never be
woken up?

/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-30  2:02                   ` Hu, Jiayu
@ 2022-03-30  9:25                     ` Maxime Coquelin
  2022-03-30 10:20                       ` Bruce Richardson
  2022-03-30 14:27                       ` Hu, Jiayu
  0 siblings, 2 replies; 58+ messages in thread
From: Maxime Coquelin @ 2022-03-30  9:25 UTC (permalink / raw)
  To: Hu, Jiayu, Ilya Maximets, Morten Brørup, Richardson, Bruce
  Cc: Van Haaren, Harry, Pai G, Sunil, Stokes, Ian, Ferriter, Cian,
	ovs-dev, dev, Mcnamara, John, O'Driscoll, Tim, Finn, Emma



On 3/30/22 04:02, Hu, Jiayu wrote:
> 
> 
>> -----Original Message-----
>> From: Ilya Maximets <i.maximets@ovn.org>
>> Sent: Wednesday, March 30, 2022 1:45 AM
>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
>> <bruce.richardson@intel.com>
>> Cc: i.maximets@ovn.org; Maxime Coquelin <maxime.coquelin@redhat.com>;
>> Van Haaren, Harry <harry.van.haaren@intel.com>; Pai G, Sunil
>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>> Finn, Emma <emma.finn@intel.com>
>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>
>> On 3/29/22 19:13, Morten Brørup wrote:
>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>
>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>
>>>>>> Hi Morten,
>>>>>>
>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>
>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>
>>>>>>>>> Having thought more about it, I think that a completely
>>>> different
>>>>>> architectural approach is required:
>>>>>>>>>
>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
>>>>>> packet burst functions, each optimized for different CPU vector
>>>>>> instruction sets. The availability of a DMA engine should be
>>>> treated
>>>>>> the same way. So I suggest that PMDs copying packet contents, e.g.
>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>>> packet
>>>>>> burst functions.
>>>>>>>>>
>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>
>>>>>>>>> In such an architecture, it would be the application's job to
>>>>>> allocate DMA channels and assign them to the specific PMDs that
>>>> should
>>>>>> use them. But the actual use of the DMA channels would move down
>>>> below
>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Med venlig hilsen / Kind regards, -Morten Brørup
>>>>>>>>
>>>>>>>> Hi Morten,
>>>>>>>>
>>>>>>>> That's *exactly* how this architecture is designed &
>>>> implemented.
>>>>>>>> 1.	The DMA configuration and initialization is up to the
>>>> application
>>>>>> (OVS).
>>>>>>>> 2.	The VHost library is passed the DMA-dev ID, and its new
>>>> async
>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>
>>>>>>>> Looking forward to talking on the call that just started.
>>>> Regards, -
>>>>>> Harry
>>>>>>>>
>>>>>>>
>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>> patches.
>>>>>>>
>>>>>>> Then, I suppose that the TX completions can be handled in the TX
>>>>>> function, and the RX completions can be handled in the RX function,
>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>
>>>>>>> TX_Burst(tx_packet_array):
>>>>>>> 1.	Clean up descriptors processed by the NIC chip. --> Process
>>>> TX
>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>>>> 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
>>>>> Pass
>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
>>>>>> pipeline stage.)
>>>>>>
>>>>>> The problem is Tx function might not be called again, so enqueued
>>>>>> packets in 2. may never be completed from a Virtio point of view.
>>>> IOW,
>>>>>> the packets will be copied to the Virtio descriptors buffers, but
>>>> the
>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>
>>>>> In that case, the application needs to call TX_Burst() periodically
>>>> with an empty array, for completion purposes.
>>>>>
>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>>>> library, to handle DMA completion. It might even handle multiple DMA
>>>> channels, if convenient - and if possible without locking or other
>>>> weird complexity.
>>>>>
>>>>> Here is another idea, inspired by a presentation at one of the DPDK
>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>
>>>>> Add an additional transaction to each DMA burst; a special
>>>> transaction containing the memory write operation that makes the
>>>> descriptors available to the Virtio driver.
>>
>> I was talking with Maxime after the call today about the same idea.
>> And it looks fairly doable, I would say.
> 
> If the idea is making DMA update used ring's index (2B) and packed ring descriptor's flag (2B),
> yes, it will work functionally. But considering the offloading cost of DMA, it would hurt
> performance. In addition, the latency of small copy of DMA is much higher than that of
> CPU. So it will also increase latency.

I agree writing back descriptors using DMA can be sub-optimal,
especially for packed ring where the head desc flags have to be written
last.

Are you sure about latency? With current solution, the descriptors
write-backs can happen quite some time after the DMA transfers are done,
isn't it?

>>
>>>>>
>>>>
>>>> That is something that can work, so long as the receiver is operating
>>>> in polling mode. For cases where virtio interrupts are enabled, you
>>>> still need to do a write to the eventfd in the kernel in vhost to
>>>> signal the virtio side. That's not something that can be offloaded to
>>>> a DMA engine, sadly, so we still need some form of completion call.
>>>
>>> I guess that virtio interrupts is the most widely deployed scenario,
>>> so let's ignore the DMA TX completion transaction for now - and call
>>> it a possible future optimization for specific use cases. So it seems
>>> that some form of completion call is unavoidable.
>>>
>>
>> We could separate the actual kick of the guest with the data transfer.
>> If interrupts are enabled, this means that the guest is not actively polling, i.e.
>> we can allow some extra latency by performing the actual kick from the rx
>> context, or, as Maxime said, if DMA engine can generate interrupts when the
>> DMA queue is empty, vhost thread may listen to them and kick the guest if
>> needed.  This will additionally remove the extra system call from the fast
>> path.
> 
> Separating kick with data transfer is a very good idea. But it requires a dedicated
> control plane thread to kick guest after DMA interrupt. Anyway, we can try this
> optimization in the future.

Yes it requires a dedicated thread, but I don't think this is really an
issue. Interrupt mode can be considered as slow-path.

> 
> Thanks,
> Jiayu


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 19:59                   ` Morten Brørup
@ 2022-03-30  9:01                     ` Van Haaren, Harry
  2022-04-07 14:04                       ` Van Haaren, Harry
  0 siblings, 1 reply; 58+ messages in thread
From: Van Haaren, Harry @ 2022-03-30  9:01 UTC (permalink / raw)
  To: Morten Brørup, Richardson,  Bruce
  Cc: Maxime Coquelin, Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter,
	Cian, Ilya Maximets, ovs-dev, dev, Mcnamara, John,
	O'Driscoll, Tim, Finn, Emma

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Tuesday, March 29, 2022 8:59 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya Maximets
> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>; Finn,
> Emma <emma.finn@intel.com>
> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> 
> > From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > Sent: Tuesday, 29 March 2022 19.46
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: Tuesday, March 29, 2022 6:14 PM
> > >
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Tuesday, 29 March 2022 19.03
> > > >
> > > > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > > > > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > > > > > Sent: Tuesday, 29 March 2022 18.24
> > > > > >
> > > > > > Hi Morten,
> > > > > >
> > > > > > On 3/29/22 16:44, Morten Brørup wrote:
> > > > > > >> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > > > > > >> Sent: Tuesday, 29 March 2022 15.02
> > > > > > >>
> > > > > > >>> From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > >>> Sent: Tuesday, March 29, 2022 1:51 PM
> > > > > > >>>
> > > > > > >>> Having thought more about it, I think that a completely
> > > > different
> > > > > > architectural approach is required:
> > > > > > >>>
> > > > > > >>> Many of the DPDK Ethernet PMDs implement a variety of RX
> > and TX
> > > > > > packet burst functions, each optimized for different CPU vector
> > > > > > instruction sets. The availability of a DMA engine should be
> > > > treated
> > > > > > the same way. So I suggest that PMDs copying packet contents,
> > e.g.
> > > > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> > > > packet
> > > > > > burst functions.
> > > > > > >>>
> > > > > > >>> Similarly for the DPDK vhost library.
> > > > > > >>>
> > > > > > >>> In such an architecture, it would be the application's job
> > to
> > > > > > allocate DMA channels and assign them to the specific PMDs that
> > > > should
> > > > > > use them. But the actual use of the DMA channels would move
> > down
> > > > below
> > > > > > the application and into the DPDK PMDs and libraries.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Med venlig hilsen / Kind regards,
> > > > > > >>> -Morten Brørup
> > > > > > >>
> > > > > > >> Hi Morten,
> > > > > > >>
> > > > > > >> That's *exactly* how this architecture is designed &
> > > > implemented.
> > > > > > >> 1.	The DMA configuration and initialization is up to the
> > > > application
> > > > > > (OVS).
> > > > > > >> 2.	The VHost library is passed the DMA-dev ID, and its
> > new
> > > > async
> > > > > > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > > > > > >>
> > > > > > >> Looking forward to talking on the call that just started.
> > > > Regards, -
> > > > > > Harry
> > > > > > >>
> > > > > > >
> > > > > > > OK, thanks - as I said on the call, I haven't looked at the
> > > > patches.
> > > > > > >
> > > > > > > Then, I suppose that the TX completions can be handled in the
> > TX
> > > > > > function, and the RX completions can be handled in the RX
> > function,
> > > > > > just like the Ethdev PMDs handle packet descriptors:
> > > > > > >
> > > > > > > TX_Burst(tx_packet_array):
> > > > > > > 1.	Clean up descriptors processed by the NIC chip. -->
> > Process
> > > > TX
> > > > > > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > > > > > 2.	Pass on the tx_packet_array to the NIC chip
> > descriptors. --
> > > > > Pass
> > > > > > on the tx_packet_array to the TX DMA channel. (Effectively, the
> > 1st
> > > > > > pipeline stage.)
> > > > > >
> > > > > > The problem is Tx function might not be called again, so
> > enqueued
> > > > > > packets in 2. may never be completed from a Virtio point of
> > view.
> > > > IOW,
> > > > > > the packets will be copied to the Virtio descriptors buffers,
> > but
> > > > the
> > > > > > descriptors will not be made available to the Virtio driver.
> > > > >
> > > > > In that case, the application needs to call TX_Burst()
> > periodically
> > > > with an empty array, for completion purposes.
> >
> > This is what the "defer work" does at the OVS thread-level, but instead
> > of
> > "brute-forcing" and *always* making the call, the defer work concept
> > tracks
> > *when* there is outstanding work (DMA copies) to be completed
> > ("deferred work")
> > and calls the generic completion function at that point.
> >
> > So "defer work" is generic infrastructure at the OVS thread level to
> > handle
> > work that needs to be done "later", e.g. DMA completion handling.
> >
> >
> > > > > Or some sort of TX_Keepalive() function can be added to the DPDK
> > > > library, to handle DMA completion. It might even handle multiple
> > DMA
> > > > channels, if convenient - and if possible without locking or other
> > > > weird complexity.
> >
> > That's exactly how it is done, the VHost library has a new API added,
> > which allows
> > for handling completions. And in the "Netdev layer" (~OVS ethdev
> > abstraction)
> > we add a function to allow the OVS thread to do those completions in a
> > new
> > Netdev-abstraction API called "async_process" where the completions can
> > be checked.
> >
> > The only method to abstract them is to "hide" them somewhere that will
> > always be
> > polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
> > use this method.
> > This allows "completions" to be transparent to the app, at the tradeoff
> > to having bad
> > separation  of concerns as Rx and Tx are now tied-together.
> >
> > The point is, the Application layer must *somehow * handle of
> > completions.
> > So fundamentally there are 2 options for the Application level:
> >
> > A) Make the application periodically call a "handle completions"
> > function
> > 	A1) Defer work, call when needed, and track "needed" at app
> > layer, and calling into vhost txq complete as required.
> > 	        Elegant in that "no work" means "no cycles spent" on
> > checking DMA completions.
> > 	A2) Brute-force-always-call, and pay some overhead when not
> > required.
> > 	        Cycle-cost in "no work" scenarios. Depending on # of
> > vhost queues, this adds up as polling required *per vhost txq*.
> > 	        Also note that "checking DMA completions" means taking a
> > virtq-lock, so this "brute-force" can needlessly increase x-thread
> > contention!
> 
> A side note: I don't see why locking is required to test for DMA completions.
> rte_dma_vchan_status() is lockless, e.g.:
> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L56
> 0

Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree manner from a single thread.

The locks I refer to are at the OVS-netdev level, as virtq's are shared across OVS's dataplane threads.
So the "M to N" comes from M dataplane threads to N virtqs, hence requiring some locking.


> > B) Hide completions and live with the complexity/architectural
> > sacrifice of mixed-RxTx.
> > 	Various downsides here in my opinion, see the slide deck
> > presented earlier today for a summary.
> >
> > In my opinion, A1 is the most elegant solution, as it has a clean
> > separation of concerns, does not  cause
> > avoidable contention on virtq locks, and spends no cycles when there is
> > no completion work to do.
> >
> 
> Thank you for elaborating, Harry.

Thanks for part-taking in the discussion & providing your insight!

> I strongly oppose against hiding any part of TX processing in an RX function. It is just
> wrong in so many ways!
> 
> I agree that A1 is the most elegant solution. And being the most elegant solution, it
> is probably also the most future proof solution. :-)

I think so too, yes.

> I would also like to stress that DMA completion handling belongs in the DPDK
> library, not in the application. And yes, the application will be required to call some
> "handle DMA completions" function in the DPDK library. But since the application
> already knows that it uses DMA, the application should also know that it needs to
> call this extra function - so I consider this requirement perfectly acceptable.

Agree here.

> I prefer if the DPDK vhost library can hide its inner workings from the application,
> and just expose the additional "handle completions" function. This also means that
> the inner workings can be implemented as "defer work", or by some other
> algorithm. And it can be tweaked and optimized later.

Yes, the choice in how to call the handle_completions function is Application layer.
For OVS we designed Defer Work, V3 and V4. But it is an App level choice, and every
application is free to choose its own method. 

> Thinking about the long term perspective, this design pattern is common for both
> the vhost library and other DPDK libraries that could benefit from DMA (e.g.
> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or a
> separate library. But for now, we should focus on the vhost use case, and just keep
> the long term roadmap for using DMA in mind.

Totally agree to keep long term roadmap in mind; but I'm not sure we can refactor
logic out of vhost. When DMA-completions arrive, the virtQ needs to be updated;
this causes a tight coupling between the DMA completion count, and the vhost library.

As Ilya raised on the call yesterday, there is an "in_order" requirement in the vhost
library, that per virtq the packets are presented to the guest "in order" of enqueue.
(To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost library
handles this today by re-ordering the DMA completions.)


> Rephrasing what I said on the conference call: This vhost design will become the
> common design pattern for using DMA in DPDK libraries. If we get it wrong, we are
> stuck with it.

Agree, and if we get it right, then we're stuck with it too! :)


> > > > > Here is another idea, inspired by a presentation at one of the
> > DPDK
> > > > Userspace conferences. It may be wishful thinking, though:
> > > > >
> > > > > Add an additional transaction to each DMA burst; a special
> > > > transaction containing the memory write operation that makes the
> > > > descriptors available to the Virtio driver.
> > > > >
> > > >
> > > > That is something that can work, so long as the receiver is
> > operating
> > > > in
> > > > polling mode. For cases where virtio interrupts are enabled, you
> > still
> > > > need
> > > > to do a write to the eventfd in the kernel in vhost to signal the
> > > > virtio
> > > > side. That's not something that can be offloaded to a DMA engine,
> > > > sadly, so
> > > > we still need some form of completion call.
> > >
> > > I guess that virtio interrupts is the most widely deployed scenario,
> > so let's ignore
> > > the DMA TX completion transaction for now - and call it a possible
> > future
> > > optimization for specific use cases. So it seems that some form of
> > completion call
> > > is unavoidable.
> >
> > Agree to leave this aside, there is in theory a potential optimization,
> > but
> > unlikely to be of large value.
> >
> 
> One more thing: When using DMA to pass on packets into a guest, there could be a
> delay from the DMA completes until the guest is signaled. Is there any CPU cache
> hotness regarding the guest's access to the packet data to consider here? I.e. if we
> wait signaling the guest, the packet data may get cold.

Interesting question; we can likely spawn a new thread around this topic!
In short, it depends on how/where the DMA hardware writes the copy.

With technologies like DDIO, the "dest" part of the copy will be in LLC. The core reading the
dest data will benefit from the LLC locality (instead of snooping it from a remote core's L1/L2).

Delays in notifying the guest could result in LLC capacity eviction, yes.
The application layer decides how often/promptly to check for completions,
and notify the guest of them. Calling the function more often will result in less
delay in that portion of the pipeline.

Overall, there are caching benefits with DMA acceleration, and the application can control
the latency introduced between dma-completion done in HW, and Guest vring update.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 17:45                 ` Ilya Maximets
  2022-03-29 18:46                   ` Morten Brørup
@ 2022-03-30  2:02                   ` Hu, Jiayu
  2022-03-30  9:25                     ` Maxime Coquelin
  1 sibling, 1 reply; 58+ messages in thread
From: Hu, Jiayu @ 2022-03-30  2:02 UTC (permalink / raw)
  To: Ilya Maximets, Morten Brørup, Richardson, Bruce
  Cc: Maxime Coquelin, Van Haaren, Harry, Pai G, Sunil, Stokes, Ian,
	Ferriter, Cian, ovs-dev, dev, Mcnamara, John, O'Driscoll,
	Tim, Finn, Emma



> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Wednesday, March 30, 2022 1:45 AM
> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: i.maximets@ovn.org; Maxime Coquelin <maxime.coquelin@redhat.com>;
> Van Haaren, Harry <harry.van.haaren@intel.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> Finn, Emma <emma.finn@intel.com>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On 3/29/22 19:13, Morten Brørup wrote:
> >> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >> Sent: Tuesday, 29 March 2022 19.03
> >>
> >> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>> Sent: Tuesday, 29 March 2022 18.24
> >>>>
> >>>> Hi Morten,
> >>>>
> >>>> On 3/29/22 16:44, Morten Brørup wrote:
> >>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> >>>>>> Sent: Tuesday, 29 March 2022 15.02
> >>>>>>
> >>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
> >>>>>>>
> >>>>>>> Having thought more about it, I think that a completely
> >> different
> >>>> architectural approach is required:
> >>>>>>>
> >>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
> >>>> packet burst functions, each optimized for different CPU vector
> >>>> instruction sets. The availability of a DMA engine should be
> >> treated
> >>>> the same way. So I suggest that PMDs copying packet contents, e.g.
> >>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> >> packet
> >>>> burst functions.
> >>>>>>>
> >>>>>>> Similarly for the DPDK vhost library.
> >>>>>>>
> >>>>>>> In such an architecture, it would be the application's job to
> >>>> allocate DMA channels and assign them to the specific PMDs that
> >> should
> >>>> use them. But the actual use of the DMA channels would move down
> >> below
> >>>> the application and into the DPDK PMDs and libraries.
> >>>>>>>
> >>>>>>>
> >>>>>>> Med venlig hilsen / Kind regards, -Morten Brørup
> >>>>>>
> >>>>>> Hi Morten,
> >>>>>>
> >>>>>> That's *exactly* how this architecture is designed &
> >> implemented.
> >>>>>> 1.	The DMA configuration and initialization is up to the
> >> application
> >>>> (OVS).
> >>>>>> 2.	The VHost library is passed the DMA-dev ID, and its new
> >> async
> >>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
> >>>>>>
> >>>>>> Looking forward to talking on the call that just started.
> >> Regards, -
> >>>> Harry
> >>>>>>
> >>>>>
> >>>>> OK, thanks - as I said on the call, I haven't looked at the
> >> patches.
> >>>>>
> >>>>> Then, I suppose that the TX completions can be handled in the TX
> >>>> function, and the RX completions can be handled in the RX function,
> >>>> just like the Ethdev PMDs handle packet descriptors:
> >>>>>
> >>>>> TX_Burst(tx_packet_array):
> >>>>> 1.	Clean up descriptors processed by the NIC chip. --> Process
> >> TX
> >>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
> >>>>> 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
> >>> Pass
> >>>> on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
> >>>> pipeline stage.)
> >>>>
> >>>> The problem is Tx function might not be called again, so enqueued
> >>>> packets in 2. may never be completed from a Virtio point of view.
> >> IOW,
> >>>> the packets will be copied to the Virtio descriptors buffers, but
> >> the
> >>>> descriptors will not be made available to the Virtio driver.
> >>>
> >>> In that case, the application needs to call TX_Burst() periodically
> >> with an empty array, for completion purposes.
> >>>
> >>> Or some sort of TX_Keepalive() function can be added to the DPDK
> >> library, to handle DMA completion. It might even handle multiple DMA
> >> channels, if convenient - and if possible without locking or other
> >> weird complexity.
> >>>
> >>> Here is another idea, inspired by a presentation at one of the DPDK
> >> Userspace conferences. It may be wishful thinking, though:
> >>>
> >>> Add an additional transaction to each DMA burst; a special
> >> transaction containing the memory write operation that makes the
> >> descriptors available to the Virtio driver.
> 
> I was talking with Maxime after the call today about the same idea.
> And it looks fairly doable, I would say.

If the idea is making DMA update used ring's index (2B) and packed ring descriptor's flag (2B),
yes, it will work functionally. But considering the offloading cost of DMA, it would hurt
performance. In addition, the latency of small copy of DMA is much higher than that of
CPU. So it will also increase latency.

> 
> >>>
> >>
> >> That is something that can work, so long as the receiver is operating
> >> in polling mode. For cases where virtio interrupts are enabled, you
> >> still need to do a write to the eventfd in the kernel in vhost to
> >> signal the virtio side. That's not something that can be offloaded to
> >> a DMA engine, sadly, so we still need some form of completion call.
> >
> > I guess that virtio interrupts is the most widely deployed scenario,
> > so let's ignore the DMA TX completion transaction for now - and call
> > it a possible future optimization for specific use cases. So it seems
> > that some form of completion call is unavoidable.
> >
> 
> We could separate the actual kick of the guest with the data transfer.
> If interrupts are enabled, this means that the guest is not actively polling, i.e.
> we can allow some extra latency by performing the actual kick from the rx
> context, or, as Maxime said, if DMA engine can generate interrupts when the
> DMA queue is empty, vhost thread may listen to them and kick the guest if
> needed.  This will additionally remove the extra system call from the fast
> path.

Separating kick with data transfer is a very good idea. But it requires a dedicated
control plane thread to kick guest after DMA interrupt. Anyway, we can try this
optimization in the future.

Thanks,
Jiayu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 17:46                 ` Van Haaren, Harry
@ 2022-03-29 19:59                   ` Morten Brørup
  2022-03-30  9:01                     ` Van Haaren, Harry
  0 siblings, 1 reply; 58+ messages in thread
From: Morten Brørup @ 2022-03-29 19:59 UTC (permalink / raw)
  To: Van Haaren, Harry, Richardson,  Bruce
  Cc: Maxime Coquelin, Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter,
	Cian, Ilya Maximets, ovs-dev, dev, Mcnamara, John,
	O'Driscoll, Tim, Finn, Emma

> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> Sent: Tuesday, 29 March 2022 19.46
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Tuesday, March 29, 2022 6:14 PM
> >
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Tuesday, 29 March 2022 19.03
> > >
> > > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > > > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > > > > Sent: Tuesday, 29 March 2022 18.24
> > > > >
> > > > > Hi Morten,
> > > > >
> > > > > On 3/29/22 16:44, Morten Brørup wrote:
> > > > > >> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > > > > >> Sent: Tuesday, 29 March 2022 15.02
> > > > > >>
> > > > > >>> From: Morten Brørup <mb@smartsharesystems.com>
> > > > > >>> Sent: Tuesday, March 29, 2022 1:51 PM
> > > > > >>>
> > > > > >>> Having thought more about it, I think that a completely
> > > different
> > > > > architectural approach is required:
> > > > > >>>
> > > > > >>> Many of the DPDK Ethernet PMDs implement a variety of RX
> and TX
> > > > > packet burst functions, each optimized for different CPU vector
> > > > > instruction sets. The availability of a DMA engine should be
> > > treated
> > > > > the same way. So I suggest that PMDs copying packet contents,
> e.g.
> > > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> > > packet
> > > > > burst functions.
> > > > > >>>
> > > > > >>> Similarly for the DPDK vhost library.
> > > > > >>>
> > > > > >>> In such an architecture, it would be the application's job
> to
> > > > > allocate DMA channels and assign them to the specific PMDs that
> > > should
> > > > > use them. But the actual use of the DMA channels would move
> down
> > > below
> > > > > the application and into the DPDK PMDs and libraries.
> > > > > >>>
> > > > > >>>
> > > > > >>> Med venlig hilsen / Kind regards,
> > > > > >>> -Morten Brørup
> > > > > >>
> > > > > >> Hi Morten,
> > > > > >>
> > > > > >> That's *exactly* how this architecture is designed &
> > > implemented.
> > > > > >> 1.	The DMA configuration and initialization is up to the
> > > application
> > > > > (OVS).
> > > > > >> 2.	The VHost library is passed the DMA-dev ID, and its
> new
> > > async
> > > > > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > > > > >>
> > > > > >> Looking forward to talking on the call that just started.
> > > Regards, -
> > > > > Harry
> > > > > >>
> > > > > >
> > > > > > OK, thanks - as I said on the call, I haven't looked at the
> > > patches.
> > > > > >
> > > > > > Then, I suppose that the TX completions can be handled in the
> TX
> > > > > function, and the RX completions can be handled in the RX
> function,
> > > > > just like the Ethdev PMDs handle packet descriptors:
> > > > > >
> > > > > > TX_Burst(tx_packet_array):
> > > > > > 1.	Clean up descriptors processed by the NIC chip. -->
> Process
> > > TX
> > > > > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > > > > 2.	Pass on the tx_packet_array to the NIC chip
> descriptors. --
> > > > Pass
> > > > > on the tx_packet_array to the TX DMA channel. (Effectively, the
> 1st
> > > > > pipeline stage.)
> > > > >
> > > > > The problem is Tx function might not be called again, so
> enqueued
> > > > > packets in 2. may never be completed from a Virtio point of
> view.
> > > IOW,
> > > > > the packets will be copied to the Virtio descriptors buffers,
> but
> > > the
> > > > > descriptors will not be made available to the Virtio driver.
> > > >
> > > > In that case, the application needs to call TX_Burst()
> periodically
> > > with an empty array, for completion purposes.
> 
> This is what the "defer work" does at the OVS thread-level, but instead
> of
> "brute-forcing" and *always* making the call, the defer work concept
> tracks
> *when* there is outstanding work (DMA copies) to be completed
> ("deferred work")
> and calls the generic completion function at that point.
> 
> So "defer work" is generic infrastructure at the OVS thread level to
> handle
> work that needs to be done "later", e.g. DMA completion handling.
> 
> 
> > > > Or some sort of TX_Keepalive() function can be added to the DPDK
> > > library, to handle DMA completion. It might even handle multiple
> DMA
> > > channels, if convenient - and if possible without locking or other
> > > weird complexity.
> 
> That's exactly how it is done, the VHost library has a new API added,
> which allows
> for handling completions. And in the "Netdev layer" (~OVS ethdev
> abstraction)
> we add a function to allow the OVS thread to do those completions in a
> new
> Netdev-abstraction API called "async_process" where the completions can
> be checked.
> 
> The only method to abstract them is to "hide" them somewhere that will
> always be
> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
> use this method.
> This allows "completions" to be transparent to the app, at the tradeoff
> to having bad
> separation  of concerns as Rx and Tx are now tied-together.
> 
> The point is, the Application layer must *somehow * handle of
> completions.
> So fundamentally there are 2 options for the Application level:
> 
> A) Make the application periodically call a "handle completions"
> function
> 	A1) Defer work, call when needed, and track "needed" at app
> layer, and calling into vhost txq complete as required.
> 	        Elegant in that "no work" means "no cycles spent" on
> checking DMA completions.
> 	A2) Brute-force-always-call, and pay some overhead when not
> required.
> 	        Cycle-cost in "no work" scenarios. Depending on # of
> vhost queues, this adds up as polling required *per vhost txq*.
> 	        Also note that "checking DMA completions" means taking a
> virtq-lock, so this "brute-force" can needlessly increase x-thread
> contention!

A side note: I don't see why locking is required to test for DMA completions. rte_dma_vchan_status() is lockless, e.g.:
https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L560

> 
> B) Hide completions and live with the complexity/architectural
> sacrifice of mixed-RxTx.
> 	Various downsides here in my opinion, see the slide deck
> presented earlier today for a summary.
> 
> In my opinion, A1 is the most elegant solution, as it has a clean
> separation of concerns, does not  cause
> avoidable contention on virtq locks, and spends no cycles when there is
> no completion work to do.
> 

Thank you for elaborating, Harry.

I strongly oppose against hiding any part of TX processing in an RX function. It is just wrong in so many ways!

I agree that A1 is the most elegant solution. And being the most elegant solution, it is probably also the most future proof solution. :-)

I would also like to stress that DMA completion handling belongs in the DPDK library, not in the application. And yes, the application will be required to call some "handle DMA completions" function in the DPDK library. But since the application already knows that it uses DMA, the application should also know that it needs to call this extra function - so I consider this requirement perfectly acceptable.

I prefer if the DPDK vhost library can hide its inner workings from the application, and just expose the additional "handle completions" function. This also means that the inner workings can be implemented as "defer work", or by some other algorithm. And it can be tweaked and optimized later.

Thinking about the long term perspective, this design pattern is common for both the vhost library and other DPDK libraries that could benefit from DMA (e.g. vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or a separate library. But for now, we should focus on the vhost use case, and just keep the long term roadmap for using DMA in mind.

Rephrasing what I said on the conference call: This vhost design will become the common design pattern for using DMA in DPDK libraries. If we get it wrong, we are stuck with it.

> 
> > > > Here is another idea, inspired by a presentation at one of the
> DPDK
> > > Userspace conferences. It may be wishful thinking, though:
> > > >
> > > > Add an additional transaction to each DMA burst; a special
> > > transaction containing the memory write operation that makes the
> > > descriptors available to the Virtio driver.
> > > >
> > >
> > > That is something that can work, so long as the receiver is
> operating
> > > in
> > > polling mode. For cases where virtio interrupts are enabled, you
> still
> > > need
> > > to do a write to the eventfd in the kernel in vhost to signal the
> > > virtio
> > > side. That's not something that can be offloaded to a DMA engine,
> > > sadly, so
> > > we still need some form of completion call.
> >
> > I guess that virtio interrupts is the most widely deployed scenario,
> so let's ignore
> > the DMA TX completion transaction for now - and call it a possible
> future
> > optimization for specific use cases. So it seems that some form of
> completion call
> > is unavoidable.
> 
> Agree to leave this aside, there is in theory a potential optimization,
> but
> unlikely to be of large value.
> 

One more thing: When using DMA to pass on packets into a guest, there could be a delay from the DMA completes until the guest is signaled. Is there any CPU cache hotness regarding the guest's access to the packet data to consider here? I.e. if we wait signaling the guest, the packet data may get cold.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 17:45                 ` Ilya Maximets
@ 2022-03-29 18:46                   ` Morten Brørup
  2022-03-30  2:02                   ` Hu, Jiayu
  1 sibling, 0 replies; 58+ messages in thread
From: Morten Brørup @ 2022-03-29 18:46 UTC (permalink / raw)
  To: Ilya Maximets, Bruce Richardson
  Cc: Maxime Coquelin, Van Haaren, Harry, Pai G, Sunil, Stokes, Ian,
	Hu, Jiayu, Ferriter, Cian, ovs-dev, dev, Mcnamara, John,
	O'Driscoll, Tim, Finn, Emma

> From: Ilya Maximets [mailto:i.maximets@ovn.org]
> Sent: Tuesday, 29 March 2022 19.45
> 
> On 3/29/22 19:13, Morten Brørup wrote:
> >> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >> Sent: Tuesday, 29 March 2022 19.03
> >>
> >> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>> Sent: Tuesday, 29 March 2022 18.24
> >>>>
> >>>> Hi Morten,
> >>>>
> >>>> On 3/29/22 16:44, Morten Brørup wrote:
> >>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> >>>>>> Sent: Tuesday, 29 March 2022 15.02
> >>>>>>
> >>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
> >>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
> >>>>>>>
> >>>>>>> Having thought more about it, I think that a completely
> >> different
> >>>> architectural approach is required:
> >>>>>>>
> >>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
> >>>> packet burst functions, each optimized for different CPU vector
> >>>> instruction sets. The availability of a DMA engine should be
> >> treated
> >>>> the same way. So I suggest that PMDs copying packet contents, e.g.
> >>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> >> packet
> >>>> burst functions.
> >>>>>>>
> >>>>>>> Similarly for the DPDK vhost library.
> >>>>>>>
> >>>>>>> In such an architecture, it would be the application's job to
> >>>> allocate DMA channels and assign them to the specific PMDs that
> >> should
> >>>> use them. But the actual use of the DMA channels would move down
> >> below
> >>>> the application and into the DPDK PMDs and libraries.
> >>>>>>>
> >>>>>>>
> >>>>>>> Med venlig hilsen / Kind regards,
> >>>>>>> -Morten Brørup
> >>>>>>
> >>>>>> Hi Morten,
> >>>>>>
> >>>>>> That's *exactly* how this architecture is designed &
> >> implemented.
> >>>>>> 1.	The DMA configuration and initialization is up to the
> >> application
> >>>> (OVS).
> >>>>>> 2.	The VHost library is passed the DMA-dev ID, and its new
> >> async
> >>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
> >>>>>>
> >>>>>> Looking forward to talking on the call that just started.
> >> Regards, -
> >>>> Harry
> >>>>>>
> >>>>>
> >>>>> OK, thanks - as I said on the call, I haven't looked at the
> >> patches.
> >>>>>
> >>>>> Then, I suppose that the TX completions can be handled in the TX
> >>>> function, and the RX completions can be handled in the RX
> function,
> >>>> just like the Ethdev PMDs handle packet descriptors:
> >>>>>
> >>>>> TX_Burst(tx_packet_array):
> >>>>> 1.	Clean up descriptors processed by the NIC chip. --> Process
> >> TX
> >>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
> >>>>> 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
> >>> Pass
> >>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
> 1st
> >>>> pipeline stage.)
> >>>>
> >>>> The problem is Tx function might not be called again, so enqueued
> >>>> packets in 2. may never be completed from a Virtio point of view.
> >> IOW,
> >>>> the packets will be copied to the Virtio descriptors buffers, but
> >> the
> >>>> descriptors will not be made available to the Virtio driver.
> >>>
> >>> In that case, the application needs to call TX_Burst() periodically
> >> with an empty array, for completion purposes.
> >>>
> >>> Or some sort of TX_Keepalive() function can be added to the DPDK
> >> library, to handle DMA completion. It might even handle multiple DMA
> >> channels, if convenient - and if possible without locking or other
> >> weird complexity.
> >>>
> >>> Here is another idea, inspired by a presentation at one of the DPDK
> >> Userspace conferences. It may be wishful thinking, though:
> >>>
> >>> Add an additional transaction to each DMA burst; a special
> >> transaction containing the memory write operation that makes the
> >> descriptors available to the Virtio driver.
> 
> I was talking with Maxime after the call today about the same idea.
> And it looks fairly doable, I would say.
> 
> >>>
> >>
> >> That is something that can work, so long as the receiver is
> operating
> >> in
> >> polling mode. For cases where virtio interrupts are enabled, you
> still
> >> need
> >> to do a write to the eventfd in the kernel in vhost to signal the
> >> virtio
> >> side. That's not something that can be offloaded to a DMA engine,
> >> sadly, so
> >> we still need some form of completion call.
> >
> > I guess that virtio interrupts is the most widely deployed scenario,
> so let's
> > ignore the DMA TX completion transaction for now - and call it a
> possible
> > future optimization for specific use cases. So it seems that some
> form of
> > completion call is unavoidable.
> >
> 
> We could separate the actual kick of the guest with the data transfer.
> If interrupts are enabled, this means that the guest is not actively
> polling, i.e. we can allow some extra latency by performing the actual
> kick from the rx context, or, as Maxime said, if DMA engine can
> generate
> interrupts when the DMA queue is empty, vhost thread may listen to them
> and kick the guest if needed.  This will additionally remove the extra
> system call from the fast path.

Excellent point about latency sensitivity!

I reminds me that we should consider what to optimize for...

Adding that extra "burst complete signal" DMA transaction at the end of a DMA burst has a cost. So we need to ask ourselves:
1. Is this the cheapest method to signal and handle completion, or would some other method (e.g. polling) be cheaper?
2. In the most important (i.e. most frequent or most latency sensitive) traffic scenarios, is this the cheapest method?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 17:13               ` Morten Brørup
  2022-03-29 17:45                 ` Ilya Maximets
@ 2022-03-29 17:46                 ` Van Haaren, Harry
  2022-03-29 19:59                   ` Morten Brørup
  1 sibling, 1 reply; 58+ messages in thread
From: Van Haaren, Harry @ 2022-03-29 17:46 UTC (permalink / raw)
  To: Morten Brørup, Richardson,  Bruce
  Cc: Maxime Coquelin, Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter,
	Cian, Ilya Maximets, ovs-dev, dev, Mcnamara, John,
	O'Driscoll, Tim, Finn, Emma

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Tuesday, March 29, 2022 6:14 PM
> To: Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> <cian.ferriter@intel.com>; Ilya Maximets <i.maximets@ovn.org>; ovs-
> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>; Finn,
> Emma <emma.finn@intel.com>
> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 29 March 2022 19.03
> >
> > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > > > Sent: Tuesday, 29 March 2022 18.24
> > > >
> > > > Hi Morten,
> > > >
> > > > On 3/29/22 16:44, Morten Brørup wrote:
> > > > >> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > > > >> Sent: Tuesday, 29 March 2022 15.02
> > > > >>
> > > > >>> From: Morten Brørup <mb@smartsharesystems.com>
> > > > >>> Sent: Tuesday, March 29, 2022 1:51 PM
> > > > >>>
> > > > >>> Having thought more about it, I think that a completely
> > different
> > > > architectural approach is required:
> > > > >>>
> > > > >>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
> > > > packet burst functions, each optimized for different CPU vector
> > > > instruction sets. The availability of a DMA engine should be
> > treated
> > > > the same way. So I suggest that PMDs copying packet contents, e.g.
> > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> > packet
> > > > burst functions.
> > > > >>>
> > > > >>> Similarly for the DPDK vhost library.
> > > > >>>
> > > > >>> In such an architecture, it would be the application's job to
> > > > allocate DMA channels and assign them to the specific PMDs that
> > should
> > > > use them. But the actual use of the DMA channels would move down
> > below
> > > > the application and into the DPDK PMDs and libraries.
> > > > >>>
> > > > >>>
> > > > >>> Med venlig hilsen / Kind regards,
> > > > >>> -Morten Brørup
> > > > >>
> > > > >> Hi Morten,
> > > > >>
> > > > >> That's *exactly* how this architecture is designed &
> > implemented.
> > > > >> 1.	The DMA configuration and initialization is up to the
> > application
> > > > (OVS).
> > > > >> 2.	The VHost library is passed the DMA-dev ID, and its new
> > async
> > > > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > > > >>
> > > > >> Looking forward to talking on the call that just started.
> > Regards, -
> > > > Harry
> > > > >>
> > > > >
> > > > > OK, thanks - as I said on the call, I haven't looked at the
> > patches.
> > > > >
> > > > > Then, I suppose that the TX completions can be handled in the TX
> > > > function, and the RX completions can be handled in the RX function,
> > > > just like the Ethdev PMDs handle packet descriptors:
> > > > >
> > > > > TX_Burst(tx_packet_array):
> > > > > 1.	Clean up descriptors processed by the NIC chip. --> Process
> > TX
> > > > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > > > 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
> > > Pass
> > > > on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
> > > > pipeline stage.)
> > > >
> > > > The problem is Tx function might not be called again, so enqueued
> > > > packets in 2. may never be completed from a Virtio point of view.
> > IOW,
> > > > the packets will be copied to the Virtio descriptors buffers, but
> > the
> > > > descriptors will not be made available to the Virtio driver.
> > >
> > > In that case, the application needs to call TX_Burst() periodically
> > with an empty array, for completion purposes.

This is what the "defer work" does at the OVS thread-level, but instead of
"brute-forcing" and *always* making the call, the defer work concept tracks
*when* there is outstanding work (DMA copies) to be completed ("deferred work")
and calls the generic completion function at that point.

So "defer work" is generic infrastructure at the OVS thread level to handle
work that needs to be done "later", e.g. DMA completion handling.


> > > Or some sort of TX_Keepalive() function can be added to the DPDK
> > library, to handle DMA completion. It might even handle multiple DMA
> > channels, if convenient - and if possible without locking or other
> > weird complexity.

That's exactly how it is done, the VHost library has a new API added, which allows
for handling completions. And in the "Netdev layer" (~OVS ethdev abstraction)
we add a function to allow the OVS thread to do those completions in a new
Netdev-abstraction API called "async_process" where the completions can be checked.

The only method to abstract them is to "hide" them somewhere that will always be
polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches use this method.
This allows "completions" to be transparent to the app, at the tradeoff to having bad
separation  of concerns as Rx and Tx are now tied-together. 

The point is, the Application layer must *somehow * handle of completions.
So fundamentally there are 2 options for the Application level:

A) Make the application periodically call a "handle completions" function
	A1) Defer work, call when needed, and track "needed" at app layer, and calling into vhost txq complete as required.
	        Elegant in that "no work" means "no cycles spent" on checking DMA completions.
	A2) Brute-force-always-call, and pay some overhead when not required.
	        Cycle-cost in "no work" scenarios. Depending on # of vhost queues, this adds up as polling required *per vhost txq*.
	        Also note that "checking DMA completions" means taking a virtq-lock, so this "brute-force" can needlessly increase x-thread contention!

B) Hide completions and live with the complexity/architectural sacrifice of mixed-RxTx. 
	Various downsides here in my opinion, see the slide deck presented earlier today for a summary. 

In my opinion, A1 is the most elegant solution, as it has a clean separation of concerns, does not  cause
avoidable contention on virtq locks, and spends no cycles when there is no completion work to do.


> > > Here is another idea, inspired by a presentation at one of the DPDK
> > Userspace conferences. It may be wishful thinking, though:
> > >
> > > Add an additional transaction to each DMA burst; a special
> > transaction containing the memory write operation that makes the
> > descriptors available to the Virtio driver.
> > >
> >
> > That is something that can work, so long as the receiver is operating
> > in
> > polling mode. For cases where virtio interrupts are enabled, you still
> > need
> > to do a write to the eventfd in the kernel in vhost to signal the
> > virtio
> > side. That's not something that can be offloaded to a DMA engine,
> > sadly, so
> > we still need some form of completion call.
> 
> I guess that virtio interrupts is the most widely deployed scenario, so let's ignore
> the DMA TX completion transaction for now - and call it a possible future
> optimization for specific use cases. So it seems that some form of completion call
> is unavoidable.

Agree to leave this aside, there is in theory a potential optimization, but
unlikely to be of large value.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 17:13               ` Morten Brørup
@ 2022-03-29 17:45                 ` Ilya Maximets
  2022-03-29 18:46                   ` Morten Brørup
  2022-03-30  2:02                   ` Hu, Jiayu
  2022-03-29 17:46                 ` Van Haaren, Harry
  1 sibling, 2 replies; 58+ messages in thread
From: Ilya Maximets @ 2022-03-29 17:45 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: i.maximets, Maxime Coquelin, Van Haaren, Harry, Pai G, Sunil,
	Stokes, Ian, Hu, Jiayu, Ferriter, Cian, ovs-dev, dev, Mcnamara,
	John, O'Driscoll, Tim, Finn, Emma

On 3/29/22 19:13, Morten Brørup wrote:
>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>> Sent: Tuesday, 29 March 2022 19.03
>>
>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>
>>>> Hi Morten,
>>>>
>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>
>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>
>>>>>>> Having thought more about it, I think that a completely
>> different
>>>> architectural approach is required:
>>>>>>>
>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
>>>> packet burst functions, each optimized for different CPU vector
>>>> instruction sets. The availability of a DMA engine should be
>> treated
>>>> the same way. So I suggest that PMDs copying packet contents, e.g.
>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>> packet
>>>> burst functions.
>>>>>>>
>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>
>>>>>>> In such an architecture, it would be the application's job to
>>>> allocate DMA channels and assign them to the specific PMDs that
>> should
>>>> use them. But the actual use of the DMA channels would move down
>> below
>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>
>>>>>>>
>>>>>>> Med venlig hilsen / Kind regards,
>>>>>>> -Morten Brørup
>>>>>>
>>>>>> Hi Morten,
>>>>>>
>>>>>> That's *exactly* how this architecture is designed &
>> implemented.
>>>>>> 1.	The DMA configuration and initialization is up to the
>> application
>>>> (OVS).
>>>>>> 2.	The VHost library is passed the DMA-dev ID, and its new
>> async
>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>
>>>>>> Looking forward to talking on the call that just started.
>> Regards, -
>>>> Harry
>>>>>>
>>>>>
>>>>> OK, thanks - as I said on the call, I haven't looked at the
>> patches.
>>>>>
>>>>> Then, I suppose that the TX completions can be handled in the TX
>>>> function, and the RX completions can be handled in the RX function,
>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>
>>>>> TX_Burst(tx_packet_array):
>>>>> 1.	Clean up descriptors processed by the NIC chip. --> Process
>> TX
>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>> 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
>>> Pass
>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
>>>> pipeline stage.)
>>>>
>>>> The problem is Tx function might not be called again, so enqueued
>>>> packets in 2. may never be completed from a Virtio point of view.
>> IOW,
>>>> the packets will be copied to the Virtio descriptors buffers, but
>> the
>>>> descriptors will not be made available to the Virtio driver.
>>>
>>> In that case, the application needs to call TX_Burst() periodically
>> with an empty array, for completion purposes.
>>>
>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>> library, to handle DMA completion. It might even handle multiple DMA
>> channels, if convenient - and if possible without locking or other
>> weird complexity.
>>>
>>> Here is another idea, inspired by a presentation at one of the DPDK
>> Userspace conferences. It may be wishful thinking, though:
>>>
>>> Add an additional transaction to each DMA burst; a special
>> transaction containing the memory write operation that makes the
>> descriptors available to the Virtio driver.

I was talking with Maxime after the call today about the same idea.
And it looks fairly doable, I would say.

>>>
>>
>> That is something that can work, so long as the receiver is operating
>> in
>> polling mode. For cases where virtio interrupts are enabled, you still
>> need
>> to do a write to the eventfd in the kernel in vhost to signal the
>> virtio
>> side. That's not something that can be offloaded to a DMA engine,
>> sadly, so
>> we still need some form of completion call.
> 
> I guess that virtio interrupts is the most widely deployed scenario, so let's
> ignore the DMA TX completion transaction for now - and call it a possible
> future optimization for specific use cases. So it seems that some form of
> completion call is unavoidable.
> 

We could separate the actual kick of the guest with the data transfer.
If interrupts are enabled, this means that the guest is not actively
polling, i.e. we can allow some extra latency by performing the actual
kick from the rx context, or, as Maxime said, if DMA engine can generate
interrupts when the DMA queue is empty, vhost thread may listen to them
and kick the guest if needed.  This will additionally remove the extra
system call from the fast path.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 17:03             ` Bruce Richardson
@ 2022-03-29 17:13               ` Morten Brørup
  2022-03-29 17:45                 ` Ilya Maximets
  2022-03-29 17:46                 ` Van Haaren, Harry
  0 siblings, 2 replies; 58+ messages in thread
From: Morten Brørup @ 2022-03-29 17:13 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Maxime Coquelin, Van Haaren, Harry, Pai G, Sunil, Stokes, Ian,
	Hu, Jiayu, Ferriter, Cian, Ilya Maximets, ovs-dev, dev, Mcnamara,
	John, O'Driscoll, Tim, Finn, Emma

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 29 March 2022 19.03
> 
> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > > Sent: Tuesday, 29 March 2022 18.24
> > >
> > > Hi Morten,
> > >
> > > On 3/29/22 16:44, Morten Brørup wrote:
> > > >> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > > >> Sent: Tuesday, 29 March 2022 15.02
> > > >>
> > > >>> From: Morten Brørup <mb@smartsharesystems.com>
> > > >>> Sent: Tuesday, March 29, 2022 1:51 PM
> > > >>>
> > > >>> Having thought more about it, I think that a completely
> different
> > > architectural approach is required:
> > > >>>
> > > >>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
> > > packet burst functions, each optimized for different CPU vector
> > > instruction sets. The availability of a DMA engine should be
> treated
> > > the same way. So I suggest that PMDs copying packet contents, e.g.
> > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> packet
> > > burst functions.
> > > >>>
> > > >>> Similarly for the DPDK vhost library.
> > > >>>
> > > >>> In such an architecture, it would be the application's job to
> > > allocate DMA channels and assign them to the specific PMDs that
> should
> > > use them. But the actual use of the DMA channels would move down
> below
> > > the application and into the DPDK PMDs and libraries.
> > > >>>
> > > >>>
> > > >>> Med venlig hilsen / Kind regards,
> > > >>> -Morten Brørup
> > > >>
> > > >> Hi Morten,
> > > >>
> > > >> That's *exactly* how this architecture is designed &
> implemented.
> > > >> 1.	The DMA configuration and initialization is up to the
> application
> > > (OVS).
> > > >> 2.	The VHost library is passed the DMA-dev ID, and its new
> async
> > > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > > >>
> > > >> Looking forward to talking on the call that just started.
> Regards, -
> > > Harry
> > > >>
> > > >
> > > > OK, thanks - as I said on the call, I haven't looked at the
> patches.
> > > >
> > > > Then, I suppose that the TX completions can be handled in the TX
> > > function, and the RX completions can be handled in the RX function,
> > > just like the Ethdev PMDs handle packet descriptors:
> > > >
> > > > TX_Burst(tx_packet_array):
> > > > 1.	Clean up descriptors processed by the NIC chip. --> Process
> TX
> > > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > > 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
> > Pass
> > > on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
> > > pipeline stage.)
> > >
> > > The problem is Tx function might not be called again, so enqueued
> > > packets in 2. may never be completed from a Virtio point of view.
> IOW,
> > > the packets will be copied to the Virtio descriptors buffers, but
> the
> > > descriptors will not be made available to the Virtio driver.
> >
> > In that case, the application needs to call TX_Burst() periodically
> with an empty array, for completion purposes.
> >
> > Or some sort of TX_Keepalive() function can be added to the DPDK
> library, to handle DMA completion. It might even handle multiple DMA
> channels, if convenient - and if possible without locking or other
> weird complexity.
> >
> > Here is another idea, inspired by a presentation at one of the DPDK
> Userspace conferences. It may be wishful thinking, though:
> >
> > Add an additional transaction to each DMA burst; a special
> transaction containing the memory write operation that makes the
> descriptors available to the Virtio driver.
> >
> 
> That is something that can work, so long as the receiver is operating
> in
> polling mode. For cases where virtio interrupts are enabled, you still
> need
> to do a write to the eventfd in the kernel in vhost to signal the
> virtio
> side. That's not something that can be offloaded to a DMA engine,
> sadly, so
> we still need some form of completion call.

I guess that virtio interrupts is the most widely deployed scenario, so let's ignore the DMA TX completion transaction for now - and call it a possible future optimization for specific use cases. So it seems that some form of completion call is unavoidable.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 16:45           ` Morten Brørup
@ 2022-03-29 17:03             ` Bruce Richardson
  2022-03-29 17:13               ` Morten Brørup
  0 siblings, 1 reply; 58+ messages in thread
From: Bruce Richardson @ 2022-03-29 17:03 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Maxime Coquelin, Van Haaren, Harry, Pai G, Sunil, Stokes, Ian,
	Hu, Jiayu, Ferriter, Cian, Ilya Maximets, ovs-dev, dev, Mcnamara,
	John, O'Driscoll, Tim, Finn, Emma

On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > Sent: Tuesday, 29 March 2022 18.24
> > 
> > Hi Morten,
> > 
> > On 3/29/22 16:44, Morten Brørup wrote:
> > >> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> > >> Sent: Tuesday, 29 March 2022 15.02
> > >>
> > >>> From: Morten Brørup <mb@smartsharesystems.com>
> > >>> Sent: Tuesday, March 29, 2022 1:51 PM
> > >>>
> > >>> Having thought more about it, I think that a completely different
> > architectural approach is required:
> > >>>
> > >>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
> > packet burst functions, each optimized for different CPU vector
> > instruction sets. The availability of a DMA engine should be treated
> > the same way. So I suggest that PMDs copying packet contents, e.g.
> > memif, pcap, vmxnet3, should implement DMA optimized RX and TX packet
> > burst functions.
> > >>>
> > >>> Similarly for the DPDK vhost library.
> > >>>
> > >>> In such an architecture, it would be the application's job to
> > allocate DMA channels and assign them to the specific PMDs that should
> > use them. But the actual use of the DMA channels would move down below
> > the application and into the DPDK PMDs and libraries.
> > >>>
> > >>>
> > >>> Med venlig hilsen / Kind regards,
> > >>> -Morten Brørup
> > >>
> > >> Hi Morten,
> > >>
> > >> That's *exactly* how this architecture is designed & implemented.
> > >> 1.	The DMA configuration and initialization is up to the application
> > (OVS).
> > >> 2.	The VHost library is passed the DMA-dev ID, and its new async
> > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > >>
> > >> Looking forward to talking on the call that just started. Regards, -
> > Harry
> > >>
> > >
> > > OK, thanks - as I said on the call, I haven't looked at the patches.
> > >
> > > Then, I suppose that the TX completions can be handled in the TX
> > function, and the RX completions can be handled in the RX function,
> > just like the Ethdev PMDs handle packet descriptors:
> > >
> > > TX_Burst(tx_packet_array):
> > > 1.	Clean up descriptors processed by the NIC chip. --> Process TX
> > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > 2.	Pass on the tx_packet_array to the NIC chip descriptors. --> Pass
> > on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
> > pipeline stage.)
> > 
> > The problem is Tx function might not be called again, so enqueued
> > packets in 2. may never be completed from a Virtio point of view. IOW,
> > the packets will be copied to the Virtio descriptors buffers, but the
> > descriptors will not be made available to the Virtio driver.
> 
> In that case, the application needs to call TX_Burst() periodically with an empty array, for completion purposes.
> 
> Or some sort of TX_Keepalive() function can be added to the DPDK library, to handle DMA completion. It might even handle multiple DMA channels, if convenient - and if possible without locking or other weird complexity.
> 
> Here is another idea, inspired by a presentation at one of the DPDK Userspace conferences. It may be wishful thinking, though:
> 
> Add an additional transaction to each DMA burst; a special transaction containing the memory write operation that makes the descriptors available to the Virtio driver.
> 

That is something that can work, so long as the receiver is operating in
polling mode. For cases where virtio interrupts are enabled, you still need
to do a write to the eventfd in the kernel in vhost to signal the virtio
side. That's not something that can be offloaded to a DMA engine, sadly, so
we still need some form of completion call.

/Bruce

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 16:24         ` Maxime Coquelin
@ 2022-03-29 16:45           ` Morten Brørup
  2022-03-29 17:03             ` Bruce Richardson
  0 siblings, 1 reply; 58+ messages in thread
From: Morten Brørup @ 2022-03-29 16:45 UTC (permalink / raw)
  To: Maxime Coquelin, Van Haaren, Harry, Pai G, Sunil, Stokes, Ian,
	Hu, Jiayu, Ferriter, Cian, Ilya Maximets, ovs-dev, dev
  Cc: Mcnamara, John, O'Driscoll, Tim, Finn, Emma

> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Tuesday, 29 March 2022 18.24
> 
> Hi Morten,
> 
> On 3/29/22 16:44, Morten Brørup wrote:
> >> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> >> Sent: Tuesday, 29 March 2022 15.02
> >>
> >>> From: Morten Brørup <mb@smartsharesystems.com>
> >>> Sent: Tuesday, March 29, 2022 1:51 PM
> >>>
> >>> Having thought more about it, I think that a completely different
> architectural approach is required:
> >>>
> >>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
> packet burst functions, each optimized for different CPU vector
> instruction sets. The availability of a DMA engine should be treated
> the same way. So I suggest that PMDs copying packet contents, e.g.
> memif, pcap, vmxnet3, should implement DMA optimized RX and TX packet
> burst functions.
> >>>
> >>> Similarly for the DPDK vhost library.
> >>>
> >>> In such an architecture, it would be the application's job to
> allocate DMA channels and assign them to the specific PMDs that should
> use them. But the actual use of the DMA channels would move down below
> the application and into the DPDK PMDs and libraries.
> >>>
> >>>
> >>> Med venlig hilsen / Kind regards,
> >>> -Morten Brørup
> >>
> >> Hi Morten,
> >>
> >> That's *exactly* how this architecture is designed & implemented.
> >> 1.	The DMA configuration and initialization is up to the application
> (OVS).
> >> 2.	The VHost library is passed the DMA-dev ID, and its new async
> rx/tx APIs, and uses the DMA device to accelerate the copy.
> >>
> >> Looking forward to talking on the call that just started. Regards, -
> Harry
> >>
> >
> > OK, thanks - as I said on the call, I haven't looked at the patches.
> >
> > Then, I suppose that the TX completions can be handled in the TX
> function, and the RX completions can be handled in the RX function,
> just like the Ethdev PMDs handle packet descriptors:
> >
> > TX_Burst(tx_packet_array):
> > 1.	Clean up descriptors processed by the NIC chip. --> Process TX
> DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > 2.	Pass on the tx_packet_array to the NIC chip descriptors. --> Pass
> on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
> pipeline stage.)
> 
> The problem is Tx function might not be called again, so enqueued
> packets in 2. may never be completed from a Virtio point of view. IOW,
> the packets will be copied to the Virtio descriptors buffers, but the
> descriptors will not be made available to the Virtio driver.

In that case, the application needs to call TX_Burst() periodically with an empty array, for completion purposes.

Or some sort of TX_Keepalive() function can be added to the DPDK library, to handle DMA completion. It might even handle multiple DMA channels, if convenient - and if possible without locking or other weird complexity.

Here is another idea, inspired by a presentation at one of the DPDK Userspace conferences. It may be wishful thinking, though:

Add an additional transaction to each DMA burst; a special transaction containing the memory write operation that makes the descriptors available to the Virtio driver.

> >
> > RX_burst(rx_packet_array):
> > 1.	Pass on the finished NIC chip RX descriptors to the
> rx_packet_array. --> Process RX DMA channel completions. (Effectively,
> the 2nd pipeline stage.)
> > 2.	Replenish NIC chip RX descriptors. --> Start RX DMA channel for
> any waiting packets. (Effectively, the 1nd pipeline stage.)
> >
> > PMD_init():
> > -	Prepare NIC chip RX descriptors. (In other words: Replenish NIC
> chip RX descriptors. = RX pipeline stage 1.)
> >
> > PS: Rearranged the email, so we can avoid top posting.
> >
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 14:44       ` Morten Brørup
@ 2022-03-29 16:24         ` Maxime Coquelin
  2022-03-29 16:45           ` Morten Brørup
  0 siblings, 1 reply; 58+ messages in thread
From: Maxime Coquelin @ 2022-03-29 16:24 UTC (permalink / raw)
  To: Morten Brørup, Van Haaren, Harry, Pai G, Sunil, Stokes, Ian,
	Hu, Jiayu, Ferriter, Cian, Ilya Maximets, ovs-dev, dev
  Cc: Mcnamara, John, O'Driscoll, Tim, Finn, Emma

Hi Morten,

On 3/29/22 16:44, Morten Brørup wrote:
>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>> Sent: Tuesday, 29 March 2022 15.02
>>
>>> From: Morten Brørup <mb@smartsharesystems.com>
>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>
>>> Having thought more about it, I think that a completely different architectural approach is required:
>>>
>>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX packet burst functions, each optimized for different CPU vector instruction sets. The availability of a DMA engine should be treated the same way. So I suggest that PMDs copying packet contents, e.g. memif, pcap, vmxnet3, should implement DMA optimized RX and TX packet burst functions.
>>>
>>> Similarly for the DPDK vhost library.
>>>
>>> In such an architecture, it would be the application's job to allocate DMA channels and assign them to the specific PMDs that should use them. But the actual use of the DMA channels would move down below the application and into the DPDK PMDs and libraries.
>>>
>>>
>>> Med venlig hilsen / Kind regards,
>>> -Morten Brørup
>>
>> Hi Morten,
>>
>> That's *exactly* how this architecture is designed & implemented.
>> 1.	The DMA configuration and initialization is up to the application (OVS).
>> 2.	The VHost library is passed the DMA-dev ID, and its new async rx/tx APIs, and uses the DMA device to accelerate the copy.
>>
>> Looking forward to talking on the call that just started. Regards, -Harry
>>
> 
> OK, thanks - as I said on the call, I haven't looked at the patches.
> 
> Then, I suppose that the TX completions can be handled in the TX function, and the RX completions can be handled in the RX function, just like the Ethdev PMDs handle packet descriptors:
> 
> TX_Burst(tx_packet_array):
> 1.	Clean up descriptors processed by the NIC chip. --> Process TX DMA channel completions. (Effectively, the 2nd pipeline stage.)
> 2.	Pass on the tx_packet_array to the NIC chip descriptors. --> Pass on the tx_packet_array to the TX DMA channel. (Effectively, the 1st pipeline stage.)

The problem is Tx function might not be called again, so enqueued
packets in 2. may never be completed from a Virtio point of view. IOW,
the packets will be copied to the Virtio descriptors buffers, but the
descriptors will not be made available to the Virtio driver.

> 
> RX_burst(rx_packet_array):
> 1.	Pass on the finished NIC chip RX descriptors to the rx_packet_array. --> Process RX DMA channel completions. (Effectively, the 2nd pipeline stage.)
> 2.	Replenish NIC chip RX descriptors. --> Start RX DMA channel for any waiting packets. (Effectively, the 1nd pipeline stage.)
> 
> PMD_init():
> -	Prepare NIC chip RX descriptors. (In other words: Replenish NIC chip RX descriptors. = RX pipeline stage 1.)
> 
> PS: Rearranged the email, so we can avoid top posting.
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 13:01     ` Van Haaren, Harry
@ 2022-03-29 14:44       ` Morten Brørup
  2022-03-29 16:24         ` Maxime Coquelin
  0 siblings, 1 reply; 58+ messages in thread
From: Morten Brørup @ 2022-03-29 14:44 UTC (permalink / raw)
  To: Van Haaren, Harry, Pai G, Sunil, Stokes, Ian, Hu, Jiayu,
	Ferriter, Cian, Ilya Maximets, maxime.coquelin, ovs-dev, dev
  Cc: Mcnamara, John, O'Driscoll, Tim, Finn, Emma

> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com] 
> Sent: Tuesday, 29 March 2022 15.02
> 
> > From: Morten Brørup <mb@smartsharesystems.com> 
> > Sent: Tuesday, March 29, 2022 1:51 PM
> > 
> > Having thought more about it, I think that a completely different architectural approach is required:
> > 
> > Many of the DPDK Ethernet PMDs implement a variety of RX and TX packet burst functions, each optimized for different CPU vector instruction sets. The availability of a DMA engine should be treated the same way. So I suggest that PMDs copying packet contents, e.g. memif, pcap, vmxnet3, should implement DMA optimized RX and TX packet burst functions.
> > 
> > Similarly for the DPDK vhost library.
> > 
> > In such an architecture, it would be the application's job to allocate DMA channels and assign them to the specific PMDs that should use them. But the actual use of the DMA channels would move down below the application and into the DPDK PMDs and libraries.
> > 
> > 
> > Med venlig hilsen / Kind regards,
> > -Morten Brørup
> 
> Hi Morten,
> 
> That's *exactly* how this architecture is designed & implemented.
> 1.	The DMA configuration and initialization is up to the application (OVS).
> 2.	The VHost library is passed the DMA-dev ID, and its new async rx/tx APIs, and uses the DMA device to accelerate the copy.
> 
> Looking forward to talking on the call that just started. Regards, -Harry
> 

OK, thanks - as I said on the call, I haven't looked at the patches.

Then, I suppose that the TX completions can be handled in the TX function, and the RX completions can be handled in the RX function, just like the Ethdev PMDs handle packet descriptors:

TX_Burst(tx_packet_array):
1.	Clean up descriptors processed by the NIC chip. --> Process TX DMA channel completions. (Effectively, the 2nd pipeline stage.)
2.	Pass on the tx_packet_array to the NIC chip descriptors. --> Pass on the tx_packet_array to the TX DMA channel. (Effectively, the 1st pipeline stage.)

RX_burst(rx_packet_array):
1.	Pass on the finished NIC chip RX descriptors to the rx_packet_array. --> Process RX DMA channel completions. (Effectively, the 2nd pipeline stage.)
2.	Replenish NIC chip RX descriptors. --> Start RX DMA channel for any waiting packets. (Effectively, the 1nd pipeline stage.)

PMD_init():
-	Prepare NIC chip RX descriptors. (In other words: Replenish NIC chip RX descriptors. = RX pipeline stage 1.)

PS: Rearranged the email, so we can avoid top posting.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-29 12:51   ` Morten Brørup
@ 2022-03-29 13:01     ` Van Haaren, Harry
  2022-03-29 14:44       ` Morten Brørup
  0 siblings, 1 reply; 58+ messages in thread
From: Van Haaren, Harry @ 2022-03-29 13:01 UTC (permalink / raw)
  To: Morten Brørup, Pai G, Sunil, Stokes, Ian, Hu, Jiayu,
	Ferriter, Cian, Ilya Maximets, maxime.coquelin, ovs-dev, dev
  Cc: Mcnamara, John, O'Driscoll, Tim, Finn, Emma

[-- Attachment #1: Type: text/plain, Size: 4323 bytes --]

Hi Morten,

That's *exactly* how this architecture is designed & implemented.

  1.  The DMA configuration and initialization is up to the application (OVS).
  2.  The VHost library is passed the DMA-dev ID, and its new async rx/tx APIs, and uses the DMA device to accelerate the copy.

Looking forward to talking on the call that just started. Regards, -Harry


From: Morten Brørup <mb@smartsharesystems.com>
Sent: Tuesday, March 29, 2022 1:51 PM
To: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>; Ilya Maximets <i.maximets@ovn.org>; maxime.coquelin@redhat.com; ovs-dev@openvswitch.org; dev@dpdk.org
Cc: Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion

Having thought more about it, I think that a completely different architectural approach is required:

Many of the DPDK Ethernet PMDs implement a variety of RX and TX packet burst functions, each optimized for different CPU vector instruction sets. The availability of a DMA engine should be treated the same way. So I suggest that PMDs copying packet contents, e.g. memif, pcap, vmxnet3, should implement DMA optimized RX and TX packet burst functions.

Similarly for the DPDK vhost library.

In such an architecture, it would be the application's job to allocate DMA channels and assign them to the specific PMDs that should use them. But the actual use of the DMA channels would move down below the application and into the DPDK PMDs and libraries.


Med venlig hilsen / Kind regards,
-Morten Brørup

From: Pai G, Sunil [mailto:sunil.pai.g@intel.com]
Sent: Monday, 28 March 2022 20.19
To: Stokes, Ian; Hu, Jiayu; Ferriter, Cian; Van Haaren, Harry; Ilya Maximets; Maxime Coquelin (maxime.coquelin@redhat.com<mailto:maxime.coquelin@redhat.com>); ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; dev@dpdk.org<mailto:dev@dpdk.org>
Cc: Mcnamara, John; O'Driscoll, Tim; Finn, Emma
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion

Hi All,

Please see below PDF which will be presented in the call.
https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022%20session%202.pdf

Thanks and Regards,
Sunil


-----Original Appointment-----
From: Stokes, Ian <ian.stokes@intel.com<mailto:ian.stokes@intel.com>>
Sent: Thursday, March 24, 2022 9:07 PM
To: Pai G, Sunil; Hu, Jiayu; Ferriter, Cian; Van Haaren, Harry; Ilya Maximets; Maxime Coquelin (maxime.coquelin@redhat.com<mailto:maxime.coquelin@redhat.com>); ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; dev@dpdk.org<mailto:dev@dpdk.org>
Cc: Mcnamara, John; O'Driscoll, Tim; Finn, Emma
Subject: OVS DPDK DMA-Dev library/Design Discussion
When: Tuesday, March 29, 2022 2:00 PM-3:00 PM (UTC+00:00) Dublin, Edinburgh, Lisbon, London.
Where: Google Meet


Hi All,

This meeting is a follow up to the call earlier this week.

This week Sunil presented 3 different approaches to integrating DMA-Dev with OVS along with the performance impacts.

https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022.pdf

The approaches were as follows:

·        Defer work.
·        Tx completions from Rx context.
·        Tx completions from Rx context + lockless ring.

The pros and cons of each approach were discussed but there was no clear solution reached.

As such a follow up call was suggested to continue discussion and to reach a clear decision on the approach to take.

Please see agenda as it stands below:

Agenda
·        Opens
·        Continue discussion of 3x approaches from last week (Defer work, "V3", V4, links to patches in Sunil's slides above)
·        Design Feedback (please review solutions of above & slide-deck from last week before call to be informed)
·        Dynamic Allocation of DMA engine per queue
·        Code Availability (DPDK GitHub, OVS GitHub branches)

Please feel free to respond with any other items to be added to the agenda.

Google Meet: https://meet.google.com/hme-pygf-bfb

Regards
Ian


[-- Attachment #2: Type: text/html, Size: 28161 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-28 18:19 ` Pai G, Sunil
@ 2022-03-29 12:51   ` Morten Brørup
  2022-03-29 13:01     ` Van Haaren, Harry
  2022-03-30 10:41   ` Ilya Maximets
  1 sibling, 1 reply; 58+ messages in thread
From: Morten Brørup @ 2022-03-29 12:51 UTC (permalink / raw)
  To: Pai G, Sunil, Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren,
	Harry, Ilya Maximets, maxime.coquelin, ovs-dev, dev
  Cc: Mcnamara, John, O'Driscoll, Tim, Finn, Emma

[-- Attachment #1: Type: text/plain, Size: 3631 bytes --]

Having thought more about it, I think that a completely different architectural approach is required:

 

Many of the DPDK Ethernet PMDs implement a variety of RX and TX packet burst functions, each optimized for different CPU vector instruction sets. The availability of a DMA engine should be treated the same way. So I suggest that PMDs copying packet contents, e.g. memif, pcap, vmxnet3, should implement DMA optimized RX and TX packet burst functions.

 

Similarly for the DPDK vhost library.

 

In such an architecture, it would be the application's job to allocate DMA channels and assign them to the specific PMDs that should use them. But the actual use of the DMA channels would move down below the application and into the DPDK PMDs and libraries.

 

 

Med venlig hilsen / Kind regards,

-Morten Brørup

 

From: Pai G, Sunil [mailto:sunil.pai.g@intel.com] 
Sent: Monday, 28 March 2022 20.19
To: Stokes, Ian; Hu, Jiayu; Ferriter, Cian; Van Haaren, Harry; Ilya Maximets; Maxime Coquelin (maxime.coquelin@redhat.com); ovs-dev@openvswitch.org; dev@dpdk.org
Cc: Mcnamara, John; O'Driscoll, Tim; Finn, Emma
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion

 

Hi All, 

 

Please see below PDF which will be presented in the call.

https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022%20session%202.pdf <https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022%20session%202.pdf>  

 

Thanks and Regards,

Sunil

 

 

-----Original Appointment-----
From: Stokes, Ian <ian.stokes@intel.com> 
Sent: Thursday, March 24, 2022 9:07 PM
To: Pai G, Sunil; Hu, Jiayu; Ferriter, Cian; Van Haaren, Harry; Ilya Maximets; Maxime Coquelin (maxime.coquelin@redhat.com); ovs-dev@openvswitch.org; dev@dpdk.org
Cc: Mcnamara, John; O'Driscoll, Tim; Finn, Emma
Subject: OVS DPDK DMA-Dev library/Design Discussion
When: Tuesday, March 29, 2022 2:00 PM-3:00 PM (UTC+00:00) Dublin, Edinburgh, Lisbon, London.
Where: Google Meet

 

 

Hi All,

 

This meeting is a follow up to the call earlier this week.

 

This week Sunil presented 3 different approaches to integrating DMA-Dev with OVS along with the performance impacts.

 

https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022.pdf <https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022.pdf> 

 

The approaches were as follows:

 

·         Defer work.

·         Tx completions from Rx context.

·         Tx completions from Rx context + lockless ring.

 

The pros and cons of each approach were discussed but there was no clear solution reached.

 

As such a follow up call was suggested to continue discussion and to reach a clear decision on the approach to take.

 

Please see agenda as it stands below:

 

Agenda

·         Opens

·         Continue discussion of 3x approaches from last week (Defer work, "V3", V4, links to patches in Sunil's slides above)

·         Design Feedback (please review solutions of above & slide-deck from last week before call to be informed)

·         Dynamic Allocation of DMA engine per queue

·         Code Availability (DPDK GitHub, OVS GitHub branches)

 

Please feel free to respond with any other items to be added to the agenda.

 

Google Meet: https://meet.google.com/hme-pygf-bfb <https://meet.google.com/hme-pygf-bfb> 

 

Regards

Ian

 


[-- Attachment #2: Type: text/html, Size: 16526 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
  2022-03-24 15:36 Stokes, Ian
@ 2022-03-28 18:19 ` Pai G, Sunil
  2022-03-29 12:51   ` Morten Brørup
  2022-03-30 10:41   ` Ilya Maximets
  0 siblings, 2 replies; 58+ messages in thread
From: Pai G, Sunil @ 2022-03-28 18:19 UTC (permalink / raw)
  To: Stokes, Ian, Hu, Jiayu, Ferriter, Cian, Van Haaren, Harry,
	Ilya Maximets, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev
  Cc: Mcnamara, John, O'Driscoll, Tim, Finn, Emma

[-- Attachment #1: Type: text/plain, Size: 2118 bytes --]

Hi All,

Please see below PDF which will be presented in the call.
https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022%20session%202.pdf

Thanks and Regards,
Sunil


      -----Original Appointment-----
      From: Stokes, Ian <ian.stokes@intel.com>
      Sent: Thursday, March 24, 2022 9:07 PM
      To: Pai G, Sunil; Hu, Jiayu; Ferriter, Cian; Van Haaren, Harry; Ilya Maximets; Maxime Coquelin (maxime.coquelin@redhat.com); ovs-dev@openvswitch.org; dev@dpdk.org
      Cc: Mcnamara, John; O'Driscoll, Tim; Finn, Emma
      Subject: OVS DPDK DMA-Dev library/Design Discussion
      When: Tuesday, March 29, 2022 2:00 PM-3:00 PM (UTC+00:00) Dublin, Edinburgh, Lisbon, London.
      Where: Google Meet


      Hi All,

      This meeting is a follow up to the call earlier this week.

      This week Sunil presented 3 different approaches to integrating DMA-Dev with OVS along with the performance impacts.

      https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022.pdf

      The approaches were as follows:

*       Defer work.
*       Tx completions from Rx context.
*       Tx completions from Rx context + lockless ring.

      The pros and cons of each approach were discussed but there was no clear solution reached.

      As such a follow up call was suggested to continue discussion and to reach a clear decision on the approach to take.

      Please see agenda as it stands below:

      Agenda
*       Opens
*       Continue discussion of 3x approaches from last week (Defer work, "V3", V4, links to patches in Sunil's slides above)
*       Design Feedback (please review solutions of above & slide-deck from last week before call to be informed)
*       Dynamic Allocation of DMA engine per queue
*       Code Availability (DPDK GitHub, OVS GitHub branches)

      Please feel free to respond with any other items to be added to the agenda.

      Google Meet: https://meet.google.com/hme-pygf-bfb

      Regards
      Ian


[-- Attachment #2: Type: text/html, Size: 7518 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* OVS DPDK DMA-Dev library/Design Discussion
@ 2022-03-24 15:36 Stokes, Ian
  2022-03-28 18:19 ` Pai G, Sunil
  0 siblings, 1 reply; 58+ messages in thread
From: Stokes, Ian @ 2022-03-24 15:36 UTC (permalink / raw)
  To: Pai G, Sunil, Hu, Jiayu, Ferriter, Cian, Van Haaren, Harry,
	Ilya Maximets, Maxime Coquelin (maxime.coquelin@redhat.com),
	ovs-dev, dev
  Cc: Mcnamara, John, O'Driscoll, Tim, Finn, Emma

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

Hi All,

This meeting is a follow up to the call earlier this week.

This week Sunil presented 3 different approaches to integrating DMA-Dev with OVS along with the performance impacts.

https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20vhost%20async%20datapath%20design%202022.pdf

The approaches were as follows:

*       Defer work.
*       Tx completions from Rx context.
*       Tx completions from Rx context + lockless ring.

The pros and cons of each approach were discussed but there was no clear solution reached.

As such a follow up call was suggested to continue discussion and to reach a clear decision on the approach to take.

Please see agenda as it stands below:

Agenda
*       Opens
*       Continue discussion of 3x approaches from last week (Defer work, "V3", V4, links to patches in Sunil's slides above)
*       Design Feedback (please review solutions of above & slide-deck from last week before call to be informed)
*       Dynamic Allocation of DMA engine per queue
*       Code Availability (DPDK GitHub, OVS GitHub branches)

Please feel free to respond with any other items to be added to the agenda.

Google Meet: https://meet.google.com/hme-pygf-bfb

Regards
Ian


[-- Attachment #2: Type: text/html, Size: 2873 bytes --]

[-- Attachment #3: Type: text/calendar, Size: 4003 bytes --]

BEGIN:VCALENDAR
METHOD:REQUEST
PRODID:Microsoft Exchange Server 2010
VERSION:2.0
BEGIN:VTIMEZONE
TZID:GMT Standard Time
BEGIN:STANDARD
DTSTART:16010101T020000
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=10
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T010000
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=3
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
ORGANIZER;CN="Stokes, Ian":MAILTO:ian.stokes@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pai G, Sun
 il":MAILTO:sunil.pai.g@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Hu, Jiayu":
 MAILTO:jiayu.hu@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Ferriter, 
 Cian":MAILTO:cian.ferriter@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Van Haaren
 , Harry":MAILTO:harry.van.haaren@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Ilya Maxim
 ets:MAILTO:i.maximets@ovn.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Maxime Coq
 uelin (maxime.coquelin@redhat.com):MAILTO:maxime.coquelin@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=ovs-dev@op
 envswitch.org:MAILTO:ovs-dev@openvswitch.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=dev@dpdk.o
 rg:MAILTO:dev@dpdk.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Mcnamara, 
 John":MAILTO:john.mcnamara@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="O'Driscoll
 , Tim":MAILTO:tim.odriscoll@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Finn, Emma"
 :MAILTO:emma.finn@intel.com
DESCRIPTION;LANGUAGE=en-US:Hi All\,\n\nThis meeting is a follow up to the c
 all earlier this week.\n\nThis week Sunil presented 3 different approaches
  to integrating DMA-Dev with OVS along with the performance impacts.\n\nht
 tps://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/blob/main/OVS%20v
 host%20async%20datapath%20design%202022.pdf\n\nThe approaches were as foll
 ows:\n\n•       Defer work.\n•       Tx completions from Rx context.\n
 •       Tx completions from Rx context + lockless ring.\n\nThe pros and 
 cons of each approach were discussed but there was no clear solution reach
 ed.\n\nAs such a follow up call was suggested to continue discussion and t
 o reach a clear decision on the approach to take.\n\nPlease see agenda as 
 it stands below:\n\nAgenda\n•       Opens\n•       Continue discussion
  of 3x approaches from last week (Defer work\, “V3”\, V4\, links to pa
 tches in Sunil’s slides above)\n•       Design Feedback (please review
  solutions of above & slide-deck from last week before call to be informed
 )\n•       Dynamic Allocation of DMA engine per queue\n•       Code Av
 ailability (DPDK GitHub\, OVS GitHub branches)\n\nPlease feel free to resp
 ond with any other items to be added to the agenda.\n\nGoogle Meet: https:
 //meet.google.com/hme-pygf-bfb\n\nRegards\nIan\n\n
UID:040000008200E00074C5B7101A82E00800000000F008D4C98B3FD801000000000000000
 010000000E0CDBAAE0858D8498703580CBA3AE406
SUMMARY;LANGUAGE=en-US:OVS DPDK DMA-Dev library/Design Discussion
DTSTART;TZID=GMT Standard Time:20220329T140000
DTEND;TZID=GMT Standard Time:20220329T150000
CLASS:PUBLIC
PRIORITY:5
DTSTAMP:20220324T153630Z
TRANSP:OPAQUE
STATUS:CONFIRMED
SEQUENCE:0
LOCATION;LANGUAGE=en-US:Google Meet
X-MICROSOFT-CDO-APPT-SEQUENCE:0
X-MICROSOFT-CDO-OWNERAPPTID:-1994901530
X-MICROSOFT-CDO-BUSYSTATUS:TENTATIVE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
X-MICROSOFT-CDO-ALLDAYEVENT:FALSE
X-MICROSOFT-CDO-IMPORTANCE:1
X-MICROSOFT-CDO-INSTTYPE:0
X-MICROSOFT-DONOTFORWARDMEETING:FALSE
X-MICROSOFT-DISALLOW-COUNTER:FALSE
BEGIN:VALARM
DESCRIPTION:REMINDER
TRIGGER;RELATED=START:-PT15M
ACTION:DISPLAY
END:VALARM
END:VEVENT
END:VCALENDAR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: OVS DPDK DMA-Dev library/Design Discussion
       [not found] <DM8PR11MB5605B4A5DBD79FFDB4B1C3B2BD0A9@DM8PR11MB5605.namprd11.prod.outlook.com>
@ 2022-03-21 18:23 ` Pai G, Sunil
  0 siblings, 0 replies; 58+ messages in thread
From: Pai G, Sunil @ 2022-03-21 18:23 UTC (permalink / raw)
  To: Stokes, Ian, Ilya Maximets,
	Maxime Coquelin (maxime.coquelin@redhat.com),
	Hu, Jiayu, ovs-dev, dev
  Cc: Kevin Traynor, Flavio Leitner, Mcnamara, John, Van Haaren, Harry,
	Ferriter, Cian, mcoqueli, fleitner, Gooch, Stephen,
	murali.krishna, Nee, Yuan Kuok, Nobuhiro Miki, wanjunjie,
	Raghupatruni, Madhusudana R, Sinai, Asaf, Pei, Andy, liangma,
	u9012063, hemal.shah, Varghese, Vipin, Jean-Philippe Longeray,
	Bent Kuhre, dmarchan, Luse, Paul E, Phelan, Michael, Hunt,
	 David


[-- Attachment #1.1: Type: text/plain, Size: 1486 bytes --]

Hi all,

Please see attached PDF which will be presented in the call.



Thanks and Regards,
Sunil


      -----Original Appointment-----
      From: Stokes, Ian <ian.stokes@intel.com>
      Sent: Tuesday, March 15, 2022 6:47 PM
      To: Stokes, Ian; Ilya Maximets; Maxime Coquelin (maxime.coquelin@redhat.com); Pai G, Sunil; Hu, Jiayu; ovs-dev@openvswitch.org; dev@dpdk.org
      Cc: Kevin Traynor; Flavio Leitner; Mcnamara, John; Van Haaren, Harry; Ferriter, Cian; mcoqueli@redhat.com; fleitner@redhat.com; Gooch, Stephen; murali.krishna@broadcom.com; Nee, Yuan Kuok; Nobuhiro Miki; wanjunjie@bytedance.com; Raghupatruni, Madhusudana R; Sinai, Asaf; Pei, Andy; liangma@bytedance.com; u9012063@gmail.com; hemal.shah@broadcom.com; Varghese, Vipin; Jean-Philippe Longeray; Bent Kuhre; dmarchan@redhat.com; Luse, Paul E; Phelan, Michael; Hunt, David
      Subject: OVS DPDK DMA-Dev library/Design Discussion
      When: Tuesday, March 22, 2022 2:00 PM-3:00 PM (UTC+00:00) Dublin, Edinburgh, Lisbon, London.
      Where: Google Meet


      Hi All,

      The goal of this meeting is to ensure that all of DPDK DMA-dev library, DPDK Vhost library (consuming DMA-dev for acceleration) and OVS (as an end user of the DPDK DMA & VHost libraries) are working well together; and that the maintainers & contributors to those libraries are aware of the design & architecture in OVS consumption.

      https://meet.google.com/hme-pygf-bfb

      Thanks
      Ian



[-- Attachment #1.2: Type: text/html, Size: 4426 bytes --]

[-- Attachment #2: OVS vhost async datapath design 2022.pdf --]
[-- Type: application/pdf, Size: 216799 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* OVS DPDK DMA-Dev library/Design Discussion
@ 2022-03-15 15:48 Stokes, Ian
  0 siblings, 0 replies; 58+ messages in thread
From: Stokes, Ian @ 2022-03-15 15:48 UTC (permalink / raw)
  To: Ilya Maximets, Maxime Coquelin (maxime.coquelin@redhat.com),
	Pai G, Sunil, Hu, Jiayu, ovs-dev, dev
  Cc: Kevin Traynor, Flavio Leitner, Mcnamara, John, Van Haaren, Harry,
	Ferriter, Cian, mcoqueli, fleitner, Gooch, Stephen,
	murali.krishna, Nee, Yuan Kuok, Nobuhiro Miki, wanjunjie,
	Raghupatruni, Madhusudana R, Asaf Sinai, Pei, Andy, liangma,
	u9012063, hemal.shah, Varghese, Vipin, Jean-Philippe Longeray,
	Bent Kuhre, dmarchan, Luse, Paul E, Phelan, Michael, Hunt,
	 David

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

Hi All,

The goal of this meeting is to ensure that all of DPDK DMA-dev library, DPDK Vhost library (consuming DMA-dev for acceleration) and OVS (as an end user of the DPDK DMA & VHost libraries) are working well together; and that the maintainers & contributors to those libraries are aware of the design & architecture in OVS consumption.

https://meet.google.com/hme-pygf-bfb

Thanks
Ian



[-- Attachment #2: Type: text/html, Size: 1046 bytes --]

[-- Attachment #3: Type: text/calendar, Size: 5443 bytes --]

BEGIN:VCALENDAR
METHOD:REQUEST
PRODID:Microsoft Exchange Server 2010
VERSION:2.0
BEGIN:VTIMEZONE
TZID:GMT Standard Time
BEGIN:STANDARD
DTSTART:16010101T020000
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=10
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T010000
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=3
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
ORGANIZER;CN="Stokes, Ian":MAILTO:ian.stokes@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Ilya Maxim
 ets:MAILTO:i.maximets@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Maxime Coq
 uelin (maxime.coquelin@redhat.com):MAILTO:maxime.coquelin@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pai G, Sun
 il":MAILTO:sunil.pai.g@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Hu, Jiayu":
 MAILTO:jiayu.hu@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=ovs-dev@op
 envswitch.org:MAILTO:ovs-dev@openvswitch.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=dev@dpdk.o
 rg:MAILTO:dev@dpdk.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Kevin Tray
 nor:MAILTO:ktraynor@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Flavio Lei
 tner:MAILTO:fbl@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Mcnamara, 
 John":MAILTO:john.mcnamara@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Van Haaren
 , Harry":MAILTO:harry.van.haaren@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Ferriter, 
 Cian":MAILTO:cian.ferriter@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=mcoqueli@r
 edhat.com:MAILTO:mcoqueli@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=fleitner@r
 edhat.com:MAILTO:fleitner@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Gooch, Ste
 phen":MAILTO:stephen.gooch@windriver.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=murali.kri
 shna@broadcom.com:MAILTO:murali.krishna@broadcom.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Nee, Yuan 
 Kuok":MAILTO:yuan.kuok.nee@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Nobuhiro M
 iki:MAILTO:nmiki@yahoo-corp.jp
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=wanjunjie@
 bytedance.com:MAILTO:wanjunjie@bytedance.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Raghupatru
 ni, Madhusudana R":MAILTO:madhu.raghupatruni@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Asaf Sinai
 :MAILTO:AsafSi@Radware.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pei, Andy":
 MAILTO:andy.pei@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=liangma@by
 tedance.com:MAILTO:liangma@bytedance.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=u9012063@g
 mail.com:MAILTO:u9012063@gmail.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=hemal.shah
 @broadcom.com:MAILTO:hemal.shah@broadcom.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Varghese, 
 Vipin":MAILTO:Vipin.Varghese@amd.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Jean-Phili
 ppe Longeray:MAILTO:Jean-Philippe.Longeray@viavisolutions.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Bent Kuhre
 :MAILTO:bk@napatech.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=dmarchan@r
 edhat.com:MAILTO:dmarchan@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Luse, Paul
  E":MAILTO:paul.e.luse@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Phelan, Mi
 chael":MAILTO:michael.phelan@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Hunt, Davi
 d":MAILTO:david.hunt@intel.com
DESCRIPTION;LANGUAGE=en-US:Hi All\,\n\nThe goal of this meeting is to ensur
 e that all of DPDK DMA-dev library\, DPDK Vhost library (consuming DMA-dev
  for acceleration) and OVS (as an end user of the DPDK DMA & VHost librari
 es) are working well together\; and that the maintainers & contributors to
  those libraries are aware of the design & architecture in OVS consumption
 .\n\nhttps://meet.google.com/hme-pygf-bfb\n\nThanks\nIan\n\n\n
UID:040000008200E00074C5B7101A82E00800000000906D32F9D233D801000000000000000
 01000000095301FFB80D10A40B54524ACAF1B0BC7
SUMMARY;LANGUAGE=en-US:OVS DPDK DMA-Dev library/Design Discussion
DTSTART;TZID=GMT Standard Time:20220322T140000
DTEND;TZID=GMT Standard Time:20220322T150000
CLASS:PUBLIC
PRIORITY:5
DTSTAMP:20220315T154758Z
TRANSP:OPAQUE
STATUS:CONFIRMED
SEQUENCE:4
LOCATION;LANGUAGE=en-US:Google  Meet
X-MICROSOFT-CDO-APPT-SEQUENCE:4
X-MICROSOFT-CDO-OWNERAPPTID:78198758
X-MICROSOFT-CDO-BUSYSTATUS:TENTATIVE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
X-MICROSOFT-CDO-ALLDAYEVENT:FALSE
X-MICROSOFT-CDO-IMPORTANCE:1
X-MICROSOFT-CDO-INSTTYPE:0
X-MICROSOFT-DONOTFORWARDMEETING:FALSE
X-MICROSOFT-DISALLOW-COUNTER:FALSE
BEGIN:VALARM
DESCRIPTION:REMINDER
TRIGGER;RELATED=START:-PT15M
ACTION:DISPLAY
END:VALARM
END:VEVENT
END:VCALENDAR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* OVS DPDK DMA-Dev library/Design Discussion
@ 2022-03-15 13:17 Stokes, Ian
  0 siblings, 0 replies; 58+ messages in thread
From: Stokes, Ian @ 2022-03-15 13:17 UTC (permalink / raw)
  To: Ilya Maximets, Maxime Coquelin (maxime.coquelin@redhat.com),
	Pai G, Sunil, Hu, Jiayu, ovs-dev, dev
  Cc: Kevin Traynor, Flavio Leitner, Mcnamara, John, Van Haaren, Harry,
	Ferriter, Cian, mcoqueli, fleitner, Gooch, Stephen,
	murali.krishna, Nee, Yuan Kuok, Nobuhiro Miki, wanjunjie,
	Raghupatruni, Madhusudana R, Asaf Sinai, Pei, Andy, liangma,
	u9012063, hemal.shah, Varghese, Vipin, Jean-Philippe Longeray,
	Bent Kuhre, dmarchan

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

Hi All,

We'd like to put a public meeting in place for the stakeholders of DPDK and OVS to discuss the next steps and design of the DMA-DEV library along with its integration in OVS.

There are a few different time zones involved so trying to find a best fit.

Currently the suggestion is 2PM Tuesday the 22nd.

https://meet.google.com/hme-pygf-bfb

The plan is for this to be a public meeting that can be shared with both DPDK and OVS communities but for the moment I've invited the direct stakeholders from both communities as a starting point as we'd like a time that suits these folks primarily, all are welcome to join the discussion.

Thanks
Ian



[-- Attachment #2: Type: text/html, Size: 1473 bytes --]

[-- Attachment #3: Type: text/calendar, Size: 5385 bytes --]

BEGIN:VCALENDAR
METHOD:REQUEST
PRODID:Microsoft Exchange Server 2010
VERSION:2.0
BEGIN:VTIMEZONE
TZID:GMT Standard Time
BEGIN:STANDARD
DTSTART:16010101T020000
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=10
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T010000
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=3
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
ORGANIZER;CN="Stokes, Ian":MAILTO:ian.stokes@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Ilya Maxim
 ets:MAILTO:i.maximets@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Maxime Coq
 uelin (maxime.coquelin@redhat.com):MAILTO:maxime.coquelin@redhat.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pai G, Sun
 il":MAILTO:sunil.pai.g@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Hu, Jiayu":
 MAILTO:jiayu.hu@intel.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=ovs-dev@op
 envswitch.org:MAILTO:ovs-dev@openvswitch.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=dev@dpdk.o
 rg:MAILTO:dev@dpdk.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Kevin Tray
 nor:MAILTO:ktraynor@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Flavio Lei
 tner:MAILTO:fbl@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Mcnamara, 
 John":MAILTO:john.mcnamara@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Van Haaren
 , Harry":MAILTO:harry.van.haaren@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Ferriter, 
 Cian":MAILTO:cian.ferriter@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=mcoqueli@r
 edhat.com:MAILTO:mcoqueli@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=fleitner@r
 edhat.com:MAILTO:fleitner@redhat.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Gooch, Ste
 phen":MAILTO:stephen.gooch@windriver.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=murali.kri
 shna@broadcom.com:MAILTO:murali.krishna@broadcom.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Nee, Yuan 
 Kuok":MAILTO:yuan.kuok.nee@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Nobuhiro M
 iki:MAILTO:nmiki@yahoo-corp.jp
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=wanjunjie@
 bytedance.com:MAILTO:wanjunjie@bytedance.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Raghupatru
 ni, Madhusudana R":MAILTO:madhu.raghupatruni@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Asaf Sinai
 :MAILTO:AsafSi@Radware.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pei, Andy":
 MAILTO:andy.pei@intel.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=liangma@by
 tedance.com:MAILTO:liangma@bytedance.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=u9012063@g
 mail.com:MAILTO:u9012063@gmail.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=hemal.shah
 @broadcom.com:MAILTO:hemal.shah@broadcom.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Varghese, 
 Vipin":MAILTO:Vipin.Varghese@amd.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Jean-Phili
 ppe Longeray:MAILTO:Jean-Philippe.Longeray@viavisolutions.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Bent Kuhre
 :MAILTO:bk@napatech.com
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=dmarchan@r
 edhat.com:MAILTO:dmarchan@redhat.com
DESCRIPTION;LANGUAGE=en-US:Hi All\,\n\nWe’d like to put a public meeting 
 in place for the stakeholders of DPDK and OVS to discuss the next steps an
 d design of the DMA-DEV library along with its integration in OVS.\n\nTher
 e are a few different time zones involved so trying to find a best fit.\n\
 nCurrently the suggestion is 2PM Tuesday the 22nd.\n\nhttps://meet.google.
 com/hme-pygf-bfb\n\nThe plan is for this to be a public meeting that can b
 e shared with both DPDK and OVS communities but for the moment I’ve invi
 ted the direct stakeholders from both communities as a starting point as w
 e’d like a time that suits these folks primarily\, all are welcome to jo
 in the discussion.\n\nThanks\nIan\n\n\n
UID:040000008200E00074C5B7101A82E00800000000906D32F9D233D801000000000000000
 01000000095301FFB80D10A40B54524ACAF1B0BC7
SUMMARY;LANGUAGE=en-US:OVS DPDK DMA-Dev library/Design Discussion
DTSTART;TZID=GMT Standard Time:20220322T140000
DTEND;TZID=GMT Standard Time:20220322T150000
CLASS:PUBLIC
PRIORITY:5
DTSTAMP:20220315T131723Z
TRANSP:OPAQUE
STATUS:CONFIRMED
SEQUENCE:3
LOCATION;LANGUAGE=en-US:Google  Meet
X-MICROSOFT-CDO-APPT-SEQUENCE:3
X-MICROSOFT-CDO-OWNERAPPTID:78198758
X-MICROSOFT-CDO-BUSYSTATUS:TENTATIVE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
X-MICROSOFT-CDO-ALLDAYEVENT:FALSE
X-MICROSOFT-CDO-IMPORTANCE:1
X-MICROSOFT-CDO-INSTTYPE:0
X-MICROSOFT-DONOTFORWARDMEETING:FALSE
X-MICROSOFT-DISALLOW-COUNTER:FALSE
BEGIN:VALARM
DESCRIPTION:REMINDER
TRIGGER;RELATED=START:-PT15M
ACTION:DISPLAY
END:VALARM
END:VEVENT
END:VCALENDAR

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2022-05-24 12:12 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-15 11:15 OVS DPDK DMA-Dev library/Design Discussion Stokes, Ian
2022-03-15 13:17 Stokes, Ian
2022-03-15 15:48 Stokes, Ian
     [not found] <DM8PR11MB5605B4A5DBD79FFDB4B1C3B2BD0A9@DM8PR11MB5605.namprd11.prod.outlook.com>
2022-03-21 18:23 ` Pai G, Sunil
2022-03-24 15:36 Stokes, Ian
2022-03-28 18:19 ` Pai G, Sunil
2022-03-29 12:51   ` Morten Brørup
2022-03-29 13:01     ` Van Haaren, Harry
2022-03-29 14:44       ` Morten Brørup
2022-03-29 16:24         ` Maxime Coquelin
2022-03-29 16:45           ` Morten Brørup
2022-03-29 17:03             ` Bruce Richardson
2022-03-29 17:13               ` Morten Brørup
2022-03-29 17:45                 ` Ilya Maximets
2022-03-29 18:46                   ` Morten Brørup
2022-03-30  2:02                   ` Hu, Jiayu
2022-03-30  9:25                     ` Maxime Coquelin
2022-03-30 10:20                       ` Bruce Richardson
2022-03-30 14:27                       ` Hu, Jiayu
2022-03-29 17:46                 ` Van Haaren, Harry
2022-03-29 19:59                   ` Morten Brørup
2022-03-30  9:01                     ` Van Haaren, Harry
2022-04-07 14:04                       ` Van Haaren, Harry
2022-04-07 14:25                         ` Maxime Coquelin
2022-04-07 14:39                           ` Ilya Maximets
2022-04-07 14:42                             ` Van Haaren, Harry
2022-04-07 15:01                               ` Ilya Maximets
2022-04-07 15:46                                 ` Maxime Coquelin
2022-04-07 16:04                                   ` Bruce Richardson
2022-04-08  7:13                             ` Hu, Jiayu
2022-04-08  8:21                               ` Morten Brørup
2022-04-08  9:57                               ` Ilya Maximets
2022-04-20 15:39                                 ` Mcnamara, John
2022-04-20 16:41                                 ` Mcnamara, John
2022-04-25 21:46                                   ` Ilya Maximets
2022-04-27 14:55                                     ` Mcnamara, John
2022-04-27 20:34                                     ` Bruce Richardson
2022-04-28 12:59                                       ` Ilya Maximets
2022-04-28 13:55                                         ` Bruce Richardson
2022-05-03 19:38                                         ` Van Haaren, Harry
2022-05-10 14:39                                           ` Van Haaren, Harry
2022-05-24 12:12                                           ` Ilya Maximets
2022-03-30 10:41   ` Ilya Maximets
2022-03-30 10:52     ` Ilya Maximets
2022-03-30 11:12       ` Bruce Richardson
2022-03-30 11:41         ` Ilya Maximets
2022-03-30 14:09           ` Bruce Richardson
2022-04-05 11:29             ` Ilya Maximets
2022-04-05 12:07               ` Bruce Richardson
2022-04-08  6:29                 ` Pai G, Sunil
2022-05-13  8:52                   ` fengchengwen
2022-05-13  9:10                     ` Bruce Richardson
2022-05-13  9:48                       ` fengchengwen
2022-05-13 10:34                         ` Bruce Richardson
2022-05-16  9:04                           ` Morten Brørup
     [not found] <DM6PR11MB3227AC0014F321EB901BE385FC199@DM6PR11MB3227.namprd11.prod.outlook.com>
2022-04-21 11:51 ` Mcnamara, John
2022-04-21 14:57 Mcnamara, John
2022-04-25 15:19 Mcnamara, John

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).