DPDK patches and discussions
 help / color / mirror / Atom feed
From: Maxime Coquelin <maxime.coquelin@redhat.com>
To: "Van Haaren, Harry" <harry.van.haaren@intel.com>,
	"Morten Brørup" <mb@smartsharesystems.com>,
	"Richardson, Bruce" <bruce.richardson@intel.com>
Cc: "Pai G, Sunil" <sunil.pai.g@intel.com>,
	"Stokes, Ian" <ian.stokes@intel.com>,
	"Hu, Jiayu" <jiayu.hu@intel.com>,
	"Ferriter, Cian" <cian.ferriter@intel.com>,
	Ilya Maximets <i.maximets@ovn.org>,
	"ovs-dev@openvswitch.org" <ovs-dev@openvswitch.org>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"Mcnamara, John" <john.mcnamara@intel.com>,
	"O'Driscoll, Tim" <tim.odriscoll@intel.com>,
	"Finn, Emma" <emma.finn@intel.com>
Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
Date: Thu, 7 Apr 2022 16:25:48 +0200	[thread overview]
Message-ID: <94d817cb-8151-6644-c577-ed8b42d24337@redhat.com> (raw)
In-Reply-To: <BN0PR11MB57120B91DC9C2AEAFA61F6F0D7E69@BN0PR11MB5712.namprd11.prod.outlook.com>

Hi Harry,

On 4/7/22 16:04, Van Haaren, Harry wrote:
> Hi OVS & DPDK, Maintainers & Community,
> 
> Top posting overview of discussion as replies to thread become slower:
> perhaps it is a good time to review and plan for next steps?
> 
>  From my perspective, it those most vocal in the thread seem to be in favour of the clean
> rx/tx split ("defer work"), with the tradeoff that the application must be aware of handling
> the async DMA completions. If there are any concerns opposing upstreaming of this method,
> please indicate this promptly, and we can continue technical discussions here now.

Wasn't there some discussions about handling the Virtio completions with
the DMA engine? With that, we wouldn't need the deferral of work.

Thanks,
Maxime

> In absence of continued technical discussion here, I suggest Sunil and Ian collaborate on getting
> the OVS Defer-work approach, and DPDK VHost Async patchsets available on GitHub for easier
> consumption and future development (as suggested in slides presented on last call).
> 
> Regards, -Harry
> 
> No inline-replies below; message just for context.
> 
>> -----Original Message-----
>> From: Van Haaren, Harry
>> Sent: Wednesday, March 30, 2022 10:02 AM
>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
>> <bruce.richardson@intel.com>
>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>> <Jiayu.Hu@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; Ilya Maximets
>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>> John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>> Finn, Emma <Emma.Finn@intel.com>
>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>
>>> -----Original Message-----
>>> From: Morten Brørup <mb@smartsharesystems.com>
>>> Sent: Tuesday, March 29, 2022 8:59 PM
>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
>>> <bruce.richardson@intel.com>
>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya Maximets
>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>> John
>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>; Finn,
>>> Emma <emma.finn@intel.com>
>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>
>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>> Sent: Tuesday, 29 March 2022 19.46
>>>>
>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
>>>>>
>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>>>
>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>>>
>>>>>>>> Hi Morten,
>>>>>>>>
>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>>>
>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>>>
>>>>>>>>>>> Having thought more about it, I think that a completely
>>>>>> different
>>>>>>>> architectural approach is required:
>>>>>>>>>>>
>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
>>>> and TX
>>>>>>>> packet burst functions, each optimized for different CPU vector
>>>>>>>> instruction sets. The availability of a DMA engine should be
>>>>>> treated
>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
>>>> e.g.
>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>>>>> packet
>>>>>>>> burst functions.
>>>>>>>>>>>
>>>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>>>
>>>>>>>>>>> In such an architecture, it would be the application's job
>>>> to
>>>>>>>> allocate DMA channels and assign them to the specific PMDs that
>>>>>> should
>>>>>>>> use them. But the actual use of the DMA channels would move
>>>> down
>>>>>> below
>>>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Med venlig hilsen / Kind regards,
>>>>>>>>>>> -Morten Brørup
>>>>>>>>>>
>>>>>>>>>> Hi Morten,
>>>>>>>>>>
>>>>>>>>>> That's *exactly* how this architecture is designed &
>>>>>> implemented.
>>>>>>>>>> 1.	The DMA configuration and initialization is up to the
>>>>>> application
>>>>>>>> (OVS).
>>>>>>>>>> 2.	The VHost library is passed the DMA-dev ID, and its
>>>> new
>>>>>> async
>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>>>
>>>>>>>>>> Looking forward to talking on the call that just started.
>>>>>> Regards, -
>>>>>>>> Harry
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>>>> patches.
>>>>>>>>>
>>>>>>>>> Then, I suppose that the TX completions can be handled in the
>>>> TX
>>>>>>>> function, and the RX completions can be handled in the RX
>>>> function,
>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>>>
>>>>>>>>> TX_Burst(tx_packet_array):
>>>>>>>>> 1.	Clean up descriptors processed by the NIC chip. -->
>>>> Process
>>>>>> TX
>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>>>>>> 2.	Pass on the tx_packet_array to the NIC chip
>>>> descriptors. --
>>>>>>> Pass
>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
>>>> 1st
>>>>>>>> pipeline stage.)
>>>>>>>>
>>>>>>>> The problem is Tx function might not be called again, so
>>>> enqueued
>>>>>>>> packets in 2. may never be completed from a Virtio point of
>>>> view.
>>>>>> IOW,
>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
>>>> but
>>>>>> the
>>>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>>>
>>>>>>> In that case, the application needs to call TX_Burst()
>>>> periodically
>>>>>> with an empty array, for completion purposes.
>>>>
>>>> This is what the "defer work" does at the OVS thread-level, but instead
>>>> of
>>>> "brute-forcing" and *always* making the call, the defer work concept
>>>> tracks
>>>> *when* there is outstanding work (DMA copies) to be completed
>>>> ("deferred work")
>>>> and calls the generic completion function at that point.
>>>>
>>>> So "defer work" is generic infrastructure at the OVS thread level to
>>>> handle
>>>> work that needs to be done "later", e.g. DMA completion handling.
>>>>
>>>>
>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>>>>>> library, to handle DMA completion. It might even handle multiple
>>>> DMA
>>>>>> channels, if convenient - and if possible without locking or other
>>>>>> weird complexity.
>>>>
>>>> That's exactly how it is done, the VHost library has a new API added,
>>>> which allows
>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev
>>>> abstraction)
>>>> we add a function to allow the OVS thread to do those completions in a
>>>> new
>>>> Netdev-abstraction API called "async_process" where the completions can
>>>> be checked.
>>>>
>>>> The only method to abstract them is to "hide" them somewhere that will
>>>> always be
>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
>>>> use this method.
>>>> This allows "completions" to be transparent to the app, at the tradeoff
>>>> to having bad
>>>> separation  of concerns as Rx and Tx are now tied-together.
>>>>
>>>> The point is, the Application layer must *somehow * handle of
>>>> completions.
>>>> So fundamentally there are 2 options for the Application level:
>>>>
>>>> A) Make the application periodically call a "handle completions"
>>>> function
>>>> 	A1) Defer work, call when needed, and track "needed" at app
>>>> layer, and calling into vhost txq complete as required.
>>>> 	        Elegant in that "no work" means "no cycles spent" on
>>>> checking DMA completions.
>>>> 	A2) Brute-force-always-call, and pay some overhead when not
>>>> required.
>>>> 	        Cycle-cost in "no work" scenarios. Depending on # of
>>>> vhost queues, this adds up as polling required *per vhost txq*.
>>>> 	        Also note that "checking DMA completions" means taking a
>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread
>>>> contention!
>>>
>>> A side note: I don't see why locking is required to test for DMA completions.
>>> rte_dma_vchan_status() is lockless, e.g.:
>>>
>> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
>> 56
>>> 0
>>
>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree manner
>> from a single thread.
>>
>> The locks I refer to are at the OVS-netdev level, as virtq's are shared across OVS's
>> dataplane threads.
>> So the "M to N" comes from M dataplane threads to N virtqs, hence requiring
>> some locking.
>>
>>
>>>> B) Hide completions and live with the complexity/architectural
>>>> sacrifice of mixed-RxTx.
>>>> 	Various downsides here in my opinion, see the slide deck
>>>> presented earlier today for a summary.
>>>>
>>>> In my opinion, A1 is the most elegant solution, as it has a clean
>>>> separation of concerns, does not  cause
>>>> avoidable contention on virtq locks, and spends no cycles when there is
>>>> no completion work to do.
>>>>
>>>
>>> Thank you for elaborating, Harry.
>>
>> Thanks for part-taking in the discussion & providing your insight!
>>
>>> I strongly oppose against hiding any part of TX processing in an RX function. It
>> is just
>>> wrong in so many ways!
>>>
>>> I agree that A1 is the most elegant solution. And being the most elegant
>> solution, it
>>> is probably also the most future proof solution. :-)
>>
>> I think so too, yes.
>>
>>> I would also like to stress that DMA completion handling belongs in the DPDK
>>> library, not in the application. And yes, the application will be required to call
>> some
>>> "handle DMA completions" function in the DPDK library. But since the
>> application
>>> already knows that it uses DMA, the application should also know that it needs
>> to
>>> call this extra function - so I consider this requirement perfectly acceptable.
>>
>> Agree here.
>>
>>> I prefer if the DPDK vhost library can hide its inner workings from the
>> application,
>>> and just expose the additional "handle completions" function. This also means
>> that
>>> the inner workings can be implemented as "defer work", or by some other
>>> algorithm. And it can be tweaked and optimized later.
>>
>> Yes, the choice in how to call the handle_completions function is Application
>> layer.
>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice, and
>> every
>> application is free to choose its own method.
>>
>>> Thinking about the long term perspective, this design pattern is common for
>> both
>>> the vhost library and other DPDK libraries that could benefit from DMA (e.g.
>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or a
>>> separate library. But for now, we should focus on the vhost use case, and just
>> keep
>>> the long term roadmap for using DMA in mind.
>>
>> Totally agree to keep long term roadmap in mind; but I'm not sure we can
>> refactor
>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
>> updated;
>> this causes a tight coupling between the DMA completion count, and the vhost
>> library.
>>
>> As Ilya raised on the call yesterday, there is an "in_order" requirement in the
>> vhost
>> library, that per virtq the packets are presented to the guest "in order" of
>> enqueue.
>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
>> library
>> handles this today by re-ordering the DMA completions.)
>>
>>
>>> Rephrasing what I said on the conference call: This vhost design will become
>> the
>>> common design pattern for using DMA in DPDK libraries. If we get it wrong, we
>> are
>>> stuck with it.
>>
>> Agree, and if we get it right, then we're stuck with it too! :)
>>
>>
>>>>>>> Here is another idea, inspired by a presentation at one of the
>>>> DPDK
>>>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>>>
>>>>>>> Add an additional transaction to each DMA burst; a special
>>>>>> transaction containing the memory write operation that makes the
>>>>>> descriptors available to the Virtio driver.
>>>>>>>
>>>>>>
>>>>>> That is something that can work, so long as the receiver is
>>>> operating
>>>>>> in
>>>>>> polling mode. For cases where virtio interrupts are enabled, you
>>>> still
>>>>>> need
>>>>>> to do a write to the eventfd in the kernel in vhost to signal the
>>>>>> virtio
>>>>>> side. That's not something that can be offloaded to a DMA engine,
>>>>>> sadly, so
>>>>>> we still need some form of completion call.
>>>>>
>>>>> I guess that virtio interrupts is the most widely deployed scenario,
>>>> so let's ignore
>>>>> the DMA TX completion transaction for now - and call it a possible
>>>> future
>>>>> optimization for specific use cases. So it seems that some form of
>>>> completion call
>>>>> is unavoidable.
>>>>
>>>> Agree to leave this aside, there is in theory a potential optimization,
>>>> but
>>>> unlikely to be of large value.
>>>>
>>>
>>> One more thing: When using DMA to pass on packets into a guest, there could
>> be a
>>> delay from the DMA completes until the guest is signaled. Is there any CPU
>> cache
>>> hotness regarding the guest's access to the packet data to consider here? I.e. if
>> we
>>> wait signaling the guest, the packet data may get cold.
>>
>> Interesting question; we can likely spawn a new thread around this topic!
>> In short, it depends on how/where the DMA hardware writes the copy.
>>
>> With technologies like DDIO, the "dest" part of the copy will be in LLC. The core
>> reading the
>> dest data will benefit from the LLC locality (instead of snooping it from a remote
>> core's L1/L2).
>>
>> Delays in notifying the guest could result in LLC capacity eviction, yes.
>> The application layer decides how often/promptly to check for completions,
>> and notify the guest of them. Calling the function more often will result in less
>> delay in that portion of the pipeline.
>>
>> Overall, there are caching benefits with DMA acceleration, and the application
>> can control
>> the latency introduced between dma-completion done in HW, and Guest vring
>> update.
> 


  reply	other threads:[~2022-04-07 14:25 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-24 15:36 Stokes, Ian
2022-03-28 18:19 ` Pai G, Sunil
2022-03-29 12:51   ` Morten Brørup
2022-03-29 13:01     ` Van Haaren, Harry
2022-03-29 14:44       ` Morten Brørup
2022-03-29 16:24         ` Maxime Coquelin
2022-03-29 16:45           ` Morten Brørup
2022-03-29 17:03             ` Bruce Richardson
2022-03-29 17:13               ` Morten Brørup
2022-03-29 17:45                 ` Ilya Maximets
2022-03-29 18:46                   ` Morten Brørup
2022-03-30  2:02                   ` Hu, Jiayu
2022-03-30  9:25                     ` Maxime Coquelin
2022-03-30 10:20                       ` Bruce Richardson
2022-03-30 14:27                       ` Hu, Jiayu
2022-03-29 17:46                 ` Van Haaren, Harry
2022-03-29 19:59                   ` Morten Brørup
2022-03-30  9:01                     ` Van Haaren, Harry
2022-04-07 14:04                       ` Van Haaren, Harry
2022-04-07 14:25                         ` Maxime Coquelin [this message]
2022-04-07 14:39                           ` Ilya Maximets
2022-04-07 14:42                             ` Van Haaren, Harry
2022-04-07 15:01                               ` Ilya Maximets
2022-04-07 15:46                                 ` Maxime Coquelin
2022-04-07 16:04                                   ` Bruce Richardson
2022-04-08  7:13                             ` Hu, Jiayu
2022-04-08  8:21                               ` Morten Brørup
2022-04-08  9:57                               ` Ilya Maximets
2022-04-20 15:39                                 ` Mcnamara, John
2022-04-20 16:41                                 ` Mcnamara, John
2022-04-25 21:46                                   ` Ilya Maximets
2022-04-27 14:55                                     ` Mcnamara, John
2022-04-27 20:34                                     ` Bruce Richardson
2022-04-28 12:59                                       ` Ilya Maximets
2022-04-28 13:55                                         ` Bruce Richardson
2022-05-03 19:38                                         ` Van Haaren, Harry
2022-05-10 14:39                                           ` Van Haaren, Harry
2022-05-24 12:12                                           ` Ilya Maximets
2022-03-30 10:41   ` Ilya Maximets
2022-03-30 10:52     ` Ilya Maximets
2022-03-30 11:12       ` Bruce Richardson
2022-03-30 11:41         ` Ilya Maximets
2022-03-30 14:09           ` Bruce Richardson
2022-04-05 11:29             ` Ilya Maximets
2022-04-05 12:07               ` Bruce Richardson
2022-04-08  6:29                 ` Pai G, Sunil
2022-05-13  8:52                   ` fengchengwen
2022-05-13  9:10                     ` Bruce Richardson
2022-05-13  9:48                       ` fengchengwen
2022-05-13 10:34                         ` Bruce Richardson
2022-05-16  9:04                           ` Morten Brørup
2022-05-16 22:31                           ` [EXT] " Radha Chintakuntla
  -- strict thread matches above, loose matches on Subject: below --
2022-04-25 15:19 Mcnamara, John
2022-04-21 14:57 Mcnamara, John
     [not found] <DM6PR11MB3227AC0014F321EB901BE385FC199@DM6PR11MB3227.namprd11.prod.outlook.com>
2022-04-21 11:51 ` Mcnamara, John
     [not found] <DM8PR11MB5605B4A5DBD79FFDB4B1C3B2BD0A9@DM8PR11MB5605.namprd11.prod.outlook.com>
2022-03-21 18:23 ` Pai G, Sunil
2022-03-15 15:48 Stokes, Ian
2022-03-15 13:17 Stokes, Ian
2022-03-15 11:15 Stokes, Ian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=94d817cb-8151-6644-c577-ed8b42d24337@redhat.com \
    --to=maxime.coquelin@redhat.com \
    --cc=bruce.richardson@intel.com \
    --cc=cian.ferriter@intel.com \
    --cc=dev@dpdk.org \
    --cc=emma.finn@intel.com \
    --cc=harry.van.haaren@intel.com \
    --cc=i.maximets@ovn.org \
    --cc=ian.stokes@intel.com \
    --cc=jiayu.hu@intel.com \
    --cc=john.mcnamara@intel.com \
    --cc=mb@smartsharesystems.com \
    --cc=ovs-dev@openvswitch.org \
    --cc=sunil.pai.g@intel.com \
    --cc=tim.odriscoll@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).