From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id D512EA00BE; Thu, 7 Apr 2022 17:01:50 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 7D6934068B; Thu, 7 Apr 2022 17:01:50 +0200 (CEST) Received: from relay11.mail.gandi.net (relay11.mail.gandi.net [217.70.178.231]) by mails.dpdk.org (Postfix) with ESMTP id A0C5540689 for ; Thu, 7 Apr 2022 17:01:49 +0200 (CEST) Received: (Authenticated sender: i.maximets@ovn.org) by mail.gandi.net (Postfix) with ESMTPSA id A01BA100009; Thu, 7 Apr 2022 15:01:45 +0000 (UTC) Message-ID: Date: Thu, 7 Apr 2022 17:01:44 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Cc: i.maximets@ovn.org, "Pai G, Sunil" , "Stokes, Ian" , "Hu, Jiayu" , "Ferriter, Cian" , "ovs-dev@openvswitch.org" , "dev@dpdk.org" , "Mcnamara, John" , "O'Driscoll, Tim" , "Finn, Emma" Content-Language: en-US To: "Van Haaren, Harry" , Maxime Coquelin , =?UTF-8?Q?Morten_Br=c3=b8rup?= , "Richardson, Bruce" References: <98CBD80474FA8B44BF855DF32C47DC35D86F7C@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F7D@smartserver.smartshare.dk> <7968dd0b-8647-8d7b-786f-dc876bcbf3f0@redhat.com> <98CBD80474FA8B44BF855DF32C47DC35D86F7E@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F80@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F82@smartserver.smartshare.dk> <94d817cb-8151-6644-c577-ed8b42d24337@redhat.com> <55c5a37f-3ee8-394b-8cff-e7daecb59f73@ovn.org> From: Ilya Maximets Subject: Re: OVS DPDK DMA-Dev library/Design Discussion In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 4/7/22 16:42, Van Haaren, Harry wrote: >> -----Original Message----- >> From: Ilya Maximets >> Sent: Thursday, April 7, 2022 3:40 PM >> To: Maxime Coquelin ; Van Haaren, Harry >> ; Morten Brørup ; >> Richardson, Bruce >> Cc: i.maximets@ovn.org; Pai G, Sunil ; Stokes, Ian >> ; Hu, Jiayu ; Ferriter, Cian >> ; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, >> John ; O'Driscoll, Tim ; >> Finn, Emma >> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion >> >> On 4/7/22 16:25, Maxime Coquelin wrote: >>> Hi Harry, >>> >>> On 4/7/22 16:04, Van Haaren, Harry wrote: >>>> Hi OVS & DPDK, Maintainers & Community, >>>> >>>> Top posting overview of discussion as replies to thread become slower: >>>> perhaps it is a good time to review and plan for next steps? >>>> >>>>  From my perspective, it those most vocal in the thread seem to be in favour >> of the clean >>>> rx/tx split ("defer work"), with the tradeoff that the application must be >> aware of handling >>>> the async DMA completions. If there are any concerns opposing upstreaming >> of this method, >>>> please indicate this promptly, and we can continue technical discussions here >> now. >>> >>> Wasn't there some discussions about handling the Virtio completions with >>> the DMA engine? With that, we wouldn't need the deferral of work. >> >> +1 > > Yes there was, the DMA/virtq completions thread here for reference; > https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html > > I do not believe that there is a viable path to actually implementing it, and particularly > not in the more complex cases; e.g. virtio with guest-interrupt enabled. > > The thread above mentions additional threads and various other options; none of which > I believe to be a clean or workable solution. I'd like input from other folks more familiar > with the exact implementations of VHost/vrings, as well as those with DMA engine expertise. I tend to trust Maxime as a vhost maintainer in such questions. :) In my own opinion though, the implementation is possible and concerns doesn't sound deal-breaking as solutions for them might work well enough. So I think the viability should be tested out before solution is disregarded. Especially because the decision will form the API of the vhost library. > > >> With the virtio completions handled by DMA itself, the vhost port >> turns almost into a real HW NIC. With that we will not need any >> extra manipulations from the OVS side, i.e. no need to defer any >> work while maintaining clear split between rx and tx operations. >> >> I'd vote for that. >> >>> >>> Thanks, >>> Maxime > > Thanks for the prompt responses, and lets understand if there is a viable workable way > to totally hide DMA-completions from the application. > > Regards, -Harry > > >>>> In absence of continued technical discussion here, I suggest Sunil and Ian >> collaborate on getting >>>> the OVS Defer-work approach, and DPDK VHost Async patchsets available on >> GitHub for easier >>>> consumption and future development (as suggested in slides presented on >> last call). >>>> >>>> Regards, -Harry >>>> >>>> No inline-replies below; message just for context. >>>> >>>>> -----Original Message----- >>>>> From: Van Haaren, Harry >>>>> Sent: Wednesday, March 30, 2022 10:02 AM >>>>> To: Morten Brørup ; Richardson, Bruce >>>>> >>>>> Cc: Maxime Coquelin ; Pai G, Sunil >>>>> ; Stokes, Ian ; Hu, Jiayu >>>>> ; Ferriter, Cian ; Ilya >> Maximets >>>>> ; ovs-dev@openvswitch.org; dev@dpdk.org; >> Mcnamara, >>>>> John ; O'Driscoll, Tim >> ; >>>>> Finn, Emma >>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion >>>>> >>>>>> -----Original Message----- >>>>>> From: Morten Brørup >>>>>> Sent: Tuesday, March 29, 2022 8:59 PM >>>>>> To: Van Haaren, Harry ; Richardson, Bruce >>>>>> >>>>>> Cc: Maxime Coquelin ; Pai G, Sunil >>>>>> ; Stokes, Ian ; Hu, Jiayu >>>>>> ; Ferriter, Cian ; Ilya >> Maximets >>>>>> ; ovs-dev@openvswitch.org; dev@dpdk.org; >> Mcnamara, >>>>> John >>>>>> ; O'Driscoll, Tim ; >> Finn, >>>>>> Emma >>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion >>>>>> >>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com] >>>>>>> Sent: Tuesday, 29 March 2022 19.46 >>>>>>> >>>>>>>> From: Morten Brørup >>>>>>>> Sent: Tuesday, March 29, 2022 6:14 PM >>>>>>>> >>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >>>>>>>>> Sent: Tuesday, 29 March 2022 19.03 >>>>>>>>> >>>>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote: >>>>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] >>>>>>>>>>> Sent: Tuesday, 29 March 2022 18.24 >>>>>>>>>>> >>>>>>>>>>> Hi Morten, >>>>>>>>>>> >>>>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote: >>>>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com] >>>>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02 >>>>>>>>>>>>> >>>>>>>>>>>>>> From: Morten Brørup >>>>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM >>>>>>>>>>>>>> >>>>>>>>>>>>>> Having thought more about it, I think that a completely >>>>>>>>> different >>>>>>>>>>> architectural approach is required: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX >>>>>>> and TX >>>>>>>>>>> packet burst functions, each optimized for different CPU vector >>>>>>>>>>> instruction sets. The availability of a DMA engine should be >>>>>>>>> treated >>>>>>>>>>> the same way. So I suggest that PMDs copying packet contents, >>>>>>> e.g. >>>>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX >>>>>>>>> packet >>>>>>>>>>> burst functions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Similarly for the DPDK vhost library. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In such an architecture, it would be the application's job >>>>>>> to >>>>>>>>>>> allocate DMA channels and assign them to the specific PMDs that >>>>>>>>> should >>>>>>>>>>> use them. But the actual use of the DMA channels would move >>>>>>> down >>>>>>>>> below >>>>>>>>>>> the application and into the DPDK PMDs and libraries. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Med venlig hilsen / Kind regards, >>>>>>>>>>>>>> -Morten Brørup >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Morten, >>>>>>>>>>>>> >>>>>>>>>>>>> That's *exactly* how this architecture is designed & >>>>>>>>> implemented. >>>>>>>>>>>>> 1.    The DMA configuration and initialization is up to the >>>>>>>>> application >>>>>>>>>>> (OVS). >>>>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its >>>>>>> new >>>>>>>>> async >>>>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy. >>>>>>>>>>>>> >>>>>>>>>>>>> Looking forward to talking on the call that just started. >>>>>>>>> Regards, - >>>>>>>>>>> Harry >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the >>>>>>>>> patches. >>>>>>>>>>>> >>>>>>>>>>>> Then, I suppose that the TX completions can be handled in the >>>>>>> TX >>>>>>>>>>> function, and the RX completions can be handled in the RX >>>>>>> function, >>>>>>>>>>> just like the Ethdev PMDs handle packet descriptors: >>>>>>>>>>>> >>>>>>>>>>>> TX_Burst(tx_packet_array): >>>>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. --> >>>>>>> Process >>>>>>>>> TX >>>>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.) >>>>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip >>>>>>> descriptors. -- >>>>>>>>>> Pass >>>>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the >>>>>>> 1st >>>>>>>>>>> pipeline stage.) >>>>>>>>>>> >>>>>>>>>>> The problem is Tx function might not be called again, so >>>>>>> enqueued >>>>>>>>>>> packets in 2. may never be completed from a Virtio point of >>>>>>> view. >>>>>>>>> IOW, >>>>>>>>>>> the packets will be copied to the Virtio descriptors buffers, >>>>>>> but >>>>>>>>> the >>>>>>>>>>> descriptors will not be made available to the Virtio driver. >>>>>>>>>> >>>>>>>>>> In that case, the application needs to call TX_Burst() >>>>>>> periodically >>>>>>>>> with an empty array, for completion purposes. >>>>>>> >>>>>>> This is what the "defer work" does at the OVS thread-level, but instead >>>>>>> of >>>>>>> "brute-forcing" and *always* making the call, the defer work concept >>>>>>> tracks >>>>>>> *when* there is outstanding work (DMA copies) to be completed >>>>>>> ("deferred work") >>>>>>> and calls the generic completion function at that point. >>>>>>> >>>>>>> So "defer work" is generic infrastructure at the OVS thread level to >>>>>>> handle >>>>>>> work that needs to be done "later", e.g. DMA completion handling. >>>>>>> >>>>>>> >>>>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK >>>>>>>>> library, to handle DMA completion. It might even handle multiple >>>>>>> DMA >>>>>>>>> channels, if convenient - and if possible without locking or other >>>>>>>>> weird complexity. >>>>>>> >>>>>>> That's exactly how it is done, the VHost library has a new API added, >>>>>>> which allows >>>>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev >>>>>>> abstraction) >>>>>>> we add a function to allow the OVS thread to do those completions in a >>>>>>> new >>>>>>> Netdev-abstraction API called "async_process" where the completions can >>>>>>> be checked. >>>>>>> >>>>>>> The only method to abstract them is to "hide" them somewhere that will >>>>>>> always be >>>>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches >>>>>>> use this method. >>>>>>> This allows "completions" to be transparent to the app, at the tradeoff >>>>>>> to having bad >>>>>>> separation  of concerns as Rx and Tx are now tied-together. >>>>>>> >>>>>>> The point is, the Application layer must *somehow * handle of >>>>>>> completions. >>>>>>> So fundamentally there are 2 options for the Application level: >>>>>>> >>>>>>> A) Make the application periodically call a "handle completions" >>>>>>> function >>>>>>>     A1) Defer work, call when needed, and track "needed" at app >>>>>>> layer, and calling into vhost txq complete as required. >>>>>>>             Elegant in that "no work" means "no cycles spent" on >>>>>>> checking DMA completions. >>>>>>>     A2) Brute-force-always-call, and pay some overhead when not >>>>>>> required. >>>>>>>             Cycle-cost in "no work" scenarios. Depending on # of >>>>>>> vhost queues, this adds up as polling required *per vhost txq*. >>>>>>>             Also note that "checking DMA completions" means taking a >>>>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread >>>>>>> contention! >>>>>> >>>>>> A side note: I don't see why locking is required to test for DMA >> completions. >>>>>> rte_dma_vchan_status() is lockless, e.g.: >>>>>> >>>>> >> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L >>>>> 56 >>>>>> 0 >>>>> >>>>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree >> manner >>>>> from a single thread. >>>>> >>>>> The locks I refer to are at the OVS-netdev level, as virtq's are shared across >> OVS's >>>>> dataplane threads. >>>>> So the "M to N" comes from M dataplane threads to N virtqs, hence >> requiring >>>>> some locking. >>>>> >>>>> >>>>>>> B) Hide completions and live with the complexity/architectural >>>>>>> sacrifice of mixed-RxTx. >>>>>>>     Various downsides here in my opinion, see the slide deck >>>>>>> presented earlier today for a summary. >>>>>>> >>>>>>> In my opinion, A1 is the most elegant solution, as it has a clean >>>>>>> separation of concerns, does not  cause >>>>>>> avoidable contention on virtq locks, and spends no cycles when there is >>>>>>> no completion work to do. >>>>>>> >>>>>> >>>>>> Thank you for elaborating, Harry. >>>>> >>>>> Thanks for part-taking in the discussion & providing your insight! >>>>> >>>>>> I strongly oppose against hiding any part of TX processing in an RX function. >> It >>>>> is just >>>>>> wrong in so many ways! >>>>>> >>>>>> I agree that A1 is the most elegant solution. And being the most elegant >>>>> solution, it >>>>>> is probably also the most future proof solution. :-) >>>>> >>>>> I think so too, yes. >>>>> >>>>>> I would also like to stress that DMA completion handling belongs in the >> DPDK >>>>>> library, not in the application. And yes, the application will be required to >> call >>>>> some >>>>>> "handle DMA completions" function in the DPDK library. But since the >>>>> application >>>>>> already knows that it uses DMA, the application should also know that it >> needs >>>>> to >>>>>> call this extra function - so I consider this requirement perfectly acceptable. >>>>> >>>>> Agree here. >>>>> >>>>>> I prefer if the DPDK vhost library can hide its inner workings from the >>>>> application, >>>>>> and just expose the additional "handle completions" function. This also >> means >>>>> that >>>>>> the inner workings can be implemented as "defer work", or by some other >>>>>> algorithm. And it can be tweaked and optimized later. >>>>> >>>>> Yes, the choice in how to call the handle_completions function is Application >>>>> layer. >>>>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice, >> and >>>>> every >>>>> application is free to choose its own method. >>>>> >>>>>> Thinking about the long term perspective, this design pattern is common >> for >>>>> both >>>>>> the vhost library and other DPDK libraries that could benefit from DMA (e.g. >>>>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or >> a >>>>>> separate library. But for now, we should focus on the vhost use case, and >> just >>>>> keep >>>>>> the long term roadmap for using DMA in mind. >>>>> >>>>> Totally agree to keep long term roadmap in mind; but I'm not sure we can >>>>> refactor >>>>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be >>>>> updated; >>>>> this causes a tight coupling between the DMA completion count, and the >> vhost >>>>> library. >>>>> >>>>> As Ilya raised on the call yesterday, there is an "in_order" requirement in the >>>>> vhost >>>>> library, that per virtq the packets are presented to the guest "in order" of >>>>> enqueue. >>>>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost >>>>> library >>>>> handles this today by re-ordering the DMA completions.) >>>>> >>>>> >>>>>> Rephrasing what I said on the conference call: This vhost design will >> become >>>>> the >>>>>> common design pattern for using DMA in DPDK libraries. If we get it wrong, >> we >>>>> are >>>>>> stuck with it. >>>>> >>>>> Agree, and if we get it right, then we're stuck with it too! :) >>>>> >>>>> >>>>>>>>>> Here is another idea, inspired by a presentation at one of the >>>>>>> DPDK >>>>>>>>> Userspace conferences. It may be wishful thinking, though: >>>>>>>>>> >>>>>>>>>> Add an additional transaction to each DMA burst; a special >>>>>>>>> transaction containing the memory write operation that makes the >>>>>>>>> descriptors available to the Virtio driver. >>>>>>>>>> >>>>>>>>> >>>>>>>>> That is something that can work, so long as the receiver is >>>>>>> operating >>>>>>>>> in >>>>>>>>> polling mode. For cases where virtio interrupts are enabled, you >>>>>>> still >>>>>>>>> need >>>>>>>>> to do a write to the eventfd in the kernel in vhost to signal the >>>>>>>>> virtio >>>>>>>>> side. That's not something that can be offloaded to a DMA engine, >>>>>>>>> sadly, so >>>>>>>>> we still need some form of completion call. >>>>>>>> >>>>>>>> I guess that virtio interrupts is the most widely deployed scenario, >>>>>>> so let's ignore >>>>>>>> the DMA TX completion transaction for now - and call it a possible >>>>>>> future >>>>>>>> optimization for specific use cases. So it seems that some form of >>>>>>> completion call >>>>>>>> is unavoidable. >>>>>>> >>>>>>> Agree to leave this aside, there is in theory a potential optimization, >>>>>>> but >>>>>>> unlikely to be of large value. >>>>>>> >>>>>> >>>>>> One more thing: When using DMA to pass on packets into a guest, there >> could >>>>> be a >>>>>> delay from the DMA completes until the guest is signaled. Is there any CPU >>>>> cache >>>>>> hotness regarding the guest's access to the packet data to consider here? >> I.e. if >>>>> we >>>>>> wait signaling the guest, the packet data may get cold. >>>>> >>>>> Interesting question; we can likely spawn a new thread around this topic! >>>>> In short, it depends on how/where the DMA hardware writes the copy. >>>>> >>>>> With technologies like DDIO, the "dest" part of the copy will be in LLC. The >> core >>>>> reading the >>>>> dest data will benefit from the LLC locality (instead of snooping it from a >> remote >>>>> core's L1/L2). >>>>> >>>>> Delays in notifying the guest could result in LLC capacity eviction, yes. >>>>> The application layer decides how often/promptly to check for completions, >>>>> and notify the guest of them. Calling the function more often will result in >> less >>>>> delay in that portion of the pipeline. >>>>> >>>>> Overall, there are caching benefits with DMA acceleration, and the >> application >>>>> can control >>>>> the latency introduced between dma-completion done in HW, and Guest >> vring >>>>> update. >>>> >>> >