From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 96070A0505; Tue, 29 Mar 2022 21:59:30 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 4AA9E40691; Tue, 29 Mar 2022 21:59:30 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 82CCC40141 for ; Tue, 29 Mar 2022 21:59:28 +0200 (CEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: OVS DPDK DMA-Dev library/Design Discussion Date: Tue, 29 Mar 2022 21:59:23 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86F82@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: OVS DPDK DMA-Dev library/Design Discussion Thread-Index: Adg/jDNGcC8G4wWtSxeVfUOuAS3Y6wACLuFQAM6falAAJoZBIAAA0srQAAK2F/AABHBTAAAAvWSAAACgo4AAAF2UgAAAVPwgAAOflKA= References: <98CBD80474FA8B44BF855DF32C47DC35D86F7C@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F7D@smartserver.smartshare.dk> <7968dd0b-8647-8d7b-786f-dc876bcbf3f0@redhat.com> <98CBD80474FA8B44BF855DF32C47DC35D86F7E@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F80@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Van Haaren, Harry" , "Richardson, Bruce" Cc: "Maxime Coquelin" , "Pai G, Sunil" , "Stokes, Ian" , "Hu, Jiayu" , "Ferriter, Cian" , "Ilya Maximets" , , , "Mcnamara, John" , "O'Driscoll, Tim" , "Finn, Emma" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com] > Sent: Tuesday, 29 March 2022 19.46 >=20 > > From: Morten Br=F8rup > > Sent: Tuesday, March 29, 2022 6:14 PM > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > Sent: Tuesday, 29 March 2022 19.03 > > > > > > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Br=F8rup wrote: > > > > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > > > > > Sent: Tuesday, 29 March 2022 18.24 > > > > > > > > > > Hi Morten, > > > > > > > > > > On 3/29/22 16:44, Morten Br=F8rup wrote: > > > > > >> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com] > > > > > >> Sent: Tuesday, 29 March 2022 15.02 > > > > > >> > > > > > >>> From: Morten Br=F8rup > > > > > >>> Sent: Tuesday, March 29, 2022 1:51 PM > > > > > >>> > > > > > >>> Having thought more about it, I think that a completely > > > different > > > > > architectural approach is required: > > > > > >>> > > > > > >>> Many of the DPDK Ethernet PMDs implement a variety of RX > and TX > > > > > packet burst functions, each optimized for different CPU = vector > > > > > instruction sets. The availability of a DMA engine should be > > > treated > > > > > the same way. So I suggest that PMDs copying packet contents, > e.g. > > > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX > > > packet > > > > > burst functions. > > > > > >>> > > > > > >>> Similarly for the DPDK vhost library. > > > > > >>> > > > > > >>> In such an architecture, it would be the application's job > to > > > > > allocate DMA channels and assign them to the specific PMDs = that > > > should > > > > > use them. But the actual use of the DMA channels would move > down > > > below > > > > > the application and into the DPDK PMDs and libraries. > > > > > >>> > > > > > >>> > > > > > >>> Med venlig hilsen / Kind regards, > > > > > >>> -Morten Br=F8rup > > > > > >> > > > > > >> Hi Morten, > > > > > >> > > > > > >> That's *exactly* how this architecture is designed & > > > implemented. > > > > > >> 1. The DMA configuration and initialization is up to the > > > application > > > > > (OVS). > > > > > >> 2. The VHost library is passed the DMA-dev ID, and its > new > > > async > > > > > rx/tx APIs, and uses the DMA device to accelerate the copy. > > > > > >> > > > > > >> Looking forward to talking on the call that just started. > > > Regards, - > > > > > Harry > > > > > >> > > > > > > > > > > > > OK, thanks - as I said on the call, I haven't looked at the > > > patches. > > > > > > > > > > > > Then, I suppose that the TX completions can be handled in = the > TX > > > > > function, and the RX completions can be handled in the RX > function, > > > > > just like the Ethdev PMDs handle packet descriptors: > > > > > > > > > > > > TX_Burst(tx_packet_array): > > > > > > 1. Clean up descriptors processed by the NIC chip. --> > Process > > > TX > > > > > DMA channel completions. (Effectively, the 2nd pipeline = stage.) > > > > > > 2. Pass on the tx_packet_array to the NIC chip > descriptors. -- > > > > Pass > > > > > on the tx_packet_array to the TX DMA channel. (Effectively, = the > 1st > > > > > pipeline stage.) > > > > > > > > > > The problem is Tx function might not be called again, so > enqueued > > > > > packets in 2. may never be completed from a Virtio point of > view. > > > IOW, > > > > > the packets will be copied to the Virtio descriptors buffers, > but > > > the > > > > > descriptors will not be made available to the Virtio driver. > > > > > > > > In that case, the application needs to call TX_Burst() > periodically > > > with an empty array, for completion purposes. >=20 > This is what the "defer work" does at the OVS thread-level, but = instead > of > "brute-forcing" and *always* making the call, the defer work concept > tracks > *when* there is outstanding work (DMA copies) to be completed > ("deferred work") > and calls the generic completion function at that point. >=20 > So "defer work" is generic infrastructure at the OVS thread level to > handle > work that needs to be done "later", e.g. DMA completion handling. >=20 >=20 > > > > Or some sort of TX_Keepalive() function can be added to the DPDK > > > library, to handle DMA completion. It might even handle multiple > DMA > > > channels, if convenient - and if possible without locking or other > > > weird complexity. >=20 > That's exactly how it is done, the VHost library has a new API added, > which allows > for handling completions. And in the "Netdev layer" (~OVS ethdev > abstraction) > we add a function to allow the OVS thread to do those completions in a > new > Netdev-abstraction API called "async_process" where the completions = can > be checked. >=20 > The only method to abstract them is to "hide" them somewhere that will > always be > polled, e.g. an ethdev port's RX function. Both V3 and V4 approaches > use this method. > This allows "completions" to be transparent to the app, at the = tradeoff > to having bad > separation of concerns as Rx and Tx are now tied-together. >=20 > The point is, the Application layer must *somehow * handle of > completions. > So fundamentally there are 2 options for the Application level: >=20 > A) Make the application periodically call a "handle completions" > function > A1) Defer work, call when needed, and track "needed" at app > layer, and calling into vhost txq complete as required. > Elegant in that "no work" means "no cycles spent" on > checking DMA completions. > A2) Brute-force-always-call, and pay some overhead when not > required. > Cycle-cost in "no work" scenarios. Depending on # of > vhost queues, this adds up as polling required *per vhost txq*. > Also note that "checking DMA completions" means taking a > virtq-lock, so this "brute-force" can needlessly increase x-thread > contention! A side note: I don't see why locking is required to test for DMA = completions. rte_dma_vchan_status() is lockless, e.g.: https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmade= v.c#L560 >=20 > B) Hide completions and live with the complexity/architectural > sacrifice of mixed-RxTx. > Various downsides here in my opinion, see the slide deck > presented earlier today for a summary. >=20 > In my opinion, A1 is the most elegant solution, as it has a clean > separation of concerns, does not cause > avoidable contention on virtq locks, and spends no cycles when there = is > no completion work to do. >=20 Thank you for elaborating, Harry. I strongly oppose against hiding any part of TX processing in an RX = function. It is just wrong in so many ways! I agree that A1 is the most elegant solution. And being the most elegant = solution, it is probably also the most future proof solution. :-) I would also like to stress that DMA completion handling belongs in the = DPDK library, not in the application. And yes, the application will be = required to call some "handle DMA completions" function in the DPDK = library. But since the application already knows that it uses DMA, the = application should also know that it needs to call this extra function - = so I consider this requirement perfectly acceptable. I prefer if the DPDK vhost library can hide its inner workings from the = application, and just expose the additional "handle completions" = function. This also means that the inner workings can be implemented as = "defer work", or by some other algorithm. And it can be tweaked and = optimized later. Thinking about the long term perspective, this design pattern is common = for both the vhost library and other DPDK libraries that could benefit = from DMA (e.g. vmxnet3 and pcap PMDs), so it could be abstracted into = the DMA library or a separate library. But for now, we should focus on = the vhost use case, and just keep the long term roadmap for using DMA in = mind. Rephrasing what I said on the conference call: This vhost design will = become the common design pattern for using DMA in DPDK libraries. If we = get it wrong, we are stuck with it. >=20 > > > > Here is another idea, inspired by a presentation at one of the > DPDK > > > Userspace conferences. It may be wishful thinking, though: > > > > > > > > Add an additional transaction to each DMA burst; a special > > > transaction containing the memory write operation that makes the > > > descriptors available to the Virtio driver. > > > > > > > > > > That is something that can work, so long as the receiver is > operating > > > in > > > polling mode. For cases where virtio interrupts are enabled, you > still > > > need > > > to do a write to the eventfd in the kernel in vhost to signal the > > > virtio > > > side. That's not something that can be offloaded to a DMA engine, > > > sadly, so > > > we still need some form of completion call. > > > > I guess that virtio interrupts is the most widely deployed scenario, > so let's ignore > > the DMA TX completion transaction for now - and call it a possible > future > > optimization for specific use cases. So it seems that some form of > completion call > > is unavoidable. >=20 > Agree to leave this aside, there is in theory a potential = optimization, > but > unlikely to be of large value. >=20 One more thing: When using DMA to pass on packets into a guest, there = could be a delay from the DMA completes until the guest is signaled. Is = there any CPU cache hotness regarding the guest's access to the packet = data to consider here? I.e. if we wait signaling the guest, the packet = data may get cold.