From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9B239A0509; Wed, 30 Mar 2022 12:20:45 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 75B8240685; Wed, 30 Mar 2022 12:20:45 +0200 (CEST) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by mails.dpdk.org (Postfix) with ESMTP id 27A3E4013F for ; Wed, 30 Mar 2022 12:20:43 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1648635644; x=1680171644; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=r2Mx92PS3+fdK1fISyorurJEa91IlFC2CB4f4WvgG0c=; b=VTU06fDNVjqLx0eSCGqC0+wmB1qIPZmmABzgJsmgQOa7iPWkGTE6SrdM t/PRA6Lg69ASbx3v9IRECnBEPryCiY3QLkOueN7d6iAlgqZwVn9KEPeVz PCeKaO5vT60oMkJra3yEvZxPUVDKX8KZ9r+5TwRAUmwuU9uFcteCiFjKA 4De5ZirV3badsmvPqTYjJtQ8lS585HtvTzH2Uf7nleKquVxIfj3TLIexH jwIvmfIKW8S3Hj6MYSsCi9yKKOB9GHzdUe/E8L2ZcOWQyInfXCP2GSALp k6q29zzFs7Zh4VfsWGRkpLOONeJd64CB9Iwzvwjo9ZFmZhbndpnUQUIti g==; X-IronPort-AV: E=McAfee;i="6200,9189,10301"; a="258338992" X-IronPort-AV: E=Sophos;i="5.90,222,1643702400"; d="scan'208";a="258338992" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Mar 2022 03:20:34 -0700 X-IronPort-AV: E=Sophos;i="5.90,222,1643702400"; d="scan'208";a="653769350" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.20.222]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 30 Mar 2022 03:20:30 -0700 Date: Wed, 30 Mar 2022 11:20:27 +0100 From: Bruce Richardson To: Maxime Coquelin Cc: "Hu, Jiayu" , Ilya Maximets , Morten =?iso-8859-1?Q?Br=F8rup?= , "Van Haaren, Harry" , "Pai G, Sunil" , "Stokes, Ian" , "Ferriter, Cian" , "ovs-dev@openvswitch.org" , "dev@dpdk.org" , "Mcnamara, John" , "O'Driscoll, Tim" , "Finn, Emma" Subject: Re: OVS DPDK DMA-Dev library/Design Discussion Message-ID: References: <98CBD80474FA8B44BF855DF32C47DC35D86F7C@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F7D@smartserver.smartshare.dk> <7968dd0b-8647-8d7b-786f-dc876bcbf3f0@redhat.com> <98CBD80474FA8B44BF855DF32C47DC35D86F7E@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F80@smartserver.smartshare.dk> <431821f0b06d45958f67cb157029d306@intel.com> <4a66558c-d0ad-f41a-fef6-db670a330f21@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <4a66558c-d0ad-f41a-fef6-db670a330f21@redhat.com> X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Wed, Mar 30, 2022 at 11:25:05AM +0200, Maxime Coquelin wrote: > > > On 3/30/22 04:02, Hu, Jiayu wrote: > > > > > > > -----Original Message----- > > > From: Ilya Maximets > > > Sent: Wednesday, March 30, 2022 1:45 AM > > > To: Morten Brørup ; Richardson, Bruce > > > > > > Cc: i.maximets@ovn.org; Maxime Coquelin ; > > > Van Haaren, Harry ; Pai G, Sunil > > > ; Stokes, Ian ; Hu, Jiayu > > > ; Ferriter, Cian ; ovs- > > > dev@openvswitch.org; dev@dpdk.org; Mcnamara, John > > > ; O'Driscoll, Tim ; > > > Finn, Emma > > > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > > > > > > On 3/29/22 19:13, Morten Brørup wrote: > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > > > Sent: Tuesday, 29 March 2022 19.03 > > > > > > > > > > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote: > > > > > > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > > > > > > > Sent: Tuesday, 29 March 2022 18.24 > > > > > > > > > > > > > > Hi Morten, > > > > > > > > > > > > > > On 3/29/22 16:44, Morten Brørup wrote: > > > > > > > > > From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com] > > > > > > > > > Sent: Tuesday, 29 March 2022 15.02 > > > > > > > > > > > > > > > > > > > From: Morten Brørup > > > > > > > > > > Sent: Tuesday, March 29, 2022 1:51 PM > > > > > > > > > > > > > > > > > > > > Having thought more about it, I think that a completely > > > > > different > > > > > > > architectural approach is required: > > > > > > > > > > > > > > > > > > > > Many of the DPDK Ethernet PMDs implement a variety of RX and TX > > > > > > > packet burst functions, each optimized for different CPU vector > > > > > > > instruction sets. The availability of a DMA engine should be > > > > > treated > > > > > > > the same way. So I suggest that PMDs copying packet contents, e.g. > > > > > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX > > > > > packet > > > > > > > burst functions. > > > > > > > > > > > > > > > > > > > > Similarly for the DPDK vhost library. > > > > > > > > > > > > > > > > > > > > In such an architecture, it would be the application's job to > > > > > > > allocate DMA channels and assign them to the specific PMDs that > > > > > should > > > > > > > use them. But the actual use of the DMA channels would move down > > > > > below > > > > > > > the application and into the DPDK PMDs and libraries. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Med venlig hilsen / Kind regards, -Morten Brørup > > > > > > > > > > > > > > > > > > Hi Morten, > > > > > > > > > > > > > > > > > > That's *exactly* how this architecture is designed & > > > > > implemented. > > > > > > > > > 1. The DMA configuration and initialization is up to the > > > > > application > > > > > > > (OVS). > > > > > > > > > 2. The VHost library is passed the DMA-dev ID, and its new > > > > > async > > > > > > > rx/tx APIs, and uses the DMA device to accelerate the copy. > > > > > > > > > > > > > > > > > > Looking forward to talking on the call that just started. > > > > > Regards, - > > > > > > > Harry > > > > > > > > > > > > > > > > > > > > > > > > > OK, thanks - as I said on the call, I haven't looked at the > > > > > patches. > > > > > > > > > > > > > > > > Then, I suppose that the TX completions can be handled in the TX > > > > > > > function, and the RX completions can be handled in the RX function, > > > > > > > just like the Ethdev PMDs handle packet descriptors: > > > > > > > > > > > > > > > > TX_Burst(tx_packet_array): > > > > > > > > 1. Clean up descriptors processed by the NIC chip. --> Process > > > > > TX > > > > > > > DMA channel completions. (Effectively, the 2nd pipeline stage.) > > > > > > > > 2. Pass on the tx_packet_array to the NIC chip descriptors. -- > > > > > > Pass > > > > > > > on the tx_packet_array to the TX DMA channel. (Effectively, the 1st > > > > > > > pipeline stage.) > > > > > > > > > > > > > > The problem is Tx function might not be called again, so enqueued > > > > > > > packets in 2. may never be completed from a Virtio point of view. > > > > > IOW, > > > > > > > the packets will be copied to the Virtio descriptors buffers, but > > > > > the > > > > > > > descriptors will not be made available to the Virtio driver. > > > > > > > > > > > > In that case, the application needs to call TX_Burst() periodically > > > > > with an empty array, for completion purposes. > > > > > > > > > > > > Or some sort of TX_Keepalive() function can be added to the DPDK > > > > > library, to handle DMA completion. It might even handle multiple DMA > > > > > channels, if convenient - and if possible without locking or other > > > > > weird complexity. > > > > > > > > > > > > Here is another idea, inspired by a presentation at one of the DPDK > > > > > Userspace conferences. It may be wishful thinking, though: > > > > > > > > > > > > Add an additional transaction to each DMA burst; a special > > > > > transaction containing the memory write operation that makes the > > > > > descriptors available to the Virtio driver. > > > > > > I was talking with Maxime after the call today about the same idea. > > > And it looks fairly doable, I would say. > > > > If the idea is making DMA update used ring's index (2B) and packed ring descriptor's flag (2B), > > yes, it will work functionally. But considering the offloading cost of DMA, it would hurt > > performance. In addition, the latency of small copy of DMA is much higher than that of > > CPU. So it will also increase latency. > > I agree writing back descriptors using DMA can be sub-optimal, > especially for packed ring where the head desc flags have to be written > last. > I think we'll have to try it out to check how it works. If we are already doing hardware offload, adding one addition job to the DMA list may be a very minor addition. [Incidentally, for something like a head-pointer update, using a fill operation rather than a copy may be a good choice, as it avoid the need for a memory read transaction from the DMA engine, since the data to be written is already in the descriptor submitted.] > Are you sure about latency? With current solution, the descriptors > write-backs can happen quite some time after the DMA transfers are done, > isn't it? > For a polling receiver, having the DMA engine automatically do any head-pointer updates after copies done should indeed lead to lowest latency. For DMA engines that perform operations in parallel (such as Intel DSA), we just need to ensure proper fencing of operations. > > > > > > > > > > > > > > > > > > > That is something that can work, so long as the receiver is operating > > > > > in polling mode. For cases where virtio interrupts are enabled, you > > > > > still need to do a write to the eventfd in the kernel in vhost to > > > > > signal the virtio side. That's not something that can be offloaded to > > > > > a DMA engine, sadly, so we still need some form of completion call. > > > > > > > > I guess that virtio interrupts is the most widely deployed scenario, > > > > so let's ignore the DMA TX completion transaction for now - and call > > > > it a possible future optimization for specific use cases. So it seems > > > > that some form of completion call is unavoidable. > > > > > > > > > > We could separate the actual kick of the guest with the data transfer. > > > If interrupts are enabled, this means that the guest is not actively polling, i.e. > > > we can allow some extra latency by performing the actual kick from the rx > > > context, or, as Maxime said, if DMA engine can generate interrupts when the > > > DMA queue is empty, vhost thread may listen to them and kick the guest if > > > needed. This will additionally remove the extra system call from the fast > > > path. > > > > Separating kick with data transfer is a very good idea. But it requires a dedicated > > control plane thread to kick guest after DMA interrupt. Anyway, we can try this > > optimization in the future. > > Yes it requires a dedicated thread, but I don't think this is really an > issue. Interrupt mode can be considered as slow-path. > While not overly familiar with virtio/vhost, my main concern about interrupt mode is not the handling of interrupts themselves when interrupt mode is enabled, but rather ensuring that we have correct behaviour when interrupt mode is enabled while copy operations are in-flight. Is it possible for a guest to enable interrupt mode, receive packets but never be woken up? /Bruce