From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9C4A9A0509; Wed, 30 Mar 2022 11:25:14 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6CD5840685; Wed, 30 Mar 2022 11:25:14 +0200 (CEST) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by mails.dpdk.org (Postfix) with ESMTP id DCFCA4013F for ; Wed, 30 Mar 2022 11:25:12 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1648632312; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gpsRDUeR/tE+4yJZCkimEZzr0zZnUmCmWk4gb4VeeYA=; b=d6XY+K75PAD1ixSJmfEz6h+jC0BcYJfbnsf1QK3an4eQo/gvtYeQ1ExXmk33xsCjWXhVsR OTCAQxMMFH6kecfWZS/qEQxm8KeLT77hr046mq9K7M5vMOXEWtX1yMEnKajJkD6m7We1QF wc5Z+CwYPwaxveGtdwkwnyyX00N4VfM= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-487-dewRKeplMQu2X_kAiXrG9A-1; Wed, 30 Mar 2022 05:25:10 -0400 X-MC-Unique: dewRKeplMQu2X_kAiXrG9A-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id F3DDB8F11C0; Wed, 30 Mar 2022 09:25:09 +0000 (UTC) Received: from [10.39.208.2] (unknown [10.39.208.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 5D3741121314; Wed, 30 Mar 2022 09:25:07 +0000 (UTC) Message-ID: <4a66558c-d0ad-f41a-fef6-db670a330f21@redhat.com> Date: Wed, 30 Mar 2022 11:25:05 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 To: "Hu, Jiayu" , Ilya Maximets , =?UTF-8?Q?Morten_Br=c3=b8rup?= , "Richardson, Bruce" Cc: "Van Haaren, Harry" , "Pai G, Sunil" , "Stokes, Ian" , "Ferriter, Cian" , "ovs-dev@openvswitch.org" , "dev@dpdk.org" , "Mcnamara, John" , "O'Driscoll, Tim" , "Finn, Emma" References: <98CBD80474FA8B44BF855DF32C47DC35D86F7C@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F7D@smartserver.smartshare.dk> <7968dd0b-8647-8d7b-786f-dc876bcbf3f0@redhat.com> <98CBD80474FA8B44BF855DF32C47DC35D86F7E@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86F80@smartserver.smartshare.dk> <431821f0b06d45958f67cb157029d306@intel.com> From: Maxime Coquelin Subject: Re: OVS DPDK DMA-Dev library/Design Discussion In-Reply-To: <431821f0b06d45958f67cb157029d306@intel.com> X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=maxime.coquelin@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 3/30/22 04:02, Hu, Jiayu wrote: > > >> -----Original Message----- >> From: Ilya Maximets >> Sent: Wednesday, March 30, 2022 1:45 AM >> To: Morten Brørup ; Richardson, Bruce >> >> Cc: i.maximets@ovn.org; Maxime Coquelin ; >> Van Haaren, Harry ; Pai G, Sunil >> ; Stokes, Ian ; Hu, Jiayu >> ; Ferriter, Cian ; ovs- >> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John >> ; O'Driscoll, Tim ; >> Finn, Emma >> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion >> >> On 3/29/22 19:13, Morten Brørup wrote: >>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >>>> Sent: Tuesday, 29 March 2022 19.03 >>>> >>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote: >>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] >>>>>> Sent: Tuesday, 29 March 2022 18.24 >>>>>> >>>>>> Hi Morten, >>>>>> >>>>>> On 3/29/22 16:44, Morten Brørup wrote: >>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com] >>>>>>>> Sent: Tuesday, 29 March 2022 15.02 >>>>>>>> >>>>>>>>> From: Morten Brørup >>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM >>>>>>>>> >>>>>>>>> Having thought more about it, I think that a completely >>>> different >>>>>> architectural approach is required: >>>>>>>>> >>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX >>>>>> packet burst functions, each optimized for different CPU vector >>>>>> instruction sets. The availability of a DMA engine should be >>>> treated >>>>>> the same way. So I suggest that PMDs copying packet contents, e.g. >>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX >>>> packet >>>>>> burst functions. >>>>>>>>> >>>>>>>>> Similarly for the DPDK vhost library. >>>>>>>>> >>>>>>>>> In such an architecture, it would be the application's job to >>>>>> allocate DMA channels and assign them to the specific PMDs that >>>> should >>>>>> use them. But the actual use of the DMA channels would move down >>>> below >>>>>> the application and into the DPDK PMDs and libraries. >>>>>>>>> >>>>>>>>> >>>>>>>>> Med venlig hilsen / Kind regards, -Morten Brørup >>>>>>>> >>>>>>>> Hi Morten, >>>>>>>> >>>>>>>> That's *exactly* how this architecture is designed & >>>> implemented. >>>>>>>> 1. The DMA configuration and initialization is up to the >>>> application >>>>>> (OVS). >>>>>>>> 2. The VHost library is passed the DMA-dev ID, and its new >>>> async >>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy. >>>>>>>> >>>>>>>> Looking forward to talking on the call that just started. >>>> Regards, - >>>>>> Harry >>>>>>>> >>>>>>> >>>>>>> OK, thanks - as I said on the call, I haven't looked at the >>>> patches. >>>>>>> >>>>>>> Then, I suppose that the TX completions can be handled in the TX >>>>>> function, and the RX completions can be handled in the RX function, >>>>>> just like the Ethdev PMDs handle packet descriptors: >>>>>>> >>>>>>> TX_Burst(tx_packet_array): >>>>>>> 1. Clean up descriptors processed by the NIC chip. --> Process >>>> TX >>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.) >>>>>>> 2. Pass on the tx_packet_array to the NIC chip descriptors. -- >>>>> Pass >>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the 1st >>>>>> pipeline stage.) >>>>>> >>>>>> The problem is Tx function might not be called again, so enqueued >>>>>> packets in 2. may never be completed from a Virtio point of view. >>>> IOW, >>>>>> the packets will be copied to the Virtio descriptors buffers, but >>>> the >>>>>> descriptors will not be made available to the Virtio driver. >>>>> >>>>> In that case, the application needs to call TX_Burst() periodically >>>> with an empty array, for completion purposes. >>>>> >>>>> Or some sort of TX_Keepalive() function can be added to the DPDK >>>> library, to handle DMA completion. It might even handle multiple DMA >>>> channels, if convenient - and if possible without locking or other >>>> weird complexity. >>>>> >>>>> Here is another idea, inspired by a presentation at one of the DPDK >>>> Userspace conferences. It may be wishful thinking, though: >>>>> >>>>> Add an additional transaction to each DMA burst; a special >>>> transaction containing the memory write operation that makes the >>>> descriptors available to the Virtio driver. >> >> I was talking with Maxime after the call today about the same idea. >> And it looks fairly doable, I would say. > > If the idea is making DMA update used ring's index (2B) and packed ring descriptor's flag (2B), > yes, it will work functionally. But considering the offloading cost of DMA, it would hurt > performance. In addition, the latency of small copy of DMA is much higher than that of > CPU. So it will also increase latency. I agree writing back descriptors using DMA can be sub-optimal, especially for packed ring where the head desc flags have to be written last. Are you sure about latency? With current solution, the descriptors write-backs can happen quite some time after the DMA transfers are done, isn't it? >> >>>>> >>>> >>>> That is something that can work, so long as the receiver is operating >>>> in polling mode. For cases where virtio interrupts are enabled, you >>>> still need to do a write to the eventfd in the kernel in vhost to >>>> signal the virtio side. That's not something that can be offloaded to >>>> a DMA engine, sadly, so we still need some form of completion call. >>> >>> I guess that virtio interrupts is the most widely deployed scenario, >>> so let's ignore the DMA TX completion transaction for now - and call >>> it a possible future optimization for specific use cases. So it seems >>> that some form of completion call is unavoidable. >>> >> >> We could separate the actual kick of the guest with the data transfer. >> If interrupts are enabled, this means that the guest is not actively polling, i.e. >> we can allow some extra latency by performing the actual kick from the rx >> context, or, as Maxime said, if DMA engine can generate interrupts when the >> DMA queue is empty, vhost thread may listen to them and kick the guest if >> needed. This will additionally remove the extra system call from the fast >> path. > > Separating kick with data transfer is a very good idea. But it requires a dedicated > control plane thread to kick guest after DMA interrupt. Anyway, we can try this > optimization in the future. Yes it requires a dedicated thread, but I don't think this is really an issue. Interrupt mode can be considered as slow-path. > > Thanks, > Jiayu