From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id F3699A034C; Thu, 28 Apr 2022 15:55:25 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id E2D8C42819; Thu, 28 Apr 2022 15:55:25 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by mails.dpdk.org (Postfix) with ESMTP id E525E40E50 for ; Thu, 28 Apr 2022 15:55:23 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1651154124; x=1682690124; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=T18DtIULN1J7VlzJF7k78tMlo38h+kDflRoz5VrAsco=; b=E8Duydv2RimtIOQArzjiIU0XLGUlF0tFwsXqre7s5KNET6N7PpoTH6MP XSYm0+efGfkdIXnS2Z2ZdbjiYPdmk31dIli+L6r4icQilXBxXXy8NqD3K ufrIO0dwsqzD245A4ZntjNtE3ecdy4PIEMmZdsVgk7OfV9uMQLGhqCD3u BtfQRmVKgtMp5q7wTFfNWNfr0oq0xnuEWC3rUO10ClD2+s51ri2EX30vk D3pRXsljyDiEvmi5x8ijiN20q2xqJs2Y66aaH1V/0VaUcByDInDKGPvgS vbMJZ6Ad3nV2b8yqxPa2x3wheBRLebPer+lCDhZiJpPk0dUiIaGS/RqA4 w==; X-IronPort-AV: E=McAfee;i="6400,9594,10330"; a="253678589" X-IronPort-AV: E=Sophos;i="5.91,295,1647327600"; d="scan'208";a="253678589" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Apr 2022 06:55:22 -0700 X-IronPort-AV: E=Sophos;i="5.91,295,1647327600"; d="scan'208";a="651229307" Received: from bricha3-mobl.ger.corp.intel.com ([10.55.133.40]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 28 Apr 2022 06:55:18 -0700 Date: Thu, 28 Apr 2022 14:55:15 +0100 From: Bruce Richardson To: Ilya Maximets Cc: "Mcnamara, John" , "Hu, Jiayu" , Maxime Coquelin , "Van Haaren, Harry" , Morten =?iso-8859-1?Q?Br=F8rup?= , "Pai G, Sunil" , "Stokes, Ian" , "Ferriter, Cian" , "ovs-dev@openvswitch.org" , "dev@dpdk.org" , "O'Driscoll, Tim" , "Finn, Emma" Subject: Re: OVS DPDK DMA-Dev library/Design Discussion Message-ID: References: <94d817cb-8151-6644-c577-ed8b42d24337@redhat.com> <55c5a37f-3ee8-394b-8cff-e7daecb59f73@ovn.org> <0a414313f07d4781b9bdd8523c2e06f5@intel.com> <5ba635f6-0e7f-a558-b599-674e272cfd1e@ovn.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Thu, Apr 28, 2022 at 02:59:37PM +0200, Ilya Maximets wrote: > On 4/27/22 22:34, Bruce Richardson wrote: > > On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote: > >> On 4/20/22 18:41, Mcnamara, John wrote: > >>>> -----Original Message----- > >>>> From: Ilya Maximets > >>>> Sent: Friday, April 8, 2022 10:58 AM > >>>> To: Hu, Jiayu ; Maxime Coquelin > >>>> ; Van Haaren, Harry > >>>> ; Morten Brørup ; > >>>> Richardson, Bruce > >>>> Cc: i.maximets@ovn.org; Pai G, Sunil ; Stokes, Ian > >>>> ; Ferriter, Cian ; ovs- > >>>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John > >>>> ; O'Driscoll, Tim ; > >>>> Finn, Emma > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > >>>> > >>>> On 4/8/22 09:13, Hu, Jiayu wrote: > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Ilya Maximets > >>>>>> Sent: Thursday, April 7, 2022 10:40 PM > >>>>>> To: Maxime Coquelin ; Van Haaren, Harry > >>>>>> ; Morten Brørup > >>>>>> ; Richardson, Bruce > >>>>>> > >>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil ; Stokes, > >>>>>> Ian ; Hu, Jiayu ; Ferriter, > >>>>>> Cian ; ovs-dev@openvswitch.org; > >>>>>> dev@dpdk.org; Mcnamara, John ; O'Driscoll, > >>>>>> Tim ; Finn, Emma > >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > >>>>>> > >>>>>> On 4/7/22 16:25, Maxime Coquelin wrote: > >>>>>>> Hi Harry, > >>>>>>> > >>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote: > >>>>>>>> Hi OVS & DPDK, Maintainers & Community, > >>>>>>>> > >>>>>>>> Top posting overview of discussion as replies to thread become > >>>> slower: > >>>>>>>> perhaps it is a good time to review and plan for next steps? > >>>>>>>> > >>>>>>>>  From my perspective, it those most vocal in the thread seem to be > >>>>>>>> in favour of the clean rx/tx split ("defer work"), with the > >>>>>>>> tradeoff that the application must be aware of handling the async > >>>>>>>> DMA completions. If there are any concerns opposing upstreaming of > >>>>>>>> this > >>>>>> method, please indicate this promptly, and we can continue technical > >>>>>> discussions here now. > >>>>>>> > >>>>>>> Wasn't there some discussions about handling the Virtio completions > >>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work. > >>>>>> > >>>>>> +1 > >>>>>> > >>>>>> With the virtio completions handled by DMA itself, the vhost port > >>>>>> turns almost into a real HW NIC. With that we will not need any > >>>>>> extra manipulations from the OVS side, i.e. no need to defer any work > >>>>>> while maintaining clear split between rx and tx operations. > >>>>> > >>>>> First, making DMA do 2B copy would sacrifice performance, and I think > >>>>> we all agree on that. > >>>> > >>>> I do not agree with that. Yes, 2B copy by DMA will likely be slower than > >>>> done by CPU, however CPU is going away for dozens or even hundreds of > >>>> thousands of cycles to process a new packet batch or service other ports, > >>>> hence DMA will likely complete the transmission faster than waiting for > >>>> the CPU thread to come back to that task. In any case, this has to be > >>>> tested. > >>>> > >>>>> Second, this method comes with an issue of ordering. > >>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD > >>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have > >>>>> own dedicated DMA device dma0 and dma1, flag/index update for the > >>>>> first 10 packets is done by dma0, and flag/index update for the left > >>>>> 20 packets is done by dma1. But there is no ordering guarantee among > >>>>> different DMA devices, so flag/index update may error. If PMD threads > >>>>> don't have dedicated DMA devices, which means DMA devices are shared > >>>>> among threads, we need lock and pay for lock contention in data-path. > >>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA > >>>>> sharing among threads. But what's the overhead of allocation mechanism? > >>>> Who does it? Any thoughts? > >>>> > >>>> 1. DMA completion was discussed in context of per-queue allocation, so > >>>> there > >>>> is no re-ordering in this case. > >>>> > >>>> 2. Overhead can be minimal if allocated device can stick to the queue for > >>>> a > >>>> reasonable amount of time without re-allocation on every send. You may > >>>> look at XPS implementation in lib/dpif-netdev.c in OVS for example of > >>>> such mechanism. For sure it can not be the same, but ideas can be re- > >>>> used. > >>>> > >>>> 3. Locking doesn't mean contention if resources are allocated/distributed > >>>> thoughtfully. > >>>> > >>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote > >>>> for doing that inside the vhost library, so any DPDK application and > >>>> vhost ethdev can use it without re-inventing from scratch. It also > >>>> should > >>>> be simpler from the API point of view if allocation and usage are in > >>>> the same place. But I don't have a strong opinion here as for now, > >>>> since > >>>> no real code examples exist, so it's hard to evaluate how they could > >>>> look > >>>> like. > >>>> > >>>> But I feel like we're starting to run in circles here as I did already say > >>>> most of that before. > >>> > >>> > >> > >> Hi, John. > >> > >> Just reading this email as I was on PTO for a last 1.5 weeks > >> and didn't get through all the emails yet. > >> > >>> This does seem to be going in circles, especially since there seemed to be technical alignment on the last public call on March 29th. > >> > >> I guess, there is a typo in the date here. > >> It seems to be 26th, not 29th. > >> > >>> It is not feasible to do a real world implementation/POC of every design proposal. > >> > >> FWIW, I think it makes sense to PoC and test options that are > >> going to be simply unavailable going forward if not explored now. > >> Especially because we don't have any good solutions anyway > >> ("Deferral of Work" is architecturally wrong solution for OVS). > >> > > > > Hi Ilya, > > > > for those of us who haven't spent a long time working on OVS, can you > > perhaps explain a bit more as to why it is architecturally wrong? From my > > experience with DPDK, use of any lookaside accelerator, not just DMA but > > any crypto, compression or otherwise, requires asynchronous operation, and > > therefore some form of setting work aside temporarily to do other tasks. > > OVS doesn't use any lookaside accelerators and doesn't have any > infrastructure for them. > > > Let me create a DPDK analogy of what is proposed for OVS. > > DPDK has an ethdev API that abstracts different device drivers for > the application. This API has a rte_eth_tx_burst() function that > is supposed to send packets through the particular network interface. > > Imagine now that there is a network card that is not capable of > sending packets right away and requires the application to come > back later to finish the operation. That is an obvious problem, > because rte_eth_tx_burst() doesn't require any extra actions and > doesn't take ownership of packets that wasn't consumed. > > The proposed solution for this problem is to change the ethdev API: > > 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively > means that packets was acknowledged, but not actually sent yet. > > 2. Require the application to call the new rte_eth_process_async() > function sometime later until it doesn't return -EINPROGRESS > anymore, in case the original rte_eth_tx_burst() call returned > -EINPROGRESS. > > The main reason why this proposal is questionable: > > It's only one specific device that requires this special handling, > all other devices are capable of sending packets right away. > However, every DPDK application now has to implement some kind > of "Deferral of Work" mechanism in order to be compliant with > the updated DPDK ethdev API. > > Will DPDK make this API change? > I have no voice in DPDK API design decisions, but I'd argue against. > > Interestingly, that's not really an imaginary proposal. That is > an exact change required for DPDK ethdev API in order to add > vhost async support to the vhost ethdev driver. > > Going back to OVS: > > An oversimplified architecture of OVS has 3 layers (top to bottom): > > 1. OFproto - the layer that handles OpenFlow. > 2. Datapath Interface - packet processing. > 3. Netdev - abstraction on top of all the different port types. > > Each layer has it's own API that allows different implementations > of the same layer to be used interchangeably without any modifications > to higher layers. That's what APIs and encapsulation is for. > > So, Netdev layer has it's own API and this API is actually very > similar to the DPDK's ethdev API. Simply because they are serving > the same purpose - abstraction on top of different network interfaces. > Beside different types of DPDK ports, there are also several types > of native linux, bsd and windows ports, variety of different tunnel > ports. > > Datapath interface layer is an "application" from the ethdev analogy > above. > > What is proposed by "Deferral of Work" solution is to make pretty > much the same API change that I described, but to netdev layer API > inside the OVS, and introduce a fairly complex (and questionable, > but I'm not going into that right now) machinery to handle that API > change into the datapath interface layer. > > So, exactly the same problem is here: > > If the API change is needed only for a single port type in a very > specific hardware environment, why we need to change the common > API and rework a lot of the code in upper layers in order to accommodate > that API change, while it makes no practical sense for any other > port types or more generic hardware setups? > And similar changes will have to be done in any other DPDK application > that is not bound to a specific hardware, but wants to support vhost > async. > > The right solution, IMO, is to make vhost async behave as any other > physical NIC, since it is essentially a physical NIC now (we're not > using DMA directly, it's a combined vhost+DMA solution), instead of > propagating quirks of the single device to a common API. > > And going back to DPDK, this implementation doesn't allow use of > vhost async in the DPDK's own vhost ethdev driver. > > My initial reply to the "Deferral of Work" RFC with pretty much > the same concerns: > https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-2-cian.ferriter@intel.com/#2751799 > > Best regards, Ilya Maximets. Thanks for the clear explanation. Gives me a much better idea of the view from your side of things. /Bruce