From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id B37C3A04FF; Tue, 24 May 2022 14:12:44 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9F173427F8; Tue, 24 May 2022 14:12:44 +0200 (CEST) Received: from relay7-d.mail.gandi.net (relay7-d.mail.gandi.net [217.70.183.200]) by mails.dpdk.org (Postfix) with ESMTP id 0907840E78 for ; Tue, 24 May 2022 14:12:43 +0200 (CEST) Received: (Authenticated sender: i.maximets@ovn.org) by mail.gandi.net (Postfix) with ESMTPSA id 2BD5C20008; Tue, 24 May 2022 12:12:39 +0000 (UTC) Message-ID: <4b0e6de6-effc-401b-79db-28a9d90a739d@ovn.org> Date: Tue, 24 May 2022 14:12:38 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Cc: i.maximets@ovn.org, "Mcnamara, John" , "Hu, Jiayu" , Maxime Coquelin , =?UTF-8?Q?Morten_Br=c3=b8rup?= , "Pai G, Sunil" , "Stokes, Ian" , "Ferriter, Cian" , "ovs-dev@openvswitch.org" , "dev@dpdk.org" , "O'Driscoll, Tim" , "Finn, Emma" Content-Language: en-US To: "Van Haaren, Harry" , "Richardson, Bruce" References: <98CBD80474FA8B44BF855DF32C47DC35D86F82@smartserver.smartshare.dk> <94d817cb-8151-6644-c577-ed8b42d24337@redhat.com> <55c5a37f-3ee8-394b-8cff-e7daecb59f73@ovn.org> <0a414313f07d4781b9bdd8523c2e06f5@intel.com> <5ba635f6-0e7f-a558-b599-674e272cfd1e@ovn.org> From: Ilya Maximets Subject: Re: OVS DPDK DMA-Dev library/Design Discussion In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 5/3/22 21:38, Van Haaren, Harry wrote: >> -----Original Message----- >> From: Ilya Maximets >> Sent: Thursday, April 28, 2022 2:00 PM >> To: Richardson, Bruce >> Cc: i.maximets@ovn.org; Mcnamara, John ; Hu, Jiayu >> ; Maxime Coquelin ; Van >> Haaren, Harry ; Morten Brørup >> ; Pai G, Sunil ; Stokes, Ian >> ; Ferriter, Cian ; ovs- >> dev@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim ; >> Finn, Emma >> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion >> >> On 4/27/22 22:34, Bruce Richardson wrote: >>> On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote: >>>> On 4/20/22 18:41, Mcnamara, John wrote: >>>>>> -----Original Message----- >>>>>> From: Ilya Maximets >>>>>> Sent: Friday, April 8, 2022 10:58 AM >>>>>> To: Hu, Jiayu ; Maxime Coquelin >>>>>> ; Van Haaren, Harry >>>>>> ; Morten Brørup >> ; >>>>>> Richardson, Bruce >>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil ; Stokes, Ian >>>>>> ; Ferriter, Cian ; ovs- >>>>>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John >>>>>> ; O'Driscoll, Tim ; >>>>>> Finn, Emma >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion >>>>>> >>>>>> On 4/8/22 09:13, Hu, Jiayu wrote: >>>>>>> >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Ilya Maximets >>>>>>>> Sent: Thursday, April 7, 2022 10:40 PM >>>>>>>> To: Maxime Coquelin ; Van Haaren, Harry >>>>>>>> ; Morten Brørup >>>>>>>> ; Richardson, Bruce >>>>>>>> >>>>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil ; Stokes, >>>>>>>> Ian ; Hu, Jiayu ; Ferriter, >>>>>>>> Cian ; ovs-dev@openvswitch.org; >>>>>>>> dev@dpdk.org; Mcnamara, John ; O'Driscoll, >>>>>>>> Tim ; Finn, Emma >>>>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion >>>>>>>> >>>>>>>> On 4/7/22 16:25, Maxime Coquelin wrote: >>>>>>>>> Hi Harry, >>>>>>>>> >>>>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote: >>>>>>>>>> Hi OVS & DPDK, Maintainers & Community, >>>>>>>>>> >>>>>>>>>> Top posting overview of discussion as replies to thread become >>>>>> slower: >>>>>>>>>> perhaps it is a good time to review and plan for next steps? >>>>>>>>>> >>>>>>>>>>  From my perspective, it those most vocal in the thread seem to be >>>>>>>>>> in favour of the clean rx/tx split ("defer work"), with the >>>>>>>>>> tradeoff that the application must be aware of handling the async >>>>>>>>>> DMA completions. If there are any concerns opposing upstreaming of >>>>>>>>>> this >>>>>>>> method, please indicate this promptly, and we can continue technical >>>>>>>> discussions here now. >>>>>>>>> >>>>>>>>> Wasn't there some discussions about handling the Virtio completions >>>>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work. >>>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> With the virtio completions handled by DMA itself, the vhost port >>>>>>>> turns almost into a real HW NIC. With that we will not need any >>>>>>>> extra manipulations from the OVS side, i.e. no need to defer any work >>>>>>>> while maintaining clear split between rx and tx operations. >>>>>>> >>>>>>> First, making DMA do 2B copy would sacrifice performance, and I think >>>>>>> we all agree on that. >>>>>> >>>>>> I do not agree with that. Yes, 2B copy by DMA will likely be slower than >>>>>> done by CPU, however CPU is going away for dozens or even hundreds of >>>>>> thousands of cycles to process a new packet batch or service other ports, >>>>>> hence DMA will likely complete the transmission faster than waiting for >>>>>> the CPU thread to come back to that task. In any case, this has to be >>>>>> tested. >>>>>> >>>>>>> Second, this method comes with an issue of ordering. >>>>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD >>>>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have >>>>>>> own dedicated DMA device dma0 and dma1, flag/index update for the >>>>>>> first 10 packets is done by dma0, and flag/index update for the left >>>>>>> 20 packets is done by dma1. But there is no ordering guarantee among >>>>>>> different DMA devices, so flag/index update may error. If PMD threads >>>>>>> don't have dedicated DMA devices, which means DMA devices are shared >>>>>>> among threads, we need lock and pay for lock contention in data-path. >>>>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA >>>>>>> sharing among threads. But what's the overhead of allocation mechanism? >>>>>> Who does it? Any thoughts? >>>>>> >>>>>> 1. DMA completion was discussed in context of per-queue allocation, so >>>>>> there >>>>>> is no re-ordering in this case. >>>>>> >>>>>> 2. Overhead can be minimal if allocated device can stick to the queue for >>>>>> a >>>>>> reasonable amount of time without re-allocation on every send. You may >>>>>> look at XPS implementation in lib/dpif-netdev.c in OVS for example of >>>>>> such mechanism. For sure it can not be the same, but ideas can be re- >>>>>> used. >>>>>> >>>>>> 3. Locking doesn't mean contention if resources are allocated/distributed >>>>>> thoughtfully. >>>>>> >>>>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote >>>>>> for doing that inside the vhost library, so any DPDK application and >>>>>> vhost ethdev can use it without re-inventing from scratch. It also >>>>>> should >>>>>> be simpler from the API point of view if allocation and usage are in >>>>>> the same place. But I don't have a strong opinion here as for now, >>>>>> since >>>>>> no real code examples exist, so it's hard to evaluate how they could >>>>>> look >>>>>> like. >>>>>> >>>>>> But I feel like we're starting to run in circles here as I did already say >>>>>> most of that before. >>>>> >>>>> >>>> >>>> Hi, John. >>>> >>>> Just reading this email as I was on PTO for a last 1.5 weeks >>>> and didn't get through all the emails yet. >>>> >>>>> This does seem to be going in circles, especially since there seemed to be >> technical alignment on the last public call on March 29th. >>>> >>>> I guess, there is a typo in the date here. >>>> It seems to be 26th, not 29th. >>>> >>>>> It is not feasible to do a real world implementation/POC of every design >> proposal. >>>> >>>> FWIW, I think it makes sense to PoC and test options that are >>>> going to be simply unavailable going forward if not explored now. >>>> Especially because we don't have any good solutions anyway >>>> ("Deferral of Work" is architecturally wrong solution for OVS). >>>> >>> >>> Hi Ilya, >>> >>> for those of us who haven't spent a long time working on OVS, can you >>> perhaps explain a bit more as to why it is architecturally wrong? From my >>> experience with DPDK, use of any lookaside accelerator, not just DMA but >>> any crypto, compression or otherwise, requires asynchronous operation, and >>> therefore some form of setting work aside temporarily to do other tasks. >> >> OVS doesn't use any lookaside accelerators and doesn't have any >> infrastructure for them. >> >> >> Let me create a DPDK analogy of what is proposed for OVS. >> >> DPDK has an ethdev API that abstracts different device drivers for >> the application. This API has a rte_eth_tx_burst() function that >> is supposed to send packets through the particular network interface. >> >> Imagine now that there is a network card that is not capable of >> sending packets right away and requires the application to come >> back later to finish the operation. That is an obvious problem, >> because rte_eth_tx_burst() doesn't require any extra actions and >> doesn't take ownership of packets that wasn't consumed. >> >> The proposed solution for this problem is to change the ethdev API: >> >> 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively >> means that packets was acknowledged, but not actually sent yet. >> >> 2. Require the application to call the new rte_eth_process_async() >> function sometime later until it doesn't return -EINPROGRESS >> anymore, in case the original rte_eth_tx_burst() call returned >> -EINPROGRESS. >> >> The main reason why this proposal is questionable: >> >> It's only one specific device that requires this special handling, >> all other devices are capable of sending packets right away. >> However, every DPDK application now has to implement some kind >> of "Deferral of Work" mechanism in order to be compliant with >> the updated DPDK ethdev API. >> >> Will DPDK make this API change? >> I have no voice in DPDK API design decisions, but I'd argue against. >> >> Interestingly, that's not really an imaginary proposal. That is >> an exact change required for DPDK ethdev API in order to add >> vhost async support to the vhost ethdev driver. >> >> Going back to OVS: >> >> An oversimplified architecture of OVS has 3 layers (top to bottom): >> >> 1. OFproto - the layer that handles OpenFlow. >> 2. Datapath Interface - packet processing. >> 3. Netdev - abstraction on top of all the different port types. >> >> Each layer has it's own API that allows different implementations >> of the same layer to be used interchangeably without any modifications >> to higher layers. That's what APIs and encapsulation is for. >> >> So, Netdev layer has it's own API and this API is actually very >> similar to the DPDK's ethdev API. Simply because they are serving >> the same purpose - abstraction on top of different network interfaces. >> Beside different types of DPDK ports, there are also several types >> of native linux, bsd and windows ports, variety of different tunnel >> ports. >> >> Datapath interface layer is an "application" from the ethdev analogy >> above. >> >> What is proposed by "Deferral of Work" solution is to make pretty >> much the same API change that I described, but to netdev layer API >> inside the OVS, and introduce a fairly complex (and questionable, >> but I'm not going into that right now) machinery to handle that API >> change into the datapath interface layer. >> >> So, exactly the same problem is here: >> >> If the API change is needed only for a single port type in a very >> specific hardware environment, why we need to change the common >> API and rework a lot of the code in upper layers in order to accommodate >> that API change, while it makes no practical sense for any other >> port types or more generic hardware setups? >> And similar changes will have to be done in any other DPDK application >> that is not bound to a specific hardware, but wants to support vhost >> async. >> >> The right solution, IMO, is to make vhost async behave as any other >> physical NIC, since it is essentially a physical NIC now (we're not >> using DMA directly, it's a combined vhost+DMA solution), instead of >> propagating quirks of the single device to a common API. >> >> And going back to DPDK, this implementation doesn't allow use of >> vhost async in the DPDK's own vhost ethdev driver. >> >> My initial reply to the "Deferral of Work" RFC with pretty much >> the same concerns: >> https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672- >> 2-cian.ferriter@intel.com/#2751799 >> >> Best regards, Ilya Maximets. > > > Hi Ilya, > > Thanks for replying in more detail, understanding your perspective here helps to > communicate the various solutions benefits and drawbacks. Agreed the OfProto/Dpif/Netdev > abstraction layers are strong abstractions in OVS, and in general they serve their purpose. > > A key difference between OVS's usage of DPDK Ethdev TX and VHost TX is that the performance > of each is very different: as you know, sending a 1500 byte packet over a physical NIC, or via > VHost into a guest has a very different CPU cycle cost. Typically DPDK Tx takes ~5% CPU cycles > while vhost copies are often ~30%, but can be > 50% in certain packet-sizes/configurations. I understand the performance difference, it's a general knowledge that vhost rx/tx is costly. At the same time ethdev TX and vhost TX are different in a way that first one is working with asynchronous devices, while the second one is not. And the goal here is to make the vhost asynchronous too. So, you're trying to increase the similarity between them by differentiating the API... !? > Let's view the performance of the above example from the perspective of an actual deployment: OVS is > very often deployed to provide an accelerated packet interface to a guest/VM via Vhost/Virtio. > Surely improving performance of this primary use-case is a valid reason to consider changes and > improvements to an internal abstraction layer in OVS? It was considered. But it also makes sense to consider solutions that doesn't require API changes. Regarding that being a primary use-case: yes, OVS <-> VM communication is a primary use-case. However, usage of accelerators requires a specific hardware platform, which may or may not be a primary use-case in the future. And DMA is not the only option even in the space of hardware accelerators (vdpa? I know that vdpa concept is fairly different from DMA accelerators, but that doesn't invalidate my point). > > Today DPDK tx and vhost tx are called via the same netdev abstraction, but we must ask the questions: > - Is the netdev abstraction really the best it can be? It is sufficient for what it needs to do. > - Does adding an optional "async" feature to the abstraction improve performance significantly? (positive from including?) Yes, but so do other possible implementations that doesn't require API changes. Possibly, performance can be even higher. > - Does adding the optional async feature cause actual degradation in DPIF implementations that don't support/use it? (negative due to including?) It reduces the API clarity and increases the maintenance cost. > > Of course strong abstractions are valuable, and of course changing them requires careful thought. > But let's be clear - It is probably fair to say that OVS is not deployed because it has good abstractions internally. > It is deployed because it is useful, and serves the need of an end-user. And part of the end-user needs is performance. Good internal abstractions are the key for a code maintainability. We can make a complete mess from a code while it will still work fast. But that will cost us ability to fix issues in the future, so end users will stop using OVS over time because it's full of bugs that we unable to isolate and fix. Little things adds up. > > The suggestion of integrating "Defer Work" method of exposing async in the OVS Datapath is well thought out, > and a clean way of handling async work in a per-thread manner at the application layer. It is the most common way of integrating > lookaside acceleration in software pipelines, and handling the async work at application thread level is the only logical place where > the programmer can reason about tradeoffs for a specific use-case. Adding "dma acceleration to Vhost" will inevitably lead to > compromises in the DPDK implementation, and ones that might (or might not) work for OVS and other apps. I'm not convinced about "inevitably lead to compromises", and most certainly not convinced that they will actually lead to degraded performance in real-world cases. Moving "dma acceleration to Vhost" library and completing transfers by the hardware may even make it faster due to lower delays between submitting the DMA job and packet being available to a VM on the other side. And I'm pretty sure that moving the implementation to vhost library will allow many different applications to use that functionality much more easily. > > As you know, there have been OVS Conference presentations[1][2], RFCs and POCs[3][4][5][6], and community calls[7][8][9] on the topic. > In the various presentations, the benefits of using application-level deferral of work are highlighted, and compared to other implementations > which have non-desirable side-effects. The presentations were one-sided (mostly a pure marketing, but that is expected taking into account their purpose) and I objected to a lot of statements made during them. Didn't get any solid arguments against my objections. Most of the stated "non-desirable side-effects" can be avoided or may not be a problem at all with a thoughtful implementation. But I'm again repeating myself. > We haven't heard any objections that people won't use OVS if the netdev abstraction is changed. The same statement can be made about every internal code change in every other project. End users doesn't know how the code look like, so they do not care. > > It seems there is a trade-off decision to be made; > A) Change/improve the netdev abstraction to allow for async accelerations, and pay the cost of added app layer complexity s/improve// > B) Demand dma-acceleration is pushed down into vhost & below (as netdev abstraction is not going to be changed), > resulting in sub-par and potentially unusable code for any given app, as lower-layers cannot reason about app level specifics s/demand/reasonably ask/ "sub-par and potentially unusable code"... I can say the same about the current async vhost API - it's unusable for any given app. And I can say that because I saw what is required in order to use it in OVS. You found a way to hack it into OVS, it doesn't mean that it's a good way of integrating it and it doesn't mean that it's possible to do the same for any given app. It will definitely be a lot of work for many apps to integrate this functionality since a lot of a new code has to be written just for it. Try and implement support for async vhost in vhost ethdev. "potentially unusable code"... That is what experimental APIs are for. You're creating a new API, waiting for several applications to try to use it, gathering feedback, making API changes, then stabilizing it when everyone is happy. "lower-layers cannot reason about app level specifics"... But that is the whole point of DPDK - to be an abstraction layer, so the common stable APIs can be used, and application developers not need to rewrite a half of their application every time. > > How can we the OVS/DPDK developers and users make a decision here? There is an experimental API in DPDK, it doesn't integrate well with OVS. OVS community is asking to change that API (didn't see a single positive review on RFCs), which is a normal process for experimental APIs. You're refusing to do so. I don't know many DPDK applications, but at least VPP will not be able to use vhost async as it doesn't seem to use vhost library and vhost async is not available via vhost ethdev in its current form. If applications can't use an API, such API is probably not good. I don't see a clear path forward here. Best regards, Ilya Maximets. > > Regards, -Harry > > [1] https://www.openvswitch.org/support/ovscon2020/#C3 > [2] https://www.openvswitch.org/support/ovscon2021/#T12 > [3] rawdev; https://patchwork.ozlabs.org/project/openvswitch/patch/20201023094845.35652-2-sunil.pai.g@intel.com/ > [4] defer work; http://patchwork.ozlabs.org/project/openvswitch/list/?series=261267&state=* > [5] v3; http://patchwork.ozlabs.org/project/openvswitch/patch/20220104125242.1064162-2-sunil.pai.g@intel.com/ > [6] v4; http://patchwork.ozlabs.org/project/openvswitch/patch/20220321173640.326795-2-sunil.pai.g@intel.com/ > [7] Slides session 1; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/OVS%20vhost%20async%20datapath%20design%202022.pdf > [8] Slides session 2; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/OVS%20vhost%20async%20datapath%20design%202022%20session%202.pdf > [9] Slides session 3; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-share/raw/main/ovs_datapath_design_2022%20session%203.pdf >