From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id DE16BA0505;
	Thu,  7 Apr 2022 17:46:42 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 7FCFC4068B;
	Thu,  7 Apr 2022 17:46:42 +0200 (CEST)
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by mails.dpdk.org (Postfix) with ESMTP id 7032340689
 for <dev@dpdk.org>; Thu,  7 Apr 2022 17:46:40 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
 s=mimecast20190719; t=1649346399;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 content-transfer-encoding:content-transfer-encoding:
 in-reply-to:in-reply-to:references:references;
 bh=8q3ASI33bUvBoQXRLrvSvqNjXKNMmULxclaOx3jA688=;
 b=R9U89Dzs8QKeGW4GXNUQj4tLZYcCp9UgnnzSko6tixISA1ppSzNSFKTo+wv363VoJmdyR+
 yGXMGDHvUTbyNO7n2+bqJFyRvk1aAM2Y+saCkxV5QIKcDoirgJ+1OsLenyXxDVsSRAAIbA
 RPsAxHZ+2xBicVEmWyDUKdt5Vu2yUfY=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-418-8TbCpWEuPuiPCFFdksLIDQ-1; Thu, 07 Apr 2022 11:46:36 -0400
X-MC-Unique: 8TbCpWEuPuiPCFFdksLIDQ-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 082F129DD980;
 Thu,  7 Apr 2022 15:46:36 +0000 (UTC)
Received: from [10.39.208.25] (unknown [10.39.208.25])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id B2CA240470E1;
 Thu,  7 Apr 2022 15:46:33 +0000 (UTC)
Message-ID: <7162eebf-c62b-c16e-92e6-ea350ac0d647@redhat.com>
Date: Thu, 7 Apr 2022 17:46:32 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.7.0
Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
To: Ilya Maximets <i.maximets@ovn.org>,
 "Van Haaren, Harry" <harry.van.haaren@intel.com>,
 =?UTF-8?Q?Morten_Br=c3=b8rup?= <mb@smartsharesystems.com>,
 "Richardson, Bruce" <bruce.richardson@intel.com>
Cc: "Pai G, Sunil" <sunil.pai.g@intel.com>, "Stokes, Ian"
 <ian.stokes@intel.com>, "Hu, Jiayu" <jiayu.hu@intel.com>,
 "Ferriter, Cian" <cian.ferriter@intel.com>,
 "ovs-dev@openvswitch.org" <ovs-dev@openvswitch.org>,
 "dev@dpdk.org" <dev@dpdk.org>, "Mcnamara, John" <john.mcnamara@intel.com>,
 "O'Driscoll, Tim" <tim.odriscoll@intel.com>, "Finn, Emma"
 <emma.finn@intel.com>
References: <ddaaf8eb51cf463581eef245543a719d@intel.com>
 <BN0PR11MB571241F94FE5750BC1AC6A4CD71E9@BN0PR11MB5712.namprd11.prod.outlook.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86F7D@smartserver.smartshare.dk>
 <7968dd0b-8647-8d7b-786f-dc876bcbf3f0@redhat.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86F7E@smartserver.smartshare.dk>
 <YkM71aqX00pY6hVf@bricha3-MOBL.ger.corp.intel.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86F80@smartserver.smartshare.dk>
 <BN0PR11MB57122986CFEC329E31133F78D71E9@BN0PR11MB5712.namprd11.prod.outlook.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86F82@smartserver.smartshare.dk>
 <BN0PR11MB5712A2D5BF5C596ACCFF0542D71F9@BN0PR11MB5712.namprd11.prod.outlook.com>
 <BN0PR11MB57120B91DC9C2AEAFA61F6F0D7E69@BN0PR11MB5712.namprd11.prod.outlook.com>
 <94d817cb-8151-6644-c577-ed8b42d24337@redhat.com>
 <55c5a37f-3ee8-394b-8cff-e7daecb59f73@ovn.org>
 <BN0PR11MB5712179F8493EFEE1459F938D7E69@BN0PR11MB5712.namprd11.prod.outlook.com>
 <c2291082-43a4-e5e5-a3f2-3486dcfd2e87@ovn.org>
From: Maxime Coquelin <maxime.coquelin@redhat.com>
In-Reply-To: <c2291082-43a4-e5e5-a3f2-3486dcfd2e87@ovn.org>
X-Scanned-By: MIMEDefang 2.84 on 10.11.54.2
Authentication-Results: relay.mimecast.com;
 auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=maxime.coquelin@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org


On 4/7/22 17:01, Ilya Maximets wrote:
> On 4/7/22 16:42, Van Haaren, Harry wrote:
>>> -----Original Message-----
>>> From: Ilya Maximets <i.maximets@ovn.org>
>>> Sent: Thursday, April 7, 2022 3:40 PM
>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
>>> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
>>> Richardson, Bruce <bruce.richardson@intel.com>
>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
>>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
>>> <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara,
>>> John <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>>> Finn, Emma <emma.finn@intel.com>
>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>>
>>> On 4/7/22 16:25, Maxime Coquelin wrote:
>>>> Hi Harry,
>>>>
>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
>>>>> Hi OVS & DPDK, Maintainers & Community,
>>>>>
>>>>> Top posting overview of discussion as replies to thread become slower:
>>>>> perhaps it is a good time to review and plan for next steps?
>>>>>
>>>>>   From my perspective, it those most vocal in the thread seem to be in favour
>>> of the clean
>>>>> rx/tx split ("defer work"), with the tradeoff that the application must be
>>> aware of handling
>>>>> the async DMA completions. If there are any concerns opposing upstreaming
>>> of this method,
>>>>> please indicate this promptly, and we can continue technical discussions here
>>> now.
>>>>
>>>> Wasn't there some discussions about handling the Virtio completions with
>>>> the DMA engine? With that, we wouldn't need the deferral of work.
>>>
>>> +1
>>
>> Yes there was, the DMA/virtq completions thread here for reference;
>> https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html
>>
>> I do not believe that there is a viable path to actually implementing it, and particularly
>> not in the more complex cases; e.g. virtio with guest-interrupt enabled.
>>
>> The thread above mentions additional threads and various other options; none of which
>> I believe to be a clean or workable solution. I'd like input from other folks more familiar
>> with the exact implementations of VHost/vrings, as well as those with DMA engine expertise.
> 
> I tend to trust Maxime as a vhost maintainer in such questions. :)
> 
> In my own opinion though, the implementation is possible and concerns doesn't
> sound deal-breaking as solutions for them might work well enough.  So I think
> the viability should be tested out before solution is disregarded.  Especially
> because the decision will form the API of the vhost library.

I agree, we need a PoC adding interrupt support to dmadev API using
eventfd, and adding a thread in Vhost library that polls for DMA
interrupts and calls vhost_vring_call if needed.

>>
>>
>>> With the virtio completions handled by DMA itself, the vhost port
>>> turns almost into a real HW NIC.  With that we will not need any
>>> extra manipulations from the OVS side, i.e. no need to defer any
>>> work while maintaining clear split between rx and tx operations.
>>>
>>> I'd vote for that.
>>>
>>>>
>>>> Thanks,
>>>> Maxime
>>
>> Thanks for the prompt responses, and lets understand if there is a viable workable way
>> to totally hide DMA-completions from the application.
>>
>> Regards,  -Harry
>>
>>
>>>>> In absence of continued technical discussion here, I suggest Sunil and Ian
>>> collaborate on getting
>>>>> the OVS Defer-work approach, and DPDK VHost Async patchsets available on
>>> GitHub for easier
>>>>> consumption and future development (as suggested in slides presented on
>>> last call).
>>>>>
>>>>> Regards, -Harry
>>>>>
>>>>> No inline-replies below; message just for context.
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Van Haaren, Harry
>>>>>> Sent: Wednesday, March 30, 2022 10:02 AM
>>>>>> To: Morten Brørup <mb@smartsharesystems.com>; Richardson, Bruce
>>>>>> <bruce.richardson@intel.com>
>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>>>> <Sunil.Pai.G@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>>>>> <Jiayu.Hu@intel.com>; Ferriter, Cian <Cian.Ferriter@intel.com>; Ilya
>>> Maximets
>>>>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org;
>>> Mcnamara,
>>>>>> John <john.mcnamara@intel.com>; O'Driscoll, Tim
>>> <tim.odriscoll@intel.com>;
>>>>>> Finn, Emma <Emma.Finn@intel.com>
>>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>> Sent: Tuesday, March 29, 2022 8:59 PM
>>>>>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Richardson, Bruce
>>>>>>> <bruce.richardson@intel.com>
>>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Pai G, Sunil
>>>>>>> <sunil.pai.g@intel.com>; Stokes, Ian <ian.stokes@intel.com>; Hu, Jiayu
>>>>>>> <jiayu.hu@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; Ilya
>>> Maximets
>>>>>>> <i.maximets@ovn.org>; ovs-dev@openvswitch.org; dev@dpdk.org;
>>> Mcnamara,
>>>>>> John
>>>>>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
>>> Finn,
>>>>>>> Emma <emma.finn@intel.com>
>>>>>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>>>>>
>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>> Sent: Tuesday, 29 March 2022 19.46
>>>>>>>>
>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
>>>>>>>>>
>>>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
>>>>>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Having thought more about it, I think that a completely
>>>>>>>>>> different
>>>>>>>>>>>> architectural approach is required:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
>>>>>>>> and TX
>>>>>>>>>>>> packet burst functions, each optimized for different CPU vector
>>>>>>>>>>>> instruction sets. The availability of a DMA engine should be
>>>>>>>>>> treated
>>>>>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
>>>>>>>> e.g.
>>>>>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>>>>>>>>> packet
>>>>>>>>>>>> burst functions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In such an architecture, it would be the application's job
>>>>>>>> to
>>>>>>>>>>>> allocate DMA channels and assign them to the specific PMDs that
>>>>>>>>>> should
>>>>>>>>>>>> use them. But the actual use of the DMA channels would move
>>>>>>>> down
>>>>>>>>>> below
>>>>>>>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Med venlig hilsen / Kind regards,
>>>>>>>>>>>>>>> -Morten Brørup
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Morten,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's *exactly* how this architecture is designed &
>>>>>>>>>> implemented.
>>>>>>>>>>>>>> 1.    The DMA configuration and initialization is up to the
>>>>>>>>>> application
>>>>>>>>>>>> (OVS).
>>>>>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
>>>>>>>> new
>>>>>>>>>> async
>>>>>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looking forward to talking on the call that just started.
>>>>>>>>>> Regards, -
>>>>>>>>>>>> Harry
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>>>>>>>> patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then, I suppose that the TX completions can be handled in the
>>>>>>>> TX
>>>>>>>>>>>> function, and the RX completions can be handled in the RX
>>>>>>>> function,
>>>>>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>>>>>>>
>>>>>>>>>>>>> TX_Burst(tx_packet_array):
>>>>>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
>>>>>>>> Process
>>>>>>>>>> TX
>>>>>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
>>>>>>>> descriptors. --
>>>>>>>>>>> Pass
>>>>>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
>>>>>>>> 1st
>>>>>>>>>>>> pipeline stage.)
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is Tx function might not be called again, so
>>>>>>>> enqueued
>>>>>>>>>>>> packets in 2. may never be completed from a Virtio point of
>>>>>>>> view.
>>>>>>>>>> IOW,
>>>>>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
>>>>>>>> but
>>>>>>>>>> the
>>>>>>>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>>>>>>>
>>>>>>>>>>> In that case, the application needs to call TX_Burst()
>>>>>>>> periodically
>>>>>>>>>> with an empty array, for completion purposes.
>>>>>>>>
>>>>>>>> This is what the "defer work" does at the OVS thread-level, but instead
>>>>>>>> of
>>>>>>>> "brute-forcing" and *always* making the call, the defer work concept
>>>>>>>> tracks
>>>>>>>> *when* there is outstanding work (DMA copies) to be completed
>>>>>>>> ("deferred work")
>>>>>>>> and calls the generic completion function at that point.
>>>>>>>>
>>>>>>>> So "defer work" is generic infrastructure at the OVS thread level to
>>>>>>>> handle
>>>>>>>> work that needs to be done "later", e.g. DMA completion handling.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>>>>>>>>>> library, to handle DMA completion. It might even handle multiple
>>>>>>>> DMA
>>>>>>>>>> channels, if convenient - and if possible without locking or other
>>>>>>>>>> weird complexity.
>>>>>>>>
>>>>>>>> That's exactly how it is done, the VHost library has a new API added,
>>>>>>>> which allows
>>>>>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev
>>>>>>>> abstraction)
>>>>>>>> we add a function to allow the OVS thread to do those completions in a
>>>>>>>> new
>>>>>>>> Netdev-abstraction API called "async_process" where the completions can
>>>>>>>> be checked.
>>>>>>>>
>>>>>>>> The only method to abstract them is to "hide" them somewhere that will
>>>>>>>> always be
>>>>>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
>>>>>>>> use this method.
>>>>>>>> This allows "completions" to be transparent to the app, at the tradeoff
>>>>>>>> to having bad
>>>>>>>> separation  of concerns as Rx and Tx are now tied-together.
>>>>>>>>
>>>>>>>> The point is, the Application layer must *somehow * handle of
>>>>>>>> completions.
>>>>>>>> So fundamentally there are 2 options for the Application level:
>>>>>>>>
>>>>>>>> A) Make the application periodically call a "handle completions"
>>>>>>>> function
>>>>>>>>      A1) Defer work, call when needed, and track "needed" at app
>>>>>>>> layer, and calling into vhost txq complete as required.
>>>>>>>>              Elegant in that "no work" means "no cycles spent" on
>>>>>>>> checking DMA completions.
>>>>>>>>      A2) Brute-force-always-call, and pay some overhead when not
>>>>>>>> required.
>>>>>>>>              Cycle-cost in "no work" scenarios. Depending on # of
>>>>>>>> vhost queues, this adds up as polling required *per vhost txq*.
>>>>>>>>              Also note that "checking DMA completions" means taking a
>>>>>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread
>>>>>>>> contention!
>>>>>>>
>>>>>>> A side note: I don't see why locking is required to test for DMA
>>> completions.
>>>>>>> rte_dma_vchan_status() is lockless, e.g.:
>>>>>>>
>>>>>>
>>> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
>>>>>> 56
>>>>>>> 0
>>>>>>
>>>>>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree
>>> manner
>>>>>> from a single thread.
>>>>>>
>>>>>> The locks I refer to are at the OVS-netdev level, as virtq's are shared across
>>> OVS's
>>>>>> dataplane threads.
>>>>>> So the "M to N" comes from M dataplane threads to N virtqs, hence
>>> requiring
>>>>>> some locking.
>>>>>>
>>>>>>
>>>>>>>> B) Hide completions and live with the complexity/architectural
>>>>>>>> sacrifice of mixed-RxTx.
>>>>>>>>      Various downsides here in my opinion, see the slide deck
>>>>>>>> presented earlier today for a summary.
>>>>>>>>
>>>>>>>> In my opinion, A1 is the most elegant solution, as it has a clean
>>>>>>>> separation of concerns, does not  cause
>>>>>>>> avoidable contention on virtq locks, and spends no cycles when there is
>>>>>>>> no completion work to do.
>>>>>>>>
>>>>>>>
>>>>>>> Thank you for elaborating, Harry.
>>>>>>
>>>>>> Thanks for part-taking in the discussion & providing your insight!
>>>>>>
>>>>>>> I strongly oppose against hiding any part of TX processing in an RX function.
>>> It
>>>>>> is just
>>>>>>> wrong in so many ways!
>>>>>>>
>>>>>>> I agree that A1 is the most elegant solution. And being the most elegant
>>>>>> solution, it
>>>>>>> is probably also the most future proof solution. :-)
>>>>>>
>>>>>> I think so too, yes.
>>>>>>
>>>>>>> I would also like to stress that DMA completion handling belongs in the
>>> DPDK
>>>>>>> library, not in the application. And yes, the application will be required to
>>> call
>>>>>> some
>>>>>>> "handle DMA completions" function in the DPDK library. But since the
>>>>>> application
>>>>>>> already knows that it uses DMA, the application should also know that it
>>> needs
>>>>>> to
>>>>>>> call this extra function - so I consider this requirement perfectly acceptable.
>>>>>>
>>>>>> Agree here.
>>>>>>
>>>>>>> I prefer if the DPDK vhost library can hide its inner workings from the
>>>>>> application,
>>>>>>> and just expose the additional "handle completions" function. This also
>>> means
>>>>>> that
>>>>>>> the inner workings can be implemented as "defer work", or by some other
>>>>>>> algorithm. And it can be tweaked and optimized later.
>>>>>>
>>>>>> Yes, the choice in how to call the handle_completions function is Application
>>>>>> layer.
>>>>>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice,
>>> and
>>>>>> every
>>>>>> application is free to choose its own method.
>>>>>>
>>>>>>> Thinking about the long term perspective, this design pattern is common
>>> for
>>>>>> both
>>>>>>> the vhost library and other DPDK libraries that could benefit from DMA (e.g.
>>>>>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or
>>> a
>>>>>>> separate library. But for now, we should focus on the vhost use case, and
>>> just
>>>>>> keep
>>>>>>> the long term roadmap for using DMA in mind.
>>>>>>
>>>>>> Totally agree to keep long term roadmap in mind; but I'm not sure we can
>>>>>> refactor
>>>>>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
>>>>>> updated;
>>>>>> this causes a tight coupling between the DMA completion count, and the
>>> vhost
>>>>>> library.
>>>>>>
>>>>>> As Ilya raised on the call yesterday, there is an "in_order" requirement in the
>>>>>> vhost
>>>>>> library, that per virtq the packets are presented to the guest "in order" of
>>>>>> enqueue.
>>>>>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
>>>>>> library
>>>>>> handles this today by re-ordering the DMA completions.)
>>>>>>
>>>>>>
>>>>>>> Rephrasing what I said on the conference call: This vhost design will
>>> become
>>>>>> the
>>>>>>> common design pattern for using DMA in DPDK libraries. If we get it wrong,
>>> we
>>>>>> are
>>>>>>> stuck with it.
>>>>>>
>>>>>> Agree, and if we get it right, then we're stuck with it too! :)
>>>>>>
>>>>>>
>>>>>>>>>>> Here is another idea, inspired by a presentation at one of the
>>>>>>>> DPDK
>>>>>>>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>>>>>>>
>>>>>>>>>>> Add an additional transaction to each DMA burst; a special
>>>>>>>>>> transaction containing the memory write operation that makes the
>>>>>>>>>> descriptors available to the Virtio driver.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That is something that can work, so long as the receiver is
>>>>>>>> operating
>>>>>>>>>> in
>>>>>>>>>> polling mode. For cases where virtio interrupts are enabled, you
>>>>>>>> still
>>>>>>>>>> need
>>>>>>>>>> to do a write to the eventfd in the kernel in vhost to signal the
>>>>>>>>>> virtio
>>>>>>>>>> side. That's not something that can be offloaded to a DMA engine,
>>>>>>>>>> sadly, so
>>>>>>>>>> we still need some form of completion call.
>>>>>>>>>
>>>>>>>>> I guess that virtio interrupts is the most widely deployed scenario,
>>>>>>>> so let's ignore
>>>>>>>>> the DMA TX completion transaction for now - and call it a possible
>>>>>>>> future
>>>>>>>>> optimization for specific use cases. So it seems that some form of
>>>>>>>> completion call
>>>>>>>>> is unavoidable.
>>>>>>>>
>>>>>>>> Agree to leave this aside, there is in theory a potential optimization,
>>>>>>>> but
>>>>>>>> unlikely to be of large value.
>>>>>>>>
>>>>>>>
>>>>>>> One more thing: When using DMA to pass on packets into a guest, there
>>> could
>>>>>> be a
>>>>>>> delay from the DMA completes until the guest is signaled. Is there any CPU
>>>>>> cache
>>>>>>> hotness regarding the guest's access to the packet data to consider here?
>>> I.e. if
>>>>>> we
>>>>>>> wait signaling the guest, the packet data may get cold.
>>>>>>
>>>>>> Interesting question; we can likely spawn a new thread around this topic!
>>>>>> In short, it depends on how/where the DMA hardware writes the copy.
>>>>>>
>>>>>> With technologies like DDIO, the "dest" part of the copy will be in LLC. The
>>> core
>>>>>> reading the
>>>>>> dest data will benefit from the LLC locality (instead of snooping it from a
>>> remote
>>>>>> core's L1/L2).
>>>>>>
>>>>>> Delays in notifying the guest could result in LLC capacity eviction, yes.
>>>>>> The application layer decides how often/promptly to check for completions,
>>>>>> and notify the guest of them. Calling the function more often will result in
>>> less
>>>>>> delay in that portion of the pipeline.
>>>>>>
>>>>>> Overall, there are caching benefits with DMA acceleration, and the
>>> application
>>>>>> can control
>>>>>> the latency introduced between dma-completion done in HW, and Guest
>>> vring
>>>>>> update.
>>>>>
>>>>
>>
>