From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id D241F431F8;
	Wed, 25 Oct 2023 09:40:58 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 5B841402C8;
	Wed, 25 Oct 2023 09:40:58 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id AFB5E402BD
 for <dev@dpdk.org>; Wed, 25 Oct 2023 09:40:56 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id 538AB1D54B
 for <dev@dpdk.org>; Wed, 25 Oct 2023 09:40:56 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id 478871DA16; Wed, 25 Oct 2023 09:40:56 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.5 required=5.0 tests=ALL_TRUSTED,AWL
 autolearn=disabled version=3.4.6
X-Spam-Score: -1.5
Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest
 SHA256) (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id 223FE1DA14;
 Wed, 25 Oct 2023 09:40:54 +0200 (CEST)
Message-ID: <1b36dd59-d9d0-434c-922f-6e0df8cafe7d@lysator.liu.se>
Date: Wed, 25 Oct 2023 09:40:54 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Eventdev dequeue-enqueue event correlation
To: Bruce Richardson <bruce.richardson@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, Jerin Jacob <jerinjacobk@gmail.com>,
 Peter Nilsson <peter.j.nilsson@ericsson.com>, svante.jarvstrat@ericsson.com,
 Harry van Haaren <harry.van.haaren@intel.com>,
 Abdullah Sevincer <abdullah.sevincer@intel.com>,
 =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <mattias.ronnblom@ericsson.com>
References: <b8ecae07-99ee-4e18-b531-30a557bfc5b2@lysator.liu.se>
 <ZTd79nkq4cEk5sGQ@bricha3-MOBL.ger.corp.intel.com>
 <ZTeJ+p6T5xJ7S0Ct@bricha3-MOBL.ger.corp.intel.com>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <ZTeJ+p6T5xJ7S0Ct@bricha3-MOBL.ger.corp.intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2023-10-24 11:10, Bruce Richardson wrote:
> On Tue, Oct 24, 2023 at 09:10:30AM +0100, Bruce Richardson wrote:
>> On Mon, Oct 23, 2023 at 06:10:54PM +0200, Mattias Rönnblom wrote:
>>> Hi.
>>>
>>> Consider an Eventdev app using atomic-type scheduling doing something like:
>>>
>>>      struct rte_event events[3];
>>>
>>>      rte_event_dequeue_burst(dev_id, port_id, events, 3, 0);
>>>
>>>      /* Assume three events were dequeued, and the application decides
>>>       * it's best off to processing event 0 and 2 consecutively */
>>>
>>>      process(&events[0]);
>>>      process(&events[2]);
>>>
>>>      events[0].queue_id++;
>>>      events[0].op = RTE_EVENT_OP_FORWARD;
>>>      events[2].queue_id++;
>>>      events[2].op = RTE_EVENT_OP_FORWARD;
>>>
>>>      rte_event_enqueue_burst(dev_id, port_id, &events[0], 1);
>>>      rte_event_enqueue_burst(dev_id, port_id, &events[2], 1);
>>>
>>>      process(&events[1]);
>>>      events[1].queue_id++;
>>>      events[1].op = RTE_EVENT_OP_FORWARD;
>>>
>>>      rte_event_enqueue_burst(dev_id, port_id, &events[1], 1);
>>>
>>> If one would just read the Eventdev API spec, they might expect this to work
>>> (especially since impl_opaque hints as potentially be useful for the purpose
>>> of identifying events).
>>>
>>> However, on certain event devices, it doesn't (and maybe rightly so). If
>>> event 0 and 2 belongs to the same flow (queue id + flow id pair), and event
>>> 1 belongs to some other, then this other flow would be "unlocked" at the
>>> point of the second enqueue operation (and thus be processed on some other
>>> core, in parallel). The first flow would still be needlessly "locked".
>>>
>>> Such event devices require the order of the enqueued events to be the same
>>> as the dequeued events, using RTE_EVENT_OP_RELEASE type events as "fillers"
>>> for dropped events.
>>>
>>> Am I missing something in the Eventdev API documentation?
>>>
>>
>> Much more likely is that the documentation is missing something. We should
>> explicitly clarify this behaviour, as it's required by a number of drivers.
>>
>>> Could an event device use the impl_opaque field to track the identity of an
>>> event (and thus relax ordering requirements) and still be complaint toward
>>> the API?
>>>
>>
>> Possibly, but the documentation also doesn't report that the impl_opaque
>> field must be preserved between dequeue and enqueue. When forwarding a
>> packet it's well possible for an app to extract an mbuf from a dequeued
>> event and create a new event for sending it back in to the eventdev. For

Such a behavior would be in violation of a part of the Eventdev API 
contract actually specified. The rte_event struct documentation says 
about impl_opaque that "An implementation may use this field to hold 
implementation specific value to share between dequeue and enqueue 
operation. The application should not modify this field. "

I see no other way to read this than that "an implementation" here is 
referring to an event device PMD. The requirement that the application 
can't modify this field only make sense in the context of "from dequeue 
to enqueue".

>> example, if the first stage post-RX is doing classify, it's entirely
>> possible for every single field in the event header to be different for the
>> event returned compared to dequeue (flow_id recomputed, event type/source
>> adjusted, target queue_id and priority updated, op type changed to forward
>> from new, etc. etc.).
>>
>>> What happens if a RTE_EVENT_OP_NEW event is inserted into the mix of
>>> OP_FORWARD and OP_RELEASE type events being enqueued? Again I'm not clear on
>>> what the API says, if anything.
>>>
>> OP_NEW should have no effect on the "history-list" of events previousl
>> dequeued. Again, our docs should clarify that explicitly. Thanks for
>> calling all this out.
>>
> Looking at the docs we have, I would propose adding a new subsection "Event
> Operations", as section 49.1.6 to [1]. There we could explain "New",
> "Forward" and "Release" events - what they mean for the different queue
> types and how to use them. That section could also cover the enqueue
> ordering rules, as the use of event "history" is necessary to explain
> releases and forwards.
> 
> This seem reasonable? If nobody else has already started on updating docs
> for this, I'm happy enough to give it a stab.
> 

Batch dequeues not only provides an opportunity to amortize 
per-interaction overhead with the event device, it also allows the 
application to reshuffle the order in which it decides to process the 
events.

Such reshuffling may have a very significant impact on performance. At a 
minimum, cache locality improves, and in case the app is able to "vector 
processing" (e.g., something akin to what fd.io VPP does), the gains may 
be further increased.

One may argue the app/core should just "do what it's told" by the event 
device. After all, an event device is a work scheduler, and reshuffling 
items of work certainly counts as (micro-)scheduling work.

However it's much to hope for to expect a fairly generic function, 
especially if it comes in the form of hardware, with a design frozen 
years ago, to be able to arrange the work in whatever is currently 
optimal order for one particular application.

What such an app can do (or must do, if it has efficiency constraints) 
is to buffer the events on the output side, rearranging them in 
accordance to the yet-seemingly-undocumented Eventdev API contract. 
That's certainly possible, and not very difficult, but it seems to me 
that this really is the job something in the platform (e.g., in Eventdev 
or the event device PMD).

One way out of this could be to add an "implicit release-*only*" mode of 
operation for eventdev.

In such a mode, the RTE_SCHED_TYPE_ATOMIC per-flow "lock" (and its 
ORDERED equivalent, if there is one) would be held until the next 
dequeue. In such a mode, the difference between OP_FORWARD and OP_NEW 
events would just be the back-pressure watermark (new_event_threshold).

That pre-rte_event_enqueue_burst() buffering would prevent the event 
device from releasing "locks" that could otherwise be released, but the 
typical cost of event device interaction is so high so I have my doubts 
about how useful that feature is. If you are worried about "locks" held 
for a long time, one may need to use short bursts anyway (since 
worst-case critical section length is not reduced by such RELEASEs).

Another option would be to have the current RTE_EVENT_DEV_CAP_BURST_MODE 
capable PMDs start using the "impl_opaque" field for the purpose of 
matching in and out events. It would require applications to actually 
start adhering to the "don't touch impl_opaque" requirement of the 
Eventdev API.

Those "fixes" are not mutually exclusive.

A side note: it's unfortunate there are no bits in the rte_event struct 
that can be used for "event id"/"event SN"/"event dequeue idx" type 
information, if an app would like to work around this issue with current 
PMDs.

> /Bruce
> 
> [1] https://doc.dpdk.org/guides-23.07/prog_guide/eventdev.html