From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 5395042AF7; Wed, 17 May 2023 14:29:00 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id D8427406B7; Wed, 17 May 2023 14:28:59 +0200 (CEST) Received: from mail-vs1-f41.google.com (mail-vs1-f41.google.com [209.85.217.41]) by mails.dpdk.org (Postfix) with ESMTP id 22B614067B for ; Wed, 17 May 2023 14:28:58 +0200 (CEST) Received: by mail-vs1-f41.google.com with SMTP id ada2fe7eead31-436161f2cf2so200183137.2 for ; Wed, 17 May 2023 05:28:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684326537; x=1686918537; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HlihUfwWWrcOQIqMlH7UxGLCPh5o+GAfaB0SetjIyUA=; b=poTtd+kvHjTYJ/t2BTUmHaevt6Pk90pKImG5ngFEtVCQXY6VfuCU5I4HnVJo9jEmz1 +FmZQUKJr4+6m7xP9kgkWZVhHkqrriIINu4GBn6Ze1ACOu7CWjnHonIs0tKaqc7K1NK/ YRqAq87K9xOJQ2pu/dqBfFnZnOZPwVimDRbLRj/KHVqqbrSLK5nC3IWNBJYFR2iTmdpY FARAqInR4Fdo+DusIh5ZPxpyr6ahmF8mSKygedSoEZ+bugG7LlNznDAwvxztLqiN124b rkDxAuFwJ8oZR3vaH7NZzc+KMyf9gMNxPJMXDObetRQ5N8H7iPawHmNs+MdVKvGg7Ezk QUeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684326537; x=1686918537; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HlihUfwWWrcOQIqMlH7UxGLCPh5o+GAfaB0SetjIyUA=; b=YF6AeQQY29VZfI23Bkx1XbS/sSyG+RidXp/Ct+9RYoT4b9h4BXWFlpxwDD2ROvpiQo M1pdOCzW3RMt/kv8hJ7SuSKe/qpiL0cF8dcWQ3jmwGde9p4dSn1In4Z/7zwki8u6/SEV ueQHfsV0ZwueSgGRUStCZwWk+bmBzebVNh2BMFFFK7Yys6Lz7y4zmaqvHpUeitsjJDby GTp9Cf1SHNkTUOCc2/TgKjY/VpGgqNYP7PECe8aEEnvqmPRgj+a37KqThVeD+SAzWR91 RIeBV1R57ldNOIG1af8znNxqn6tZe9ksL842ZdahAlDO85VAaqV96NuZGKJRz+fd8MR4 fXjw== X-Gm-Message-State: AC+VfDwOdcW/5SpjHTTzqgA6nzPcoCXCKcB3mfPCNgD8ArO8SX+JJCaS AC6wBxgXvdJMWbhaO/U4U/UlASVR3M2YP8JTMow= X-Google-Smtp-Source: ACHHUZ4JIHpyWfGn/E63v7tGcowUmII4pv7opwKS4cPO5PseJqR+WDB8zk2J71rDfDAu7mHcWdCB26cVp674U/NAzM8= X-Received: by 2002:a05:6102:382:b0:436:160e:9d56 with SMTP id m2-20020a056102038200b00436160e9d56mr13220783vsq.10.1684326537168; Wed, 17 May 2023 05:28:57 -0700 (PDT) MIME-Version: 1.0 References: <20230511081641.6693-1-mattias.ronnblom@ericsson.com> <20230511082415.6720-1-mattias.ronnblom@ericsson.com> <7c0013e1-d11b-2ad0-04b9-73be426d4719@ericsson.com> <1a5704dd-48be-c3e3-18e4-714c25c0edfa@lysator.liu.se> In-Reply-To: <1a5704dd-48be-c3e3-18e4-714c25c0edfa@lysator.liu.se> From: Jerin Jacob Date: Wed, 17 May 2023 17:58:30 +0530 Message-ID: Subject: Re: [PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts To: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , "jerinj@marvell.com" , "dev@dpdk.org" , =?UTF-8?Q?Morten_Br=C3=B8rup?= Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Wed, May 17, 2023 at 12:46=E2=80=AFPM Mattias R=C3=B6nnblom wrote: > > On 2023-05-16 15:08, Jerin Jacob wrote: > > On Tue, May 16, 2023 at 2:22=E2=80=AFAM Mattias R=C3=B6nnblom wrote: > >> > >> On 2023-05-15 14:38, Jerin Jacob wrote: > >>> On Fri, May 12, 2023 at 6:45=E2=80=AFPM Mattias R=C3=B6nnblom > >>> wrote: > >>>> > >>>> On 2023-05-12 13:59, Jerin Jacob wrote: > >>>>> On Thu, May 11, 2023 at 2:00=E2=80=AFPM Mattias R=C3=B6nnblom > >>>>> wrote: > >>>>>> > >>>>>> Use non-burst event enqueue and dequeue calls from burst enqueue a= nd > >>>>>> dequeue only when the burst size is compile-time constant (and equ= al > >>>>>> to one). > >>>>>> > >>>>>> Signed-off-by: Mattias R=C3=B6nnblom > >>>>>> > >>>>>> --- > >>>>>> > >>>>>> v3: Actually include the change v2 claimed to contain. > >>>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings= if > >>>>>> application is compiled with -pedantic. (Morten Br=C3=B8rup= ) > >>>>>> --- > >>>>>> lib/eventdev/rte_eventdev.h | 4 ++-- > >>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>>>>> > >>>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventd= ev.h > >>>>>> index a90e23ac8b..a471caeb6d 100644 > >>>>>> --- a/lib/eventdev/rte_eventdev.h > >>>>>> +++ b/lib/eventdev/rte_eventdev.h > >>>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, ui= nt8_t port_id, > >>>>>> * Allow zero cost non burst mode routine invocation if= application > >>>>>> * requests nb_events as const one > >>>>>> */ > >>>>>> - if (nb_events =3D=3D 1) > >>>>>> + if (__extension__(__builtin_constant_p(nb_events)) && nb_e= vents =3D=3D 1) > >>>>> > >>>>> "Why" part is not clear from the commit message. Is this to avoid > >>>>> nb_events read if it is built-in const. > >>>> > >>>> The __builtin_constant_p() is introduced to avoid having the compile= r > >>>> generate a conditional branch and two different code paths in case > >>>> nb_elem is a run-time variable. > >>>> > >>>> In particular, this matters if nb_elems is run-time variable and var= ies > >>>> between 1 and some larger value. > >>>> > >>>> I should have mention this in the commit message. > >>>> > >>>> A very slight performance improvement. It also makes the code better > >>>> match the comment, imo. Zero cost for const one enqueues, but no imp= act > >>>> non-compile-time-constant-length enqueues. > >>>> > >>>> Feel free to ignore. > >>> > >>> > >>> I did some performance comparison of the patch. > >>> A low-end ARM machines shows 0.7% drop with single event case. No > >>> regression see with high-end ARM cores with single event case. > >>> > >>> IMO, optimizing the check for burst mode(the new patch) may not show > >>> any real improvement as the cost is divided by number of event. > >>> Whereas optimizing the check for single event case(The current code) > >>> shows better performance with single event case and no regression > >>> with burst mode as cost is divided by number of events. > >> > >> I ran some tests on an AMD Zen 3 with DSW. > >> In the below tests the enqueue burst size is not compile-time constant= . > >> > >> Enqueue burst size Performance improvement > >> Run-time constant 1 ~5% > >> Run-time constant 2 ~0% > >> Run-time variable 1-2 ~9% > >> Run-time variable 1-16 ~0% > >> > >> The run-time variable enqueue sizes randomly (uniformly) distributed i= n > >> the specified range. > >> > >> The first result may come as a surprise. The benchmark is using > >> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op ty= pe > >> in most apps). The single-event enqueue function only exists in a > >> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent= ). > >> I suspect that is the reason for the performance improvement. > >> > >> This effect is large-enough to make it somewhat beneficial (+~1%) to u= se > >> run-time variable single-event enqueue compared to keeping the burst > >> size compile-time constant. > > > > # Interesting, Could you share your testeventdev command to test it. > > I'm using a proprietary benchmark to evaluate the effect of these > changes. There's certainly nothing secret about that program, and also > nothing very DSW-specific either. I hope to at some point both extend > DPDK eventdev tests to include DSW, and also to contribute > benchmarks/characteristics tests (perf unit tests or as a separate > program), if there seems to be a value in this. Yes. Please extend the testeventdev for your use case so that all drivers can test and help to optimize _real world_ cases. Testeventdev already has plugin kind of interface, it should pretty easy to add new MODES. > > > # By having quick glance on DSW code, following change can be added(or > > similar cases). > > Not sure such change in DSW driver is making a difference or nor? > > > > > > diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_even= t.c > > index e84b65d99f..455470997b 100644 > > --- a/drivers/event/dsw/dsw_event.c > > +++ b/drivers/event/dsw/dsw_event.c > > @@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev > > *dsw, struct dsw_port *source_port) > > uint16_t > > dsw_event_enqueue(void *port, const struct rte_event *ev) > > { > > - return dsw_event_enqueue_burst(port, ev, unlikely(ev =3D=3D NUL= L) ? 0 : 1); > > + return dsw_event_enqueue_burst(port, ev, 1); > > Good point. > > Historical note: I think that comparison is old cruft borne out of a > misconception, that the single-event enqueue could be called directly > from application code, combined with the fact that producer-only ports > needed some way to "maintain" a port, prior to the introduction of > rte_event_maintain(). > > > } > > > > static __rte_always_inline uint16_t > > @@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port > > *source_port, > > return (num_non_release + num_release); > > } > > > > -uint16_t > > +inline uint16_t > > From what it seems, this does not have the desired effect, at least not > on GCC 11.3 (w/ the default DPDK compiler configuration). > > I reached this conclusion when I noticed that if I reshuffle the code so > to force (not hint) the inlining of the burst (and generic burst) > enqueue function into dsw_event_enqueue(), your change performs better. > > > dsw_event_enqueue_burst(void *port, const struct rte_event events[], > > uint16_t events_len) > > { > > > > # I am testing with command like this "/build/app/dpdk-test-eventdev > > -l 0-23 -a 0002:0e:00.0 -- --test=3Dperf_atq --plcores 1 --wlcores 8 > > --stlist p --nb_pkts=3D10000000000" > > > > > I re-ran the compile-time variable, run-time constant enqueue size of 1, > and I got the following: > > Jerin's change: +4% > Jerin's change + ensure inlining: +6% > RFC v3: +7% > > (Here I use a more different setup that produces more deterministic > results, hence the different numbers compared to the previous runs. They > were using a pipeline spread over two chiplets, and these runs are using > only a single chiplet.) > > It seems like with your suggested changes you eliminate most of the > single-enqueue-special case performance degradation (for DSW), but not > all of it. The remaining degradation is very small (for the above case, Cores like AMD Zen 3, I was not expecting 1% diff with such check. e.s.p if it has proper branch predictors. Even pretty low-end arm cores, had around 0.7% diff and new arm cores shows no difference. > larger for small by run-time variable enqueue sizes), but it's a little > sad that a supposedly performance-enhancing special case (that drives > complexity in the code, although not much) actually degrades performance. OK. Let's get rid of fp_ops->dequeue callback. Initial RFC of eventdev has public non burst API, that was the reason for that callback. > > >> > >> The performance gain is counted toward both enqueue and dequeue costs > >> (+benchmark app overhead), so an under-estimation if see this as an > >> enqueue performance improvement. > >> > >>> If you agree, then we can skip this patch. > >>> > >> > >> I have no strong opinion if this should be included or not. > >> > >> It was up to me, I would drop the single-enqueue special case handling > >> altogether in the next ABI update. > > > > That's a reasonable path. If we are willing to push a patch, we can > > test it and give feedback. > > Or in our spare time, We can do that as well. > > > > Sure, I'll give it a try. > > The next release is an ABI-breaking one? Yes (23.11). Please plan to send deprecation notice before 23.07 release. I will mark this patch as rejected in patchwork. Thanks for your time.