From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9CE1F42B25; Wed, 17 May 2023 09:16:28 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 377234114A; Wed, 17 May 2023 09:16:28 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 3C12B4067B for ; Wed, 17 May 2023 09:16:27 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id B06371F3F0 for ; Wed, 17 May 2023 09:16:26 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id AF5511F3EF; Wed, 17 May 2023 09:16:26 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-2.2 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A, T_SCC_BODY_TEXT_LINE autolearn=disabled version=3.4.6 X-Spam-Score: -2.2 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 44B381F374; Wed, 17 May 2023 09:16:25 +0200 (CEST) Message-ID: <1a5704dd-48be-c3e3-18e4-714c25c0edfa@lysator.liu.se> Date: Wed, 17 May 2023 09:16:24 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: [PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts To: Jerin Jacob Cc: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= , "jerinj@marvell.com" , "dev@dpdk.org" , =?UTF-8?Q?Morten_Br=c3=b8rup?= References: <20230511081641.6693-1-mattias.ronnblom@ericsson.com> <20230511082415.6720-1-mattias.ronnblom@ericsson.com> <7c0013e1-d11b-2ad0-04b9-73be426d4719@ericsson.com> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2023-05-16 15:08, Jerin Jacob wrote: > On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom wrote: >> >> On 2023-05-15 14:38, Jerin Jacob wrote: >>> On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom >>> wrote: >>>> >>>> On 2023-05-12 13:59, Jerin Jacob wrote: >>>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom >>>>> wrote: >>>>>> >>>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and >>>>>> dequeue only when the burst size is compile-time constant (and equal >>>>>> to one). >>>>>> >>>>>> Signed-off-by: Mattias Rönnblom >>>>>> >>>>>> --- >>>>>> >>>>>> v3: Actually include the change v2 claimed to contain. >>>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if >>>>>> application is compiled with -pedantic. (Morten Brørup) >>>>>> --- >>>>>> lib/eventdev/rte_eventdev.h | 4 ++-- >>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h >>>>>> index a90e23ac8b..a471caeb6d 100644 >>>>>> --- a/lib/eventdev/rte_eventdev.h >>>>>> +++ b/lib/eventdev/rte_eventdev.h >>>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id, >>>>>> * Allow zero cost non burst mode routine invocation if application >>>>>> * requests nb_events as const one >>>>>> */ >>>>>> - if (nb_events == 1) >>>>>> + if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1) >>>>> >>>>> "Why" part is not clear from the commit message. Is this to avoid >>>>> nb_events read if it is built-in const. >>>> >>>> The __builtin_constant_p() is introduced to avoid having the compiler >>>> generate a conditional branch and two different code paths in case >>>> nb_elem is a run-time variable. >>>> >>>> In particular, this matters if nb_elems is run-time variable and varies >>>> between 1 and some larger value. >>>> >>>> I should have mention this in the commit message. >>>> >>>> A very slight performance improvement. It also makes the code better >>>> match the comment, imo. Zero cost for const one enqueues, but no impact >>>> non-compile-time-constant-length enqueues. >>>> >>>> Feel free to ignore. >>> >>> >>> I did some performance comparison of the patch. >>> A low-end ARM machines shows 0.7% drop with single event case. No >>> regression see with high-end ARM cores with single event case. >>> >>> IMO, optimizing the check for burst mode(the new patch) may not show >>> any real improvement as the cost is divided by number of event. >>> Whereas optimizing the check for single event case(The current code) >>> shows better performance with single event case and no regression >>> with burst mode as cost is divided by number of events. >> >> I ran some tests on an AMD Zen 3 with DSW. >> In the below tests the enqueue burst size is not compile-time constant. >> >> Enqueue burst size Performance improvement >> Run-time constant 1 ~5% >> Run-time constant 2 ~0% >> Run-time variable 1-2 ~9% >> Run-time variable 1-16 ~0% >> >> The run-time variable enqueue sizes randomly (uniformly) distributed in >> the specified range. >> >> The first result may come as a surprise. The benchmark is using >> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type >> in most apps). The single-event enqueue function only exists in a >> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent). >> I suspect that is the reason for the performance improvement. >> >> This effect is large-enough to make it somewhat beneficial (+~1%) to use >> run-time variable single-event enqueue compared to keeping the burst >> size compile-time constant. > > # Interesting, Could you share your testeventdev command to test it. I'm using a proprietary benchmark to evaluate the effect of these changes. There's certainly nothing secret about that program, and also nothing very DSW-specific either. I hope to at some point both extend DPDK eventdev tests to include DSW, and also to contribute benchmarks/characteristics tests (perf unit tests or as a separate program), if there seems to be a value in this. > # By having quick glance on DSW code, following change can be added(or > similar cases). > Not sure such change in DSW driver is making a difference or nor? > > > diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c > index e84b65d99f..455470997b 100644 > --- a/drivers/event/dsw/dsw_event.c > +++ b/drivers/event/dsw/dsw_event.c > @@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev > *dsw, struct dsw_port *source_port) > uint16_t > dsw_event_enqueue(void *port, const struct rte_event *ev) > { > - return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 1); > + return dsw_event_enqueue_burst(port, ev, 1); Good point. Historical note: I think that comparison is old cruft borne out of a misconception, that the single-event enqueue could be called directly from application code, combined with the fact that producer-only ports needed some way to "maintain" a port, prior to the introduction of rte_event_maintain(). > } > > static __rte_always_inline uint16_t > @@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port > *source_port, > return (num_non_release + num_release); > } > > -uint16_t > +inline uint16_t From what it seems, this does not have the desired effect, at least not on GCC 11.3 (w/ the default DPDK compiler configuration). I reached this conclusion when I noticed that if I reshuffle the code so to force (not hint) the inlining of the burst (and generic burst) enqueue function into dsw_event_enqueue(), your change performs better. > dsw_event_enqueue_burst(void *port, const struct rte_event events[], > uint16_t events_len) > { > > # I am testing with command like this "/build/app/dpdk-test-eventdev > -l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8 > --stlist p --nb_pkts=10000000000" > I re-ran the compile-time variable, run-time constant enqueue size of 1, and I got the following: Jerin's change: +4% Jerin's change + ensure inlining: +6% RFC v3: +7% (Here I use a more different setup that produces more deterministic results, hence the different numbers compared to the previous runs. They were using a pipeline spread over two chiplets, and these runs are using only a single chiplet.) It seems like with your suggested changes you eliminate most of the single-enqueue-special case performance degradation (for DSW), but not all of it. The remaining degradation is very small (for the above case, larger for small by run-time variable enqueue sizes), but it's a little sad that a supposedly performance-enhancing special case (that drives complexity in the code, although not much) actually degrades performance. >> >> The performance gain is counted toward both enqueue and dequeue costs >> (+benchmark app overhead), so an under-estimation if see this as an >> enqueue performance improvement. >> >>> If you agree, then we can skip this patch. >>> >> >> I have no strong opinion if this should be included or not. >> >> It was up to me, I would drop the single-enqueue special case handling >> altogether in the next ABI update. > > That's a reasonable path. If we are willing to push a patch, we can > test it and give feedback. > Or in our spare time, We can do that as well. > Sure, I'll give it a try. The next release is an ABI-breaking one? >> >>> >>>> >>>>> If so, check should be following. Right? >>>>> >>>>> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1) >>>>> || nb_events == 1) >>>>> >>>>> At least, It was my original intention in the code. >>>>> >>>>> >>>>> >>>>>> return (fp_ops->enqueue)(port, ev); >>>>>> else >>>>>> return fn(port, ev, nb_events); >>>>>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[], >>>>>> * Allow zero cost non burst mode routine invocation if application >>>>>> * requests nb_events as const one >>>>>> */ >>>>>> - if (nb_events == 1) >>>>>> + if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1) >>>>>> return (fp_ops->dequeue)(port, ev, timeout_ticks); >>>>>> else >>>>>> return (fp_ops->dequeue_burst)(port, ev, nb_events, >>>>>> -- >>>>>> 2.34.1 >>>>>> >>>>