DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com>,
	"Tom Barbette" <barbette@kth.se>, <dev@dpdk.org>,
	"nd" <nd@arm.com>, "Alireza Farshin" <farshin@kth.se>,
	"Van Haaren, Harry" <harry.van.haaren@intel.com>
Subject: Re: [dpdk-dev] Minutes of Technical Board Meeting, 2021-03-10
Date: Wed, 7 Apr 2021 12:29:21 +0200	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35C616C3@smartserver.smartshare.dk> (raw)
In-Reply-To: <20210407095853.GA1644@bricha3-MOBL.ger.corp.intel.com>

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, April 7, 2021 11:59 AM
> 
> On Wed, Apr 07, 2021 at 09:11:23AM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Honnappa
> > > Nagarahalli
> > > Sent: Wednesday, April 7, 2021 2:48 AM
> > >
> > > <snip>
> > > >
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tom
> Barbette
> > > > > Sent: Wednesday, March 31, 2021 10:53 AM
> > > > >
> > > > > Le 31-03-21 à 02:44, Honnappa Nagarahalli a écrit :
> > > > > > 	- Ability to tune the values of #defines
> > > > > >     * Few prominent points discussed
> > > > > > 	- This will result in #ifdefs in the code (for ex: in
> > > testpmd)
> > > > > > 	- One option is for all the PMDs to document their
> > > configurable
> > > > > #defines in PMD specific header files. Having these distributed
> is
> > > > > much easier to search.
> > > > > > 	- Can some of the existing #defines be converted to runtime
> > > > > configurations? For ex: RTE_MAX_LCORE? This might impact ABI.
> > > > > >     * Bruce to think about converting the doc to a blog or an
> > > email
> > > > > on the mailing list. But soliciting feedback is most important.
> > > > >
> > > > > One alternative path worth looking at is to encourage the use
> of
> > > LTO,
> > > > > and modify APIs so the configuration can be provided at linking
> > > time,
> > > > > and propagated by the compiler.
> > > > >
> > > > > E.g. one can define rte_max_lcore as a weak constant symbol,
> equal
> > > to
> > > > > 128. At linking time the user may provide a rte_max_lcore that
> is
> > > more
> > > > > tailored, and still, dynamic arrays[rte_max_lcore] will be
> > > allocatable
> > > > > on the .bss section, avoiding an indirection. The compiler will
> be
> > > > > able to optimize loops etc which is impossible with pure
> runtime
> > > > > configuration.
> > > > >
> > > > > In packetmill.io we actually pushed this to the next level
> where
> > > the
> > > > > driver can completely change its behavior without recompiling
> DPDK
> > > > > itself and spawning ifdefs everywhere.
> > > > >
> > > > > However the price is the slowiness of LTO...
> > > > >
> > > > > My 2 cents.
> > > > >
> > > > > Tom
> > > > >
> > > >
> > > > If we are moving away from Compile Time parameters, I certainly
> > > prefer Tom's
> > > > suggestion of Link Time parameters, rather than Run Time
> parameters.
> > > I think compile time constants are fine if they are not used in
> #ifdef.
> > > For ex: if they are used in 'if (...)', it will help eliminate code
> and
> > > branches.
> >
> > Yes!
> >
> > And "if (...)" is more flexible than #ifdef/#if because it allows the
> expression to be mixed with non-constants.
> >
> > Then perhaps Bruce's script to automatically make C constants out of
> #defines was not so silly anyway. :-)
> >
> > >
> > > >
> > > > This might also provide a middle ground for optimizations where
> > > Compile
> > > > Time parameters are considered unacceptable by the DPDK
> community.
> > > I'm
> > > > thinking about something along the lines of the "constant size"
> > > rte_event
> > > > array presented at the 2020 Userspace Summit by Harry
> > > >
> > >
> (https://static.sched.com/hosted_files/dpdkuserspace2020/d3/dpdk_usersp
> > > ac
> > > > e_20_api_performance_hvh.pdf). Taking this thinking even further
> out,
> > > a Link
> > > > Time parameter could perhaps replace the nb_pkts parameter in on
> > > > optimized rte_eth_rx_burst() function.
> > > >
> >
> > Optimally, I would like to see e.g. the RX burst size being so
> constant that the PMD's RX function knows it and can use vector
> functions and possibly loop unrolling, without having to implement a
> pre-check on nb_pkts and a trailing non-vector loop for receiving any
> remaining odd nb_pkts. All the DPDK examples use #define MAX_PKT_BURST
> 32 or similar, and I assume most DPDK applications do too.
> >
> > I do not trust the compiler to be clever enough to realize that the
> PMD's RX function is always called with a specific nb_pkts and optimize
> all this cruft away at compile time (or at link time), unless it is a
> #define or a compile time constant.
> >
> 
> It certainly is not possible to do at compile time, because the calls
> are
> in a different compilation unit from the functions themselves, not to
> mention that a link-time the RX functions are called via a function
> pointer.

Exactly. It would only work with a global #define or global compile time constant.

> Therefore the only way to do this that I am aware of, is to
> have a
> wrapper function use for the common values inside the drivers
> themselves.
> 
> For example, inside the i40e driver (which I'm using because it's the
> one
> I'm most familiar with), the main receive function is already a wrapper
> around a raw receive function, using constant-expansion by the compiler
> of
> the final parameter (NULL) to automatically remove the code for
> tracking
> scattered packets.
> 
> uint16_t
> i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
>                    uint16_t nb_pkts)
> {
> 	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
> }
> 
> We can produce a version of this optimized for 32-element dequeues by
> special-casing where nb_pkts == 32:
> 
> uint16_t
> i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
>                    uint16_t nb_pkts)
> {
> 	if (nb_pkts == 32)
> 		return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 32,
> NULL);
> 	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
> }
> 
> Now the compiler when inlining the _raw_ function, can see that it
> needs
> two copies, and for the first, that nb_pkts is compile-time constant of
> 32.

Great example! Now, consider this modification to your example, where rte_eth_rx_burst_size is a global constant variable that can be evaluated at compile time:

uint16_t
i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
                   uint16_t nb_pkts)
{
-	if (nb_pkts == 32)
-		return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 32, NULL);
+	if (rte_eth_rx_burst_size)
+		return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, rte_eth_rx_burst_size, NULL);
	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
}

> However, I'm not sure how useful an optimization like this is, and I'd
> be interested to see what benefits testing shows.

Yes, performance test results would be beneficial.

> Beyond the loop iteration
> count of 32, there is also the check after each burst of 8 dequeues in
> the
> driver to check that we have a full set of 8 - and abort the loop if
> not.

That loop could be optimized too, going from 8 to 32 - or going all the way to global_rx_burst_size.

> It's also the case that unless an app is already at maximum load (or
> overloaded), one would probably not expect to always get a full set of
> 32
> packets each time, as you have no additional headroom for more.
> 

Except if the application is designed to call rte_eth_rx_burst() less frequently to get a full burst as often as possible, using the NIC's many RX descriptors as burst buffer. Working with full bursts throughout the application provides higher total performance per clock cycle.

> /Bruce


      reply	other threads:[~2021-04-07 10:29 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-31  0:44 Honnappa Nagarahalli
2021-03-31  8:52 ` Tom Barbette
2021-04-06 13:13   ` Morten Brørup
2021-04-07  0:47     ` Honnappa Nagarahalli
2021-04-07  7:11       ` Morten Brørup
2021-04-07  9:58         ` Bruce Richardson
2021-04-07 10:29           ` Morten Brørup [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35C616C3@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=barbette@kth.se \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=farshin@kth.se \
    --cc=harry.van.haaren@intel.com \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).