RE: Optimizations are not features - Honnappa Nagarahalli

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
To: "Konstantin Ananyev" <konstantin.v.ananyev@yandex.ru>,
	"Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru>,
	"Morten Brørup" <mb@smartsharesystems.com>,
	"Jerin Jacob" <jerinjacobk@gmail.com>
Cc: dpdk-dev <dev@dpdk.org>,
	"techboard@dpdk.org" <techboard@dpdk.org>, nd <nd@arm.com>,
	nd <nd@arm.com>
Subject: RE: Optimizations are not features
Date: Wed, 29 Jun 2022 20:44:29 +0000	[thread overview]
Message-ID: <DBAPR08MB5814C8A0A365FA629EE3B26398BB9@DBAPR08MB5814.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <497ed526-2fb7-805a-f72c-5909b85a62c1@yandex.ru>

<snip>

> 
> 04/06/2022 13:51, Andrew Rybchenko пишет:
> > On 6/4/22 15:19, Morten Brørup wrote:
> >>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> >>> Sent: Saturday, 4 June 2022 13.10
> >>>
> >>> On Sat, Jun 4, 2022 at 3:30 PM Andrew Rybchenko
> >>> <andrew.rybchenko@oktetlabs.ru> wrote:
> >>>>
> >>>> On 6/4/22 12:33, Jerin Jacob wrote:
> >>>>> On Sat, Jun 4, 2022 at 2:39 PM Morten Brørup
> >>> <mb@smartsharesystems.com> wrote:
> >>>>>>
> >>>>>> I would like the DPDK community to change its view on compile
> >>>>>> time
> >>> options. Here is why:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Application specific performance micro-optimizations like “fast
> >>> mbuf free” and “mbuf direct re-arm” are being added to DPDK and
> >>> presented as features.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> They are not features, but optimizations, and I don’t understand
> >>> the need for them to be available at run-time!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Instead of adding a bunch of exotic exceptions to the fast path
> >>>>>> of
> >>> the PMDs, they should be compile time options. This will improve
> >>> performance by avoiding branches in the fast path, both for the
> >>> applications using them, and for generic applications (where the
> >>> exotic code is omitted).
> >>>>>
> >>>>> Agree. I think, keeping the best of both worlds would be
> >>>>>
> >>>>> -Enable the feature/optimization as runtime -Have a compile-time
> >>>>> option to disable the feature/optimization as
> >>> an override.
> >>>>
> >>>> It is hard to find the right balance, but in general compile time
> >>>> options are a nightmare for maintenance. Number of required builds
> >>>> will grow as an exponent.
> >>
> >> Test combinations are exponential for N features, regardless if N are
> >> runtime or compile time options.
> >
> > But since I'm talking about build checks I don't care about
> > exponential grows in run time. Yes, testing should care, but it is a separate
> story.
> >
> >>
> >>>> Of course, we can
> >>>> limit number of checked combinations, but it will result in flow of
> >>>> patches to fix build in other cases.
> >>>
> >>> The build breakage can be fixed if we use (2) vs (1)
> >>>
> >>> 1)
> >>> #ifdef ...
> >>> My feature
> >>> #endif
> >>>
> >>> 2)
> >>> static __rte_always_inline int
> >>> rte_has_xyz_feature(void)
> >>> {
> >>> #ifdef RTE_LIBRTE_XYZ_FEATURE
> >>>          return RTE_LIBRTE_XYZ_FEATURE; #else
> >>>          return 0;
> >>> #endif
> >>> }
> >>>
> >>> if(rte_has_xyz_feature())) {
> >>> My feature code
> >>>
> >>> }
> >>>
> >
> > Jerin, thanks, very good example.
> >
> >> I'm not sure all the features can be covered by that, e.g. added
> >> fields in structures.
> >
> > +1
> >
> >>
> >> Also, I would consider such features "opt in" at compile time only.
> >> As such, they could be allowed to break the ABI/API.
> >>
> >>>
> >>>
> >>>> Also compile time options tend to make code less readable which
> >>>> makes all aspects of the development harder.
> >>>>
> >>>> Yes, compile time is nice for micro optimizations, but I have great
> >>>> concerns that it is a right way to go.
> >>>>
> >>>>>> Please note that I am only talking about the performance
> >>> optimizations that are limited to application specific use cases. I
> >>> think it makes sense to require that performance optimizing an
> >>> application also requires recompiling the performance critical
> >>> libraries used by it.
> >>>>>>abandon some of existing functionality to create a 'short-cut'
> >>>>>>
> >>>>>>
> >>>>>> Allowing compile time options for application specific
> >>>>>> performance
> >>> optimizations in DPDK would also open a path for other
> >>> optimizations, which can only be achieved at compile time, such as
> >>> “no fragmented packets”, “no attached mbufs” and “single mbuf pool”.
> >>> And even more exotic optimizations, such as the “indexed mempool
> >>> cache”, which was rejected due to ABI violations – they could be
> >>> marked as “risky and untested” or similar, but still be part of the DPDK main
> repository.
> >>>>>>
> 
> 
> Thanks Morten for bringing it up, it is an interesting topic.
> Though I look at it from different angle.
> All optimizations you mentioned above introduce new limitations:
> MBUF_FAST_FREE - no indirect mbufs and multiple mempools, mempool object
> indexes - mempool size is limited to 4GB, direct rearm - drop ability to
> stop/reconfigure TX queue, while RX queue is still running, etc.
> Note that all these limitations are not forced by HW.
> All of them are pure SW limitations that developers forced in (or tried to) to get
> few extra performance.
> That's concerning tendency.
> 
> As more and more such 'optimization via limitation' will come in:
> - DPDK feature list will become more and more fragmented.
> - Would cause more and more confusion for the users.
> - Unmet expectations - difference in performance between 'default'
>    and 'optimized' version of DPDK will become bigger and bigger.
> - As Andrew already mentioned, maintaining all these 'sub-flavours'
>    of DPDK will become more and more difficult.
The point that we need to remember is, these features/optimizations are introduced after seeing performance issues in practical use cases.
DPDK is not being used in just one use case, it is being used in several use cases which have their own unique requirements. Is 4GB enough for packet buffers - yes it is enough in certain use cases. Are their NICs with single port - yes there are. HW is being created because use cases and business cases exist. It is obvious that as DPDK gets adopted on more platforms that differ largely, the features will increase and it will become complex. Complexity should not be used as a criteria to reject patches.

There is different perspective to what you are calling as 'limitations'. I can argue that multiple mempools, stop/reconfigure TX queue while RX queue is still running are exotic. Just because those are allowed currently (probably accidently) does not mean they are being used. Are there use cases that make use of these features?

The base/existing design for DPDK was done with one particular HW architecture in mind where there was an abundance of resources. Unfortunately, that HW architecture is fast evolving and DPDK is adopted in use cases where that kind of resources are not available. For ex: efficiency cores are being introduced by every CPU vendor now. Soon enough, we will see big-little architecture in networking as well. The existing PMD design introduces 512B of stores (256B for copying to stack variable and 256B to store lcore cache) and 256B load/store on RX side every 32 packets back to back. It doesn't make sense to have that kind of memcopy for little/efficiency cores just for the driver code.

> 
> So, probably instead of making such changes easier, we need somehow to
> persuade developers to think more about optimizations that would be generic
> and transparent to the user.
Or may be we need to think of creating alternate ways of programming.

> I do realize that it is not always possible due to various reasons (HW limitations,
> external dependencies, etc.) but that's another story.
> 
> Let's take for example MBUF_FAST_FREE.
> In fact, I am not sure that we need it as tx offload flag at all.
> PMD TX-path has all necessary information to decide at run-time can it do
> fast_free() for not:
> At tx_burst() PMD can check are all mbufs satisfy these conditions (same
> mempool, refcnt==1) and update some fields and/or counters inside TXQ to
> reflect it.
> Then, at tx_free() we can use this info to decide between fast_free() and
> normal_free().
> As at tx_burst() we read mbuf fields anyway, impact for this extra step I guess
> would be minimal.
> Yes, most likely, it wouldn't be as fast as with current TX offload flag, or
> conditional compilation approach.
> But it might be still significantly faster then normal_free(), plus such approach
> will be generic and transparent to the user.
IMO, this depends on the philosophy that we want to adopt. I would prefer to make control plane complex for performance gains on the data plane. The performance on the data plane has a multiplying effect due to the ratio of number of cores assigned for data plane vs control plane.

I am not against evaluating alternatives, but the alternative approaches need to have similar (not the same) performance.

> 
> Konstantin

next prev parent reply	other threads:[~2022-06-29 20:44 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-04  9:09 Morten Brørup
2022-06-04  9:33 ` Jerin Jacob
2022-06-04 10:00   ` Andrew Rybchenko
2022-06-04 11:10     ` Jerin Jacob
2022-06-04 12:19       ` Morten Brørup
2022-06-04 12:51         ` Andrew Rybchenko
2022-06-05  8:15           ` Morten Brørup
2022-06-05 16:05           ` Stephen Hemminger
2022-06-06  9:35           ` Konstantin Ananyev
2022-06-29 20:44             ` Honnappa Nagarahalli [this message]
2022-06-30 15:39               ` Morten Brørup
2022-07-03 19:38               ` Konstantin Ananyev
2022-07-04 16:33                 ` Stephen Hemminger
2022-07-04 22:06                   ` Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DBAPR08MB5814C8A0A365FA629EE3B26398BB9@DBAPR08MB5814.eurprd08.prod.outlook.com \
    --to=honnappa.nagarahalli@arm.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=dev@dpdk.org \
    --cc=jerinjacobk@gmail.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mb@smartsharesystems.com \
    --cc=nd@arm.com \
    --cc=techboard@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).