DPDK patches and discussions
 help / color / mirror / Atom feed
From: Bruce Richardson <bruce.richardson@intel.com>
To: jigsaw <jigsaw@gmail.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
Date: Wed, 12 Nov 2014 15:51:46 +0000	[thread overview]
Message-ID: <20141112155145.GA7952@bricha3-MOBL3> (raw)
In-Reply-To: <CAHVfvh7wgkRGT=SrE-e6h=VGHh-japqQKOmmj7aP1THh=aLb3A@mail.gmail.com>

On Tue, Nov 11, 2014 at 10:53:40AM +0200, jigsaw wrote:
> Hi Bruce,
> 
> This patch has little, if any, performance impact.
> See the perf stat -d for original and patched version
> of test_distributor_perf.

Thanks for running the test. Results look ok to me. I confirm my previous ack.

/Bruce

> 
> Original version
> ============
> perf stat -d ./test_orig  -cffff -n2
> ============
> ==== Cache line switch test ===
> 
> [4/4590]
> Time for 1048576 iterations = 362349056 ticks
> Ticks per iteration = 345
> 
> === Performance test of distributor ===
> Time per burst:  7043
> Time per packet: 220
> 
> Worker 0 handled 3377372 packets
> Worker 1 handled 2857280 packets
> Worker 2 handled 2120982 packets
> Worker 3 handled 2112720 packets
> Worker 4 handled 2102014 packets
> Worker 5 handled 2101314 packets
> Worker 6 handled 2099248 packets
> Worker 7 handled 2098560 packets
> Worker 8 handled 2098114 packets
> Worker 9 handled 2097962 packets
> Worker 10 handled 2097892 packets
> Worker 11 handled 2097856 packets
> Worker 12 handled 2097762 packets
> Worker 13 handled 2097726 packets
> Worker 14 handled 2097630 packets
> Total packets: 33554432 (2000000)
> === Perf test done ===
> 
> 
>  Performance counter stats for './test_orig -cffff -n2 --no-huge':
> 
>       45784.935475 task-clock                #   12.847 CPUs utilized
>               4822 context-switches          #    0.105 K/sec
>                  7 CPU-migrations            #    0.000 K/sec
>               1017 page-faults               #    0.022 K/sec
>       118181375409 cycles                    #    2.581 GHz
>     [40.00%]
>        88915381373 stalled-cycles-frontend   #   75.24% frontend cycles
> idle    [39.98%]
>        77015854611 stalled-cycles-backend    #   65.17% backend  cycles
> idle    [39.96%]
>        30469536186 instructions              #    0.26  insns per cycle
>                                              #    2.92  stalled cycles per
> insn [49.96%]
>         6481829773 branches                  #  141.571 M/sec
>     [49.97%]
>           45283365 branch-misses             #    0.70% of all branches
>     [50.00%]
>         6537115556 L1-dcache-loads           #  142.779 M/sec
>     [50.05%]
>          249533128 L1-dcache-load-misses     #    3.82% of all L1-dcache
> hits   [50.08%]
>          144469663 LLC-loads                 #    3.155 M/sec
>     [40.05%]
>           69897760 LLC-load-misses           #   48.38% of all LL-cache
> hits    [40.03%]
> 
>        3.563886431 seconds time elapsed
> 
> 
> Patched version
> =============
> perf stat -d ./test_patch  -cffff -n2
> ==== Cache line switch test ===
> 
> [4/4747]
> Time for 1048576 iterations = 456390606 ticks
> Ticks per iteration = 435
> 
> === Performance test of distributor ===
> Time per burst:  6566
> Time per packet: 205
> 
> Worker 0 handled 2528418 packets
> Worker 1 handled 2531906 packets
> Worker 2 handled 2524133 packets
> Worker 3 handled 2507720 packets
> Worker 4 handled 2332648 packets
> Worker 5 handled 2300046 packets
> Worker 6 handled 2213035 packets
> Worker 7 handled 2125486 packets
> Worker 8 handled 2096726 packets
> Worker 9 handled 2089150 packets
> Worker 10 handled 2079626 packets
> Worker 11 handled 2074512 packets
> Worker 12 handled 2073560 packets
> Worker 13 handled 2054838 packets
> Worker 14 handled 2022628 packets
> Total packets: 33554432 (2000000)
> === Perf test done ===
> 
> 
>  Performance counter stats for './test_patch -cffff -n2 --no-huge':
> 
>       42784.064267 task-clock                #   12.546 CPUs utilized
>               4517 context-switches          #    0.106 K/sec
>                  5 CPU-migrations            #    0.000 K/sec
>               1020 page-faults               #    0.024 K/sec
>       110531300647 cycles                    #    2.583 GHz
>     [40.06%]
>        83135557406 stalled-cycles-frontend   #   75.21% frontend cycles
> idle    [40.02%]
>        71862086289 stalled-cycles-backend    #   65.02% backend  cycles
> idle    [39.99%]
>        28847200612 instructions              #    0.26  insns per cycle
>                                              #    2.88  stalled cycles per
> insn [49.98%]
>         6059139104 branches                  #  141.621 M/sec
>     [49.94%]
>           38455463 branch-misses             #    0.63% of all branches
>     [49.95%]
>         6090169906 L1-dcache-loads           #  142.347 M/sec
>     [49.99%]
>          235094118 L1-dcache-load-misses     #    3.86% of all L1-dcache
> hits   [50.03%]
>          129974492 LLC-loads                 #    3.038 M/sec
>     [40.07%]
>           56701883 LLC-load-misses           #   43.63% of all LL-cache
> hits    [40.08%]
> 
>        3.410231367 seconds time elapsed
> 
> 
> The patched version even runs a bit faster.
> 
> Perf annotation shows that 92% time is spent in the while loop
> of rte_distributor_get_pkt.
> Anothoer 6% spent in the calculation of match
> inside rte_distributor_process.
> 
> Also we can see that there is a big amount of LLC load miss, which is
> perhaps due to cache conflict
> in the while-loop of rte_distributor_get_pkt?
> 
> I'm no going to draw any conclusion here but evidently the
> in_flight_bitmask is not in the hot path, and
> is irrelevant to the performance of distributor.
> 
> Thanks a lot for your code review and ack.
> 
> thx &
> rgds,
> -qinglai
> 
> 
> On Mon, Nov 10, 2014 at 6:55 PM, Bruce Richardson <
> bruce.richardson@intel.com> wrote:
> 
> > On Mon, Nov 10, 2014 at 05:52:26PM +0200, jigsaw wrote:
> > > Hi Bruce,
> > >
> > > Sorry I didn't. I will run a performance test tomorrow.
> > >
> > > thx &
> > > rgds,
> > > -qinglai
> > >
> >
> > On the assumption that no performance regressions show up...
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> >
> >
> > > On Mon, Nov 10, 2014 at 5:13 PM, Bruce Richardson <
> > > bruce.richardson@intel.com> wrote:
> > >
> > > > On Mon, Nov 10, 2014 at 04:44:02PM +0200, Qinglai Xiao wrote:
> > > > > With introduction of in_flight_bitmask, the whole 32 bits of tag can
> > be
> > > > > used. Further more, this patch fixed the integer overflow when
> > finding
> > > > > the matched tags.
> > > > > The maximum number workers is now defined as 64, which is length of
> > > > > double-word. The link between number of workers and RTE_MAX_LCORE is
> > > > > now removed. Compile time check is added to ensure the
> > > > > RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.
> > > > >
> > > > > Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>
> > > >
> > > > Looks good to me.
> > > > Just before I ack this, have you checked to see if there is any
> > > > performance impact?
> > > >
> > > > /Bruce
> > > >
> > > > > ---
> > > > >  lib/librte_distributor/rte_distributor.c |   64
> > > > ++++++++++++++++++++++--------
> > > > >  lib/librte_distributor/rte_distributor.h |    4 ++
> > > > >  2 files changed, 51 insertions(+), 17 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_distributor/rte_distributor.c
> > > > b/lib/librte_distributor/rte_distributor.c
> > > > > index 3dfec4a..2c5d61c 100644
> > > > > --- a/lib/librte_distributor/rte_distributor.c
> > > > > +++ b/lib/librte_distributor/rte_distributor.c
> > > > > @@ -62,6 +62,13 @@
> > > > >  #define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> > > > >
> > > > >  /**
> > > > > + * Maximum number of workers allowed.
> > > > > + * Be aware of increasing the limit, becaus it is limited by how we
> > > > track
> > > > > + * in-flight tags. See @in_flight_bitmask and
> > @rte_distributor_process
> > > > > + */
> > > > > +#define RTE_DISTRIB_MAX_WORKERS      64
> > > > > +
> > > > > +/**
> > > > >   * Buffer structure used to pass the pointer data between cores.
> > This
> > > > is cache
> > > > >   * line aligned, but to improve performance and prevent adjacent
> > > > cache-line
> > > > >   * prefetches of buffers for other workers, e.g. when worker 1's
> > buffer
> > > > is on
> > > > > @@ -91,11 +98,17 @@ struct rte_distributor {
> > > > >       char name[RTE_DISTRIBUTOR_NAMESIZE];  /**< Name of the ring. */
> > > > >       unsigned num_workers;                 /**< Number of workers
> > > > polling */
> > > > >
> > > > > -     uint32_t in_flight_tags[RTE_MAX_LCORE];
> > > > > -             /**< Tracks the tag being processed per core, 0 == no
> > pkt
> > > > */
> > > > > -     struct rte_distributor_backlog backlog[RTE_MAX_LCORE];
> > > > > +     uint32_t in_flight_tags[RTE_DISTRIB_MAX_WORKERS];
> > > > > +             /**< Tracks the tag being processed per core */
> > > > > +     uint64_t in_flight_bitmask;
> > > > > +             /**< on/off bits for in-flight tags.
> > > > > +              * Note that if RTE_DISTRIB_MAX_WORKERS is larger than
> > 64
> > > > then
> > > > > +              * the bitmask has to expand.
> > > > > +              */
> > > > > +
> > > > > +     struct rte_distributor_backlog
> > backlog[RTE_DISTRIB_MAX_WORKERS];
> > > > >
> > > > > -     union rte_distributor_buffer bufs[RTE_MAX_LCORE];
> > > > > +     union rte_distributor_buffer bufs[RTE_DISTRIB_MAX_WORKERS];
> > > > >
> > > > >       struct rte_distributor_returned_pkts returns;
> > > > >  };
> > > > > @@ -189,6 +202,7 @@ static inline void
> > > > >  handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
> > > > >  {
> > > > >       d->in_flight_tags[wkr] = 0;
> > > > > +     d->in_flight_bitmask &= ~(1UL << wkr);
> > > > >       d->bufs[wkr].bufptr64 = 0;
> > > > >       if (unlikely(d->backlog[wkr].count != 0)) {
> > > > >               /* On return of a packet, we need to move the
> > > > > @@ -211,7 +225,10 @@ handle_worker_shutdown(struct rte_distributor
> > *d,
> > > > unsigned wkr)
> > > > >                       pkts[i] = (void *)((uintptr_t)(bl->pkts[idx] >>
> > > > >                                       RTE_DISTRIB_FLAG_BITS));
> > > > >               }
> > > > > -             /* recursive call */
> > > > > +             /* recursive call.
> > > > > +              * Note that the tags were set before first level call
> > > > > +              * to rte_distributor_process.
> > > > > +              */
> > > > >               rte_distributor_process(d, pkts, i);
> > > > >               bl->count = bl->start = 0;
> > > > >       }
> > > > > @@ -242,6 +259,7 @@ process_returns(struct rte_distributor *d)
> > > > >                       else {
> > > > >                               d->bufs[wkr].bufptr64 =
> > > > RTE_DISTRIB_GET_BUF;
> > > > >                               d->in_flight_tags[wkr] = 0;
> > > > > +                             d->in_flight_bitmask &= ~(1UL << wkr);
> > > > >                       }
> > > > >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > > > >               } else if (data & RTE_DISTRIB_RETURN_BUF) {
> > > > > @@ -284,14 +302,18 @@ rte_distributor_process(struct rte_distributor
> > *d,
> > > > >                       next_value = (((int64_t)(uintptr_t)next_mb)
> > > > >                                       << RTE_DISTRIB_FLAG_BITS);
> > > > >                       /*
> > > > > -                      * Set the low bit on the tag, so we can
> > guarantee
> > > > that
> > > > > -                      * we never store a tag value of zero. That
> > means
> > > > we can
> > > > > -                      * use the zero-value to indicate that no
> > packet is
> > > > > -                      * being processed by a worker.
> > > > > +                      * User is advocated to set tag vaue for each
> > > > > +                      * mbuf before calling rte_distributor_process.
> > > > > +                      * User defined tags are used to identify
> > flows,
> > > > > +                      * or sessions.
> > > > >                        */
> > > > > -                     new_tag = (next_mb->hash.usr | 1);
> > > > > +                     new_tag = next_mb->hash.usr;
> > > > >
> > > > > -                     uint32_t match = 0;
> > > > > +                     /*
> > > > > +                      * Note that if RTE_DISTRIB_MAX_WORKERS is
> > larger
> > > > than 64
> > > > > +                      * then the size of match has to be expanded.
> > > > > +                      */
> > > > > +                     uint64_t match = 0;
> > > > >                       unsigned i;
> > > > >                       /*
> > > > >                        * to scan for a match use "xor" and "not" to
> > get
> > > > a 0/1
> > > > > @@ -303,9 +325,12 @@ rte_distributor_process(struct rte_distributor
> > *d,
> > > > >                               match |= (!(d->in_flight_tags[i] ^
> > new_tag)
> > > > >                                       << i);
> > > > >
> > > > > +                     /* Only turned-on bits are considered as match
> > */
> > > > > +                     match &= d->in_flight_bitmask;
> > > > > +
> > > > >                       if (match) {
> > > > >                               next_mb = NULL;
> > > > > -                             unsigned worker = __builtin_ctz(match);
> > > > > +                             unsigned worker =
> > __builtin_ctzl(match);
> > > > >                               if (add_to_backlog(&d->backlog[worker],
> > > > >                                               next_value) < 0)
> > > > >                                       next_idx--;
> > > > > @@ -322,6 +347,7 @@ rte_distributor_process(struct rte_distributor
> > *d,
> > > > >                       else {
> > > > >                               d->bufs[wkr].bufptr64 = next_value;
> > > > >                               d->in_flight_tags[wkr] = new_tag;
> > > > > +                             d->in_flight_bitmask |= (1UL << wkr);
> > > > >                               next_mb = NULL;
> > > > >                       }
> > > > >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > > > > @@ -379,11 +405,13 @@ rte_distributor_returned_pkts(struct
> > > > rte_distributor *d,
> > > > >  static inline unsigned
> > > > >  total_outstanding(const struct rte_distributor *d)
> > > > >  {
> > > > > -     unsigned wkr, total_outstanding = 0;
> > > > > +     unsigned wkr, total_outstanding;
> > > > > +
> > > > > +     total_outstanding = __builtin_popcountl(d->in_flight_bitmask);
> > > > >
> > > > >       for (wkr = 0; wkr < d->num_workers; wkr++)
> > > > > -             total_outstanding += d->backlog[wkr].count +
> > > > > -                             !!(d->in_flight_tags[wkr]);
> > > > > +             total_outstanding += d->backlog[wkr].count;
> > > > > +
> > > > >       return total_outstanding;
> > > > >  }
> > > > >
> > > > > @@ -423,9 +451,11 @@ rte_distributor_create(const char *name,
> > > > >
> > > > >       /* compilation-time checks */
> > > > >       RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> > > > > -     RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> > > > > +     RTE_BUILD_BUG_ON((RTE_DISTRIB_MAX_WORKERS & 7) != 0);
> > > > > +     RTE_BUILD_BUG_ON(RTE_DISTRIB_MAX_WORKERS >
> > > > > +                             sizeof(d->in_flight_bitmask) *
> > CHAR_BIT);
> > > > >
> > > > > -     if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> > > > > +     if (name == NULL || num_workers >= RTE_DISTRIB_MAX_WORKERS) {
> > > > >               rte_errno = EINVAL;
> > > > >               return NULL;
> > > > >       }
> > > > > diff --git a/lib/librte_distributor/rte_distributor.h
> > > > b/lib/librte_distributor/rte_distributor.h
> > > > > index ec0d74a..cc1d559 100644
> > > > > --- a/lib/librte_distributor/rte_distributor.h
> > > > > +++ b/lib/librte_distributor/rte_distributor.h
> > > > > @@ -88,6 +88,10 @@ rte_distributor_create(const char *name, unsigned
> > > > socket_id,
> > > > >   * packets. The distributor will ensure that no two packets that
> > have
> > > > the
> > > > >   * same flow id, or tag, in the mbuf will be procesed at the same
> > time.
> > > > >   *
> > > > > + * The user is advocated to set tag for each mbuf before calling
> > this
> > > > function.
> > > > > + * If user doesn't set the tag, the tag value can be various values
> > > > depending on
> > > > > + * driver implementation and configuration.
> > > > > + *
> > > > >   * This is not multi-thread safe and should only be called on a
> > single
> > > > lcore.
> > > > >   *
> > > > >   * @param d
> > > > > --
> > > > > 1.7.1
> > > > >
> > > >
> >

  reply	other threads:[~2014-11-12 15:42 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-10 14:44 Qinglai Xiao
2014-11-10 15:13 ` Bruce Richardson
2014-11-10 15:52   ` jigsaw
2014-11-10 16:55     ` Bruce Richardson
2014-11-11  8:53       ` jigsaw
2014-11-12 15:51         ` Bruce Richardson [this message]
2014-11-13 11:26           ` Thomas Monjalon
2014-11-13  0:50 ` Thomas Monjalon
2014-11-13  6:56   ` jigsaw
2014-11-13  9:18     ` Thomas Monjalon
2014-11-13  9:24       ` jigsaw

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141112155145.GA7952@bricha3-MOBL3 \
    --to=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=jigsaw@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).