DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
@ 2014-11-10 14:44 Qinglai Xiao
  2014-11-10 15:13 ` Bruce Richardson
  2014-11-13  0:50 ` Thomas Monjalon
  0 siblings, 2 replies; 11+ messages in thread
From: Qinglai Xiao @ 2014-11-10 14:44 UTC (permalink / raw)
  To: dev; +Cc: Qinglai Xiao

With introduction of in_flight_bitmask, the whole 32 bits of tag can be
used. Further more, this patch fixed the integer overflow when finding
the matched tags.
The maximum number workers is now defined as 64, which is length of
double-word. The link between number of workers and RTE_MAX_LCORE is
now removed. Compile time check is added to ensure the
RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.

Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>
---
 lib/librte_distributor/rte_distributor.c |   64 ++++++++++++++++++++++--------
 lib/librte_distributor/rte_distributor.h |    4 ++
 2 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/lib/librte_distributor/rte_distributor.c b/lib/librte_distributor/rte_distributor.c
index 3dfec4a..2c5d61c 100644
--- a/lib/librte_distributor/rte_distributor.c
+++ b/lib/librte_distributor/rte_distributor.c
@@ -62,6 +62,13 @@
 #define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
 
 /**
+ * Maximum number of workers allowed.
+ * Be aware of increasing the limit, becaus it is limited by how we track
+ * in-flight tags. See @in_flight_bitmask and @rte_distributor_process
+ */
+#define RTE_DISTRIB_MAX_WORKERS	64
+
+/**
  * Buffer structure used to pass the pointer data between cores. This is cache
  * line aligned, but to improve performance and prevent adjacent cache-line
  * prefetches of buffers for other workers, e.g. when worker 1's buffer is on
@@ -91,11 +98,17 @@ struct rte_distributor {
 	char name[RTE_DISTRIBUTOR_NAMESIZE];  /**< Name of the ring. */
 	unsigned num_workers;                 /**< Number of workers polling */
 
-	uint32_t in_flight_tags[RTE_MAX_LCORE];
-		/**< Tracks the tag being processed per core, 0 == no pkt */
-	struct rte_distributor_backlog backlog[RTE_MAX_LCORE];
+	uint32_t in_flight_tags[RTE_DISTRIB_MAX_WORKERS];
+		/**< Tracks the tag being processed per core */
+	uint64_t in_flight_bitmask;
+		/**< on/off bits for in-flight tags.
+		 * Note that if RTE_DISTRIB_MAX_WORKERS is larger than 64 then
+		 * the bitmask has to expand.
+		 */
+
+	struct rte_distributor_backlog backlog[RTE_DISTRIB_MAX_WORKERS];
 
-	union rte_distributor_buffer bufs[RTE_MAX_LCORE];
+	union rte_distributor_buffer bufs[RTE_DISTRIB_MAX_WORKERS];
 
 	struct rte_distributor_returned_pkts returns;
 };
@@ -189,6 +202,7 @@ static inline void
 handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
 {
 	d->in_flight_tags[wkr] = 0;
+	d->in_flight_bitmask &= ~(1UL << wkr);
 	d->bufs[wkr].bufptr64 = 0;
 	if (unlikely(d->backlog[wkr].count != 0)) {
 		/* On return of a packet, we need to move the
@@ -211,7 +225,10 @@ handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
 			pkts[i] = (void *)((uintptr_t)(bl->pkts[idx] >>
 					RTE_DISTRIB_FLAG_BITS));
 		}
-		/* recursive call */
+		/* recursive call.
+		 * Note that the tags were set before first level call
+		 * to rte_distributor_process.
+		 */
 		rte_distributor_process(d, pkts, i);
 		bl->count = bl->start = 0;
 	}
@@ -242,6 +259,7 @@ process_returns(struct rte_distributor *d)
 			else {
 				d->bufs[wkr].bufptr64 = RTE_DISTRIB_GET_BUF;
 				d->in_flight_tags[wkr] = 0;
+				d->in_flight_bitmask &= ~(1UL << wkr);
 			}
 			oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
 		} else if (data & RTE_DISTRIB_RETURN_BUF) {
@@ -284,14 +302,18 @@ rte_distributor_process(struct rte_distributor *d,
 			next_value = (((int64_t)(uintptr_t)next_mb)
 					<< RTE_DISTRIB_FLAG_BITS);
 			/*
-			 * Set the low bit on the tag, so we can guarantee that
-			 * we never store a tag value of zero. That means we can
-			 * use the zero-value to indicate that no packet is
-			 * being processed by a worker.
+			 * User is advocated to set tag vaue for each
+			 * mbuf before calling rte_distributor_process.
+			 * User defined tags are used to identify flows,
+			 * or sessions.
 			 */
-			new_tag = (next_mb->hash.usr | 1);
+			new_tag = next_mb->hash.usr;
 
-			uint32_t match = 0;
+			/*
+			 * Note that if RTE_DISTRIB_MAX_WORKERS is larger than 64
+			 * then the size of match has to be expanded.
+			 */
+			uint64_t match = 0;
 			unsigned i;
 			/*
 			 * to scan for a match use "xor" and "not" to get a 0/1
@@ -303,9 +325,12 @@ rte_distributor_process(struct rte_distributor *d,
 				match |= (!(d->in_flight_tags[i] ^ new_tag)
 					<< i);
 
+			/* Only turned-on bits are considered as match */
+			match &= d->in_flight_bitmask;
+
 			if (match) {
 				next_mb = NULL;
-				unsigned worker = __builtin_ctz(match);
+				unsigned worker = __builtin_ctzl(match);
 				if (add_to_backlog(&d->backlog[worker],
 						next_value) < 0)
 					next_idx--;
@@ -322,6 +347,7 @@ rte_distributor_process(struct rte_distributor *d,
 			else {
 				d->bufs[wkr].bufptr64 = next_value;
 				d->in_flight_tags[wkr] = new_tag;
+				d->in_flight_bitmask |= (1UL << wkr);
 				next_mb = NULL;
 			}
 			oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
@@ -379,11 +405,13 @@ rte_distributor_returned_pkts(struct rte_distributor *d,
 static inline unsigned
 total_outstanding(const struct rte_distributor *d)
 {
-	unsigned wkr, total_outstanding = 0;
+	unsigned wkr, total_outstanding;
+
+	total_outstanding = __builtin_popcountl(d->in_flight_bitmask);
 
 	for (wkr = 0; wkr < d->num_workers; wkr++)
-		total_outstanding += d->backlog[wkr].count +
-				!!(d->in_flight_tags[wkr]);
+		total_outstanding += d->backlog[wkr].count;
+
 	return total_outstanding;
 }
 
@@ -423,9 +451,11 @@ rte_distributor_create(const char *name,
 
 	/* compilation-time checks */
 	RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
-	RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
+	RTE_BUILD_BUG_ON((RTE_DISTRIB_MAX_WORKERS & 7) != 0);
+	RTE_BUILD_BUG_ON(RTE_DISTRIB_MAX_WORKERS >
+				sizeof(d->in_flight_bitmask) * CHAR_BIT);
 
-	if (name == NULL || num_workers >= RTE_MAX_LCORE) {
+	if (name == NULL || num_workers >= RTE_DISTRIB_MAX_WORKERS) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/librte_distributor/rte_distributor.h b/lib/librte_distributor/rte_distributor.h
index ec0d74a..cc1d559 100644
--- a/lib/librte_distributor/rte_distributor.h
+++ b/lib/librte_distributor/rte_distributor.h
@@ -88,6 +88,10 @@ rte_distributor_create(const char *name, unsigned socket_id,
  * packets. The distributor will ensure that no two packets that have the
  * same flow id, or tag, in the mbuf will be procesed at the same time.
  *
+ * The user is advocated to set tag for each mbuf before calling this function.
+ * If user doesn't set the tag, the tag value can be various values depending on
+ * driver implementation and configuration.
+ *
  * This is not multi-thread safe and should only be called on a single lcore.
  *
  * @param d
-- 
1.7.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-10 14:44 [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag Qinglai Xiao
@ 2014-11-10 15:13 ` Bruce Richardson
  2014-11-10 15:52   ` jigsaw
  2014-11-13  0:50 ` Thomas Monjalon
  1 sibling, 1 reply; 11+ messages in thread
From: Bruce Richardson @ 2014-11-10 15:13 UTC (permalink / raw)
  To: Qinglai Xiao; +Cc: dev

On Mon, Nov 10, 2014 at 04:44:02PM +0200, Qinglai Xiao wrote:
> With introduction of in_flight_bitmask, the whole 32 bits of tag can be
> used. Further more, this patch fixed the integer overflow when finding
> the matched tags.
> The maximum number workers is now defined as 64, which is length of
> double-word. The link between number of workers and RTE_MAX_LCORE is
> now removed. Compile time check is added to ensure the
> RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.
> 
> Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>

Looks good to me.
Just before I ack this, have you checked to see if there is any performance impact?

/Bruce

> ---
>  lib/librte_distributor/rte_distributor.c |   64 ++++++++++++++++++++++--------
>  lib/librte_distributor/rte_distributor.h |    4 ++
>  2 files changed, 51 insertions(+), 17 deletions(-)
> 
> diff --git a/lib/librte_distributor/rte_distributor.c b/lib/librte_distributor/rte_distributor.c
> index 3dfec4a..2c5d61c 100644
> --- a/lib/librte_distributor/rte_distributor.c
> +++ b/lib/librte_distributor/rte_distributor.c
> @@ -62,6 +62,13 @@
>  #define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
>  
>  /**
> + * Maximum number of workers allowed.
> + * Be aware of increasing the limit, becaus it is limited by how we track
> + * in-flight tags. See @in_flight_bitmask and @rte_distributor_process
> + */
> +#define RTE_DISTRIB_MAX_WORKERS	64
> +
> +/**
>   * Buffer structure used to pass the pointer data between cores. This is cache
>   * line aligned, but to improve performance and prevent adjacent cache-line
>   * prefetches of buffers for other workers, e.g. when worker 1's buffer is on
> @@ -91,11 +98,17 @@ struct rte_distributor {
>  	char name[RTE_DISTRIBUTOR_NAMESIZE];  /**< Name of the ring. */
>  	unsigned num_workers;                 /**< Number of workers polling */
>  
> -	uint32_t in_flight_tags[RTE_MAX_LCORE];
> -		/**< Tracks the tag being processed per core, 0 == no pkt */
> -	struct rte_distributor_backlog backlog[RTE_MAX_LCORE];
> +	uint32_t in_flight_tags[RTE_DISTRIB_MAX_WORKERS];
> +		/**< Tracks the tag being processed per core */
> +	uint64_t in_flight_bitmask;
> +		/**< on/off bits for in-flight tags.
> +		 * Note that if RTE_DISTRIB_MAX_WORKERS is larger than 64 then
> +		 * the bitmask has to expand.
> +		 */
> +
> +	struct rte_distributor_backlog backlog[RTE_DISTRIB_MAX_WORKERS];
>  
> -	union rte_distributor_buffer bufs[RTE_MAX_LCORE];
> +	union rte_distributor_buffer bufs[RTE_DISTRIB_MAX_WORKERS];
>  
>  	struct rte_distributor_returned_pkts returns;
>  };
> @@ -189,6 +202,7 @@ static inline void
>  handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
>  {
>  	d->in_flight_tags[wkr] = 0;
> +	d->in_flight_bitmask &= ~(1UL << wkr);
>  	d->bufs[wkr].bufptr64 = 0;
>  	if (unlikely(d->backlog[wkr].count != 0)) {
>  		/* On return of a packet, we need to move the
> @@ -211,7 +225,10 @@ handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
>  			pkts[i] = (void *)((uintptr_t)(bl->pkts[idx] >>
>  					RTE_DISTRIB_FLAG_BITS));
>  		}
> -		/* recursive call */
> +		/* recursive call.
> +		 * Note that the tags were set before first level call
> +		 * to rte_distributor_process.
> +		 */
>  		rte_distributor_process(d, pkts, i);
>  		bl->count = bl->start = 0;
>  	}
> @@ -242,6 +259,7 @@ process_returns(struct rte_distributor *d)
>  			else {
>  				d->bufs[wkr].bufptr64 = RTE_DISTRIB_GET_BUF;
>  				d->in_flight_tags[wkr] = 0;
> +				d->in_flight_bitmask &= ~(1UL << wkr);
>  			}
>  			oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
>  		} else if (data & RTE_DISTRIB_RETURN_BUF) {
> @@ -284,14 +302,18 @@ rte_distributor_process(struct rte_distributor *d,
>  			next_value = (((int64_t)(uintptr_t)next_mb)
>  					<< RTE_DISTRIB_FLAG_BITS);
>  			/*
> -			 * Set the low bit on the tag, so we can guarantee that
> -			 * we never store a tag value of zero. That means we can
> -			 * use the zero-value to indicate that no packet is
> -			 * being processed by a worker.
> +			 * User is advocated to set tag vaue for each
> +			 * mbuf before calling rte_distributor_process.
> +			 * User defined tags are used to identify flows,
> +			 * or sessions.
>  			 */
> -			new_tag = (next_mb->hash.usr | 1);
> +			new_tag = next_mb->hash.usr;
>  
> -			uint32_t match = 0;
> +			/*
> +			 * Note that if RTE_DISTRIB_MAX_WORKERS is larger than 64
> +			 * then the size of match has to be expanded.
> +			 */
> +			uint64_t match = 0;
>  			unsigned i;
>  			/*
>  			 * to scan for a match use "xor" and "not" to get a 0/1
> @@ -303,9 +325,12 @@ rte_distributor_process(struct rte_distributor *d,
>  				match |= (!(d->in_flight_tags[i] ^ new_tag)
>  					<< i);
>  
> +			/* Only turned-on bits are considered as match */
> +			match &= d->in_flight_bitmask;
> +
>  			if (match) {
>  				next_mb = NULL;
> -				unsigned worker = __builtin_ctz(match);
> +				unsigned worker = __builtin_ctzl(match);
>  				if (add_to_backlog(&d->backlog[worker],
>  						next_value) < 0)
>  					next_idx--;
> @@ -322,6 +347,7 @@ rte_distributor_process(struct rte_distributor *d,
>  			else {
>  				d->bufs[wkr].bufptr64 = next_value;
>  				d->in_flight_tags[wkr] = new_tag;
> +				d->in_flight_bitmask |= (1UL << wkr);
>  				next_mb = NULL;
>  			}
>  			oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> @@ -379,11 +405,13 @@ rte_distributor_returned_pkts(struct rte_distributor *d,
>  static inline unsigned
>  total_outstanding(const struct rte_distributor *d)
>  {
> -	unsigned wkr, total_outstanding = 0;
> +	unsigned wkr, total_outstanding;
> +
> +	total_outstanding = __builtin_popcountl(d->in_flight_bitmask);
>  
>  	for (wkr = 0; wkr < d->num_workers; wkr++)
> -		total_outstanding += d->backlog[wkr].count +
> -				!!(d->in_flight_tags[wkr]);
> +		total_outstanding += d->backlog[wkr].count;
> +
>  	return total_outstanding;
>  }
>  
> @@ -423,9 +451,11 @@ rte_distributor_create(const char *name,
>  
>  	/* compilation-time checks */
>  	RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> -	RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> +	RTE_BUILD_BUG_ON((RTE_DISTRIB_MAX_WORKERS & 7) != 0);
> +	RTE_BUILD_BUG_ON(RTE_DISTRIB_MAX_WORKERS >
> +				sizeof(d->in_flight_bitmask) * CHAR_BIT);
>  
> -	if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> +	if (name == NULL || num_workers >= RTE_DISTRIB_MAX_WORKERS) {
>  		rte_errno = EINVAL;
>  		return NULL;
>  	}
> diff --git a/lib/librte_distributor/rte_distributor.h b/lib/librte_distributor/rte_distributor.h
> index ec0d74a..cc1d559 100644
> --- a/lib/librte_distributor/rte_distributor.h
> +++ b/lib/librte_distributor/rte_distributor.h
> @@ -88,6 +88,10 @@ rte_distributor_create(const char *name, unsigned socket_id,
>   * packets. The distributor will ensure that no two packets that have the
>   * same flow id, or tag, in the mbuf will be procesed at the same time.
>   *
> + * The user is advocated to set tag for each mbuf before calling this function.
> + * If user doesn't set the tag, the tag value can be various values depending on
> + * driver implementation and configuration.
> + *
>   * This is not multi-thread safe and should only be called on a single lcore.
>   *
>   * @param d
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-10 15:13 ` Bruce Richardson
@ 2014-11-10 15:52   ` jigsaw
  2014-11-10 16:55     ` Bruce Richardson
  0 siblings, 1 reply; 11+ messages in thread
From: jigsaw @ 2014-11-10 15:52 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Hi Bruce,

Sorry I didn't. I will run a performance test tomorrow.

thx &
rgds,
-qinglai

On Mon, Nov 10, 2014 at 5:13 PM, Bruce Richardson <
bruce.richardson@intel.com> wrote:

> On Mon, Nov 10, 2014 at 04:44:02PM +0200, Qinglai Xiao wrote:
> > With introduction of in_flight_bitmask, the whole 32 bits of tag can be
> > used. Further more, this patch fixed the integer overflow when finding
> > the matched tags.
> > The maximum number workers is now defined as 64, which is length of
> > double-word. The link between number of workers and RTE_MAX_LCORE is
> > now removed. Compile time check is added to ensure the
> > RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.
> >
> > Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>
>
> Looks good to me.
> Just before I ack this, have you checked to see if there is any
> performance impact?
>
> /Bruce
>
> > ---
> >  lib/librte_distributor/rte_distributor.c |   64
> ++++++++++++++++++++++--------
> >  lib/librte_distributor/rte_distributor.h |    4 ++
> >  2 files changed, 51 insertions(+), 17 deletions(-)
> >
> > diff --git a/lib/librte_distributor/rte_distributor.c
> b/lib/librte_distributor/rte_distributor.c
> > index 3dfec4a..2c5d61c 100644
> > --- a/lib/librte_distributor/rte_distributor.c
> > +++ b/lib/librte_distributor/rte_distributor.c
> > @@ -62,6 +62,13 @@
> >  #define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> >
> >  /**
> > + * Maximum number of workers allowed.
> > + * Be aware of increasing the limit, becaus it is limited by how we
> track
> > + * in-flight tags. See @in_flight_bitmask and @rte_distributor_process
> > + */
> > +#define RTE_DISTRIB_MAX_WORKERS      64
> > +
> > +/**
> >   * Buffer structure used to pass the pointer data between cores. This
> is cache
> >   * line aligned, but to improve performance and prevent adjacent
> cache-line
> >   * prefetches of buffers for other workers, e.g. when worker 1's buffer
> is on
> > @@ -91,11 +98,17 @@ struct rte_distributor {
> >       char name[RTE_DISTRIBUTOR_NAMESIZE];  /**< Name of the ring. */
> >       unsigned num_workers;                 /**< Number of workers
> polling */
> >
> > -     uint32_t in_flight_tags[RTE_MAX_LCORE];
> > -             /**< Tracks the tag being processed per core, 0 == no pkt
> */
> > -     struct rte_distributor_backlog backlog[RTE_MAX_LCORE];
> > +     uint32_t in_flight_tags[RTE_DISTRIB_MAX_WORKERS];
> > +             /**< Tracks the tag being processed per core */
> > +     uint64_t in_flight_bitmask;
> > +             /**< on/off bits for in-flight tags.
> > +              * Note that if RTE_DISTRIB_MAX_WORKERS is larger than 64
> then
> > +              * the bitmask has to expand.
> > +              */
> > +
> > +     struct rte_distributor_backlog backlog[RTE_DISTRIB_MAX_WORKERS];
> >
> > -     union rte_distributor_buffer bufs[RTE_MAX_LCORE];
> > +     union rte_distributor_buffer bufs[RTE_DISTRIB_MAX_WORKERS];
> >
> >       struct rte_distributor_returned_pkts returns;
> >  };
> > @@ -189,6 +202,7 @@ static inline void
> >  handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
> >  {
> >       d->in_flight_tags[wkr] = 0;
> > +     d->in_flight_bitmask &= ~(1UL << wkr);
> >       d->bufs[wkr].bufptr64 = 0;
> >       if (unlikely(d->backlog[wkr].count != 0)) {
> >               /* On return of a packet, we need to move the
> > @@ -211,7 +225,10 @@ handle_worker_shutdown(struct rte_distributor *d,
> unsigned wkr)
> >                       pkts[i] = (void *)((uintptr_t)(bl->pkts[idx] >>
> >                                       RTE_DISTRIB_FLAG_BITS));
> >               }
> > -             /* recursive call */
> > +             /* recursive call.
> > +              * Note that the tags were set before first level call
> > +              * to rte_distributor_process.
> > +              */
> >               rte_distributor_process(d, pkts, i);
> >               bl->count = bl->start = 0;
> >       }
> > @@ -242,6 +259,7 @@ process_returns(struct rte_distributor *d)
> >                       else {
> >                               d->bufs[wkr].bufptr64 =
> RTE_DISTRIB_GET_BUF;
> >                               d->in_flight_tags[wkr] = 0;
> > +                             d->in_flight_bitmask &= ~(1UL << wkr);
> >                       }
> >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> >               } else if (data & RTE_DISTRIB_RETURN_BUF) {
> > @@ -284,14 +302,18 @@ rte_distributor_process(struct rte_distributor *d,
> >                       next_value = (((int64_t)(uintptr_t)next_mb)
> >                                       << RTE_DISTRIB_FLAG_BITS);
> >                       /*
> > -                      * Set the low bit on the tag, so we can guarantee
> that
> > -                      * we never store a tag value of zero. That means
> we can
> > -                      * use the zero-value to indicate that no packet is
> > -                      * being processed by a worker.
> > +                      * User is advocated to set tag vaue for each
> > +                      * mbuf before calling rte_distributor_process.
> > +                      * User defined tags are used to identify flows,
> > +                      * or sessions.
> >                        */
> > -                     new_tag = (next_mb->hash.usr | 1);
> > +                     new_tag = next_mb->hash.usr;
> >
> > -                     uint32_t match = 0;
> > +                     /*
> > +                      * Note that if RTE_DISTRIB_MAX_WORKERS is larger
> than 64
> > +                      * then the size of match has to be expanded.
> > +                      */
> > +                     uint64_t match = 0;
> >                       unsigned i;
> >                       /*
> >                        * to scan for a match use "xor" and "not" to get
> a 0/1
> > @@ -303,9 +325,12 @@ rte_distributor_process(struct rte_distributor *d,
> >                               match |= (!(d->in_flight_tags[i] ^ new_tag)
> >                                       << i);
> >
> > +                     /* Only turned-on bits are considered as match */
> > +                     match &= d->in_flight_bitmask;
> > +
> >                       if (match) {
> >                               next_mb = NULL;
> > -                             unsigned worker = __builtin_ctz(match);
> > +                             unsigned worker = __builtin_ctzl(match);
> >                               if (add_to_backlog(&d->backlog[worker],
> >                                               next_value) < 0)
> >                                       next_idx--;
> > @@ -322,6 +347,7 @@ rte_distributor_process(struct rte_distributor *d,
> >                       else {
> >                               d->bufs[wkr].bufptr64 = next_value;
> >                               d->in_flight_tags[wkr] = new_tag;
> > +                             d->in_flight_bitmask |= (1UL << wkr);
> >                               next_mb = NULL;
> >                       }
> >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > @@ -379,11 +405,13 @@ rte_distributor_returned_pkts(struct
> rte_distributor *d,
> >  static inline unsigned
> >  total_outstanding(const struct rte_distributor *d)
> >  {
> > -     unsigned wkr, total_outstanding = 0;
> > +     unsigned wkr, total_outstanding;
> > +
> > +     total_outstanding = __builtin_popcountl(d->in_flight_bitmask);
> >
> >       for (wkr = 0; wkr < d->num_workers; wkr++)
> > -             total_outstanding += d->backlog[wkr].count +
> > -                             !!(d->in_flight_tags[wkr]);
> > +             total_outstanding += d->backlog[wkr].count;
> > +
> >       return total_outstanding;
> >  }
> >
> > @@ -423,9 +451,11 @@ rte_distributor_create(const char *name,
> >
> >       /* compilation-time checks */
> >       RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> > -     RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> > +     RTE_BUILD_BUG_ON((RTE_DISTRIB_MAX_WORKERS & 7) != 0);
> > +     RTE_BUILD_BUG_ON(RTE_DISTRIB_MAX_WORKERS >
> > +                             sizeof(d->in_flight_bitmask) * CHAR_BIT);
> >
> > -     if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> > +     if (name == NULL || num_workers >= RTE_DISTRIB_MAX_WORKERS) {
> >               rte_errno = EINVAL;
> >               return NULL;
> >       }
> > diff --git a/lib/librte_distributor/rte_distributor.h
> b/lib/librte_distributor/rte_distributor.h
> > index ec0d74a..cc1d559 100644
> > --- a/lib/librte_distributor/rte_distributor.h
> > +++ b/lib/librte_distributor/rte_distributor.h
> > @@ -88,6 +88,10 @@ rte_distributor_create(const char *name, unsigned
> socket_id,
> >   * packets. The distributor will ensure that no two packets that have
> the
> >   * same flow id, or tag, in the mbuf will be procesed at the same time.
> >   *
> > + * The user is advocated to set tag for each mbuf before calling this
> function.
> > + * If user doesn't set the tag, the tag value can be various values
> depending on
> > + * driver implementation and configuration.
> > + *
> >   * This is not multi-thread safe and should only be called on a single
> lcore.
> >   *
> >   * @param d
> > --
> > 1.7.1
> >
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-10 15:52   ` jigsaw
@ 2014-11-10 16:55     ` Bruce Richardson
  2014-11-11  8:53       ` jigsaw
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Richardson @ 2014-11-10 16:55 UTC (permalink / raw)
  To: jigsaw; +Cc: dev

On Mon, Nov 10, 2014 at 05:52:26PM +0200, jigsaw wrote:
> Hi Bruce,
> 
> Sorry I didn't. I will run a performance test tomorrow.
> 
> thx &
> rgds,
> -qinglai
> 

On the assumption that no performance regressions show up...
Acked-by: Bruce Richardson <bruce.richardson@intel.com>


> On Mon, Nov 10, 2014 at 5:13 PM, Bruce Richardson <
> bruce.richardson@intel.com> wrote:
> 
> > On Mon, Nov 10, 2014 at 04:44:02PM +0200, Qinglai Xiao wrote:
> > > With introduction of in_flight_bitmask, the whole 32 bits of tag can be
> > > used. Further more, this patch fixed the integer overflow when finding
> > > the matched tags.
> > > The maximum number workers is now defined as 64, which is length of
> > > double-word. The link between number of workers and RTE_MAX_LCORE is
> > > now removed. Compile time check is added to ensure the
> > > RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.
> > >
> > > Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>
> >
> > Looks good to me.
> > Just before I ack this, have you checked to see if there is any
> > performance impact?
> >
> > /Bruce
> >
> > > ---
> > >  lib/librte_distributor/rte_distributor.c |   64
> > ++++++++++++++++++++++--------
> > >  lib/librte_distributor/rte_distributor.h |    4 ++
> > >  2 files changed, 51 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/lib/librte_distributor/rte_distributor.c
> > b/lib/librte_distributor/rte_distributor.c
> > > index 3dfec4a..2c5d61c 100644
> > > --- a/lib/librte_distributor/rte_distributor.c
> > > +++ b/lib/librte_distributor/rte_distributor.c
> > > @@ -62,6 +62,13 @@
> > >  #define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> > >
> > >  /**
> > > + * Maximum number of workers allowed.
> > > + * Be aware of increasing the limit, becaus it is limited by how we
> > track
> > > + * in-flight tags. See @in_flight_bitmask and @rte_distributor_process
> > > + */
> > > +#define RTE_DISTRIB_MAX_WORKERS      64
> > > +
> > > +/**
> > >   * Buffer structure used to pass the pointer data between cores. This
> > is cache
> > >   * line aligned, but to improve performance and prevent adjacent
> > cache-line
> > >   * prefetches of buffers for other workers, e.g. when worker 1's buffer
> > is on
> > > @@ -91,11 +98,17 @@ struct rte_distributor {
> > >       char name[RTE_DISTRIBUTOR_NAMESIZE];  /**< Name of the ring. */
> > >       unsigned num_workers;                 /**< Number of workers
> > polling */
> > >
> > > -     uint32_t in_flight_tags[RTE_MAX_LCORE];
> > > -             /**< Tracks the tag being processed per core, 0 == no pkt
> > */
> > > -     struct rte_distributor_backlog backlog[RTE_MAX_LCORE];
> > > +     uint32_t in_flight_tags[RTE_DISTRIB_MAX_WORKERS];
> > > +             /**< Tracks the tag being processed per core */
> > > +     uint64_t in_flight_bitmask;
> > > +             /**< on/off bits for in-flight tags.
> > > +              * Note that if RTE_DISTRIB_MAX_WORKERS is larger than 64
> > then
> > > +              * the bitmask has to expand.
> > > +              */
> > > +
> > > +     struct rte_distributor_backlog backlog[RTE_DISTRIB_MAX_WORKERS];
> > >
> > > -     union rte_distributor_buffer bufs[RTE_MAX_LCORE];
> > > +     union rte_distributor_buffer bufs[RTE_DISTRIB_MAX_WORKERS];
> > >
> > >       struct rte_distributor_returned_pkts returns;
> > >  };
> > > @@ -189,6 +202,7 @@ static inline void
> > >  handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
> > >  {
> > >       d->in_flight_tags[wkr] = 0;
> > > +     d->in_flight_bitmask &= ~(1UL << wkr);
> > >       d->bufs[wkr].bufptr64 = 0;
> > >       if (unlikely(d->backlog[wkr].count != 0)) {
> > >               /* On return of a packet, we need to move the
> > > @@ -211,7 +225,10 @@ handle_worker_shutdown(struct rte_distributor *d,
> > unsigned wkr)
> > >                       pkts[i] = (void *)((uintptr_t)(bl->pkts[idx] >>
> > >                                       RTE_DISTRIB_FLAG_BITS));
> > >               }
> > > -             /* recursive call */
> > > +             /* recursive call.
> > > +              * Note that the tags were set before first level call
> > > +              * to rte_distributor_process.
> > > +              */
> > >               rte_distributor_process(d, pkts, i);
> > >               bl->count = bl->start = 0;
> > >       }
> > > @@ -242,6 +259,7 @@ process_returns(struct rte_distributor *d)
> > >                       else {
> > >                               d->bufs[wkr].bufptr64 =
> > RTE_DISTRIB_GET_BUF;
> > >                               d->in_flight_tags[wkr] = 0;
> > > +                             d->in_flight_bitmask &= ~(1UL << wkr);
> > >                       }
> > >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > >               } else if (data & RTE_DISTRIB_RETURN_BUF) {
> > > @@ -284,14 +302,18 @@ rte_distributor_process(struct rte_distributor *d,
> > >                       next_value = (((int64_t)(uintptr_t)next_mb)
> > >                                       << RTE_DISTRIB_FLAG_BITS);
> > >                       /*
> > > -                      * Set the low bit on the tag, so we can guarantee
> > that
> > > -                      * we never store a tag value of zero. That means
> > we can
> > > -                      * use the zero-value to indicate that no packet is
> > > -                      * being processed by a worker.
> > > +                      * User is advocated to set tag vaue for each
> > > +                      * mbuf before calling rte_distributor_process.
> > > +                      * User defined tags are used to identify flows,
> > > +                      * or sessions.
> > >                        */
> > > -                     new_tag = (next_mb->hash.usr | 1);
> > > +                     new_tag = next_mb->hash.usr;
> > >
> > > -                     uint32_t match = 0;
> > > +                     /*
> > > +                      * Note that if RTE_DISTRIB_MAX_WORKERS is larger
> > than 64
> > > +                      * then the size of match has to be expanded.
> > > +                      */
> > > +                     uint64_t match = 0;
> > >                       unsigned i;
> > >                       /*
> > >                        * to scan for a match use "xor" and "not" to get
> > a 0/1
> > > @@ -303,9 +325,12 @@ rte_distributor_process(struct rte_distributor *d,
> > >                               match |= (!(d->in_flight_tags[i] ^ new_tag)
> > >                                       << i);
> > >
> > > +                     /* Only turned-on bits are considered as match */
> > > +                     match &= d->in_flight_bitmask;
> > > +
> > >                       if (match) {
> > >                               next_mb = NULL;
> > > -                             unsigned worker = __builtin_ctz(match);
> > > +                             unsigned worker = __builtin_ctzl(match);
> > >                               if (add_to_backlog(&d->backlog[worker],
> > >                                               next_value) < 0)
> > >                                       next_idx--;
> > > @@ -322,6 +347,7 @@ rte_distributor_process(struct rte_distributor *d,
> > >                       else {
> > >                               d->bufs[wkr].bufptr64 = next_value;
> > >                               d->in_flight_tags[wkr] = new_tag;
> > > +                             d->in_flight_bitmask |= (1UL << wkr);
> > >                               next_mb = NULL;
> > >                       }
> > >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > > @@ -379,11 +405,13 @@ rte_distributor_returned_pkts(struct
> > rte_distributor *d,
> > >  static inline unsigned
> > >  total_outstanding(const struct rte_distributor *d)
> > >  {
> > > -     unsigned wkr, total_outstanding = 0;
> > > +     unsigned wkr, total_outstanding;
> > > +
> > > +     total_outstanding = __builtin_popcountl(d->in_flight_bitmask);
> > >
> > >       for (wkr = 0; wkr < d->num_workers; wkr++)
> > > -             total_outstanding += d->backlog[wkr].count +
> > > -                             !!(d->in_flight_tags[wkr]);
> > > +             total_outstanding += d->backlog[wkr].count;
> > > +
> > >       return total_outstanding;
> > >  }
> > >
> > > @@ -423,9 +451,11 @@ rte_distributor_create(const char *name,
> > >
> > >       /* compilation-time checks */
> > >       RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> > > -     RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> > > +     RTE_BUILD_BUG_ON((RTE_DISTRIB_MAX_WORKERS & 7) != 0);
> > > +     RTE_BUILD_BUG_ON(RTE_DISTRIB_MAX_WORKERS >
> > > +                             sizeof(d->in_flight_bitmask) * CHAR_BIT);
> > >
> > > -     if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> > > +     if (name == NULL || num_workers >= RTE_DISTRIB_MAX_WORKERS) {
> > >               rte_errno = EINVAL;
> > >               return NULL;
> > >       }
> > > diff --git a/lib/librte_distributor/rte_distributor.h
> > b/lib/librte_distributor/rte_distributor.h
> > > index ec0d74a..cc1d559 100644
> > > --- a/lib/librte_distributor/rte_distributor.h
> > > +++ b/lib/librte_distributor/rte_distributor.h
> > > @@ -88,6 +88,10 @@ rte_distributor_create(const char *name, unsigned
> > socket_id,
> > >   * packets. The distributor will ensure that no two packets that have
> > the
> > >   * same flow id, or tag, in the mbuf will be procesed at the same time.
> > >   *
> > > + * The user is advocated to set tag for each mbuf before calling this
> > function.
> > > + * If user doesn't set the tag, the tag value can be various values
> > depending on
> > > + * driver implementation and configuration.
> > > + *
> > >   * This is not multi-thread safe and should only be called on a single
> > lcore.
> > >   *
> > >   * @param d
> > > --
> > > 1.7.1
> > >
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-10 16:55     ` Bruce Richardson
@ 2014-11-11  8:53       ` jigsaw
  2014-11-12 15:51         ` Bruce Richardson
  0 siblings, 1 reply; 11+ messages in thread
From: jigsaw @ 2014-11-11  8:53 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Hi Bruce,

This patch has little, if any, performance impact.
See the perf stat -d for original and patched version
of test_distributor_perf.

Original version
============
perf stat -d ./test_orig  -cffff -n2
============
==== Cache line switch test ===

[4/4590]
Time for 1048576 iterations = 362349056 ticks
Ticks per iteration = 345

=== Performance test of distributor ===
Time per burst:  7043
Time per packet: 220

Worker 0 handled 3377372 packets
Worker 1 handled 2857280 packets
Worker 2 handled 2120982 packets
Worker 3 handled 2112720 packets
Worker 4 handled 2102014 packets
Worker 5 handled 2101314 packets
Worker 6 handled 2099248 packets
Worker 7 handled 2098560 packets
Worker 8 handled 2098114 packets
Worker 9 handled 2097962 packets
Worker 10 handled 2097892 packets
Worker 11 handled 2097856 packets
Worker 12 handled 2097762 packets
Worker 13 handled 2097726 packets
Worker 14 handled 2097630 packets
Total packets: 33554432 (2000000)
=== Perf test done ===


 Performance counter stats for './test_orig -cffff -n2 --no-huge':

      45784.935475 task-clock                #   12.847 CPUs utilized
              4822 context-switches          #    0.105 K/sec
                 7 CPU-migrations            #    0.000 K/sec
              1017 page-faults               #    0.022 K/sec
      118181375409 cycles                    #    2.581 GHz
    [40.00%]
       88915381373 stalled-cycles-frontend   #   75.24% frontend cycles
idle    [39.98%]
       77015854611 stalled-cycles-backend    #   65.17% backend  cycles
idle    [39.96%]
       30469536186 instructions              #    0.26  insns per cycle
                                             #    2.92  stalled cycles per
insn [49.96%]
        6481829773 branches                  #  141.571 M/sec
    [49.97%]
          45283365 branch-misses             #    0.70% of all branches
    [50.00%]
        6537115556 L1-dcache-loads           #  142.779 M/sec
    [50.05%]
         249533128 L1-dcache-load-misses     #    3.82% of all L1-dcache
hits   [50.08%]
         144469663 LLC-loads                 #    3.155 M/sec
    [40.05%]
          69897760 LLC-load-misses           #   48.38% of all LL-cache
hits    [40.03%]

       3.563886431 seconds time elapsed


Patched version
=============
perf stat -d ./test_patch  -cffff -n2
==== Cache line switch test ===

[4/4747]
Time for 1048576 iterations = 456390606 ticks
Ticks per iteration = 435

=== Performance test of distributor ===
Time per burst:  6566
Time per packet: 205

Worker 0 handled 2528418 packets
Worker 1 handled 2531906 packets
Worker 2 handled 2524133 packets
Worker 3 handled 2507720 packets
Worker 4 handled 2332648 packets
Worker 5 handled 2300046 packets
Worker 6 handled 2213035 packets
Worker 7 handled 2125486 packets
Worker 8 handled 2096726 packets
Worker 9 handled 2089150 packets
Worker 10 handled 2079626 packets
Worker 11 handled 2074512 packets
Worker 12 handled 2073560 packets
Worker 13 handled 2054838 packets
Worker 14 handled 2022628 packets
Total packets: 33554432 (2000000)
=== Perf test done ===


 Performance counter stats for './test_patch -cffff -n2 --no-huge':

      42784.064267 task-clock                #   12.546 CPUs utilized
              4517 context-switches          #    0.106 K/sec
                 5 CPU-migrations            #    0.000 K/sec
              1020 page-faults               #    0.024 K/sec
      110531300647 cycles                    #    2.583 GHz
    [40.06%]
       83135557406 stalled-cycles-frontend   #   75.21% frontend cycles
idle    [40.02%]
       71862086289 stalled-cycles-backend    #   65.02% backend  cycles
idle    [39.99%]
       28847200612 instructions              #    0.26  insns per cycle
                                             #    2.88  stalled cycles per
insn [49.98%]
        6059139104 branches                  #  141.621 M/sec
    [49.94%]
          38455463 branch-misses             #    0.63% of all branches
    [49.95%]
        6090169906 L1-dcache-loads           #  142.347 M/sec
    [49.99%]
         235094118 L1-dcache-load-misses     #    3.86% of all L1-dcache
hits   [50.03%]
         129974492 LLC-loads                 #    3.038 M/sec
    [40.07%]
          56701883 LLC-load-misses           #   43.63% of all LL-cache
hits    [40.08%]

       3.410231367 seconds time elapsed


The patched version even runs a bit faster.

Perf annotation shows that 92% time is spent in the while loop
of rte_distributor_get_pkt.
Anothoer 6% spent in the calculation of match
inside rte_distributor_process.

Also we can see that there is a big amount of LLC load miss, which is
perhaps due to cache conflict
in the while-loop of rte_distributor_get_pkt?

I'm no going to draw any conclusion here but evidently the
in_flight_bitmask is not in the hot path, and
is irrelevant to the performance of distributor.

Thanks a lot for your code review and ack.

thx &
rgds,
-qinglai


On Mon, Nov 10, 2014 at 6:55 PM, Bruce Richardson <
bruce.richardson@intel.com> wrote:

> On Mon, Nov 10, 2014 at 05:52:26PM +0200, jigsaw wrote:
> > Hi Bruce,
> >
> > Sorry I didn't. I will run a performance test tomorrow.
> >
> > thx &
> > rgds,
> > -qinglai
> >
>
> On the assumption that no performance regressions show up...
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
>
>
> > On Mon, Nov 10, 2014 at 5:13 PM, Bruce Richardson <
> > bruce.richardson@intel.com> wrote:
> >
> > > On Mon, Nov 10, 2014 at 04:44:02PM +0200, Qinglai Xiao wrote:
> > > > With introduction of in_flight_bitmask, the whole 32 bits of tag can
> be
> > > > used. Further more, this patch fixed the integer overflow when
> finding
> > > > the matched tags.
> > > > The maximum number workers is now defined as 64, which is length of
> > > > double-word. The link between number of workers and RTE_MAX_LCORE is
> > > > now removed. Compile time check is added to ensure the
> > > > RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.
> > > >
> > > > Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>
> > >
> > > Looks good to me.
> > > Just before I ack this, have you checked to see if there is any
> > > performance impact?
> > >
> > > /Bruce
> > >
> > > > ---
> > > >  lib/librte_distributor/rte_distributor.c |   64
> > > ++++++++++++++++++++++--------
> > > >  lib/librte_distributor/rte_distributor.h |    4 ++
> > > >  2 files changed, 51 insertions(+), 17 deletions(-)
> > > >
> > > > diff --git a/lib/librte_distributor/rte_distributor.c
> > > b/lib/librte_distributor/rte_distributor.c
> > > > index 3dfec4a..2c5d61c 100644
> > > > --- a/lib/librte_distributor/rte_distributor.c
> > > > +++ b/lib/librte_distributor/rte_distributor.c
> > > > @@ -62,6 +62,13 @@
> > > >  #define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> > > >
> > > >  /**
> > > > + * Maximum number of workers allowed.
> > > > + * Be aware of increasing the limit, becaus it is limited by how we
> > > track
> > > > + * in-flight tags. See @in_flight_bitmask and
> @rte_distributor_process
> > > > + */
> > > > +#define RTE_DISTRIB_MAX_WORKERS      64
> > > > +
> > > > +/**
> > > >   * Buffer structure used to pass the pointer data between cores.
> This
> > > is cache
> > > >   * line aligned, but to improve performance and prevent adjacent
> > > cache-line
> > > >   * prefetches of buffers for other workers, e.g. when worker 1's
> buffer
> > > is on
> > > > @@ -91,11 +98,17 @@ struct rte_distributor {
> > > >       char name[RTE_DISTRIBUTOR_NAMESIZE];  /**< Name of the ring. */
> > > >       unsigned num_workers;                 /**< Number of workers
> > > polling */
> > > >
> > > > -     uint32_t in_flight_tags[RTE_MAX_LCORE];
> > > > -             /**< Tracks the tag being processed per core, 0 == no
> pkt
> > > */
> > > > -     struct rte_distributor_backlog backlog[RTE_MAX_LCORE];
> > > > +     uint32_t in_flight_tags[RTE_DISTRIB_MAX_WORKERS];
> > > > +             /**< Tracks the tag being processed per core */
> > > > +     uint64_t in_flight_bitmask;
> > > > +             /**< on/off bits for in-flight tags.
> > > > +              * Note that if RTE_DISTRIB_MAX_WORKERS is larger than
> 64
> > > then
> > > > +              * the bitmask has to expand.
> > > > +              */
> > > > +
> > > > +     struct rte_distributor_backlog
> backlog[RTE_DISTRIB_MAX_WORKERS];
> > > >
> > > > -     union rte_distributor_buffer bufs[RTE_MAX_LCORE];
> > > > +     union rte_distributor_buffer bufs[RTE_DISTRIB_MAX_WORKERS];
> > > >
> > > >       struct rte_distributor_returned_pkts returns;
> > > >  };
> > > > @@ -189,6 +202,7 @@ static inline void
> > > >  handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
> > > >  {
> > > >       d->in_flight_tags[wkr] = 0;
> > > > +     d->in_flight_bitmask &= ~(1UL << wkr);
> > > >       d->bufs[wkr].bufptr64 = 0;
> > > >       if (unlikely(d->backlog[wkr].count != 0)) {
> > > >               /* On return of a packet, we need to move the
> > > > @@ -211,7 +225,10 @@ handle_worker_shutdown(struct rte_distributor
> *d,
> > > unsigned wkr)
> > > >                       pkts[i] = (void *)((uintptr_t)(bl->pkts[idx] >>
> > > >                                       RTE_DISTRIB_FLAG_BITS));
> > > >               }
> > > > -             /* recursive call */
> > > > +             /* recursive call.
> > > > +              * Note that the tags were set before first level call
> > > > +              * to rte_distributor_process.
> > > > +              */
> > > >               rte_distributor_process(d, pkts, i);
> > > >               bl->count = bl->start = 0;
> > > >       }
> > > > @@ -242,6 +259,7 @@ process_returns(struct rte_distributor *d)
> > > >                       else {
> > > >                               d->bufs[wkr].bufptr64 =
> > > RTE_DISTRIB_GET_BUF;
> > > >                               d->in_flight_tags[wkr] = 0;
> > > > +                             d->in_flight_bitmask &= ~(1UL << wkr);
> > > >                       }
> > > >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > > >               } else if (data & RTE_DISTRIB_RETURN_BUF) {
> > > > @@ -284,14 +302,18 @@ rte_distributor_process(struct rte_distributor
> *d,
> > > >                       next_value = (((int64_t)(uintptr_t)next_mb)
> > > >                                       << RTE_DISTRIB_FLAG_BITS);
> > > >                       /*
> > > > -                      * Set the low bit on the tag, so we can
> guarantee
> > > that
> > > > -                      * we never store a tag value of zero. That
> means
> > > we can
> > > > -                      * use the zero-value to indicate that no
> packet is
> > > > -                      * being processed by a worker.
> > > > +                      * User is advocated to set tag vaue for each
> > > > +                      * mbuf before calling rte_distributor_process.
> > > > +                      * User defined tags are used to identify
> flows,
> > > > +                      * or sessions.
> > > >                        */
> > > > -                     new_tag = (next_mb->hash.usr | 1);
> > > > +                     new_tag = next_mb->hash.usr;
> > > >
> > > > -                     uint32_t match = 0;
> > > > +                     /*
> > > > +                      * Note that if RTE_DISTRIB_MAX_WORKERS is
> larger
> > > than 64
> > > > +                      * then the size of match has to be expanded.
> > > > +                      */
> > > > +                     uint64_t match = 0;
> > > >                       unsigned i;
> > > >                       /*
> > > >                        * to scan for a match use "xor" and "not" to
> get
> > > a 0/1
> > > > @@ -303,9 +325,12 @@ rte_distributor_process(struct rte_distributor
> *d,
> > > >                               match |= (!(d->in_flight_tags[i] ^
> new_tag)
> > > >                                       << i);
> > > >
> > > > +                     /* Only turned-on bits are considered as match
> */
> > > > +                     match &= d->in_flight_bitmask;
> > > > +
> > > >                       if (match) {
> > > >                               next_mb = NULL;
> > > > -                             unsigned worker = __builtin_ctz(match);
> > > > +                             unsigned worker =
> __builtin_ctzl(match);
> > > >                               if (add_to_backlog(&d->backlog[worker],
> > > >                                               next_value) < 0)
> > > >                                       next_idx--;
> > > > @@ -322,6 +347,7 @@ rte_distributor_process(struct rte_distributor
> *d,
> > > >                       else {
> > > >                               d->bufs[wkr].bufptr64 = next_value;
> > > >                               d->in_flight_tags[wkr] = new_tag;
> > > > +                             d->in_flight_bitmask |= (1UL << wkr);
> > > >                               next_mb = NULL;
> > > >                       }
> > > >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > > > @@ -379,11 +405,13 @@ rte_distributor_returned_pkts(struct
> > > rte_distributor *d,
> > > >  static inline unsigned
> > > >  total_outstanding(const struct rte_distributor *d)
> > > >  {
> > > > -     unsigned wkr, total_outstanding = 0;
> > > > +     unsigned wkr, total_outstanding;
> > > > +
> > > > +     total_outstanding = __builtin_popcountl(d->in_flight_bitmask);
> > > >
> > > >       for (wkr = 0; wkr < d->num_workers; wkr++)
> > > > -             total_outstanding += d->backlog[wkr].count +
> > > > -                             !!(d->in_flight_tags[wkr]);
> > > > +             total_outstanding += d->backlog[wkr].count;
> > > > +
> > > >       return total_outstanding;
> > > >  }
> > > >
> > > > @@ -423,9 +451,11 @@ rte_distributor_create(const char *name,
> > > >
> > > >       /* compilation-time checks */
> > > >       RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> > > > -     RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> > > > +     RTE_BUILD_BUG_ON((RTE_DISTRIB_MAX_WORKERS & 7) != 0);
> > > > +     RTE_BUILD_BUG_ON(RTE_DISTRIB_MAX_WORKERS >
> > > > +                             sizeof(d->in_flight_bitmask) *
> CHAR_BIT);
> > > >
> > > > -     if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> > > > +     if (name == NULL || num_workers >= RTE_DISTRIB_MAX_WORKERS) {
> > > >               rte_errno = EINVAL;
> > > >               return NULL;
> > > >       }
> > > > diff --git a/lib/librte_distributor/rte_distributor.h
> > > b/lib/librte_distributor/rte_distributor.h
> > > > index ec0d74a..cc1d559 100644
> > > > --- a/lib/librte_distributor/rte_distributor.h
> > > > +++ b/lib/librte_distributor/rte_distributor.h
> > > > @@ -88,6 +88,10 @@ rte_distributor_create(const char *name, unsigned
> > > socket_id,
> > > >   * packets. The distributor will ensure that no two packets that
> have
> > > the
> > > >   * same flow id, or tag, in the mbuf will be procesed at the same
> time.
> > > >   *
> > > > + * The user is advocated to set tag for each mbuf before calling
> this
> > > function.
> > > > + * If user doesn't set the tag, the tag value can be various values
> > > depending on
> > > > + * driver implementation and configuration.
> > > > + *
> > > >   * This is not multi-thread safe and should only be called on a
> single
> > > lcore.
> > > >   *
> > > >   * @param d
> > > > --
> > > > 1.7.1
> > > >
> > >
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-11  8:53       ` jigsaw
@ 2014-11-12 15:51         ` Bruce Richardson
  2014-11-13 11:26           ` Thomas Monjalon
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Richardson @ 2014-11-12 15:51 UTC (permalink / raw)
  To: jigsaw; +Cc: dev

On Tue, Nov 11, 2014 at 10:53:40AM +0200, jigsaw wrote:
> Hi Bruce,
> 
> This patch has little, if any, performance impact.
> See the perf stat -d for original and patched version
> of test_distributor_perf.

Thanks for running the test. Results look ok to me. I confirm my previous ack.

/Bruce

> 
> Original version
> ============
> perf stat -d ./test_orig  -cffff -n2
> ============
> ==== Cache line switch test ===
> 
> [4/4590]
> Time for 1048576 iterations = 362349056 ticks
> Ticks per iteration = 345
> 
> === Performance test of distributor ===
> Time per burst:  7043
> Time per packet: 220
> 
> Worker 0 handled 3377372 packets
> Worker 1 handled 2857280 packets
> Worker 2 handled 2120982 packets
> Worker 3 handled 2112720 packets
> Worker 4 handled 2102014 packets
> Worker 5 handled 2101314 packets
> Worker 6 handled 2099248 packets
> Worker 7 handled 2098560 packets
> Worker 8 handled 2098114 packets
> Worker 9 handled 2097962 packets
> Worker 10 handled 2097892 packets
> Worker 11 handled 2097856 packets
> Worker 12 handled 2097762 packets
> Worker 13 handled 2097726 packets
> Worker 14 handled 2097630 packets
> Total packets: 33554432 (2000000)
> === Perf test done ===
> 
> 
>  Performance counter stats for './test_orig -cffff -n2 --no-huge':
> 
>       45784.935475 task-clock                #   12.847 CPUs utilized
>               4822 context-switches          #    0.105 K/sec
>                  7 CPU-migrations            #    0.000 K/sec
>               1017 page-faults               #    0.022 K/sec
>       118181375409 cycles                    #    2.581 GHz
>     [40.00%]
>        88915381373 stalled-cycles-frontend   #   75.24% frontend cycles
> idle    [39.98%]
>        77015854611 stalled-cycles-backend    #   65.17% backend  cycles
> idle    [39.96%]
>        30469536186 instructions              #    0.26  insns per cycle
>                                              #    2.92  stalled cycles per
> insn [49.96%]
>         6481829773 branches                  #  141.571 M/sec
>     [49.97%]
>           45283365 branch-misses             #    0.70% of all branches
>     [50.00%]
>         6537115556 L1-dcache-loads           #  142.779 M/sec
>     [50.05%]
>          249533128 L1-dcache-load-misses     #    3.82% of all L1-dcache
> hits   [50.08%]
>          144469663 LLC-loads                 #    3.155 M/sec
>     [40.05%]
>           69897760 LLC-load-misses           #   48.38% of all LL-cache
> hits    [40.03%]
> 
>        3.563886431 seconds time elapsed
> 
> 
> Patched version
> =============
> perf stat -d ./test_patch  -cffff -n2
> ==== Cache line switch test ===
> 
> [4/4747]
> Time for 1048576 iterations = 456390606 ticks
> Ticks per iteration = 435
> 
> === Performance test of distributor ===
> Time per burst:  6566
> Time per packet: 205
> 
> Worker 0 handled 2528418 packets
> Worker 1 handled 2531906 packets
> Worker 2 handled 2524133 packets
> Worker 3 handled 2507720 packets
> Worker 4 handled 2332648 packets
> Worker 5 handled 2300046 packets
> Worker 6 handled 2213035 packets
> Worker 7 handled 2125486 packets
> Worker 8 handled 2096726 packets
> Worker 9 handled 2089150 packets
> Worker 10 handled 2079626 packets
> Worker 11 handled 2074512 packets
> Worker 12 handled 2073560 packets
> Worker 13 handled 2054838 packets
> Worker 14 handled 2022628 packets
> Total packets: 33554432 (2000000)
> === Perf test done ===
> 
> 
>  Performance counter stats for './test_patch -cffff -n2 --no-huge':
> 
>       42784.064267 task-clock                #   12.546 CPUs utilized
>               4517 context-switches          #    0.106 K/sec
>                  5 CPU-migrations            #    0.000 K/sec
>               1020 page-faults               #    0.024 K/sec
>       110531300647 cycles                    #    2.583 GHz
>     [40.06%]
>        83135557406 stalled-cycles-frontend   #   75.21% frontend cycles
> idle    [40.02%]
>        71862086289 stalled-cycles-backend    #   65.02% backend  cycles
> idle    [39.99%]
>        28847200612 instructions              #    0.26  insns per cycle
>                                              #    2.88  stalled cycles per
> insn [49.98%]
>         6059139104 branches                  #  141.621 M/sec
>     [49.94%]
>           38455463 branch-misses             #    0.63% of all branches
>     [49.95%]
>         6090169906 L1-dcache-loads           #  142.347 M/sec
>     [49.99%]
>          235094118 L1-dcache-load-misses     #    3.86% of all L1-dcache
> hits   [50.03%]
>          129974492 LLC-loads                 #    3.038 M/sec
>     [40.07%]
>           56701883 LLC-load-misses           #   43.63% of all LL-cache
> hits    [40.08%]
> 
>        3.410231367 seconds time elapsed
> 
> 
> The patched version even runs a bit faster.
> 
> Perf annotation shows that 92% time is spent in the while loop
> of rte_distributor_get_pkt.
> Anothoer 6% spent in the calculation of match
> inside rte_distributor_process.
> 
> Also we can see that there is a big amount of LLC load miss, which is
> perhaps due to cache conflict
> in the while-loop of rte_distributor_get_pkt?
> 
> I'm no going to draw any conclusion here but evidently the
> in_flight_bitmask is not in the hot path, and
> is irrelevant to the performance of distributor.
> 
> Thanks a lot for your code review and ack.
> 
> thx &
> rgds,
> -qinglai
> 
> 
> On Mon, Nov 10, 2014 at 6:55 PM, Bruce Richardson <
> bruce.richardson@intel.com> wrote:
> 
> > On Mon, Nov 10, 2014 at 05:52:26PM +0200, jigsaw wrote:
> > > Hi Bruce,
> > >
> > > Sorry I didn't. I will run a performance test tomorrow.
> > >
> > > thx &
> > > rgds,
> > > -qinglai
> > >
> >
> > On the assumption that no performance regressions show up...
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> >
> >
> > > On Mon, Nov 10, 2014 at 5:13 PM, Bruce Richardson <
> > > bruce.richardson@intel.com> wrote:
> > >
> > > > On Mon, Nov 10, 2014 at 04:44:02PM +0200, Qinglai Xiao wrote:
> > > > > With introduction of in_flight_bitmask, the whole 32 bits of tag can
> > be
> > > > > used. Further more, this patch fixed the integer overflow when
> > finding
> > > > > the matched tags.
> > > > > The maximum number workers is now defined as 64, which is length of
> > > > > double-word. The link between number of workers and RTE_MAX_LCORE is
> > > > > now removed. Compile time check is added to ensure the
> > > > > RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.
> > > > >
> > > > > Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>
> > > >
> > > > Looks good to me.
> > > > Just before I ack this, have you checked to see if there is any
> > > > performance impact?
> > > >
> > > > /Bruce
> > > >
> > > > > ---
> > > > >  lib/librte_distributor/rte_distributor.c |   64
> > > > ++++++++++++++++++++++--------
> > > > >  lib/librte_distributor/rte_distributor.h |    4 ++
> > > > >  2 files changed, 51 insertions(+), 17 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_distributor/rte_distributor.c
> > > > b/lib/librte_distributor/rte_distributor.c
> > > > > index 3dfec4a..2c5d61c 100644
> > > > > --- a/lib/librte_distributor/rte_distributor.c
> > > > > +++ b/lib/librte_distributor/rte_distributor.c
> > > > > @@ -62,6 +62,13 @@
> > > > >  #define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> > > > >
> > > > >  /**
> > > > > + * Maximum number of workers allowed.
> > > > > + * Be aware of increasing the limit, becaus it is limited by how we
> > > > track
> > > > > + * in-flight tags. See @in_flight_bitmask and
> > @rte_distributor_process
> > > > > + */
> > > > > +#define RTE_DISTRIB_MAX_WORKERS      64
> > > > > +
> > > > > +/**
> > > > >   * Buffer structure used to pass the pointer data between cores.
> > This
> > > > is cache
> > > > >   * line aligned, but to improve performance and prevent adjacent
> > > > cache-line
> > > > >   * prefetches of buffers for other workers, e.g. when worker 1's
> > buffer
> > > > is on
> > > > > @@ -91,11 +98,17 @@ struct rte_distributor {
> > > > >       char name[RTE_DISTRIBUTOR_NAMESIZE];  /**< Name of the ring. */
> > > > >       unsigned num_workers;                 /**< Number of workers
> > > > polling */
> > > > >
> > > > > -     uint32_t in_flight_tags[RTE_MAX_LCORE];
> > > > > -             /**< Tracks the tag being processed per core, 0 == no
> > pkt
> > > > */
> > > > > -     struct rte_distributor_backlog backlog[RTE_MAX_LCORE];
> > > > > +     uint32_t in_flight_tags[RTE_DISTRIB_MAX_WORKERS];
> > > > > +             /**< Tracks the tag being processed per core */
> > > > > +     uint64_t in_flight_bitmask;
> > > > > +             /**< on/off bits for in-flight tags.
> > > > > +              * Note that if RTE_DISTRIB_MAX_WORKERS is larger than
> > 64
> > > > then
> > > > > +              * the bitmask has to expand.
> > > > > +              */
> > > > > +
> > > > > +     struct rte_distributor_backlog
> > backlog[RTE_DISTRIB_MAX_WORKERS];
> > > > >
> > > > > -     union rte_distributor_buffer bufs[RTE_MAX_LCORE];
> > > > > +     union rte_distributor_buffer bufs[RTE_DISTRIB_MAX_WORKERS];
> > > > >
> > > > >       struct rte_distributor_returned_pkts returns;
> > > > >  };
> > > > > @@ -189,6 +202,7 @@ static inline void
> > > > >  handle_worker_shutdown(struct rte_distributor *d, unsigned wkr)
> > > > >  {
> > > > >       d->in_flight_tags[wkr] = 0;
> > > > > +     d->in_flight_bitmask &= ~(1UL << wkr);
> > > > >       d->bufs[wkr].bufptr64 = 0;
> > > > >       if (unlikely(d->backlog[wkr].count != 0)) {
> > > > >               /* On return of a packet, we need to move the
> > > > > @@ -211,7 +225,10 @@ handle_worker_shutdown(struct rte_distributor
> > *d,
> > > > unsigned wkr)
> > > > >                       pkts[i] = (void *)((uintptr_t)(bl->pkts[idx] >>
> > > > >                                       RTE_DISTRIB_FLAG_BITS));
> > > > >               }
> > > > > -             /* recursive call */
> > > > > +             /* recursive call.
> > > > > +              * Note that the tags were set before first level call
> > > > > +              * to rte_distributor_process.
> > > > > +              */
> > > > >               rte_distributor_process(d, pkts, i);
> > > > >               bl->count = bl->start = 0;
> > > > >       }
> > > > > @@ -242,6 +259,7 @@ process_returns(struct rte_distributor *d)
> > > > >                       else {
> > > > >                               d->bufs[wkr].bufptr64 =
> > > > RTE_DISTRIB_GET_BUF;
> > > > >                               d->in_flight_tags[wkr] = 0;
> > > > > +                             d->in_flight_bitmask &= ~(1UL << wkr);
> > > > >                       }
> > > > >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > > > >               } else if (data & RTE_DISTRIB_RETURN_BUF) {
> > > > > @@ -284,14 +302,18 @@ rte_distributor_process(struct rte_distributor
> > *d,
> > > > >                       next_value = (((int64_t)(uintptr_t)next_mb)
> > > > >                                       << RTE_DISTRIB_FLAG_BITS);
> > > > >                       /*
> > > > > -                      * Set the low bit on the tag, so we can
> > guarantee
> > > > that
> > > > > -                      * we never store a tag value of zero. That
> > means
> > > > we can
> > > > > -                      * use the zero-value to indicate that no
> > packet is
> > > > > -                      * being processed by a worker.
> > > > > +                      * User is advocated to set tag vaue for each
> > > > > +                      * mbuf before calling rte_distributor_process.
> > > > > +                      * User defined tags are used to identify
> > flows,
> > > > > +                      * or sessions.
> > > > >                        */
> > > > > -                     new_tag = (next_mb->hash.usr | 1);
> > > > > +                     new_tag = next_mb->hash.usr;
> > > > >
> > > > > -                     uint32_t match = 0;
> > > > > +                     /*
> > > > > +                      * Note that if RTE_DISTRIB_MAX_WORKERS is
> > larger
> > > > than 64
> > > > > +                      * then the size of match has to be expanded.
> > > > > +                      */
> > > > > +                     uint64_t match = 0;
> > > > >                       unsigned i;
> > > > >                       /*
> > > > >                        * to scan for a match use "xor" and "not" to
> > get
> > > > a 0/1
> > > > > @@ -303,9 +325,12 @@ rte_distributor_process(struct rte_distributor
> > *d,
> > > > >                               match |= (!(d->in_flight_tags[i] ^
> > new_tag)
> > > > >                                       << i);
> > > > >
> > > > > +                     /* Only turned-on bits are considered as match
> > */
> > > > > +                     match &= d->in_flight_bitmask;
> > > > > +
> > > > >                       if (match) {
> > > > >                               next_mb = NULL;
> > > > > -                             unsigned worker = __builtin_ctz(match);
> > > > > +                             unsigned worker =
> > __builtin_ctzl(match);
> > > > >                               if (add_to_backlog(&d->backlog[worker],
> > > > >                                               next_value) < 0)
> > > > >                                       next_idx--;
> > > > > @@ -322,6 +347,7 @@ rte_distributor_process(struct rte_distributor
> > *d,
> > > > >                       else {
> > > > >                               d->bufs[wkr].bufptr64 = next_value;
> > > > >                               d->in_flight_tags[wkr] = new_tag;
> > > > > +                             d->in_flight_bitmask |= (1UL << wkr);
> > > > >                               next_mb = NULL;
> > > > >                       }
> > > > >                       oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > > > > @@ -379,11 +405,13 @@ rte_distributor_returned_pkts(struct
> > > > rte_distributor *d,
> > > > >  static inline unsigned
> > > > >  total_outstanding(const struct rte_distributor *d)
> > > > >  {
> > > > > -     unsigned wkr, total_outstanding = 0;
> > > > > +     unsigned wkr, total_outstanding;
> > > > > +
> > > > > +     total_outstanding = __builtin_popcountl(d->in_flight_bitmask);
> > > > >
> > > > >       for (wkr = 0; wkr < d->num_workers; wkr++)
> > > > > -             total_outstanding += d->backlog[wkr].count +
> > > > > -                             !!(d->in_flight_tags[wkr]);
> > > > > +             total_outstanding += d->backlog[wkr].count;
> > > > > +
> > > > >       return total_outstanding;
> > > > >  }
> > > > >
> > > > > @@ -423,9 +451,11 @@ rte_distributor_create(const char *name,
> > > > >
> > > > >       /* compilation-time checks */
> > > > >       RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> > > > > -     RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> > > > > +     RTE_BUILD_BUG_ON((RTE_DISTRIB_MAX_WORKERS & 7) != 0);
> > > > > +     RTE_BUILD_BUG_ON(RTE_DISTRIB_MAX_WORKERS >
> > > > > +                             sizeof(d->in_flight_bitmask) *
> > CHAR_BIT);
> > > > >
> > > > > -     if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> > > > > +     if (name == NULL || num_workers >= RTE_DISTRIB_MAX_WORKERS) {
> > > > >               rte_errno = EINVAL;
> > > > >               return NULL;
> > > > >       }
> > > > > diff --git a/lib/librte_distributor/rte_distributor.h
> > > > b/lib/librte_distributor/rte_distributor.h
> > > > > index ec0d74a..cc1d559 100644
> > > > > --- a/lib/librte_distributor/rte_distributor.h
> > > > > +++ b/lib/librte_distributor/rte_distributor.h
> > > > > @@ -88,6 +88,10 @@ rte_distributor_create(const char *name, unsigned
> > > > socket_id,
> > > > >   * packets. The distributor will ensure that no two packets that
> > have
> > > > the
> > > > >   * same flow id, or tag, in the mbuf will be procesed at the same
> > time.
> > > > >   *
> > > > > + * The user is advocated to set tag for each mbuf before calling
> > this
> > > > function.
> > > > > + * If user doesn't set the tag, the tag value can be various values
> > > > depending on
> > > > > + * driver implementation and configuration.
> > > > > + *
> > > > >   * This is not multi-thread safe and should only be called on a
> > single
> > > > lcore.
> > > > >   *
> > > > >   * @param d
> > > > > --
> > > > > 1.7.1
> > > > >
> > > >
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-10 14:44 [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag Qinglai Xiao
  2014-11-10 15:13 ` Bruce Richardson
@ 2014-11-13  0:50 ` Thomas Monjalon
  2014-11-13  6:56   ` jigsaw
  1 sibling, 1 reply; 11+ messages in thread
From: Thomas Monjalon @ 2014-11-13  0:50 UTC (permalink / raw)
  To: Qinglai Xiao; +Cc: dev

Hi,

2014-11-10 16:44, Qinglai Xiao:
> With introduction of in_flight_bitmask, the whole 32 bits of tag can be
> used. Further more, this patch fixed the integer overflow when finding
> the matched tags.
> The maximum number workers is now defined as 64, which is length of
> double-word. The link between number of workers and RTE_MAX_LCORE is
> now removed. Compile time check is added to ensure the
> RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.
> 
> Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>

The patch doesn't apply cleanly and fail to compile:
lib/librte_distributor/rte_distributor.c:310:27: error: ‘union <anonymous>’ has no member named ‘usr’

Do you have another commit before this one in your tree?

-- 
Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-13  0:50 ` Thomas Monjalon
@ 2014-11-13  6:56   ` jigsaw
  2014-11-13  9:18     ` Thomas Monjalon
  0 siblings, 1 reply; 11+ messages in thread
From: jigsaw @ 2014-11-13  6:56 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

Hi Thomas,

>>Do you have another commit before this one in your tree?

Yes this patch relies on this one:
http://dpdk.org/ml/archives/dev/2014-November/007943.html

Sorry I didn't make it clear. The new field usr in rte_mbuf was under same
cover letter in v2 of the in_flight_bitmask patch.
Then in_flight_bitmask has a v3 patch, but I didn't include the rte_mbuf in
the same cover letter, coz the usr patch has been ACKed.

thx &
rgds,
-ql

On Thu, Nov 13, 2014 at 2:50 AM, Thomas Monjalon <thomas.monjalon@6wind.com>
wrote:

> Hi,
>
> 2014-11-10 16:44, Qinglai Xiao:
> > With introduction of in_flight_bitmask, the whole 32 bits of tag can be
> > used. Further more, this patch fixed the integer overflow when finding
> > the matched tags.
> > The maximum number workers is now defined as 64, which is length of
> > double-word. The link between number of workers and RTE_MAX_LCORE is
> > now removed. Compile time check is added to ensure the
> > RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word.
> >
> > Signed-off-by: Qinglai Xiao <jigsaw@gmail.com>
>
> The patch doesn't apply cleanly and fail to compile:
> lib/librte_distributor/rte_distributor.c:310:27: error: ‘union
> <anonymous>’ has no member named ‘usr’
>
> Do you have another commit before this one in your tree?
>
> --
> Thomas
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-13  6:56   ` jigsaw
@ 2014-11-13  9:18     ` Thomas Monjalon
  2014-11-13  9:24       ` jigsaw
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Monjalon @ 2014-11-13  9:18 UTC (permalink / raw)
  To: jigsaw; +Cc: dev

2014-11-13 08:56, jigsaw:
> >>Do you have another commit before this one in your tree?
> 
> Yes this patch relies on this one:
> http://dpdk.org/ml/archives/dev/2014-November/007943.html
> 
> Sorry I didn't make it clear. The new field usr in rte_mbuf was under same
> cover letter in v2 of the in_flight_bitmask patch.
> Then in_flight_bitmask has a v3 patch, but I didn't include the rte_mbuf in
> the same cover letter, coz the usr patch has been ACKed.

OK. In this case, you should re-submit the whole serie and add the acked-by
line in the patch, and/or use the --in-reply-to option to place the v3 patch in
the original thread.

No need to resend it, I can manage to get the whole patchset now.

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-13  9:18     ` Thomas Monjalon
@ 2014-11-13  9:24       ` jigsaw
  0 siblings, 0 replies; 11+ messages in thread
From: jigsaw @ 2014-11-13  9:24 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

OK thanks. Sorry my bad. -ql

On Thu, Nov 13, 2014 at 11:18 AM, Thomas Monjalon <thomas.monjalon@6wind.com
> wrote:

> 2014-11-13 08:56, jigsaw:
> > >>Do you have another commit before this one in your tree?
> >
> > Yes this patch relies on this one:
> > http://dpdk.org/ml/archives/dev/2014-November/007943.html
> >
> > Sorry I didn't make it clear. The new field usr in rte_mbuf was under
> same
> > cover letter in v2 of the in_flight_bitmask patch.
> > Then in_flight_bitmask has a v3 patch, but I didn't include the rte_mbuf
> in
> > the same cover letter, coz the usr patch has been ACKed.
>
> OK. In this case, you should re-submit the whole serie and add the acked-by
> line in the patch, and/or use the --in-reply-to option to place the v3
> patch in
> the original thread.
>
> No need to resend it, I can manage to get the whole patchset now.
>
> Thanks
> --
> Thomas
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag.
  2014-11-12 15:51         ` Bruce Richardson
@ 2014-11-13 11:26           ` Thomas Monjalon
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Monjalon @ 2014-11-13 11:26 UTC (permalink / raw)
  To: jigsaw; +Cc: dev

2014-11-12 15:51, Bruce Richardson:
> On Tue, Nov 11, 2014 at 10:53:40AM +0200, jigsaw wrote:
> > This patch has little, if any, performance impact.
> > See the perf stat -d for original and patched version
> > of test_distributor_perf.
> 
> Thanks for running the test. Results look ok to me. I confirm my previous ack.

Applied

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-11-13 11:16 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-10 14:44 [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag Qinglai Xiao
2014-11-10 15:13 ` Bruce Richardson
2014-11-10 15:52   ` jigsaw
2014-11-10 16:55     ` Bruce Richardson
2014-11-11  8:53       ` jigsaw
2014-11-12 15:51         ` Bruce Richardson
2014-11-13 11:26           ` Thomas Monjalon
2014-11-13  0:50 ` Thomas Monjalon
2014-11-13  6:56   ` jigsaw
2014-11-13  9:18     ` Thomas Monjalon
2014-11-13  9:24       ` jigsaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).