DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH 0/3] DLB2 Performance Optimizations
@ 2022-08-20  0:59 Timothy McDaniel
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Timothy McDaniel @ 2022-08-20  0:59 UTC (permalink / raw)
  To: jerinj; +Cc: dev

This patchset contains performance optimizations
for the DLB2 eventdev PMD.

The port probing patch is dependent on patch 115303
"eal: make eal_parse_coremask external",
since the application can pass in a coremask to aid in
discovering the best performing port/core combinations.

The fence bypass patch uses a #define intentionally.
The performance improvement is significant, but
only customers with very specific use cases can benefit
from this. We don't want to expose a command line option
that when used incorrectly could lead to data corruption.

Depends-on: patch-115303 ("eal: make eal_parse_coremask external")

Timothy McDaniel (3):
  event/dlb2: add producer port probing optimization
  event/dlb2: add fence bypass option for producer ports
  event/dlb2: optimize credit allocations

 drivers/event/dlb2/dlb2.c                  |  90 +++++++-
 drivers/event/dlb2/dlb2_priv.h             |   5 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 248 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  13 ++
 drivers/event/dlb2/pf/dlb2_main.c          |   7 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 9 files changed, 382 insertions(+), 33 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/3] event/dlb2: add producer port probing optimization
  2022-08-20  0:59 [PATCH 0/3] DLB2 Performance Optimizations Timothy McDaniel
@ 2022-08-20  0:59 ` Timothy McDaniel
  2022-09-03 13:16   ` Jerin Jacob
                     ` (5 more replies)
  2022-08-20  0:59 ` [PATCH 2/3] event/dlb2: add fence bypass option for producer ports Timothy McDaniel
  2022-08-20  0:59 ` [PATCH 3/3] event/dlb2: optimize credit allocations Timothy McDaniel
  2 siblings, 6 replies; 37+ messages in thread
From: Timothy McDaniel @ 2022-08-20  0:59 UTC (permalink / raw)
  To: jerinj; +Cc: dev

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Timothy McDaniel <timothy.mcdaniel@intel.com>
---
 drivers/event/dlb2/dlb2.c                  |  40 +++-
 drivers/event/dlb2/dlb2_priv.h             |   5 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 248 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  13 ++
 drivers/event/dlb2/pf/dlb2_main.c          |   7 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 9 files changed, 341 insertions(+), 24 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 5a443acff8..a9a174e136 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -1785,6 +1802,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +1999,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2029,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4721,7 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4903,18 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..e1743c9c7e 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,7 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +387,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +671,7 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +725,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..745b3d59ad 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -122,6 +123,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+#ifdef DLB2_DEFAULT_LDB_PORT_ALLOCATION
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -129,6 +132,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		48, 55, 62, 53, 60, 51, 58, 49, 56, 63, 54, 61, 52, 59, 50, 57,
 	};
 
+#endif
+
 	hw->ver = ver;
 
 	dlb2_init_fn_rsrc_lists(&hw->pf);
@@ -164,7 +169,11 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
+#ifdef DLB2_DEFAULT_LDB_PORT_ALLOCATION
 		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+#else
+		port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
+#endif
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +181,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +602,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +622,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +757,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4570,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4781,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5100,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..285a0c18c7 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -25,6 +25,19 @@
  */
 int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
 
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
+
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
  * @hw: dlb2_hw handle for a particular device.
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..f36c77b24a 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 9eeda482a3..a951ec7a2a 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index dd3f2b8ece..35be5d1401 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.23.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/3] event/dlb2: add fence bypass option for producer ports
  2022-08-20  0:59 [PATCH 0/3] DLB2 Performance Optimizations Timothy McDaniel
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
@ 2022-08-20  0:59 ` Timothy McDaniel
  2022-08-20  0:59 ` [PATCH 3/3] event/dlb2: optimize credit allocations Timothy McDaniel
  2 siblings, 0 replies; 37+ messages in thread
From: Timothy McDaniel @ 2022-08-20  0:59 UTC (permalink / raw)
  To: jerinj; +Cc: dev

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instrucntion.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Timothy McDaniel <timothy.mcdaniel@intel.com>
---
 drivers/event/dlb2/dlb2.c | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index a9a174e136..76f51736c4 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1965,21 +1975,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -1989,6 +1993,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3055,6 +3063,13 @@ __dlb2_event_enqueue_burst(void *event_port,
 
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
+		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
+
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 3/3] event/dlb2: optimize credit allocations
  2022-08-20  0:59 [PATCH 0/3] DLB2 Performance Optimizations Timothy McDaniel
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
  2022-08-20  0:59 ` [PATCH 2/3] event/dlb2: add fence bypass option for producer ports Timothy McDaniel
@ 2022-08-20  0:59 ` Timothy McDaniel
  2 siblings, 0 replies; 37+ messages in thread
From: Timothy McDaniel @ 2022-08-20  0:59 UTC (permalink / raw)
  To: jerinj; +Cc: dev

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Timothy McDaniel <timothy.mcdaniel@intel.com>
---
 drivers/event/dlb2/dlb2.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 76f51736c4..438184bb27 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1945,8 +1945,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2047,6 +2047,23 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/3] event/dlb2: add producer port probing optimization
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
@ 2022-09-03 13:16   ` Jerin Jacob
  2022-09-26 22:55   ` [PATCH v3 " Abdullah Sevincer
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Jerin Jacob @ 2022-09-03 13:16 UTC (permalink / raw)
  To: Timothy McDaniel; +Cc: Jerin Jacob, dpdk-dev

On Sat, Aug 20, 2022 at 6:30 AM Timothy McDaniel
<timothy.mcdaniel@intel.com> wrote:
>
> For best performance, applications running on certain cores should use
> the DLB device locally available on the same tile along with other
> resources. To allocate optimal resources, probing is done for each
> producer port (PP) for a given CPU and the best performing ports are
> allocated to producers. The cpu used for probing is either the first
> core of producer coremask (if present) or the second core of EAL
> coremask. This will be extended later to probe for all CPUs in the
> producer coremask or EAL coremask.
>
> Producer coremask can be passed along with the BDF of the DLB devices.
> "-a xx:y.z,producer_coremask=<core_mask>"
>
> Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
> rte_event_port_setup() for producer ports for optimal port allocation.
>
> When events are dropped by workers or consumers that use LDB ports,
> completions are sent which are just ENQs and may impact the latency.
> To address this,  probing is done for LDB ports as well. Probing is
> done on ports per 'cos'. When default cos is used, ports will be
> allocated from best ports from the best 'cos', else from best ports of
> the specific cos.
>
> Signed-off-by: Timothy McDaniel <timothy.mcdaniel@intel.com>
> ---
>  drivers/event/dlb2/dlb2.c                  |  40 +++-
>  drivers/event/dlb2/dlb2_priv.h             |   5 +
>  drivers/event/dlb2/dlb2_user.h             |   1 +
>  drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
>  drivers/event/dlb2/pf/base/dlb2_resource.c | 248 ++++++++++++++++++++-
>  drivers/event/dlb2/pf/base/dlb2_resource.h |  13 ++
>  drivers/event/dlb2/pf/dlb2_main.c          |   7 +-
>  drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
>  drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
>  9 files changed, 341 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
> index 5a443acff8..a9a174e136 100644
> --- a/drivers/event/dlb2/dlb2.c
> +++ b/drivers/event/dlb2/dlb2.c
> @@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
>         return 0;
>  }
>
> +/* Extern functions */
> +extern int rte_eal_parse_coremask(const char *coremask, int *cores);

Include eal header file.

I will wait for the dependent patch to merge to the main tree to tale
the next version.

> +#ifdef DLB2_DEFAULT_LDB_PORT_ALLOCATION

Introduce a new devargs to make it runtime.

Also update the PMD doc for the existing and new devargs.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 1/3] event/dlb2: add producer port probing optimization
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
  2022-09-03 13:16   ` Jerin Jacob
@ 2022-09-26 22:55   ` Abdullah Sevincer
  2022-09-26 22:55     ` [PATCH v3 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-26 22:55     ` [PATCH v3 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-26 22:55 UTC (permalink / raw)
  To: dev
  Cc: jerinj, rashmi.shetty, pravin.pathak, mike.ximing.chen,
	timothy.mcdaniel, shivani.doneria, tirthendu.sarkar,
	Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 9 files changed, 377 insertions(+), 28 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-26 22:55   ` [PATCH v3 " Abdullah Sevincer
@ 2022-09-26 22:55     ` Abdullah Sevincer
  2022-09-26 22:55     ` [PATCH v3 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-26 22:55 UTC (permalink / raw)
  To: dev
  Cc: jerinj, rashmi.shetty, pravin.pathak, mike.ximing.chen,
	timothy.mcdaniel, shivani.doneria, tirthendu.sarkar,
	Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 3/3] event/dlb2: optimize credit allocations
  2022-09-26 22:55   ` [PATCH v3 " Abdullah Sevincer
  2022-09-26 22:55     ` [PATCH v3 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-26 22:55     ` Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-26 22:55 UTC (permalink / raw)
  To: dev
  Cc: jerinj, rashmi.shetty, pravin.pathak, mike.ximing.chen,
	timothy.mcdaniel, shivani.doneria, tirthendu.sarkar,
	Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 1/3] event/dlb2: add producer port probing optimization
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
  2022-09-03 13:16   ` Jerin Jacob
  2022-09-26 22:55   ` [PATCH v3 " Abdullah Sevincer
@ 2022-09-27  1:42   ` Abdullah Sevincer
  2022-09-27  1:42     ` [PATCH v4 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
                       ` (8 more replies)
  2022-09-29  5:03   ` [PATCH v10 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                     ` (2 subsequent siblings)
  5 siblings, 9 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-27  1:42 UTC (permalink / raw)
  To: dev
  Cc: jerinj, rashmi.shetty, pravin.pathak, mike.ximing.chen,
	timothy.mcdaniel, shivani.doneria, tirthendu.sarkar,
	Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 9 files changed, 377 insertions(+), 28 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-27  1:42     ` Abdullah Sevincer
  2022-09-27  1:42     ` [PATCH v4 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-27  1:42 UTC (permalink / raw)
  To: dev
  Cc: jerinj, rashmi.shetty, pravin.pathak, mike.ximing.chen,
	timothy.mcdaniel, shivani.doneria, tirthendu.sarkar,
	Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 3/3] event/dlb2: optimize credit allocations
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-27  1:42     ` [PATCH v4 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-27  1:42     ` Abdullah Sevincer
  2022-09-28 14:45     ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Jerin Jacob
                       ` (6 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-27  1:42 UTC (permalink / raw)
  To: dev
  Cc: jerinj, rashmi.shetty, pravin.pathak, mike.ximing.chen,
	timothy.mcdaniel, shivani.doneria, tirthendu.sarkar,
	Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/3] event/dlb2: add producer port probing optimization
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-27  1:42     ` [PATCH v4 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-27  1:42     ` [PATCH v4 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
@ 2022-09-28 14:45     ` Jerin Jacob
  2022-09-28 19:11     ` [PATCH v5 " Abdullah Sevincer
                       ` (5 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Jerin Jacob @ 2022-09-28 14:45 UTC (permalink / raw)
  To: Abdullah Sevincer
  Cc: dev, jerinj, rashmi.shetty, pravin.pathak, mike.ximing.chen,
	timothy.mcdaniel, shivani.doneria, tirthendu.sarkar

On Tue, Sep 27, 2022 at 7:12 AM Abdullah Sevincer
<abdullah.sevincer@intel.com> wrote:
>
> For best performance, applications running on certain cores should use
> the DLB device locally available on the same tile along with other
> resources. To allocate optimal resources, probing is done for each
> producer port (PP) for a given CPU and the best performing ports are
> allocated to producers. The cpu used for probing is either the first
> core of producer coremask (if present) or the second core of EAL
> coremask. This will be extended later to probe for all CPUs in the
> producer coremask or EAL coremask.
>
> Producer coremask can be passed along with the BDF of the DLB devices.
> "-a xx:y.z,producer_coremask=<core_mask>"
>
> Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
> rte_event_port_setup() for producer ports for optimal port allocation.
>
> +#define DLB2_PRODUCER_COREMASK "producer_coremask"

Documentation patch is missing. Please add in 1/3.

> +#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"

Please squash https://patches.dpdk.org/project/dpdk/patch/20220927001835.1394994-1-abdullah.sevincer@intel.com/
patch 1/3 here.


Rest looks good. We should merge the next version.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 1/3] event/dlb2: add producer port probing optimization
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                       ` (2 preceding siblings ...)
  2022-09-28 14:45     ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Jerin Jacob
@ 2022-09-28 19:11     ` Abdullah Sevincer
  2022-09-28 19:11     ` [PATCH v5 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-28 19:11 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use
+the DLB device locally available on the same tile along with other
+resources. To allocate optimal resources, probing is done for each
+producer port (PP) for a given CPU and the best performing ports are
+allocated to producers. The cpu used for probing is either the first
+core of producer coremask (if present) or the second core of EAL
+coremask. This will be extended later to probe for all CPUs in the
+producer coremask or EAL coremask. Producer coremask can be passed
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common
+should not be in numerical sequence. The port->QID mapping is application
+dependent, but the driver interleaves port IDs as much as possible to
+reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the
+average distance between an ID and its immediate neighbors. (i.e.the
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or Y)
+inial port allocation will be used, otherwise initial port allocation
+won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                       ` (3 preceding siblings ...)
  2022-09-28 19:11     ` [PATCH v5 " Abdullah Sevincer
@ 2022-09-28 19:11     ` Abdullah Sevincer
  2022-09-28 19:19     ` [PATCH v6 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-28 19:11 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v6 1/3] event/dlb2: add producer port probing optimization
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                       ` (4 preceding siblings ...)
  2022-09-28 19:11     ` [PATCH v5 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-28 19:19     ` Abdullah Sevincer
  2022-09-28 19:19       ` [PATCH v6 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-28 19:19       ` [PATCH v6 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  2022-09-28 20:28     ` [PATCH v7 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                       ` (2 subsequent siblings)
  8 siblings, 2 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-28 19:19 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use
+the DLB device locally available on the same tile along with other
+resources. To allocate optimal resources, probing is done for each
+producer port (PP) for a given CPU and the best performing ports are
+allocated to producers. The cpu used for probing is either the first
+core of producer coremask (if present) or the second core of EAL
+coremask. This will be extended later to probe for all CPUs in the
+producer coremask or EAL coremask. Producer coremask can be passed
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common
+should not be in numerical sequence. The port->QID mapping is application
+dependent, but the driver interleaves port IDs as much as possible to
+reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the
+average distance between an ID and its immediate neighbors. (i.e.the
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or Y)
+inial port allocation will be used, otherwise initial port allocation
+won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v6 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-28 19:19     ` [PATCH v6 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-28 19:19       ` Abdullah Sevincer
  2022-09-28 19:19       ` [PATCH v6 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-28 19:19 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v6 3/3] event/dlb2: optimize credit allocations
  2022-09-28 19:19     ` [PATCH v6 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-28 19:19       ` [PATCH v6 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-28 19:19       ` Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-28 19:19 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v7 1/3] event/dlb2: add producer port probing optimization
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                       ` (5 preceding siblings ...)
  2022-09-28 19:19     ` [PATCH v6 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-28 20:28     ` Abdullah Sevincer
  2022-09-28 20:28       ` [PATCH v7 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-28 20:28       ` [PATCH v7 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  2022-09-29  1:32     ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-29  3:46     ` [PATCH v9 " Abdullah Sevincer
  8 siblings, 2 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-28 20:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use
+the DLB device locally available on the same tile along with other
+resources. To allocate optimal resources, probing is done for each
+producer port (PP) for a given CPU and the best performing ports are
+allocated to producers. The cpu used for probing is either the first
+core of producer coremask (if present) or the second core of EAL
+coremask. This will be extended later to probe for all CPUs in the
+producer coremask or EAL coremask. Producer coremask can be passed
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common
+should not be in numerical sequence. The port->QID mapping is application
+dependent, but the driver interleaves port IDs as much as possible to
+reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the
+average distance between an ID and its immediate neighbors. (i.e.the
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or Y)
+inial port allocation will be used, otherwise initial port allocation
+won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v7 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-28 20:28     ` [PATCH v7 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-28 20:28       ` Abdullah Sevincer
  2022-09-28 20:28       ` [PATCH v7 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-28 20:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v7 3/3] event/dlb2: optimize credit allocations
  2022-09-28 20:28     ` [PATCH v7 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-28 20:28       ` [PATCH v7 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-28 20:28       ` Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-28 20:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v8 1/3] event/dlb2: add producer port probing optimization
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                       ` (6 preceding siblings ...)
  2022-09-28 20:28     ` [PATCH v7 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29  1:32     ` Abdullah Sevincer
  2022-09-29  1:32       ` [PATCH v8 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
                         ` (2 more replies)
  2022-09-29  3:46     ` [PATCH v9 " Abdullah Sevincer
  8 siblings, 3 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  1:32 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use
+the DLB device locally available on the same tile along with other
+resources. To allocate optimal resources, probing is done for each
+producer port (PP) for a given CPU and the best performing ports are
+allocated to producers. The cpu used for probing is either the first
+core of producer coremask (if present) or the second core of EAL
+coremask. This will be extended later to probe for all CPUs in the
+producer coremask or EAL coremask. Producer coremask can be passed
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common
+should not be in numerical sequence. The port->QID mapping is application
+dependent, but the driver interleaves port IDs as much as possible to
+reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the
+average distance between an ID and its immediate neighbors. (i.e.the
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or Y)
+inial port allocation will be used, otherwise initial port allocation
+won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v8 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-29  1:32     ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29  1:32       ` Abdullah Sevincer
  2022-09-29  1:32       ` [PATCH v8 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  2022-09-29  2:48       ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Sevincer, Abdullah
  2 siblings, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  1:32 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v8 3/3] event/dlb2: optimize credit allocations
  2022-09-29  1:32     ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-29  1:32       ` [PATCH v8 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-29  1:32       ` Abdullah Sevincer
  2022-09-29  2:48       ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Sevincer, Abdullah
  2 siblings, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  1:32 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v8 1/3] event/dlb2: add producer port probing optimization
  2022-09-29  1:32     ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-29  1:32       ` [PATCH v8 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-29  1:32       ` [PATCH v8 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
@ 2022-09-29  2:48       ` Sevincer, Abdullah
  2 siblings, 0 replies; 37+ messages in thread
From: Sevincer, Abdullah @ 2022-09-29  2:48 UTC (permalink / raw)
  To: dev; +Cc: jerinj

Not sure these are failing now in RHEL platform, seems fail is unrelated to the commit, I will check and  resubmit new version.

-----Original Message-----
From: Sevincer, Abdullah <abdullah.sevincer@intel.com> 
Sent: Wednesday, September 28, 2022 6:33 PM
To: dev@dpdk.org
Cc: jerinj@marvell.com; Sevincer, Abdullah <abdullah.sevincer@intel.com>
Subject: [PATCH v8 1/3] event/dlb2: add producer port probing optimization

For best performance, applications running on certain cores should use the DLB device locally available on the same tile along with other resources. To allocate optimal resources, probing is done for each producer port (PP) for a given CPU and the best performing ports are allocated to producers. The cpu used for probing is either the first core of producer coremask (if present) or the second core of EAL coremask. This will be extended later to probe for all CPUs in the producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common should not be in numerical sequence. The port->QID mapping is application dependent, but the driver interleaves port IDs as much as possible to reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the average distance between an ID and its immediate neighbors. Using the initialport allocation option can be passed through devarg "default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports, completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is done on ports per 'cos'. When default cos is used, ports will be allocated from best ports from the best 'cos', else from best ports of the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-  drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use 
+the DLB device locally available on the same tile along with other 
+resources. To allocate optimal resources, probing is done for each 
+producer port (PP) for a given CPU and the best performing ports are 
+allocated to producers. The cpu used for probing is either the first 
+core of producer coremask (if present) or the second core of EAL 
+coremask. This will be extended later to probe for all CPUs in the 
+producer coremask or EAL coremask. Producer coremask can be passed 
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common 
+should not be in numerical sequence. The port->QID mapping is 
+application dependent, but the driver interleaves port IDs as much as 
+possible to reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the 
+average distance between an ID and its immediate neighbors. (i.e.the 
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or 
+Y) inial port allocation will be used, otherwise initial port 
+allocation won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)  { @@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode); @@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') { @@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */ @@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */ @@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES]; diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const 
+void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : 
+false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25, @@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) { @@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports; @@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb) {
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start); }
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b) {
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */ static void 
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int 
+cos_id) {
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args) {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs, @@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR; @@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const 
+void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to 
+producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,  }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void 
+*probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);  int dlb2_pf_create_sched_domain(struct dlb2_hw *hw, diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT, @@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id); @@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
--
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 1/3] event/dlb2: add producer port probing optimization
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
                       ` (7 preceding siblings ...)
  2022-09-29  1:32     ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29  3:46     ` Abdullah Sevincer
  2022-09-29  3:46       ` [PATCH v9 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-29  3:46       ` [PATCH v9 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  8 siblings, 2 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  3:46 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use
+the DLB device locally available on the same tile along with other
+resources. To allocate optimal resources, probing is done for each
+producer port (PP) for a given CPU and the best performing ports are
+allocated to producers. The cpu used for probing is either the first
+core of producer coremask (if present) or the second core of EAL
+coremask. This will be extended later to probe for all CPUs in the
+producer coremask or EAL coremask. Producer coremask can be passed
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common
+should not be in numerical sequence. The port->QID mapping is application
+dependent, but the driver interleaves port IDs as much as possible to
+reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the
+average distance between an ID and its immediate neighbors. (i.e.the
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or Y)
+inial port allocation will be used, otherwise initial port allocation
+won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-29  3:46     ` [PATCH v9 " Abdullah Sevincer
@ 2022-09-29  3:46       ` Abdullah Sevincer
  2022-09-29  3:46       ` [PATCH v9 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  3:46 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 3/3] event/dlb2: optimize credit allocations
  2022-09-29  3:46     ` [PATCH v9 " Abdullah Sevincer
  2022-09-29  3:46       ` [PATCH v9 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-29  3:46       ` Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  3:46 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v10 1/3] event/dlb2: add producer port probing optimization
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
                     ` (2 preceding siblings ...)
  2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29  5:03   ` Abdullah Sevincer
  2022-09-29  5:03     ` [PATCH v10 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-29  5:03     ` [PATCH v10 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  2022-09-29 15:26   ` [PATCH v11 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-29 23:58   ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  5 siblings, 2 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  5:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use
+the DLB device locally available on the same tile along with other
+resources. To allocate optimal resources, probing is done for each
+producer port (PP) for a given CPU and the best performing ports are
+allocated to producers. The cpu used for probing is either the first
+core of producer coremask (if present) or the second core of EAL
+coremask. This will be extended later to probe for all CPUs in the
+producer coremask or EAL coremask. Producer coremask can be passed
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common
+should not be in numerical sequence. The port->QID mapping is application
+dependent, but the driver interleaves port IDs as much as possible to
+reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the
+average distance between an ID and its immediate neighbors. (i.e.the
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or Y)
+inial port allocation will be used, otherwise initial port allocation
+won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v10 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-29  5:03   ` [PATCH v10 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29  5:03     ` Abdullah Sevincer
  2022-09-29  5:03     ` [PATCH v10 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  5:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v10 3/3] event/dlb2: optimize credit allocations
  2022-09-29  5:03   ` [PATCH v10 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-29  5:03     ` [PATCH v10 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-29  5:03     ` Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29  5:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v11 1/3] event/dlb2: add producer port probing optimization
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
                     ` (3 preceding siblings ...)
  2022-09-29  5:03   ` [PATCH v10 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29 15:26   ` Abdullah Sevincer
  2022-09-29 15:26     ` [PATCH v11 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-29 15:26     ` [PATCH v11 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  2022-09-29 23:58   ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  5 siblings, 2 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29 15:26 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use
+the DLB device locally available on the same tile along with other
+resources. To allocate optimal resources, probing is done for each
+producer port (PP) for a given CPU and the best performing ports are
+allocated to producers. The cpu used for probing is either the first
+core of producer coremask (if present) or the second core of EAL
+coremask. This will be extended later to probe for all CPUs in the
+producer coremask or EAL coremask. Producer coremask can be passed
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common
+should not be in numerical sequence. The port->QID mapping is application
+dependent, but the driver interleaves port IDs as much as possible to
+reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the
+average distance between an ID and its immediate neighbors. (i.e.the
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or Y)
+inial port allocation will be used, otherwise initial port allocation
+won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v11 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-29 15:26   ` [PATCH v11 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29 15:26     ` Abdullah Sevincer
  2022-09-29 15:26     ` [PATCH v11 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29 15:26 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v11 3/3] event/dlb2: optimize credit allocations
  2022-09-29 15:26   ` [PATCH v11 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-29 15:26     ` [PATCH v11 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-29 15:26     ` Abdullah Sevincer
  1 sibling, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29 15:26 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v12 1/3] event/dlb2: add producer port probing optimization
  2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
                     ` (4 preceding siblings ...)
  2022-09-29 15:26   ` [PATCH v11 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29 23:58   ` Abdullah Sevincer
  2022-09-29 23:58     ` [PATCH v12 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
                       ` (2 more replies)
  5 siblings, 3 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29 23:58 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

For best performance, applications running on certain cores should use
the DLB device locally available on the same tile along with other
resources. To allocate optimal resources, probing is done for each
producer port (PP) for a given CPU and the best performing ports are
allocated to producers. The cpu used for probing is either the first
core of producer coremask (if present) or the second core of EAL
coremask. This will be extended later to probe for all CPUs in the
producer coremask or EAL coremask.

Producer coremask can be passed along with the BDF of the DLB devices.
"-a xx:y.z,producer_coremask=<core_mask>"

Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
rte_event_port_setup() for producer ports for optimal port allocation.

For optimal load balancing ports that map to one or more QIDs in common
should not be in numerical sequence. The port->QID mapping is application
dependent, but the driver interleaves port IDs as much as possible to
reduce the likelihood of sequential ports mapping to the same QID(s).

Hence, DLB uses an initial allocation of Port IDs to maximize the
average distance between an ID and its immediate neighbors. Using
the initialport allocation option can be passed through devarg
"default_port_allocation=y(or Y)".

When events are dropped by workers or consumers that use LDB ports,
completions are sent which are just ENQs and may impact the latency.
To address this,  probing is done for LDB ports as well. Probing is
done on ports per 'cos'. When default cos is used, ports will be
allocated from best ports from the best 'cos', else from best ports of
the specific cos.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 doc/guides/eventdevs/dlb2.rst              |  36 +++
 drivers/event/dlb2/dlb2.c                  |  72 +++++-
 drivers/event/dlb2/dlb2_priv.h             |   7 +
 drivers/event/dlb2/dlb2_user.h             |   1 +
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
 drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
 drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
 drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
 drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
 drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
 10 files changed, 413 insertions(+), 28 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 5b21f13b68..f5bf5757c6 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
        --allow ea:00.0,cq_weight=all:<weight>
        --allow ea:00.0,cq_weight=qidA-qidB:<weight>
        --allow ea:00.0,cq_weight=qid:<weight>
+
+Producer Coremask
+~~~~~~~~~~~~~~~~~
+
+For best performance, applications running on certain cores should use
+the DLB device locally available on the same tile along with other
+resources. To allocate optimal resources, probing is done for each
+producer port (PP) for a given CPU and the best performing ports are
+allocated to producers. The cpu used for probing is either the first
+core of producer coremask (if present) or the second core of EAL
+coremask. This will be extended later to probe for all CPUs in the
+producer coremask or EAL coremask. Producer coremask can be passed
+along with the BDF of the DLB devices.
+
+    .. code-block:: console
+
+       -a xx:y.z,producer_coremask=<core_mask>
+
+Default LDB Port Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For optimal load balancing ports that map to one or more QIDs in common
+should not be in numerical sequence. The port->QID mapping is application
+dependent, but the driver interleaves port IDs as much as possible to
+reduce the likelihood of sequential ports mapping to the same QID(s).
+
+Hence, DLB uses an initial allocation of Port IDs to maximize the
+average distance between an ID and its immediate neighbors. (i.e.the
+distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
+Initial port allocation option can be passed through devarg. If y (or Y)
+inial port allocation will be used, otherwise initial port allocation
+won't be used.
+
+    .. code-block:: console
+
+       --allow ea:00.0,default_port_allocation=<y/Y>
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 759578378f..6a9db4b642 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
 	return 0;
 }
 
+static int
+set_producer_coremask(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	const char **mask_str = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	*mask_str = value;
+
+	return 0;
+}
+
 static int
 set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
 {
@@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
 	return 0;
 }
 
+static int
+set_default_ldb_port_allocation(const char *key __rte_unused,
+		      const char *value,
+		      void *opaque)
+{
+	bool *default_ldb_port_allocation = opaque;
+
+	if (value == NULL || opaque == NULL) {
+		DLB2_LOG_ERR("NULL pointer\n");
+		return -EINVAL;
+	}
+
+	if ((*value == 'y') || (*value == 'Y'))
+		*default_ldb_port_allocation = true;
+	else
+		*default_ldb_port_allocation = false;
+
+	return 0;
+}
+
 static int
 set_qid_depth_thresh(const char *key __rte_unused,
 		     const char *value,
@@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
 	} else
 		credit_high_watermark = enqueue_depth;
 
+	if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
+		cfg.is_producer = 1;
+
 	/* Per QM values */
 
 	ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
@@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	}
 	ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
 
+	/* Save off port config for reconfig */
+	ev_port->conf = *port_conf;
+
+
 	/*
 	 * Create port
 	 */
@@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 		}
 	}
 
-	/* Save off port config for reconfig */
-	ev_port->conf = *port_conf;
-
 	ev_port->id = ev_port_id;
 	ev_port->enq_configured = true;
 	ev_port->setup_done = true;
@@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
 					     DLB2_CQ_WEIGHT,
 					     DLB2_PORT_COS,
 					     DLB2_COS_BW,
+					     DLB2_PRODUCER_COREMASK,
+					     DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
 					     NULL };
 
 	if (params != NULL && params[0] != '\0') {
@@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
 			}
 
 
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_PRODUCER_COREMASK,
+						 set_producer_coremask,
+						 &dlb2_args->producer_coremask);
+			if (ret != 0) {
+				DLB2_LOG_ERR(
+					"%s: Error parsing producer coremask",
+					name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
+			ret = rte_kvargs_process(kvlist,
+						 DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
+						 set_default_ldb_port_allocation,
+						 &dlb2_args->default_ldb_port_allocation);
+			if (ret != 0) {
+				DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
+					     name);
+				rte_kvargs_free(kvlist);
+				return ret;
+			}
+
 			rte_kvargs_free(kvlist);
 		}
 	}
diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
index db431f7d8b..9ef5bcb901 100644
--- a/drivers/event/dlb2/dlb2_priv.h
+++ b/drivers/event/dlb2/dlb2_priv.h
@@ -51,6 +51,8 @@
 #define DLB2_CQ_WEIGHT "cq_weight"
 #define DLB2_PORT_COS "port_cos"
 #define DLB2_COS_BW "cos_bw"
+#define DLB2_PRODUCER_COREMASK "producer_coremask"
+#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
 
 /* Begin HW related defines and structs */
 
@@ -386,6 +388,7 @@ struct dlb2_port {
 	uint16_t hw_credit_quanta;
 	bool use_avx512;
 	uint32_t cq_weight;
+	bool is_producer; /* True if port is of type producer */
 };
 
 /* Per-process per-port mmio and memory pointers */
@@ -669,6 +672,8 @@ struct dlb2_devargs {
 	struct dlb2_cq_weight cq_weight;
 	struct dlb2_port_cos port_cos;
 	struct dlb2_cos_bw cos_bw;
+	const char *producer_coremask;
+	bool default_ldb_port_allocation;
 };
 
 /* End Eventdev related defines and structs */
@@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
 			   uint8_t *sched_type,
 			   uint8_t *queue_id);
 
+/* Extern functions */
+extern int rte_eal_parse_coremask(const char *coremask, int *cores);
 
 /* Extern globals */
 extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
index 901e2e0c66..28c6aaaf43 100644
--- a/drivers/event/dlb2/dlb2_user.h
+++ b/drivers/event/dlb2/dlb2_user.h
@@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
 	__u16 cq_depth;
 	__u16 cq_depth_threshold;
 	__s32 queue_id;
+	__u8 is_producer;
 };
 
 /*
diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
index 9511521e67..87996ef621 100644
--- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
+++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
@@ -249,6 +249,7 @@ struct dlb2_hw_domain {
 	struct dlb2_list_head avail_ldb_queues;
 	struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
 	struct dlb2_list_head avail_dir_pq_pairs;
+	struct dlb2_list_head rsvd_dir_pq_pairs;
 	u32 total_hist_list_entries;
 	u32 avail_hist_list_entries;
 	u32 hist_list_entry_base;
@@ -347,6 +348,10 @@ struct dlb2_hw {
 	struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
 	struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
 	u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
+	int prod_core_list[RTE_MAX_LCORE];
+	u8 num_prod_cores;
+	int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
+	int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
 
 	/* Virtualization */
 	int virt_mode;
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
index 0731416a43..280a8e51b1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.c
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
@@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
 	dlb2_list_init_head(&domain->used_dir_pq_pairs);
 	dlb2_list_init_head(&domain->avail_ldb_queues);
 	dlb2_list_init_head(&domain->avail_dir_pq_pairs);
+	dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
 
 	for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
 		dlb2_list_init_head(&domain->used_ldb_ports[i]);
@@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
 {
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
 	struct dlb2_list_entry *list;
 	unsigned int i;
 	int ret;
@@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 	 * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
 	 * 3, etc.).
 	 */
+
 	const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
 		0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
 		16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
@@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 		int cos_id = i >> DLB2_NUM_COS_DOMAINS;
 		struct dlb2_ldb_port *port;
 
-		port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		if (ldb_port_default == true)
+			port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
+		else
+			port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
 
 		dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
 			      &port->func_list);
@@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
 
 	hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
 	for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
-		list = &hw->rsrcs.dir_pq_pairs[i].func_list;
+		int index = hw->dir_pp_allocations[i];
+		list = &hw->rsrcs.dir_pq_pairs[index].func_list;
 
 		dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
 	}
@@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 				 u32 num_ports,
 				 struct dlb2_cmd_response *resp)
 {
+	int num_res = hw->num_prod_cores;
 	unsigned int i;
 
 	if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
@@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
 			return -EFAULT;
 		}
 
+		if (num_res) {
+			dlb2_list_add(&domain->rsvd_dir_pq_pairs,
+				      &port->domain_list);
+			num_res--;
+		} else {
+			dlb2_list_add(&domain->avail_dir_pq_pairs,
+			&port->domain_list);
+		}
+
 		dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
 
 		port->domain_id = domain->id;
 		port->owned = true;
-
-		dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
 	}
 
 	rsrcs->num_avail_dir_pq_pairs -= num_ports;
@@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
 	return 0;
 }
 
+static int
+dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
+	void __iomem *pp_addr;
+	cpu_set_t cpuset;
+	int i;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	pp_addr = os_map_producer_port(hw, port, is_ldb);
+
+	/* Point hcw to a 64B-aligned location */
+	hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
+	      ~DLB2_HCW_ALIGN_MASK);
+
+	/*
+	 * Program the first HCW for a completion and token return and
+	 * the other HCWs as NOOPS
+	 */
+
+	memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
+	hcw->qe_comp = 1;
+	hcw->cq_token = 1;
+	hcw->lock_id = 1;
+
+	cycle_start = rte_get_tsc_cycles();
+	for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
+		dlb2_movdir64b(pp_addr, hcw);
+
+	cycle_end = rte_get_tsc_cycles();
+
+	os_unmap_producer_port(hw, pp_addr);
+	return (int)(cycle_end - cycle_start);
+}
+
+static void *
+dlb2_pp_profile_func(void *data)
+{
+	struct dlb2_pp_thread_data *thread_data = data;
+	int cycles;
+
+	cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
+	thread_data->cpu, thread_data->is_ldb);
+
+	thread_data->cycles = cycles;
+
+	return NULL;
+}
+
+static int dlb2_pp_cycle_comp(const void *a, const void *b)
+{
+	const struct dlb2_pp_thread_data *x = a;
+	const struct dlb2_pp_thread_data *y = b;
+
+	return x->cycles - y->cycles;
+}
+
+
+/* Probe producer ports from different CPU cores */
+static void
+dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
+{
+	struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
+	int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
+	bool is_ldb = (port_type == DLB2_LDB_PORT);
+	int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
+	DLB2_MAX_NUM_DIR_PORTS(ver);
+	struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
+	int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
+					 hw->dir_pp_allocations;
+	int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
+	struct dlb2_pp_thread_data cos_cycles[num_sort];
+	int num_ports_per_sort = num_ports / num_sort;
+	pthread_t pthread;
+
+	dlb2_dev->enqueue_four = dlb2_movdir64b;
+
+	DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
+		      is_ldb ? "LDB" : "DIR", cpu);
+
+	memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
+	for (i = 0; i < num_ports; i++) {
+		int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
+
+		dlb2_thread_data[i].is_ldb = is_ldb;
+		dlb2_thread_data[i].pp = i;
+		dlb2_thread_data[i].cycles = 0;
+		dlb2_thread_data[i].hw = hw;
+		dlb2_thread_data[i].cpu = cpu;
+
+		err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
+				     &dlb2_thread_data[i]);
+		if (err) {
+			DLB2_LOG_ERR(": thread creation failed! err=%d", err);
+			return;
+		}
+
+		err = pthread_join(pthread, NULL);
+		if (err) {
+			DLB2_LOG_ERR(": thread join failed! err=%d", err);
+			return;
+		}
+		cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
+
+		if ((i + 1) % num_ports_per_sort == 0) {
+			int index = cos * num_ports_per_sort;
+
+			cos_cycles[cos].pp = index;
+			/*
+			 * For LDB ports first sort with in a cos. Later sort
+			 * the best cos based on total cycles for the cos.
+			 * For DIR ports, there is a single sort across all
+			 * ports.
+			 */
+			qsort(&dlb2_thread_data[index], num_ports_per_sort,
+			      sizeof(struct dlb2_pp_thread_data),
+			      dlb2_pp_cycle_comp);
+		}
+	}
+
+	/*
+	 * Re-arrange best ports by cos if default cos is used.
+	 */
+	if (is_ldb && cos_id == DLB2_COS_DEFAULT)
+		qsort(cos_cycles, num_sort,
+		      sizeof(struct dlb2_pp_thread_data),
+		      dlb2_pp_cycle_comp);
+
+	for (i = 0; i < num_ports; i++) {
+		int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
+		int index = i % num_ports_per_sort;
+
+		port_allocations[i] = dlb2_thread_data[start + index].pp;
+		DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
+			     dlb2_thread_data[start + index].cycles);
+	}
+}
+
+int
+dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
+{
+	const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
+	const char *mask = NULL;
+	int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
+	int i, cos_id = DLB2_COS_DEFAULT;
+
+	if (args) {
+		mask = (const char *)args->producer_coremask;
+		cos_id = args->cos_id;
+	}
+
+	if (mask && rte_eal_parse_coremask(mask, cores)) {
+		DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
+		return -1;
+	}
+
+	hw->num_prod_cores = 0;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		if (rte_lcore_is_enabled(i)) {
+			if (mask) {
+				/*
+				 * Populate the producer cores from parsed
+				 * coremask
+				 */
+				if (cores[i] != -1) {
+					hw->prod_core_list[cores[i]] = i;
+					hw->num_prod_cores++;
+				}
+			} else if ((++cnt == DLB2_EAL_PROBE_CORE ||
+			   rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
+				/*
+				 * If no producer coremask is provided, use the
+				 * second EAL core to probe
+				 */
+				cpu = i;
+				break;
+			}
+		}
+	}
+	/* Use the first core in producer coremask to probe */
+	if (hw->num_prod_cores)
+		cpu = hw->prod_core_list[0];
+
+	dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
+	dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
+
+	return 0;
+}
+
 static int
 dlb2_domain_attach_resources(struct dlb2_hw *hw,
 			     struct dlb2_function_resources *rsrcs,
@@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
 		return -EINVAL;
 	}
 
+	DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
+
 	/* Check cache-line alignment */
 	if ((cq_dma_base & 0x3F) != 0) {
 		resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
@@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
 		/*
 		 * If the port's queue is not configured, validate that a free
 		 * port-queue pair is available.
+		 * First try the 'res' list if the port is producer OR if
+		 * 'avail' list is empty else fall back to 'avail' list
 		 */
-		pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
-					typeof(*pq));
+		if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
+		    (args->is_producer ||
+		     dlb2_list_empty(&domain->avail_dir_pq_pairs)))
+			pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
+						typeof(*pq));
+		else
+			pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
+						typeof(*pq));
+
 		if (!pq) {
 			resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
 			return -EINVAL;
 		}
+		DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
+			      pq->id.phys_id, args->is_producer);
+
 	}
 
 	/* Check cache-line alignment */
@@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
 		return ret;
 
 	/*
-	 * Configuration succeeded, so move the resource from the 'avail' to
-	 * the 'used' list (if it's not already there).
+	 * Configuration succeeded, so move the resource from the 'avail' or
+	 * 'res' to the 'used' list (if it's not already there).
 	 */
 	if (args->queue_id == -1) {
-		dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
+		struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
+		struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
+
+		if ((args->is_producer && !dlb2_list_empty(res)) ||
+		     dlb2_list_empty(avail))
+			dlb2_list_del(res, &port->domain_list);
+		else
+			dlb2_list_del(avail, &port->domain_list);
 
 		dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
 	}
diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
index a7e6c90888..71bd6148f1 100644
--- a/drivers/event/dlb2/pf/base/dlb2_resource.h
+++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
@@ -23,7 +23,20 @@
  * Return:
  * Returns 0 upon success, <0 otherwise.
  */
-int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
+int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
+
+/**
+ * dlb2_resource_probe() - probe hw resources
+ * @hw: pointer to struct dlb2_hw.
+ *
+ * This function probes hw resources for best port allocation to producer
+ * cores.
+ *
+ * Return:
+ * Returns 0 upon success, <0 otherwise.
+ */
+int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
+
 
 /**
  * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
index b6ec85b479..717aa4fc08 100644
--- a/drivers/event/dlb2/pf/dlb2_main.c
+++ b/drivers/event/dlb2/pf/dlb2_main.c
@@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
 }
 
 struct dlb2_dev *
-dlb2_probe(struct rte_pci_device *pdev)
+dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
 {
 	struct dlb2_dev *dlb2_dev;
 	int ret = 0;
@@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto wait_for_device_ready_fail;
 
+	ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
+	if (ret)
+		goto resource_probe_fail;
+
 	ret = dlb2_pf_reset(dlb2_dev);
 	if (ret)
 		goto dlb2_reset_fail;
@@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 	if (ret)
 		goto init_driver_state_fail;
 
-	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
+	ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
 	if (ret)
 		goto resource_init_fail;
 
@@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
 init_driver_state_fail:
 dlb2_reset_fail:
 pci_mmap_bad_addr:
+resource_probe_fail:
 wait_for_device_ready_fail:
 	rte_free(dlb2_dev);
 dlb2_dev_malloc_fail:
diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
index 5aa51b1616..4c64d72e9c 100644
--- a/drivers/event/dlb2/pf/dlb2_main.h
+++ b/drivers/event/dlb2/pf/dlb2_main.h
@@ -15,7 +15,11 @@
 #include "base/dlb2_hw_types.h"
 #include "../dlb2_user.h"
 
-#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
+#define DLB2_EAL_PROBE_CORE 2
+#define DLB2_NUM_PROBE_ENQS 1000
+#define DLB2_HCW_MEM_SIZE 8
+#define DLB2_HCW_64B_OFF 4
+#define DLB2_HCW_ALIGN_MASK 0x3F
 
 struct dlb2_dev;
 
@@ -31,15 +35,30 @@ struct dlb2_dev {
 	/* struct list_head list; */
 	struct device *dlb2_device;
 	bool domain_reset_failed;
+	/* The enqueue_four function enqueues four HCWs (one cache-line worth)
+	 * to the HQM, using whichever mechanism is supported by the platform
+	 * on which this driver is running.
+	 */
+	void (*enqueue_four)(void *qe4, void *pp_addr);
 	/* The resource mutex serializes access to driver data structures and
 	 * hardware registers.
 	 */
 	rte_spinlock_t resource_mutex;
 	bool worker_launched;
 	u8 revision;
+	u8 version;
+};
+
+struct dlb2_pp_thread_data {
+	struct dlb2_hw *hw;
+	int pp;
+	int cpu;
+	bool is_ldb;
+	int cycles;
 };
 
-struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
+struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
+
 
 int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
 int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
index 71ac141b66..3d15250e11 100644
--- a/drivers/event/dlb2/pf/dlb2_pf.c
+++ b/drivers/event/dlb2/pf/dlb2_pf.c
@@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	struct dlb2_devargs dlb2_args = {
 		.socket_id = rte_socket_id(),
 		.max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
+		.producer_coremask = NULL,
 		.num_dir_credits_override = -1,
 		.qid_depth_thresholds = { {0} },
 		.poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
@@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 	};
 	struct dlb2_eventdev *dlb2;
 	int q;
+	const void *probe_args = NULL;
 
 	DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
 		     eventdev->data->dev_id, eventdev->data->socket_id);
@@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 		dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
 		dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
 
-		/* Probe the DLB2 PF layer */
-		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
-
-		if (dlb2->qm_instance.pf_dev == NULL) {
-			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
-				     rte_errno);
-			ret = -rte_errno;
-			goto dlb2_probe_failed;
-		}
-
 		/* Were we invoked with runtime parameters? */
 		if (pci_dev->device.devargs) {
 			ret = dlb2_parse_params(pci_dev->device.devargs->args,
@@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
 					     ret, rte_errno);
 				goto dlb2_probe_failed;
 			}
+			probe_args = &dlb2_args;
+		}
+
+		/* Probe the DLB2 PF layer */
+		dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
+
+		if (dlb2->qm_instance.pf_dev == NULL) {
+			DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
+				     rte_errno);
+			ret = -rte_errno;
+			goto dlb2_probe_failed;
 		}
 
 		ret = dlb2_primary_eventdev_probe(eventdev,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v12 2/3] event/dlb2: add fence bypass option for producer ports
  2022-09-29 23:58   ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
@ 2022-09-29 23:58     ` Abdullah Sevincer
  2022-09-29 23:59     ` [PATCH v12 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
  2022-09-30  8:28     ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Jerin Jacob
  2 siblings, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29 23:58 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

If producer thread is only acting as a bridge between NIC and DLB, then
performance can be greatly improved by bypassing the fence instruction.
DLB enqueue API calls memory fence once per enqueue burst.  If prodcuer
thread is just reading from NIC and sending to DLB without updating
the read buffers or buffer headers OR producer is not writing
to data structures with dependencies on the enqueue write order, then
fencing can be safely disabled.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 6a9db4b642..4dd1d55ddc 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -35,6 +35,16 @@
 #include "dlb2_iface.h"
 #include "dlb2_inline_fns.h"
 
+/*
+ * Bypass memory fencing instructions when port is of Producer type.
+ * This should be enabled very carefully with understanding that producer
+ * is not doing any writes which need fencing. The movdir64 instruction used to
+ * enqueue events to DLB is a weakly-ordered instruction and movdir64 write
+ * to DLB can go ahead of relevant application writes like updates to buffers
+ * being sent with event
+ */
+#define DLB2_BYPASS_FENCE_ON_PP 0  /* 1 == Bypass fence, 0 == do not bypass */
+
 /*
  * Resources exposed to eventdev. Some values overridden at runtime using
  * values returned by the DLB kernel driver.
@@ -1985,21 +1995,15 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	sw_credit_quanta = dlb2->sw_credit_quanta;
 	hw_credit_quanta = dlb2->hw_credit_quanta;
 
+	ev_port->qm_port.is_producer = false;
 	ev_port->qm_port.is_directed = port_conf->event_port_cfg &
 		RTE_EVENT_PORT_CFG_SINGLE_LINK;
 
-	/*
-	 * Validate credit config before creating port
-	 */
-
-	/* Default for worker ports */
-	sw_credit_quanta = dlb2->sw_credit_quanta;
-	hw_credit_quanta = dlb2->hw_credit_quanta;
-
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
 		/* Producer type ports. Mostly enqueue */
 		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
 		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+		ev_port->qm_port.is_producer = true;
 	}
 	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
 		/* Consumer type ports. Mostly dequeue */
@@ -2009,6 +2013,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->credit_update_quanta = sw_credit_quanta;
 	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
 
+	/*
+	 * Validate credit config before creating port
+	 */
+
 	if (port_conf->enqueue_depth > sw_credit_quanta ||
 	    port_conf->enqueue_depth > hw_credit_quanta) {
 		DLB2_LOG_ERR("Invalid port config. Enqueue depth %d must be <= credit quanta %d and batch size %d\n",
@@ -3073,7 +3081,12 @@ __dlb2_event_enqueue_burst(void *event_port,
 		dlb2_event_build_hcws(qm_port, &events[i], j - pop_offs,
 				      sched_types, queue_ids);
 
+#if DLB2_BYPASS_FENCE_ON_PP == 1
+		/* Bypass fence instruction for producer ports */
+		dlb2_hw_do_enqueue(qm_port, i == 0 && !qm_port->is_producer, port_data);
+#else
 		dlb2_hw_do_enqueue(qm_port, i == 0, port_data);
+#endif
 
 		/* Don't include the token pop QE in the enqueue count */
 		i += j - pop_offs;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v12 3/3] event/dlb2: optimize credit allocations
  2022-09-29 23:58   ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-29 23:58     ` [PATCH v12 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
@ 2022-09-29 23:59     ` Abdullah Sevincer
  2022-09-30  8:28     ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Jerin Jacob
  2 siblings, 0 replies; 37+ messages in thread
From: Abdullah Sevincer @ 2022-09-29 23:59 UTC (permalink / raw)
  To: dev; +Cc: jerinj, Abdullah Sevincer

This commit implements the changes required for using suggested
port type hint feature. Each port uses different credit quanta
based on port type specified using port configuration flags.

Each port has separate quanta defined in dlb2_priv.h
Producer and consumer ports will need larger quanta value to reduce number
of credit calls they make. Workers can use small quanta as they mostly
work out of locally cached credits and don't request/return credits often.

Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>
---
 drivers/event/dlb2/dlb2.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 4dd1d55ddc..164ebbcfe2 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -1965,8 +1965,8 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 {
 	struct dlb2_eventdev *dlb2;
 	struct dlb2_eventdev_port *ev_port;
-	int ret;
 	uint32_t hw_credit_quanta, sw_credit_quanta;
+	int ret;
 
 	if (dev == NULL || port_conf == NULL) {
 		DLB2_LOG_ERR("Null parameter\n");
@@ -2067,6 +2067,24 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
 	ev_port->inflight_credits = 0;
 	ev_port->dlb2 = dlb2; /* reverse link */
 
+	/* Default for worker ports */
+	sw_credit_quanta = dlb2->sw_credit_quanta;
+	hw_credit_quanta = dlb2->hw_credit_quanta;
+
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER) {
+		/* Producer type ports. Mostly enqueue */
+		sw_credit_quanta = DLB2_SW_CREDIT_P_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_P_BATCH_SZ;
+	}
+	if (port_conf->event_port_cfg & RTE_EVENT_PORT_CFG_HINT_CONSUMER) {
+		/* Consumer type ports. Mostly dequeue */
+		sw_credit_quanta = DLB2_SW_CREDIT_C_QUANTA_DEFAULT;
+		hw_credit_quanta = DLB2_SW_CREDIT_C_BATCH_SZ;
+	}
+	ev_port->credit_update_quanta = sw_credit_quanta;
+	ev_port->qm_port.hw_credit_quanta = hw_credit_quanta;
+
+
 	/* Tear down pre-existing port->queue links */
 	if (dlb2->run_state == DLB2_RUN_STATE_STOPPED)
 		dlb2_port_link_teardown(dlb2, &dlb2->ev_ports[ev_port_id]);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 1/3] event/dlb2: add producer port probing optimization
  2022-09-29 23:58   ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
  2022-09-29 23:58     ` [PATCH v12 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
  2022-09-29 23:59     ` [PATCH v12 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
@ 2022-09-30  8:28     ` Jerin Jacob
  2 siblings, 0 replies; 37+ messages in thread
From: Jerin Jacob @ 2022-09-30  8:28 UTC (permalink / raw)
  To: Abdullah Sevincer; +Cc: dev, jerinj

On Fri, Sep 30, 2022 at 5:29 AM Abdullah Sevincer
<abdullah.sevincer@intel.com> wrote:
>
> For best performance, applications running on certain cores should use
> the DLB device locally available on the same tile along with other
> resources. To allocate optimal resources, probing is done for each
> producer port (PP) for a given CPU and the best performing ports are
> allocated to producers. The cpu used for probing is either the first
> core of producer coremask (if present) or the second core of EAL
> coremask. This will be extended later to probe for all CPUs in the
> producer coremask or EAL coremask.
>
> Producer coremask can be passed along with the BDF of the DLB devices.
> "-a xx:y.z,producer_coremask=<core_mask>"
>
> Applications also need to pass RTE_EVENT_PORT_CFG_HINT_PRODUCER during
> rte_event_port_setup() for producer ports for optimal port allocation.
>
> For optimal load balancing ports that map to one or more QIDs in common
> should not be in numerical sequence. The port->QID mapping is application
> dependent, but the driver interleaves port IDs as much as possible to
> reduce the likelihood of sequential ports mapping to the same QID(s).
>
> Hence, DLB uses an initial allocation of Port IDs to maximize the
> average distance between an ID and its immediate neighbors. Using
> the initialport allocation option can be passed through devarg
> "default_port_allocation=y(or Y)".
>
> When events are dropped by workers or consumers that use LDB ports,
> completions are sent which are just ENQs and may impact the latency.
> To address this,  probing is done for LDB ports as well. Probing is
> done on ports per 'cos'. When default cos is used, ports will be
> allocated from best ports from the best 'cos', else from best ports of
> the specific cos.
>
> Signed-off-by: Abdullah Sevincer <abdullah.sevincer@intel.com>

Changed subject as " event/dlb2: optimize producer port probing"

Series applied to dpdk-next-net-eventdev/for-main. Thanks

> ---
>  doc/guides/eventdevs/dlb2.rst              |  36 +++
>  drivers/event/dlb2/dlb2.c                  |  72 +++++-
>  drivers/event/dlb2/dlb2_priv.h             |   7 +
>  drivers/event/dlb2/dlb2_user.h             |   1 +
>  drivers/event/dlb2/pf/base/dlb2_hw_types.h |   5 +
>  drivers/event/dlb2/pf/base/dlb2_resource.c | 250 ++++++++++++++++++++-
>  drivers/event/dlb2/pf/base/dlb2_resource.h |  15 +-
>  drivers/event/dlb2/pf/dlb2_main.c          |   9 +-
>  drivers/event/dlb2/pf/dlb2_main.h          |  23 +-
>  drivers/event/dlb2/pf/dlb2_pf.c            |  23 +-
>  10 files changed, 413 insertions(+), 28 deletions(-)
>
> diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
> index 5b21f13b68..f5bf5757c6 100644
> --- a/doc/guides/eventdevs/dlb2.rst
> +++ b/doc/guides/eventdevs/dlb2.rst
> @@ -414,3 +414,39 @@ Note that the weight may not exceed the maximum CQ depth.
>         --allow ea:00.0,cq_weight=all:<weight>
>         --allow ea:00.0,cq_weight=qidA-qidB:<weight>
>         --allow ea:00.0,cq_weight=qid:<weight>
> +
> +Producer Coremask
> +~~~~~~~~~~~~~~~~~
> +
> +For best performance, applications running on certain cores should use
> +the DLB device locally available on the same tile along with other
> +resources. To allocate optimal resources, probing is done for each
> +producer port (PP) for a given CPU and the best performing ports are
> +allocated to producers. The cpu used for probing is either the first
> +core of producer coremask (if present) or the second core of EAL
> +coremask. This will be extended later to probe for all CPUs in the
> +producer coremask or EAL coremask. Producer coremask can be passed
> +along with the BDF of the DLB devices.
> +
> +    .. code-block:: console
> +
> +       -a xx:y.z,producer_coremask=<core_mask>
> +
> +Default LDB Port Allocation
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For optimal load balancing ports that map to one or more QIDs in common
> +should not be in numerical sequence. The port->QID mapping is application
> +dependent, but the driver interleaves port IDs as much as possible to
> +reduce the likelihood of sequential ports mapping to the same QID(s).
> +
> +Hence, DLB uses an initial allocation of Port IDs to maximize the
> +average distance between an ID and its immediate neighbors. (i.e.the
> +distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
> +Initial port allocation option can be passed through devarg. If y (or Y)
> +inial port allocation will be used, otherwise initial port allocation
> +won't be used.
> +
> +    .. code-block:: console
> +
> +       --allow ea:00.0,default_port_allocation=<y/Y>
> diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
> index 759578378f..6a9db4b642 100644
> --- a/drivers/event/dlb2/dlb2.c
> +++ b/drivers/event/dlb2/dlb2.c
> @@ -293,6 +293,23 @@ dlb2_string_to_int(int *result, const char *str)
>         return 0;
>  }
>
> +static int
> +set_producer_coremask(const char *key __rte_unused,
> +                     const char *value,
> +                     void *opaque)
> +{
> +       const char **mask_str = opaque;
> +
> +       if (value == NULL || opaque == NULL) {
> +               DLB2_LOG_ERR("NULL pointer\n");
> +               return -EINVAL;
> +       }
> +
> +       *mask_str = value;
> +
> +       return 0;
> +}
> +
>  static int
>  set_numa_node(const char *key __rte_unused, const char *value, void *opaque)
>  {
> @@ -617,6 +634,26 @@ set_vector_opts_enab(const char *key __rte_unused,
>         return 0;
>  }
>
> +static int
> +set_default_ldb_port_allocation(const char *key __rte_unused,
> +                     const char *value,
> +                     void *opaque)
> +{
> +       bool *default_ldb_port_allocation = opaque;
> +
> +       if (value == NULL || opaque == NULL) {
> +               DLB2_LOG_ERR("NULL pointer\n");
> +               return -EINVAL;
> +       }
> +
> +       if ((*value == 'y') || (*value == 'Y'))
> +               *default_ldb_port_allocation = true;
> +       else
> +               *default_ldb_port_allocation = false;
> +
> +       return 0;
> +}
> +
>  static int
>  set_qid_depth_thresh(const char *key __rte_unused,
>                      const char *value,
> @@ -1785,6 +1822,9 @@ dlb2_hw_create_dir_port(struct dlb2_eventdev *dlb2,
>         } else
>                 credit_high_watermark = enqueue_depth;
>
> +       if (ev_port->conf.event_port_cfg & RTE_EVENT_PORT_CFG_HINT_PRODUCER)
> +               cfg.is_producer = 1;
> +
>         /* Per QM values */
>
>         ret = dlb2_iface_dir_port_create(handle, &cfg,  dlb2->poll_mode);
> @@ -1979,6 +2019,10 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
>         }
>         ev_port->enq_retries = port_conf->enqueue_depth / sw_credit_quanta;
>
> +       /* Save off port config for reconfig */
> +       ev_port->conf = *port_conf;
> +
> +
>         /*
>          * Create port
>          */
> @@ -2005,9 +2049,6 @@ dlb2_eventdev_port_setup(struct rte_eventdev *dev,
>                 }
>         }
>
> -       /* Save off port config for reconfig */
> -       ev_port->conf = *port_conf;
> -
>         ev_port->id = ev_port_id;
>         ev_port->enq_configured = true;
>         ev_port->setup_done = true;
> @@ -4700,6 +4741,8 @@ dlb2_parse_params(const char *params,
>                                              DLB2_CQ_WEIGHT,
>                                              DLB2_PORT_COS,
>                                              DLB2_COS_BW,
> +                                            DLB2_PRODUCER_COREMASK,
> +                                            DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
>                                              NULL };
>
>         if (params != NULL && params[0] != '\0') {
> @@ -4881,6 +4924,29 @@ dlb2_parse_params(const char *params,
>                         }
>
>
> +                       ret = rte_kvargs_process(kvlist,
> +                                                DLB2_PRODUCER_COREMASK,
> +                                                set_producer_coremask,
> +                                                &dlb2_args->producer_coremask);
> +                       if (ret != 0) {
> +                               DLB2_LOG_ERR(
> +                                       "%s: Error parsing producer coremask",
> +                                       name);
> +                               rte_kvargs_free(kvlist);
> +                               return ret;
> +                       }
> +
> +                       ret = rte_kvargs_process(kvlist,
> +                                                DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG,
> +                                                set_default_ldb_port_allocation,
> +                                                &dlb2_args->default_ldb_port_allocation);
> +                       if (ret != 0) {
> +                               DLB2_LOG_ERR("%s: Error parsing ldb default port allocation arg",
> +                                            name);
> +                               rte_kvargs_free(kvlist);
> +                               return ret;
> +                       }
> +
>                         rte_kvargs_free(kvlist);
>                 }
>         }
> diff --git a/drivers/event/dlb2/dlb2_priv.h b/drivers/event/dlb2/dlb2_priv.h
> index db431f7d8b..9ef5bcb901 100644
> --- a/drivers/event/dlb2/dlb2_priv.h
> +++ b/drivers/event/dlb2/dlb2_priv.h
> @@ -51,6 +51,8 @@
>  #define DLB2_CQ_WEIGHT "cq_weight"
>  #define DLB2_PORT_COS "port_cos"
>  #define DLB2_COS_BW "cos_bw"
> +#define DLB2_PRODUCER_COREMASK "producer_coremask"
> +#define DLB2_DEFAULT_LDB_PORT_ALLOCATION_ARG "default_port_allocation"
>
>  /* Begin HW related defines and structs */
>
> @@ -386,6 +388,7 @@ struct dlb2_port {
>         uint16_t hw_credit_quanta;
>         bool use_avx512;
>         uint32_t cq_weight;
> +       bool is_producer; /* True if port is of type producer */
>  };
>
>  /* Per-process per-port mmio and memory pointers */
> @@ -669,6 +672,8 @@ struct dlb2_devargs {
>         struct dlb2_cq_weight cq_weight;
>         struct dlb2_port_cos port_cos;
>         struct dlb2_cos_bw cos_bw;
> +       const char *producer_coremask;
> +       bool default_ldb_port_allocation;
>  };
>
>  /* End Eventdev related defines and structs */
> @@ -722,6 +727,8 @@ void dlb2_event_build_hcws(struct dlb2_port *qm_port,
>                            uint8_t *sched_type,
>                            uint8_t *queue_id);
>
> +/* Extern functions */
> +extern int rte_eal_parse_coremask(const char *coremask, int *cores);
>
>  /* Extern globals */
>  extern struct process_local_port_data dlb2_port[][DLB2_NUM_PORT_TYPES];
> diff --git a/drivers/event/dlb2/dlb2_user.h b/drivers/event/dlb2/dlb2_user.h
> index 901e2e0c66..28c6aaaf43 100644
> --- a/drivers/event/dlb2/dlb2_user.h
> +++ b/drivers/event/dlb2/dlb2_user.h
> @@ -498,6 +498,7 @@ struct dlb2_create_dir_port_args {
>         __u16 cq_depth;
>         __u16 cq_depth_threshold;
>         __s32 queue_id;
> +       __u8 is_producer;
>  };
>
>  /*
> diff --git a/drivers/event/dlb2/pf/base/dlb2_hw_types.h b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
> index 9511521e67..87996ef621 100644
> --- a/drivers/event/dlb2/pf/base/dlb2_hw_types.h
> +++ b/drivers/event/dlb2/pf/base/dlb2_hw_types.h
> @@ -249,6 +249,7 @@ struct dlb2_hw_domain {
>         struct dlb2_list_head avail_ldb_queues;
>         struct dlb2_list_head avail_ldb_ports[DLB2_NUM_COS_DOMAINS];
>         struct dlb2_list_head avail_dir_pq_pairs;
> +       struct dlb2_list_head rsvd_dir_pq_pairs;
>         u32 total_hist_list_entries;
>         u32 avail_hist_list_entries;
>         u32 hist_list_entry_base;
> @@ -347,6 +348,10 @@ struct dlb2_hw {
>         struct dlb2_function_resources vdev[DLB2_MAX_NUM_VDEVS];
>         struct dlb2_hw_domain domains[DLB2_MAX_NUM_DOMAINS];
>         u8 cos_reservation[DLB2_NUM_COS_DOMAINS];
> +       int prod_core_list[RTE_MAX_LCORE];
> +       u8 num_prod_cores;
> +       int dir_pp_allocations[DLB2_MAX_NUM_DIR_PORTS_V2_5];
> +       int ldb_pp_allocations[DLB2_MAX_NUM_LDB_PORTS];
>
>         /* Virtualization */
>         int virt_mode;
> diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.c b/drivers/event/dlb2/pf/base/dlb2_resource.c
> index 0731416a43..280a8e51b1 100644
> --- a/drivers/event/dlb2/pf/base/dlb2_resource.c
> +++ b/drivers/event/dlb2/pf/base/dlb2_resource.c
> @@ -51,6 +51,7 @@ static void dlb2_init_domain_rsrc_lists(struct dlb2_hw_domain *domain)
>         dlb2_list_init_head(&domain->used_dir_pq_pairs);
>         dlb2_list_init_head(&domain->avail_ldb_queues);
>         dlb2_list_init_head(&domain->avail_dir_pq_pairs);
> +       dlb2_list_init_head(&domain->rsvd_dir_pq_pairs);
>
>         for (i = 0; i < DLB2_NUM_COS_DOMAINS; i++)
>                 dlb2_list_init_head(&domain->used_ldb_ports[i]);
> @@ -106,8 +107,10 @@ void dlb2_resource_free(struct dlb2_hw *hw)
>   * Return:
>   * Returns 0 upon success, <0 otherwise.
>   */
> -int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
> +int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args)
>  {
> +       const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
> +       bool ldb_port_default = args ? args->default_ldb_port_allocation : false;
>         struct dlb2_list_entry *list;
>         unsigned int i;
>         int ret;
> @@ -122,6 +125,7 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
>          * the distance from 1 to 0 and to 2, the distance from 2 to 1 and to
>          * 3, etc.).
>          */
> +
>         const u8 init_ldb_port_allocation[DLB2_MAX_NUM_LDB_PORTS] = {
>                 0,  7,  14,  5, 12,  3, 10,  1,  8, 15,  6, 13,  4, 11,  2,  9,
>                 16, 23, 30, 21, 28, 19, 26, 17, 24, 31, 22, 29, 20, 27, 18, 25,
> @@ -164,7 +168,10 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
>                 int cos_id = i >> DLB2_NUM_COS_DOMAINS;
>                 struct dlb2_ldb_port *port;
>
> -               port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
> +               if (ldb_port_default == true)
> +                       port = &hw->rsrcs.ldb_ports[init_ldb_port_allocation[i]];
> +               else
> +                       port = &hw->rsrcs.ldb_ports[hw->ldb_pp_allocations[i]];
>
>                 dlb2_list_add(&hw->pf.avail_ldb_ports[cos_id],
>                               &port->func_list);
> @@ -172,7 +179,8 @@ int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver)
>
>         hw->pf.num_avail_dir_pq_pairs = DLB2_MAX_NUM_DIR_PORTS(hw->ver);
>         for (i = 0; i < hw->pf.num_avail_dir_pq_pairs; i++) {
> -               list = &hw->rsrcs.dir_pq_pairs[i].func_list;
> +               int index = hw->dir_pp_allocations[i];
> +               list = &hw->rsrcs.dir_pq_pairs[index].func_list;
>
>                 dlb2_list_add(&hw->pf.avail_dir_pq_pairs, list);
>         }
> @@ -592,6 +600,7 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
>                                  u32 num_ports,
>                                  struct dlb2_cmd_response *resp)
>  {
> +       int num_res = hw->num_prod_cores;
>         unsigned int i;
>
>         if (rsrcs->num_avail_dir_pq_pairs < num_ports) {
> @@ -611,12 +620,19 @@ static int dlb2_attach_dir_ports(struct dlb2_hw *hw,
>                         return -EFAULT;
>                 }
>
> +               if (num_res) {
> +                       dlb2_list_add(&domain->rsvd_dir_pq_pairs,
> +                                     &port->domain_list);
> +                       num_res--;
> +               } else {
> +                       dlb2_list_add(&domain->avail_dir_pq_pairs,
> +                       &port->domain_list);
> +               }
> +
>                 dlb2_list_del(&rsrcs->avail_dir_pq_pairs, &port->func_list);
>
>                 port->domain_id = domain->id;
>                 port->owned = true;
> -
> -               dlb2_list_add(&domain->avail_dir_pq_pairs, &port->domain_list);
>         }
>
>         rsrcs->num_avail_dir_pq_pairs -= num_ports;
> @@ -739,6 +755,199 @@ static int dlb2_attach_ldb_queues(struct dlb2_hw *hw,
>         return 0;
>  }
>
> +static int
> +dlb2_pp_profile(struct dlb2_hw *hw, int port, int cpu, bool is_ldb)
> +{
> +       u64 cycle_start = 0ULL, cycle_end = 0ULL;
> +       struct dlb2_hcw hcw_mem[DLB2_HCW_MEM_SIZE], *hcw;
> +       void __iomem *pp_addr;
> +       cpu_set_t cpuset;
> +       int i;
> +
> +       CPU_ZERO(&cpuset);
> +       CPU_SET(cpu, &cpuset);
> +       sched_setaffinity(0, sizeof(cpuset), &cpuset);
> +
> +       pp_addr = os_map_producer_port(hw, port, is_ldb);
> +
> +       /* Point hcw to a 64B-aligned location */
> +       hcw = (struct dlb2_hcw *)((uintptr_t)&hcw_mem[DLB2_HCW_64B_OFF] &
> +             ~DLB2_HCW_ALIGN_MASK);
> +
> +       /*
> +        * Program the first HCW for a completion and token return and
> +        * the other HCWs as NOOPS
> +        */
> +
> +       memset(hcw, 0, (DLB2_HCW_MEM_SIZE - DLB2_HCW_64B_OFF) * sizeof(*hcw));
> +       hcw->qe_comp = 1;
> +       hcw->cq_token = 1;
> +       hcw->lock_id = 1;
> +
> +       cycle_start = rte_get_tsc_cycles();
> +       for (i = 0; i < DLB2_NUM_PROBE_ENQS; i++)
> +               dlb2_movdir64b(pp_addr, hcw);
> +
> +       cycle_end = rte_get_tsc_cycles();
> +
> +       os_unmap_producer_port(hw, pp_addr);
> +       return (int)(cycle_end - cycle_start);
> +}
> +
> +static void *
> +dlb2_pp_profile_func(void *data)
> +{
> +       struct dlb2_pp_thread_data *thread_data = data;
> +       int cycles;
> +
> +       cycles = dlb2_pp_profile(thread_data->hw, thread_data->pp,
> +       thread_data->cpu, thread_data->is_ldb);
> +
> +       thread_data->cycles = cycles;
> +
> +       return NULL;
> +}
> +
> +static int dlb2_pp_cycle_comp(const void *a, const void *b)
> +{
> +       const struct dlb2_pp_thread_data *x = a;
> +       const struct dlb2_pp_thread_data *y = b;
> +
> +       return x->cycles - y->cycles;
> +}
> +
> +
> +/* Probe producer ports from different CPU cores */
> +static void
> +dlb2_get_pp_allocation(struct dlb2_hw *hw, int cpu, int port_type, int cos_id)
> +{
> +       struct dlb2_dev *dlb2_dev = container_of(hw, struct dlb2_dev, hw);
> +       int i, err, ver = DLB2_HW_DEVICE_FROM_PCI_ID(dlb2_dev->pdev);
> +       bool is_ldb = (port_type == DLB2_LDB_PORT);
> +       int num_ports = is_ldb ? DLB2_MAX_NUM_LDB_PORTS :
> +       DLB2_MAX_NUM_DIR_PORTS(ver);
> +       struct dlb2_pp_thread_data dlb2_thread_data[num_ports];
> +       int *port_allocations = is_ldb ? hw->ldb_pp_allocations :
> +                                        hw->dir_pp_allocations;
> +       int num_sort = is_ldb ? DLB2_NUM_COS_DOMAINS : 1;
> +       struct dlb2_pp_thread_data cos_cycles[num_sort];
> +       int num_ports_per_sort = num_ports / num_sort;
> +       pthread_t pthread;
> +
> +       dlb2_dev->enqueue_four = dlb2_movdir64b;
> +
> +       DLB2_LOG_INFO(" for %s: cpu core used in pp profiling: %d\n",
> +                     is_ldb ? "LDB" : "DIR", cpu);
> +
> +       memset(cos_cycles, 0, num_sort * sizeof(struct dlb2_pp_thread_data));
> +       for (i = 0; i < num_ports; i++) {
> +               int cos = is_ldb ? (i >> DLB2_NUM_COS_DOMAINS) : 0;
> +
> +               dlb2_thread_data[i].is_ldb = is_ldb;
> +               dlb2_thread_data[i].pp = i;
> +               dlb2_thread_data[i].cycles = 0;
> +               dlb2_thread_data[i].hw = hw;
> +               dlb2_thread_data[i].cpu = cpu;
> +
> +               err = pthread_create(&pthread, NULL, &dlb2_pp_profile_func,
> +                                    &dlb2_thread_data[i]);
> +               if (err) {
> +                       DLB2_LOG_ERR(": thread creation failed! err=%d", err);
> +                       return;
> +               }
> +
> +               err = pthread_join(pthread, NULL);
> +               if (err) {
> +                       DLB2_LOG_ERR(": thread join failed! err=%d", err);
> +                       return;
> +               }
> +               cos_cycles[cos].cycles += dlb2_thread_data[i].cycles;
> +
> +               if ((i + 1) % num_ports_per_sort == 0) {
> +                       int index = cos * num_ports_per_sort;
> +
> +                       cos_cycles[cos].pp = index;
> +                       /*
> +                        * For LDB ports first sort with in a cos. Later sort
> +                        * the best cos based on total cycles for the cos.
> +                        * For DIR ports, there is a single sort across all
> +                        * ports.
> +                        */
> +                       qsort(&dlb2_thread_data[index], num_ports_per_sort,
> +                             sizeof(struct dlb2_pp_thread_data),
> +                             dlb2_pp_cycle_comp);
> +               }
> +       }
> +
> +       /*
> +        * Re-arrange best ports by cos if default cos is used.
> +        */
> +       if (is_ldb && cos_id == DLB2_COS_DEFAULT)
> +               qsort(cos_cycles, num_sort,
> +                     sizeof(struct dlb2_pp_thread_data),
> +                     dlb2_pp_cycle_comp);
> +
> +       for (i = 0; i < num_ports; i++) {
> +               int start = is_ldb ? cos_cycles[i / num_ports_per_sort].pp : 0;
> +               int index = i % num_ports_per_sort;
> +
> +               port_allocations[i] = dlb2_thread_data[start + index].pp;
> +               DLB2_LOG_INFO(": pp %d cycles %d", port_allocations[i],
> +                            dlb2_thread_data[start + index].cycles);
> +       }
> +}
> +
> +int
> +dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args)
> +{
> +       const struct dlb2_devargs *args = (const struct dlb2_devargs *)probe_args;
> +       const char *mask = NULL;
> +       int cpu = 0, cnt = 0, cores[RTE_MAX_LCORE];
> +       int i, cos_id = DLB2_COS_DEFAULT;
> +
> +       if (args) {
> +               mask = (const char *)args->producer_coremask;
> +               cos_id = args->cos_id;
> +       }
> +
> +       if (mask && rte_eal_parse_coremask(mask, cores)) {
> +               DLB2_LOG_ERR(": Invalid producer coremask=%s", mask);
> +               return -1;
> +       }
> +
> +       hw->num_prod_cores = 0;
> +       for (i = 0; i < RTE_MAX_LCORE; i++) {
> +               if (rte_lcore_is_enabled(i)) {
> +                       if (mask) {
> +                               /*
> +                                * Populate the producer cores from parsed
> +                                * coremask
> +                                */
> +                               if (cores[i] != -1) {
> +                                       hw->prod_core_list[cores[i]] = i;
> +                                       hw->num_prod_cores++;
> +                               }
> +                       } else if ((++cnt == DLB2_EAL_PROBE_CORE ||
> +                          rte_lcore_count() < DLB2_EAL_PROBE_CORE)) {
> +                               /*
> +                                * If no producer coremask is provided, use the
> +                                * second EAL core to probe
> +                                */
> +                               cpu = i;
> +                               break;
> +                       }
> +               }
> +       }
> +       /* Use the first core in producer coremask to probe */
> +       if (hw->num_prod_cores)
> +               cpu = hw->prod_core_list[0];
> +
> +       dlb2_get_pp_allocation(hw, cpu, DLB2_LDB_PORT, cos_id);
> +       dlb2_get_pp_allocation(hw, cpu, DLB2_DIR_PORT, DLB2_COS_DEFAULT);
> +
> +       return 0;
> +}
> +
>  static int
>  dlb2_domain_attach_resources(struct dlb2_hw *hw,
>                              struct dlb2_function_resources *rsrcs,
> @@ -4359,6 +4568,8 @@ dlb2_verify_create_ldb_port_args(struct dlb2_hw *hw,
>                 return -EINVAL;
>         }
>
> +       DLB2_LOG_INFO(": LDB: cos=%d port:%d\n", id, port->id.phys_id);
> +
>         /* Check cache-line alignment */
>         if ((cq_dma_base & 0x3F) != 0) {
>                 resp->status = DLB2_ST_INVALID_CQ_VIRT_ADDR;
> @@ -4568,13 +4779,25 @@ dlb2_verify_create_dir_port_args(struct dlb2_hw *hw,
>                 /*
>                  * If the port's queue is not configured, validate that a free
>                  * port-queue pair is available.
> +                * First try the 'res' list if the port is producer OR if
> +                * 'avail' list is empty else fall back to 'avail' list
>                  */
> -               pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
> -                                       typeof(*pq));
> +               if (!dlb2_list_empty(&domain->rsvd_dir_pq_pairs) &&
> +                   (args->is_producer ||
> +                    dlb2_list_empty(&domain->avail_dir_pq_pairs)))
> +                       pq = DLB2_DOM_LIST_HEAD(domain->rsvd_dir_pq_pairs,
> +                                               typeof(*pq));
> +               else
> +                       pq = DLB2_DOM_LIST_HEAD(domain->avail_dir_pq_pairs,
> +                                               typeof(*pq));
> +
>                 if (!pq) {
>                         resp->status = DLB2_ST_DIR_PORTS_UNAVAILABLE;
>                         return -EINVAL;
>                 }
> +               DLB2_LOG_INFO(": DIR: port:%d is_producer=%d\n",
> +                             pq->id.phys_id, args->is_producer);
> +
>         }
>
>         /* Check cache-line alignment */
> @@ -4875,11 +5098,18 @@ int dlb2_hw_create_dir_port(struct dlb2_hw *hw,
>                 return ret;
>
>         /*
> -        * Configuration succeeded, so move the resource from the 'avail' to
> -        * the 'used' list (if it's not already there).
> +        * Configuration succeeded, so move the resource from the 'avail' or
> +        * 'res' to the 'used' list (if it's not already there).
>          */
>         if (args->queue_id == -1) {
> -               dlb2_list_del(&domain->avail_dir_pq_pairs, &port->domain_list);
> +               struct dlb2_list_head *res = &domain->rsvd_dir_pq_pairs;
> +               struct dlb2_list_head *avail = &domain->avail_dir_pq_pairs;
> +
> +               if ((args->is_producer && !dlb2_list_empty(res)) ||
> +                    dlb2_list_empty(avail))
> +                       dlb2_list_del(res, &port->domain_list);
> +               else
> +                       dlb2_list_del(avail, &port->domain_list);
>
>                 dlb2_list_add(&domain->used_dir_pq_pairs, &port->domain_list);
>         }
> diff --git a/drivers/event/dlb2/pf/base/dlb2_resource.h b/drivers/event/dlb2/pf/base/dlb2_resource.h
> index a7e6c90888..71bd6148f1 100644
> --- a/drivers/event/dlb2/pf/base/dlb2_resource.h
> +++ b/drivers/event/dlb2/pf/base/dlb2_resource.h
> @@ -23,7 +23,20 @@
>   * Return:
>   * Returns 0 upon success, <0 otherwise.
>   */
> -int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver);
> +int dlb2_resource_init(struct dlb2_hw *hw, enum dlb2_hw_ver ver, const void *probe_args);
> +
> +/**
> + * dlb2_resource_probe() - probe hw resources
> + * @hw: pointer to struct dlb2_hw.
> + *
> + * This function probes hw resources for best port allocation to producer
> + * cores.
> + *
> + * Return:
> + * Returns 0 upon success, <0 otherwise.
> + */
> +int dlb2_resource_probe(struct dlb2_hw *hw, const void *probe_args);
> +
>
>  /**
>   * dlb2_clr_pmcsr_disable() - power on bulk of DLB 2.0 logic
> diff --git a/drivers/event/dlb2/pf/dlb2_main.c b/drivers/event/dlb2/pf/dlb2_main.c
> index b6ec85b479..717aa4fc08 100644
> --- a/drivers/event/dlb2/pf/dlb2_main.c
> +++ b/drivers/event/dlb2/pf/dlb2_main.c
> @@ -147,7 +147,7 @@ static int dlb2_pf_wait_for_device_ready(struct dlb2_dev *dlb2_dev,
>  }
>
>  struct dlb2_dev *
> -dlb2_probe(struct rte_pci_device *pdev)
> +dlb2_probe(struct rte_pci_device *pdev, const void *probe_args)
>  {
>         struct dlb2_dev *dlb2_dev;
>         int ret = 0;
> @@ -208,6 +208,10 @@ dlb2_probe(struct rte_pci_device *pdev)
>         if (ret)
>                 goto wait_for_device_ready_fail;
>
> +       ret = dlb2_resource_probe(&dlb2_dev->hw, probe_args);
> +       if (ret)
> +               goto resource_probe_fail;
> +
>         ret = dlb2_pf_reset(dlb2_dev);
>         if (ret)
>                 goto dlb2_reset_fail;
> @@ -216,7 +220,7 @@ dlb2_probe(struct rte_pci_device *pdev)
>         if (ret)
>                 goto init_driver_state_fail;
>
> -       ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version);
> +       ret = dlb2_resource_init(&dlb2_dev->hw, dlb_version, probe_args);
>         if (ret)
>                 goto resource_init_fail;
>
> @@ -227,6 +231,7 @@ dlb2_probe(struct rte_pci_device *pdev)
>  init_driver_state_fail:
>  dlb2_reset_fail:
>  pci_mmap_bad_addr:
> +resource_probe_fail:
>  wait_for_device_ready_fail:
>         rte_free(dlb2_dev);
>  dlb2_dev_malloc_fail:
> diff --git a/drivers/event/dlb2/pf/dlb2_main.h b/drivers/event/dlb2/pf/dlb2_main.h
> index 5aa51b1616..4c64d72e9c 100644
> --- a/drivers/event/dlb2/pf/dlb2_main.h
> +++ b/drivers/event/dlb2/pf/dlb2_main.h
> @@ -15,7 +15,11 @@
>  #include "base/dlb2_hw_types.h"
>  #include "../dlb2_user.h"
>
> -#define DLB2_DEFAULT_UNREGISTER_TIMEOUT_S 5
> +#define DLB2_EAL_PROBE_CORE 2
> +#define DLB2_NUM_PROBE_ENQS 1000
> +#define DLB2_HCW_MEM_SIZE 8
> +#define DLB2_HCW_64B_OFF 4
> +#define DLB2_HCW_ALIGN_MASK 0x3F
>
>  struct dlb2_dev;
>
> @@ -31,15 +35,30 @@ struct dlb2_dev {
>         /* struct list_head list; */
>         struct device *dlb2_device;
>         bool domain_reset_failed;
> +       /* The enqueue_four function enqueues four HCWs (one cache-line worth)
> +        * to the HQM, using whichever mechanism is supported by the platform
> +        * on which this driver is running.
> +        */
> +       void (*enqueue_four)(void *qe4, void *pp_addr);
>         /* The resource mutex serializes access to driver data structures and
>          * hardware registers.
>          */
>         rte_spinlock_t resource_mutex;
>         bool worker_launched;
>         u8 revision;
> +       u8 version;
> +};
> +
> +struct dlb2_pp_thread_data {
> +       struct dlb2_hw *hw;
> +       int pp;
> +       int cpu;
> +       bool is_ldb;
> +       int cycles;
>  };
>
> -struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev);
> +struct dlb2_dev *dlb2_probe(struct rte_pci_device *pdev, const void *probe_args);
> +
>
>  int dlb2_pf_reset(struct dlb2_dev *dlb2_dev);
>  int dlb2_pf_create_sched_domain(struct dlb2_hw *hw,
> diff --git a/drivers/event/dlb2/pf/dlb2_pf.c b/drivers/event/dlb2/pf/dlb2_pf.c
> index 71ac141b66..3d15250e11 100644
> --- a/drivers/event/dlb2/pf/dlb2_pf.c
> +++ b/drivers/event/dlb2/pf/dlb2_pf.c
> @@ -702,6 +702,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
>         struct dlb2_devargs dlb2_args = {
>                 .socket_id = rte_socket_id(),
>                 .max_num_events = DLB2_MAX_NUM_LDB_CREDITS,
> +               .producer_coremask = NULL,
>                 .num_dir_credits_override = -1,
>                 .qid_depth_thresholds = { {0} },
>                 .poll_interval = DLB2_POLL_INTERVAL_DEFAULT,
> @@ -713,6 +714,7 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
>         };
>         struct dlb2_eventdev *dlb2;
>         int q;
> +       const void *probe_args = NULL;
>
>         DLB2_LOG_DBG("Enter with dev_id=%d socket_id=%d",
>                      eventdev->data->dev_id, eventdev->data->socket_id);
> @@ -728,16 +730,6 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
>                 dlb2 = dlb2_pmd_priv(eventdev); /* rte_zmalloc_socket mem */
>                 dlb2->version = DLB2_HW_DEVICE_FROM_PCI_ID(pci_dev);
>
> -               /* Probe the DLB2 PF layer */
> -               dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev);
> -
> -               if (dlb2->qm_instance.pf_dev == NULL) {
> -                       DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
> -                                    rte_errno);
> -                       ret = -rte_errno;
> -                       goto dlb2_probe_failed;
> -               }
> -
>                 /* Were we invoked with runtime parameters? */
>                 if (pci_dev->device.devargs) {
>                         ret = dlb2_parse_params(pci_dev->device.devargs->args,
> @@ -749,6 +741,17 @@ dlb2_eventdev_pci_init(struct rte_eventdev *eventdev)
>                                              ret, rte_errno);
>                                 goto dlb2_probe_failed;
>                         }
> +                       probe_args = &dlb2_args;
> +               }
> +
> +               /* Probe the DLB2 PF layer */
> +               dlb2->qm_instance.pf_dev = dlb2_probe(pci_dev, probe_args);
> +
> +               if (dlb2->qm_instance.pf_dev == NULL) {
> +                       DLB2_LOG_ERR("DLB2 PF Probe failed with error %d\n",
> +                                    rte_errno);
> +                       ret = -rte_errno;
> +                       goto dlb2_probe_failed;
>                 }
>
>                 ret = dlb2_primary_eventdev_probe(eventdev,
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-09-30  8:28 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-20  0:59 [PATCH 0/3] DLB2 Performance Optimizations Timothy McDaniel
2022-08-20  0:59 ` [PATCH 1/3] event/dlb2: add producer port probing optimization Timothy McDaniel
2022-09-03 13:16   ` Jerin Jacob
2022-09-26 22:55   ` [PATCH v3 " Abdullah Sevincer
2022-09-26 22:55     ` [PATCH v3 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-26 22:55     ` [PATCH v3 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-27  1:42   ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
2022-09-27  1:42     ` [PATCH v4 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-27  1:42     ` [PATCH v4 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-28 14:45     ` [PATCH v4 1/3] event/dlb2: add producer port probing optimization Jerin Jacob
2022-09-28 19:11     ` [PATCH v5 " Abdullah Sevincer
2022-09-28 19:11     ` [PATCH v5 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-28 19:19     ` [PATCH v6 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
2022-09-28 19:19       ` [PATCH v6 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-28 19:19       ` [PATCH v6 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-28 20:28     ` [PATCH v7 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
2022-09-28 20:28       ` [PATCH v7 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-28 20:28       ` [PATCH v7 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-29  1:32     ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
2022-09-29  1:32       ` [PATCH v8 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-29  1:32       ` [PATCH v8 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-29  2:48       ` [PATCH v8 1/3] event/dlb2: add producer port probing optimization Sevincer, Abdullah
2022-09-29  3:46     ` [PATCH v9 " Abdullah Sevincer
2022-09-29  3:46       ` [PATCH v9 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-29  3:46       ` [PATCH v9 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-29  5:03   ` [PATCH v10 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
2022-09-29  5:03     ` [PATCH v10 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-29  5:03     ` [PATCH v10 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-29 15:26   ` [PATCH v11 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
2022-09-29 15:26     ` [PATCH v11 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-29 15:26     ` [PATCH v11 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-29 23:58   ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Abdullah Sevincer
2022-09-29 23:58     ` [PATCH v12 2/3] event/dlb2: add fence bypass option for producer ports Abdullah Sevincer
2022-09-29 23:59     ` [PATCH v12 3/3] event/dlb2: optimize credit allocations Abdullah Sevincer
2022-09-30  8:28     ` [PATCH v12 1/3] event/dlb2: add producer port probing optimization Jerin Jacob
2022-08-20  0:59 ` [PATCH 2/3] event/dlb2: add fence bypass option for producer ports Timothy McDaniel
2022-08-20  0:59 ` [PATCH 3/3] event/dlb2: optimize credit allocations Timothy McDaniel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).