From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id B1BB0A052A;
	Thu, 26 Nov 2020 12:16:47 +0100 (CET)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 07F40C99C;
	Thu, 26 Nov 2020 12:16:09 +0100 (CET)
Received: from hqnvemgate25.nvidia.com (hqnvemgate25.nvidia.com
 [216.228.121.64]) by dpdk.org (Postfix) with ESMTP id 3E1E1C96C
 for <dev@dpdk.org>; Thu, 26 Nov 2020 12:16:06 +0100 (CET)
Received: from hqmail.nvidia.com (Not Verified[216.228.121.13]) by
 hqnvemgate25.nvidia.com (using TLS: TLSv1.2, AES256-SHA)
 id <B5fbf8e730001>; Thu, 26 Nov 2020 03:16:03 -0800
Received: from nvidia.com (10.124.1.5) by HQMAIL107.nvidia.com (172.20.187.13)
 with Microsoft SMTP Server (TLS) id 15.0.1473.3;
 Thu, 26 Nov 2020 11:16:02 +0000
From: Wisam Jaddo <wisamm@nvidia.com>
To: <thomas@monjalon.net>, <arybchenko@solarflare.com>,
 <suanmingm@nvidia.com>, <akozyrev@nvidia.com>
CC: <dev@dpdk.org>
Date: Thu, 26 Nov 2020 13:15:41 +0200
Message-ID: <20201126111543.16928-3-wisamm@nvidia.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20201126111543.16928-1-wisamm@nvidia.com>
References: <20201126111543.16928-1-wisamm@nvidia.com>
MIME-Version: 1.0
Content-Type: text/plain
X-Originating-IP: [10.124.1.5]
X-ClientProxiedBy: HQMAIL101.nvidia.com (172.20.187.10) To
 HQMAIL107.nvidia.com (172.20.187.13)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1;
 t=1606389363; bh=wW00TgXxNUV4L/5wvAttlT2ZcsNA+b3dHMydlgqBprs=;
 h=From:To:CC:Subject:Date:Message-ID:X-Mailer:In-Reply-To:
 References:MIME-Version:Content-Type:X-Originating-IP:
 X-ClientProxiedBy;
 b=MfOVPKlZo1SlG3lps0T5j+MbLsoM/uJD3fG4o545k/qxVWJ6SFy6h5utBaAMPX4Ys
 X5KiJ9iiDS/gIDKdSnXYS9gu3wOfdliCDRA6ZMbV956ci2PYB7jAHfAoSHdAsrwfXV
 R3HgXTu3WanZgS4EFHakZu32kxVARO15lARLIFU+WJFycRunM/eoahwByTqpjne/Gb
 H0g+rZSs1Ev2tL2zN7wfW3DkKteZ+5VfZBhOhYbR/ZOUw1I03U8VJQQXRzW+2JM4hW
 0tqzweExctjvaBs8U4e4xcVVFrA0d974DZhzNtyWSf+SAQ/1i16UMZDiIYO6OJ1K5F
 jEDOcdgNQxiaw==
Subject: [dpdk-dev] [PATCH 2/4] app/flow-perf: add multiple cores insertion
	and deletion
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

One of the ways to increase the insertion/deletion rate is to use
multi-threaded insertion/deletion. Thus it's needed to have support
for testing and measure those rates using flow-perf application.

Now we generate cores and distribute all flows to those cores,
and start inserting/deleting in parallel.

The app now receive the cores count to use from command line option,
then it distribute the rte_flow rules evenly between the cores, and
start inserting/deleting. Each worker will report it's own results,
and in the end the MAIN worker will report the total results for all
cores.

The total results are calculated using RULES_COUNT divided over
max time used between all cores.

Also this touches the memory area, since inserting using multiple cores
in same time the pre solution for memory is not valid, thus now we save
memory before and after each allocation for all cores. In the end we
pick the min pre memory and the max post memory from all cores.

The difference between those values represent the total memory consumed
by the total rte_flow rules from all cores, and then report the total
size of single rte_flow in byte for each port.

How to use this feature:
--cores=N

Where 1 =< N <= RTE_MAX_LCORE

Signed-off-by: Wisam Jaddo <wisamm@nvidia.com>
Reviewed-by: Alexander Kozyrev <akozyrev@nvidia.com>
Reviewed-by: Suanming Mou <suanmingm@nvidia.com>
---
 app/test-flow-perf/actions_gen.c | 175 ++++++++++----------
 app/test-flow-perf/actions_gen.h |   2 +-
 app/test-flow-perf/config.h      |   1 +
 app/test-flow-perf/flow_gen.c    |   5 +-
 app/test-flow-perf/flow_gen.h    |   1 +
 app/test-flow-perf/items_gen.c   | 103 ++++++------
 app/test-flow-perf/items_gen.h   |   2 +-
 app/test-flow-perf/main.c        | 266 +++++++++++++++++++++++++------
 doc/guides/tools/flow-perf.rst   |  14 +-
 9 files changed, 372 insertions(+), 197 deletions(-)
diff --git a/app/test-flow-perf/actions_gen.c b/app/test-flow-perf/actions_gen.c
index ac525f6fdb..1364407056 100644
--- a/app/test-flow-perf/actions_gen.c
+++ b/app/test-flow-perf/actions_gen.c
@@ -29,6 +29,7 @@ struct additional_para {
 	uint32_t counter;
 	uint64_t encap_data;
 	uint64_t decap_data;
+	uint8_t core_idx;
 };
 
 /* Storage for struct rte_flow_action_raw_encap including external data. */
@@ -58,16 +59,16 @@ add_mark(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	struct additional_para para)
 {
-	static struct rte_flow_action_mark mark_action;
+	static struct rte_flow_action_mark mark_actions[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t counter = para.counter;
 
 	do {
 		/* Random values from 1 to 256 */
-		mark_action.id = (counter % 255) + 1;
+		mark_actions[para.core_idx].id = (counter % 255) + 1;
 	} while (0);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_MARK;
-	actions[actions_counter].conf = &mark_action;
+	actions[actions_counter].conf = &mark_actions[para.core_idx];
 }
 
 static void
@@ -75,14 +76,14 @@ add_queue(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	struct additional_para para)
 {
-	static struct rte_flow_action_queue queue_action;
+	static struct rte_flow_action_queue queue_actions[RTE_MAX_LCORE] __rte_cache_aligned;
 
 	do {
-		queue_action.index = para.queue;
+		queue_actions[para.core_idx].index = para.queue;
 	} while (0);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_QUEUE;
-	actions[actions_counter].conf = &queue_action;
+	actions[actions_counter].conf = &queue_actions[para.core_idx];
 }
 
 static void
@@ -105,39 +106,36 @@ add_rss(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	struct additional_para para)
 {
-	static struct rte_flow_action_rss *rss_action;
-	static struct action_rss_data *rss_data;
+	static struct action_rss_data *rss_data[RTE_MAX_LCORE] __rte_cache_aligned;
 
 	uint16_t queue;
 
-	if (rss_data == NULL)
-		rss_data = rte_malloc("rss_data",
+	if (rss_data[para.core_idx] == NULL)
+		rss_data[para.core_idx] = rte_malloc("rss_data",
 			sizeof(struct action_rss_data), 0);
 
-	if (rss_data == NULL)
+	if (rss_data[para.core_idx] == NULL)
 		rte_exit(EXIT_FAILURE, "No Memory available!");
 
-	*rss_data = (struct action_rss_data){
+	*rss_data[para.core_idx] = (struct action_rss_data){
 		.conf = (struct rte_flow_action_rss){
 			.func = RTE_ETH_HASH_FUNCTION_DEFAULT,
 			.level = 0,
 			.types = GET_RSS_HF(),
-			.key_len = sizeof(rss_data->key),
+			.key_len = sizeof(rss_data[para.core_idx]->key),
 			.queue_num = para.queues_number,
-			.key = rss_data->key,
-			.queue = rss_data->queue,
+			.key = rss_data[para.core_idx]->key,
+			.queue = rss_data[para.core_idx]->queue,
 		},
 		.key = { 1 },
 		.queue = { 0 },
 	};
 
 	for (queue = 0; queue < para.queues_number; queue++)
-		rss_data->queue[queue] = para.queues[queue];
-
-	rss_action = &rss_data->conf;
+		rss_data[para.core_idx]->queue[queue] = para.queues[queue];
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_RSS;
-	actions[actions_counter].conf = rss_action;
+	actions[actions_counter].conf = &rss_data[para.core_idx]->conf;
 }
 
 static void
@@ -212,7 +210,7 @@ add_set_src_mac(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_mac set_mac;
+	static struct rte_flow_action_set_mac set_macs[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t mac = para.counter;
 	uint16_t i;
 
@@ -222,12 +220,12 @@ add_set_src_mac(struct rte_flow_action *actions,
 
 	/* Mac address to be set is random each time */
 	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++) {
-		set_mac.mac_addr[i] = mac & 0xff;
+		set_macs[para.core_idx].mac_addr[i] = mac & 0xff;
 		mac = mac >> 8;
 	}
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_MAC_SRC;
-	actions[actions_counter].conf = &set_mac;
+	actions[actions_counter].conf = &set_macs[para.core_idx];
 }
 
 static void
@@ -235,7 +233,7 @@ add_set_dst_mac(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_mac set_mac;
+	static struct rte_flow_action_set_mac set_macs[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t mac = para.counter;
 	uint16_t i;
 
@@ -245,12 +243,12 @@ add_set_dst_mac(struct rte_flow_action *actions,
 
 	/* Mac address to be set is random each time */
 	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++) {
-		set_mac.mac_addr[i] = mac & 0xff;
+		set_macs[para.core_idx].mac_addr[i] = mac & 0xff;
 		mac = mac >> 8;
 	}
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_MAC_DST;
-	actions[actions_counter].conf = &set_mac;
+	actions[actions_counter].conf = &set_macs[para.core_idx];
 }
 
 static void
@@ -258,7 +256,7 @@ add_set_src_ipv4(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_ipv4 set_ipv4;
+	static struct rte_flow_action_set_ipv4 set_ipv4[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t ip = para.counter;
 
 	/* Fixed value */
@@ -266,10 +264,10 @@ add_set_src_ipv4(struct rte_flow_action *actions,
 		ip = 1;
 
 	/* IPv4 value to be set is random each time */
-	set_ipv4.ipv4_addr = RTE_BE32(ip + 1);
+	set_ipv4[para.core_idx].ipv4_addr = RTE_BE32(ip + 1);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC;
-	actions[actions_counter].conf = &set_ipv4;
+	actions[actions_counter].conf = &set_ipv4[para.core_idx];
 }
 
 static void
@@ -277,7 +275,7 @@ add_set_dst_ipv4(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_ipv4 set_ipv4;
+	static struct rte_flow_action_set_ipv4 set_ipv4[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t ip = para.counter;
 
 	/* Fixed value */
@@ -285,10 +283,10 @@ add_set_dst_ipv4(struct rte_flow_action *actions,
 		ip = 1;
 
 	/* IPv4 value to be set is random each time */
-	set_ipv4.ipv4_addr = RTE_BE32(ip + 1);
+	set_ipv4[para.core_idx].ipv4_addr = RTE_BE32(ip + 1);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_IPV4_DST;
-	actions[actions_counter].conf = &set_ipv4;
+	actions[actions_counter].conf = &set_ipv4[para.core_idx];
 }
 
 static void
@@ -296,7 +294,7 @@ add_set_src_ipv6(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_ipv6 set_ipv6;
+	static struct rte_flow_action_set_ipv6 set_ipv6[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t ipv6 = para.counter;
 	uint8_t i;
 
@@ -306,12 +304,12 @@ add_set_src_ipv6(struct rte_flow_action *actions,
 
 	/* IPv6 value to set is random each time */
 	for (i = 0; i < 16; i++) {
-		set_ipv6.ipv6_addr[i] = ipv6 & 0xff;
+		set_ipv6[para.core_idx].ipv6_addr[i] = ipv6 & 0xff;
 		ipv6 = ipv6 >> 8;
 	}
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC;
-	actions[actions_counter].conf = &set_ipv6;
+	actions[actions_counter].conf = &set_ipv6[para.core_idx];
 }
 
 static void
@@ -319,7 +317,7 @@ add_set_dst_ipv6(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_ipv6 set_ipv6;
+	static struct rte_flow_action_set_ipv6 set_ipv6[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t ipv6 = para.counter;
 	uint8_t i;
 
@@ -329,12 +327,12 @@ add_set_dst_ipv6(struct rte_flow_action *actions,
 
 	/* IPv6 value to set is random each time */
 	for (i = 0; i < 16; i++) {
-		set_ipv6.ipv6_addr[i] = ipv6 & 0xff;
+		set_ipv6[para.core_idx].ipv6_addr[i] = ipv6 & 0xff;
 		ipv6 = ipv6 >> 8;
 	}
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_IPV6_DST;
-	actions[actions_counter].conf = &set_ipv6;
+	actions[actions_counter].conf = &set_ipv6[para.core_idx];
 }
 
 static void
@@ -342,7 +340,7 @@ add_set_src_tp(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_tp set_tp;
+	static struct rte_flow_action_set_tp set_tp[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t tp = para.counter;
 
 	/* Fixed value */
@@ -352,10 +350,10 @@ add_set_src_tp(struct rte_flow_action *actions,
 	/* TP src port is random each time */
 	tp = tp % 0xffff;
 
-	set_tp.port = RTE_BE16(tp & 0xffff);
+	set_tp[para.core_idx].port = RTE_BE16(tp & 0xffff);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_TP_SRC;
-	actions[actions_counter].conf = &set_tp;
+	actions[actions_counter].conf = &set_tp[para.core_idx];
 }
 
 static void
@@ -363,7 +361,7 @@ add_set_dst_tp(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_tp set_tp;
+	static struct rte_flow_action_set_tp set_tp[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t tp = para.counter;
 
 	/* Fixed value */
@@ -374,10 +372,10 @@ add_set_dst_tp(struct rte_flow_action *actions,
 	if (tp > 0xffff)
 		tp = tp >> 16;
 
-	set_tp.port = RTE_BE16(tp & 0xffff);
+	set_tp[para.core_idx].port = RTE_BE16(tp & 0xffff);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_TP_DST;
-	actions[actions_counter].conf = &set_tp;
+	actions[actions_counter].conf = &set_tp[para.core_idx];
 }
 
 static void
@@ -385,17 +383,17 @@ add_inc_tcp_ack(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static rte_be32_t value;
+	static rte_be32_t value[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t ack_value = para.counter;
 
 	/* Fixed value */
 	if (FIXED_VALUES)
 		ack_value = 1;
 
-	value = RTE_BE32(ack_value);
+	value[para.core_idx] = RTE_BE32(ack_value);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_INC_TCP_ACK;
-	actions[actions_counter].conf = &value;
+	actions[actions_counter].conf = &value[para.core_idx];
 }
 
 static void
@@ -403,17 +401,17 @@ add_dec_tcp_ack(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static rte_be32_t value;
+	static rte_be32_t value[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t ack_value = para.counter;
 
 	/* Fixed value */
 	if (FIXED_VALUES)
 		ack_value = 1;
 
-	value = RTE_BE32(ack_value);
+	value[para.core_idx] = RTE_BE32(ack_value);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_DEC_TCP_ACK;
-	actions[actions_counter].conf = &value;
+	actions[actions_counter].conf = &value[para.core_idx];
 }
 
 static void
@@ -421,17 +419,17 @@ add_inc_tcp_seq(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static rte_be32_t value;
+	static rte_be32_t value[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t seq_value = para.counter;
 
 	/* Fixed value */
 	if (FIXED_VALUES)
 		seq_value = 1;
 
-	value = RTE_BE32(seq_value);
+	value[para.core_idx] = RTE_BE32(seq_value);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_INC_TCP_SEQ;
-	actions[actions_counter].conf = &value;
+	actions[actions_counter].conf = &value[para.core_idx];
 }
 
 static void
@@ -439,17 +437,17 @@ add_dec_tcp_seq(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static rte_be32_t value;
+	static rte_be32_t value[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t seq_value = para.counter;
 
 	/* Fixed value */
 	if (FIXED_VALUES)
 		seq_value = 1;
 
-	value	= RTE_BE32(seq_value);
+	value[para.core_idx] = RTE_BE32(seq_value);
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_DEC_TCP_SEQ;
-	actions[actions_counter].conf = &value;
+	actions[actions_counter].conf = &value[para.core_idx];
 }
 
 static void
@@ -457,7 +455,7 @@ add_set_ttl(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_ttl set_ttl;
+	static struct rte_flow_action_set_ttl set_ttl[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t ttl_value = para.counter;
 
 	/* Fixed value */
@@ -467,10 +465,10 @@ add_set_ttl(struct rte_flow_action *actions,
 	/* Set ttl to random value each time */
 	ttl_value = ttl_value % 0xff;
 
-	set_ttl.ttl_value = ttl_value;
+	set_ttl[para.core_idx].ttl_value = ttl_value;
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_TTL;
-	actions[actions_counter].conf = &set_ttl;
+	actions[actions_counter].conf = &set_ttl[para.core_idx];
 }
 
 static void
@@ -486,7 +484,7 @@ add_set_ipv4_dscp(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_dscp set_dscp;
+	static struct rte_flow_action_set_dscp set_dscp[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t dscp_value = para.counter;
 
 	/* Fixed value */
@@ -496,10 +494,10 @@ add_set_ipv4_dscp(struct rte_flow_action *actions,
 	/* Set dscp to random value each time */
 	dscp_value = dscp_value % 0xff;
 
-	set_dscp.dscp = dscp_value;
+	set_dscp[para.core_idx].dscp = dscp_value;
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP;
-	actions[actions_counter].conf = &set_dscp;
+	actions[actions_counter].conf = &set_dscp[para.core_idx];
 }
 
 static void
@@ -507,7 +505,7 @@ add_set_ipv6_dscp(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_set_dscp set_dscp;
+	static struct rte_flow_action_set_dscp set_dscp[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint32_t dscp_value = para.counter;
 
 	/* Fixed value */
@@ -517,10 +515,10 @@ add_set_ipv6_dscp(struct rte_flow_action *actions,
 	/* Set dscp to random value each time */
 	dscp_value = dscp_value % 0xff;
 
-	set_dscp.dscp = dscp_value;
+	set_dscp[para.core_idx].dscp = dscp_value;
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP;
-	actions[actions_counter].conf = &set_dscp;
+	actions[actions_counter].conf = &set_dscp[para.core_idx];
 }
 
 static void
@@ -774,36 +772,36 @@ add_raw_encap(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	struct additional_para para)
 {
-	static struct action_raw_encap_data *action_encap_data;
+	static struct action_raw_encap_data *action_encap_data[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint64_t encap_data = para.encap_data;
 	uint8_t *header;
 	uint8_t i;
 
 	/* Avoid double allocation. */
-	if (action_encap_data == NULL)
-		action_encap_data = rte_malloc("encap_data",
+	if (action_encap_data[para.core_idx] == NULL)
+		action_encap_data[para.core_idx] = rte_malloc("encap_data",
 			sizeof(struct action_raw_encap_data), 0);
 
 	/* Check if allocation failed. */
-	if (action_encap_data == NULL)
+	if (action_encap_data[para.core_idx] == NULL)
 		rte_exit(EXIT_FAILURE, "No Memory available!");
 
-	*action_encap_data = (struct action_raw_encap_data) {
+	*action_encap_data[para.core_idx] = (struct action_raw_encap_data) {
 		.conf = (struct rte_flow_action_raw_encap) {
-			.data = action_encap_data->data,
+			.data = action_encap_data[para.core_idx]->data,
 		},
 			.data = {},
 	};
-	header = action_encap_data->data;
+	header = action_encap_data[para.core_idx]->data;
 
 	for (i = 0; i < RTE_DIM(headers); i++)
 		headers[i].funct(&header, encap_data, para);
 
-	action_encap_data->conf.size = header -
-		action_encap_data->data;
+	action_encap_data[para.core_idx]->conf.size = header -
+		action_encap_data[para.core_idx]->data;
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_RAW_ENCAP;
-	actions[actions_counter].conf = &action_encap_data->conf;
+	actions[actions_counter].conf = &action_encap_data[para.core_idx]->conf;
 }
 
 static void
@@ -811,36 +809,36 @@ add_raw_decap(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	struct additional_para para)
 {
-	static struct action_raw_decap_data *action_decap_data;
+	static struct action_raw_decap_data *action_decap_data[RTE_MAX_LCORE] __rte_cache_aligned;
 	uint64_t decap_data = para.decap_data;
 	uint8_t *header;
 	uint8_t i;
 
 	/* Avoid double allocation. */
-	if (action_decap_data == NULL)
-		action_decap_data = rte_malloc("decap_data",
+	if (action_decap_data[para.core_idx] == NULL)
+		action_decap_data[para.core_idx] = rte_malloc("decap_data",
 			sizeof(struct action_raw_decap_data), 0);
 
 	/* Check if allocation failed. */
-	if (action_decap_data == NULL)
+	if (action_decap_data[para.core_idx] == NULL)
 		rte_exit(EXIT_FAILURE, "No Memory available!");
 
-	*action_decap_data = (struct action_raw_decap_data) {
+	*action_decap_data[para.core_idx] = (struct action_raw_decap_data) {
 		.conf = (struct rte_flow_action_raw_decap) {
-			.data = action_decap_data->data,
+			.data = action_decap_data[para.core_idx]->data,
 		},
 			.data = {},
 	};
-	header = action_decap_data->data;
+	header = action_decap_data[para.core_idx]->data;
 
 	for (i = 0; i < RTE_DIM(headers); i++)
 		headers[i].funct(&header, decap_data, para);
 
-	action_decap_data->conf.size = header -
-		action_decap_data->data;
+	action_decap_data[para.core_idx]->conf.size = header -
+		action_decap_data[para.core_idx]->data;
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_RAW_DECAP;
-	actions[actions_counter].conf = &action_decap_data->conf;
+	actions[actions_counter].conf = &action_decap_data[para.core_idx]->conf;
 }
 
 static void
@@ -848,7 +846,7 @@ add_vxlan_encap(struct rte_flow_action *actions,
 	uint8_t actions_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_action_vxlan_encap vxlan_encap;
+	static struct rte_flow_action_vxlan_encap vxlan_encap[RTE_MAX_LCORE] __rte_cache_aligned;
 	static struct rte_flow_item items[5];
 	static struct rte_flow_item_eth item_eth;
 	static struct rte_flow_item_ipv4 item_ipv4;
@@ -885,10 +883,10 @@ add_vxlan_encap(struct rte_flow_action *actions,
 
 	items[4].type = RTE_FLOW_ITEM_TYPE_END;
 
-	vxlan_encap.definition = items;
+	vxlan_encap[para.core_idx].definition = items;
 
 	actions[actions_counter].type = RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP;
-	actions[actions_counter].conf = &vxlan_encap;
+	actions[actions_counter].conf = &vxlan_encap[para.core_idx];
 }
 
 static void
@@ -902,7 +900,7 @@ add_vxlan_decap(struct rte_flow_action *actions,
 void
 fill_actions(struct rte_flow_action *actions, uint64_t *flow_actions,
 	uint32_t counter, uint16_t next_table, uint16_t hairpinq,
-	uint64_t encap_data, uint64_t decap_data)
+	uint64_t encap_data, uint64_t decap_data, uint8_t core_idx)
 {
 	struct additional_para additional_para_data;
 	uint8_t actions_counter = 0;
@@ -924,6 +922,7 @@ fill_actions(struct rte_flow_action *actions, uint64_t *flow_actions,
 		.counter = counter,
 		.encap_data = encap_data,
 		.decap_data = decap_data,
+		.core_idx = core_idx,
 	};
 
 	if (hairpinq != 0) {
diff --git a/app/test-flow-perf/actions_gen.h b/app/test-flow-perf/actions_gen.h
index 85e3176b09..77353cfe09 100644
--- a/app/test-flow-perf/actions_gen.h
+++ b/app/test-flow-perf/actions_gen.h
@@ -19,6 +19,6 @@
 
 void fill_actions(struct rte_flow_action *actions, uint64_t *flow_actions,
 	uint32_t counter, uint16_t next_table, uint16_t hairpinq,
-	uint64_t encap_data, uint64_t decap_data);
+	uint64_t encap_data, uint64_t decap_data, uint8_t core_idx);
 
 #endif /* FLOW_PERF_ACTION_GEN */
diff --git a/app/test-flow-perf/config.h b/app/test-flow-perf/config.h
index 8f42bc589c..94e83c9abc 100644
--- a/app/test-flow-perf/config.h
+++ b/app/test-flow-perf/config.h
@@ -15,6 +15,7 @@
 #define MBUF_CACHE_SIZE 512
 #define NR_RXD  256
 #define NR_TXD  256
+#define MAX_PORTS 64
 
 /* This is used for encap/decap & header modify actions.
  * When it's 1: it means all actions have fixed values.
diff --git a/app/test-flow-perf/flow_gen.c b/app/test-flow-perf/flow_gen.c
index a979b3856d..df4af16de8 100644
--- a/app/test-flow-perf/flow_gen.c
+++ b/app/test-flow-perf/flow_gen.c
@@ -45,6 +45,7 @@ generate_flow(uint16_t port_id,
 	uint16_t hairpinq,
 	uint64_t encap_data,
 	uint64_t decap_data,
+	uint8_t core_idx,
 	struct rte_flow_error *error)
 {
 	struct rte_flow_attr attr;
@@ -60,9 +61,9 @@ generate_flow(uint16_t port_id,
 
 	fill_actions(actions, flow_actions,
 		outer_ip_src, next_table, hairpinq,
-		encap_data, decap_data);
+		encap_data, decap_data, core_idx);
 
-	fill_items(items, flow_items, outer_ip_src);
+	fill_items(items, flow_items, outer_ip_src, core_idx);
 
 	flow = rte_flow_create(port_id, &attr, items, actions, error);
 	return flow;
diff --git a/app/test-flow-perf/flow_gen.h b/app/test-flow-perf/flow_gen.h
index 3d13737d65..f1d0999af1 100644
--- a/app/test-flow-perf/flow_gen.h
+++ b/app/test-flow-perf/flow_gen.h
@@ -34,6 +34,7 @@ generate_flow(uint16_t port_id,
 	uint16_t hairpinq,
 	uint64_t encap_data,
 	uint64_t decap_data,
+	uint8_t core_idx,
 	struct rte_flow_error *error);
 
 #endif /* FLOW_PERF_FLOW_GEN */
diff --git a/app/test-flow-perf/items_gen.c b/app/test-flow-perf/items_gen.c
index 2b1ab41467..0950023608 100644
--- a/app/test-flow-perf/items_gen.c
+++ b/app/test-flow-perf/items_gen.c
@@ -15,6 +15,7 @@
 /* Storage for additional parameters for items */
 struct additional_para {
 	rte_be32_t src_ip;
+	uint8_t core_idx;
 };
 
 static void
@@ -58,18 +59,19 @@ static void
 add_ipv4(struct rte_flow_item *items,
 	uint8_t items_counter, struct additional_para para)
 {
-	static struct rte_flow_item_ipv4 ipv4_spec;
-	static struct rte_flow_item_ipv4 ipv4_mask;
+	static struct rte_flow_item_ipv4 ipv4_specs[RTE_MAX_LCORE] __rte_cache_aligned;
+	static struct rte_flow_item_ipv4 ipv4_masks[RTE_MAX_LCORE] __rte_cache_aligned;
+	uint8_t ti = para.core_idx;
 
-	memset(&ipv4_spec, 0, sizeof(struct rte_flow_item_ipv4));
-	memset(&ipv4_mask, 0, sizeof(struct rte_flow_item_ipv4));
+	memset(&ipv4_specs[ti], 0, sizeof(struct rte_flow_item_ipv4));
+	memset(&ipv4_masks[ti], 0, sizeof(struct rte_flow_item_ipv4));
 
-	ipv4_spec.hdr.src_addr = RTE_BE32(para.src_ip);
-	ipv4_mask.hdr.src_addr = RTE_BE32(0xffffffff);
+	ipv4_specs[ti].hdr.src_addr = RTE_BE32(para.src_ip);
+	ipv4_masks[ti].hdr.src_addr = RTE_BE32(0xffffffff);
 
 	items[items_counter].type = RTE_FLOW_ITEM_TYPE_IPV4;
-	items[items_counter].spec = &ipv4_spec;
-	items[items_counter].mask = &ipv4_mask;
+	items[items_counter].spec = &ipv4_specs[ti];
+	items[items_counter].mask = &ipv4_masks[ti];
 }
 
 
@@ -77,23 +79,24 @@ static void
 add_ipv6(struct rte_flow_item *items,
 	uint8_t items_counter, struct additional_para para)
 {
-	static struct rte_flow_item_ipv6 ipv6_spec;
-	static struct rte_flow_item_ipv6 ipv6_mask;
+	static struct rte_flow_item_ipv6 ipv6_specs[RTE_MAX_LCORE] __rte_cache_aligned;
+	static struct rte_flow_item_ipv6 ipv6_masks[RTE_MAX_LCORE] __rte_cache_aligned;
+	uint8_t ti = para.core_idx;
 
-	memset(&ipv6_spec, 0, sizeof(struct rte_flow_item_ipv6));
-	memset(&ipv6_mask, 0, sizeof(struct rte_flow_item_ipv6));
+	memset(&ipv6_specs[ti], 0, sizeof(struct rte_flow_item_ipv6));
+	memset(&ipv6_masks[ti], 0, sizeof(struct rte_flow_item_ipv6));
 
 	/** Set ipv6 src **/
-	memset(&ipv6_spec.hdr.src_addr, para.src_ip,
-		sizeof(ipv6_spec.hdr.src_addr) / 2);
+	memset(&ipv6_specs[ti].hdr.src_addr, para.src_ip,
+		sizeof(ipv6_specs->hdr.src_addr) / 2);
 
 	/** Full mask **/
-	memset(&ipv6_mask.hdr.src_addr, 0xff,
-		sizeof(ipv6_spec.hdr.src_addr));
+	memset(&ipv6_masks[ti].hdr.src_addr, 0xff,
+		sizeof(ipv6_specs->hdr.src_addr));
 
 	items[items_counter].type = RTE_FLOW_ITEM_TYPE_IPV6;
-	items[items_counter].spec = &ipv6_spec;
-	items[items_counter].mask = &ipv6_mask;
+	items[items_counter].spec = &ipv6_specs[ti];
+	items[items_counter].mask = &ipv6_masks[ti];
 }
 
 static void
@@ -131,31 +134,31 @@ add_udp(struct rte_flow_item *items,
 static void
 add_vxlan(struct rte_flow_item *items,
 	uint8_t items_counter,
-	__rte_unused struct additional_para para)
+	struct additional_para para)
 {
-	static struct rte_flow_item_vxlan vxlan_spec;
-	static struct rte_flow_item_vxlan vxlan_mask;
-
+	static struct rte_flow_item_vxlan vxlan_specs[RTE_MAX_LCORE] __rte_cache_aligned;
+	static struct rte_flow_item_vxlan vxlan_masks[RTE_MAX_LCORE] __rte_cache_aligned;
+	uint8_t ti = para.core_idx;
 	uint32_t vni_value;
 	uint8_t i;
 
 	vni_value = VNI_VALUE;
 
-	memset(&vxlan_spec, 0, sizeof(struct rte_flow_item_vxlan));
-	memset(&vxlan_mask, 0, sizeof(struct rte_flow_item_vxlan));
+	memset(&vxlan_specs[ti], 0, sizeof(struct rte_flow_item_vxlan));
+	memset(&vxlan_masks[ti], 0, sizeof(struct rte_flow_item_vxlan));
 
 	/* Set standard vxlan vni */
 	for (i = 0; i < 3; i++) {
-		vxlan_spec.vni[2 - i] = vni_value >> (i * 8);
-		vxlan_mask.vni[2 - i] = 0xff;
+		vxlan_specs[ti].vni[2 - i] = vni_value >> (i * 8);
+		vxlan_masks[ti].vni[2 - i] = 0xff;
 	}
 
 	/* Standard vxlan flags */
-	vxlan_spec.flags = 0x8;
+	vxlan_specs[ti].flags = 0x8;
 
 	items[items_counter].type = RTE_FLOW_ITEM_TYPE_VXLAN;
-	items[items_counter].spec = &vxlan_spec;
-	items[items_counter].mask = &vxlan_mask;
+	items[items_counter].spec = &vxlan_specs[ti];
+	items[items_counter].mask = &vxlan_masks[ti];
 }
 
 static void
@@ -163,29 +166,29 @@ add_vxlan_gpe(struct rte_flow_item *items,
 	uint8_t items_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_item_vxlan_gpe vxlan_gpe_spec;
-	static struct rte_flow_item_vxlan_gpe vxlan_gpe_mask;
-
+	static struct rte_flow_item_vxlan_gpe vxlan_gpe_specs[RTE_MAX_LCORE] __rte_cache_aligned;
+	static struct rte_flow_item_vxlan_gpe vxlan_gpe_masks[RTE_MAX_LCORE] __rte_cache_aligned;
+	uint8_t ti = para.core_idx;
 	uint32_t vni_value;
 	uint8_t i;
 
 	vni_value = VNI_VALUE;
 
-	memset(&vxlan_gpe_spec, 0, sizeof(struct rte_flow_item_vxlan_gpe));
-	memset(&vxlan_gpe_mask, 0, sizeof(struct rte_flow_item_vxlan_gpe));
+	memset(&vxlan_gpe_specs[ti], 0, sizeof(struct rte_flow_item_vxlan_gpe));
+	memset(&vxlan_gpe_masks[ti], 0, sizeof(struct rte_flow_item_vxlan_gpe));
 
 	/* Set vxlan-gpe vni */
 	for (i = 0; i < 3; i++) {
-		vxlan_gpe_spec.vni[2 - i] = vni_value >> (i * 8);
-		vxlan_gpe_mask.vni[2 - i] = 0xff;
+		vxlan_gpe_specs[ti].vni[2 - i] = vni_value >> (i * 8);
+		vxlan_gpe_masks[ti].vni[2 - i] = 0xff;
 	}
 
 	/* vxlan-gpe flags */
-	vxlan_gpe_spec.flags = 0x0c;
+	vxlan_gpe_specs[ti].flags = 0x0c;
 
 	items[items_counter].type = RTE_FLOW_ITEM_TYPE_VXLAN_GPE;
-	items[items_counter].spec = &vxlan_gpe_spec;
-	items[items_counter].mask = &vxlan_gpe_mask;
+	items[items_counter].spec = &vxlan_gpe_specs[ti];
+	items[items_counter].mask = &vxlan_gpe_masks[ti];
 }
 
 static void
@@ -216,25 +219,25 @@ add_geneve(struct rte_flow_item *items,
 	uint8_t items_counter,
 	__rte_unused struct additional_para para)
 {
-	static struct rte_flow_item_geneve geneve_spec;
-	static struct rte_flow_item_geneve geneve_mask;
-
+	static struct rte_flow_item_geneve geneve_specs[RTE_MAX_LCORE] __rte_cache_aligned;
+	static struct rte_flow_item_geneve geneve_masks[RTE_MAX_LCORE] __rte_cache_aligned;
+	uint8_t ti = para.core_idx;
 	uint32_t vni_value;
 	uint8_t i;
 
 	vni_value = VNI_VALUE;
 
-	memset(&geneve_spec, 0, sizeof(struct rte_flow_item_geneve));
-	memset(&geneve_mask, 0, sizeof(struct rte_flow_item_geneve));
+	memset(&geneve_specs[ti], 0, sizeof(struct rte_flow_item_geneve));
+	memset(&geneve_masks[ti], 0, sizeof(struct rte_flow_item_geneve));
 
 	for (i = 0; i < 3; i++) {
-		geneve_spec.vni[2 - i] = vni_value >> (i * 8);
-		geneve_mask.vni[2 - i] = 0xff;
+		geneve_specs[ti].vni[2 - i] = vni_value >> (i * 8);
+		geneve_masks[ti].vni[2 - i] = 0xff;
 	}
 
 	items[items_counter].type = RTE_FLOW_ITEM_TYPE_GENEVE;
-	items[items_counter].spec = &geneve_spec;
-	items[items_counter].mask = &geneve_mask;
+	items[items_counter].spec = &geneve_specs[ti];
+	items[items_counter].mask = &geneve_masks[ti];
 }
 
 static void
@@ -344,12 +347,14 @@ add_icmpv6(struct rte_flow_item *items,
 
 void
 fill_items(struct rte_flow_item *items,
-	uint64_t *flow_items, uint32_t outer_ip_src)
+	uint64_t *flow_items, uint32_t outer_ip_src,
+	uint8_t core_idx)
 {
 	uint8_t items_counter = 0;
 	uint8_t i, j;
 	struct additional_para additional_para_data = {
 		.src_ip = outer_ip_src,
+		.core_idx = core_idx,
 	};
 
 	/* Support outer items up to tunnel layer only. */
diff --git a/app/test-flow-perf/items_gen.h b/app/test-flow-perf/items_gen.h
index d68958e4d3..f4b0e9a981 100644
--- a/app/test-flow-perf/items_gen.h
+++ b/app/test-flow-perf/items_gen.h
@@ -13,6 +13,6 @@
 #include "config.h"
 
 void fill_items(struct rte_flow_item *items, uint64_t *flow_items,
-	uint32_t outer_ip_src);
+	uint32_t outer_ip_src, uint8_t core_idx);
 
 #endif /* FLOW_PERF_ITEMS_GEN */
diff --git a/app/test-flow-perf/main.c b/app/test-flow-perf/main.c
index 5ec9a15c61..663b2e9bae 100644
--- a/app/test-flow-perf/main.c
+++ b/app/test-flow-perf/main.c
@@ -72,7 +72,6 @@ static uint32_t nb_lcores;
 #define LCORE_MODE_PKT    1
 #define LCORE_MODE_STATS  2
 #define MAX_STREAMS      64
-#define MAX_LCORES       64
 
 struct stream {
 	int tx_port;
@@ -92,7 +91,20 @@ struct lcore_info {
 	struct rte_mbuf *pkts[MAX_PKT_BURST];
 } __rte_cache_aligned;
 
-static struct lcore_info lcore_infos[MAX_LCORES];
+static struct lcore_info lcore_infos[RTE_MAX_LCORE];
+
+struct multi_cores_pool {
+	uint32_t cores_count;
+	uint32_t rules_count;
+	double cpu_time_used_insertion[MAX_PORTS][RTE_MAX_LCORE];
+	double cpu_time_used_deletion[MAX_PORTS][RTE_MAX_LCORE];
+	int64_t last_alloc[RTE_MAX_LCORE];
+	int64_t current_alloc[RTE_MAX_LCORE];
+} __rte_cache_aligned;
+
+static struct multi_cores_pool mc_pool = {
+	.cores_count = 1,
+};
 
 static void
 usage(char *progname)
@@ -118,6 +130,8 @@ usage(char *progname)
 	printf("  --transfer: set transfer attribute in flows\n");
 	printf("  --group=N: set group for all flows,"
 		" default is %d\n", DEFAULT_GROUP);
+	printf("  --cores=N: to set the number of needed "
+		"cores to insert rte_flow rules, default is 1\n");
 
 	printf("To set flow items:\n");
 	printf("  --ether: add ether layer in flow items\n");
@@ -537,6 +551,7 @@ args_parse(int argc, char **argv)
 		{ "dump-socket-mem",            0, 0, 0 },
 		{ "enable-fwd",                 0, 0, 0 },
 		{ "portmask",                   1, 0, 0 },
+		{ "cores",                      1, 0, 0 },
 		/* Attributes */
 		{ "ingress",                    0, 0, 0 },
 		{ "egress",                     0, 0, 0 },
@@ -750,6 +765,21 @@ args_parse(int argc, char **argv)
 					rte_exit(EXIT_FAILURE, "Invalid fwd port mask\n");
 				ports_mask = pm;
 			}
+			if (strcmp(lgopts[opt_idx].name, "cores") == 0) {
+				n = atoi(optarg);
+				if ((int) rte_lcore_count() <= n) {
+					printf("\nError: you need %d cores to run on multi-cores\n"
+						"Existing cores are: %d\n", n, rte_lcore_count());
+					rte_exit(EXIT_FAILURE, " ");
+				}
+				if (n <= RTE_MAX_LCORE && n > 0)
+					mc_pool.cores_count = n;
+				else {
+					printf("Error: cores count must be > 0 "
+						" and < %d\n", RTE_MAX_LCORE);
+					rte_exit(EXIT_FAILURE, " ");
+				}
+			}
 			break;
 		default:
 			fprintf(stderr, "Invalid option: %s\n", argv[optind]);
@@ -845,7 +875,7 @@ print_rules_batches(double *cpu_time_per_batch)
 }
 
 static inline void
-destroy_flows(int port_id, struct rte_flow **flows_list)
+destroy_flows(int port_id, uint8_t core_id, struct rte_flow **flows_list)
 {
 	struct rte_flow_error error;
 	clock_t start_batch, end_batch;
@@ -855,12 +885,12 @@ destroy_flows(int port_id, struct rte_flow **flows_list)
 	double delta;
 	uint32_t i;
 	int rules_batch_idx;
+	int rules_count_per_core;
 
-	/* Deletion Rate */
-	printf("\nRules Deletion on port = %d\n", port_id);
+	rules_count_per_core = rules_count / mc_pool.cores_count;
 
 	start_batch = clock();
-	for (i = 0; i < rules_count; i++) {
+	for (i = 0; i < (uint32_t) rules_count_per_core; i++) {
 		if (flows_list[i] == 0)
 			break;
 
@@ -891,15 +921,17 @@ destroy_flows(int port_id, struct rte_flow **flows_list)
 		print_rules_batches(cpu_time_per_batch);
 
 	/* Deletion rate for all rules */
-	deletion_rate = ((double) (rules_count / cpu_time_used) / 1000);
-	printf(":: Total rules deletion rate -> %f K Rule/Sec\n",
-		deletion_rate);
-	printf(":: The time for deleting %d in rules %f seconds\n",
-		rules_count, cpu_time_used);
+	deletion_rate = ((double) (rules_count_per_core / cpu_time_used) / 1000);
+	printf(":: Port %d :: Core %d :: Rules deletion rate -> %f K Rule/Sec\n",
+		port_id, core_id, deletion_rate);
+	printf(":: Port %d :: Core %d :: The time for deleting %d rules is %f seconds\n",
+		port_id, core_id, rules_count_per_core, cpu_time_used);
+
+	mc_pool.cpu_time_used_deletion[port_id][core_id] = cpu_time_used;
 }
 
 static struct rte_flow **
-insert_flows(int port_id)
+insert_flows(int port_id, uint8_t core_id)
 {
 	struct rte_flow **flows_list;
 	struct rte_flow_error error;
@@ -909,32 +941,42 @@ insert_flows(int port_id)
 	double cpu_time_per_batch[MAX_BATCHES_COUNT] = { 0 };
 	double delta;
 	uint32_t flow_index;
-	uint32_t counter;
+	uint32_t counter, start_counter = 0, end_counter;
 	uint64_t global_items[MAX_ITEMS_NUM] = { 0 };
 	uint64_t global_actions[MAX_ACTIONS_NUM] = { 0 };
 	int rules_batch_idx;
+	int rules_count_per_core;
+
+	rules_count_per_core = rules_count / mc_pool.cores_count;
+
+	/* Set boundaries of rules for each core. */
+	if (core_id)
+		start_counter = core_id * rules_count_per_core;
+	end_counter = (core_id + 1) * rules_count_per_core;
 
 	global_items[0] = FLOW_ITEM_MASK(RTE_FLOW_ITEM_TYPE_ETH);
 	global_actions[0] = FLOW_ITEM_MASK(RTE_FLOW_ACTION_TYPE_JUMP);
 
 	flows_list = rte_zmalloc("flows_list",
-		(sizeof(struct rte_flow *) * rules_count) + 1, 0);
+		(sizeof(struct rte_flow *) * rules_count_per_core) + 1, 0);
 	if (flows_list == NULL)
 		rte_exit(EXIT_FAILURE, "No Memory available!");
 
 	cpu_time_used = 0;
 	flow_index = 0;
-	if (flow_group > 0) {
+	if (flow_group > 0 && core_id == 0) {
 		/*
 		 * Create global rule to jump into flow_group,
 		 * this way the app will avoid the default rules.
 		 *
+		 * This rule will be created only once.
+		 *
 		 * Global rule:
 		 * group 0 eth / end actions jump group <flow_group>
 		 */
 		flow = generate_flow(port_id, 0, flow_attrs,
 			global_items, global_actions,
-			flow_group, 0, 0, 0, 0, &error);
+			flow_group, 0, 0, 0, 0, core_id, &error);
 
 		if (flow == NULL) {
 			print_flow_error(error);
@@ -943,19 +985,17 @@ insert_flows(int port_id)
 		flows_list[flow_index++] = flow;
 	}
 
-	/* Insertion Rate */
-	printf("Rules insertion on port = %d\n", port_id);
 	start_batch = clock();
-	for (counter = 0; counter < rules_count; counter++) {
+	for (counter = start_counter; counter < end_counter; counter++) {
 		flow = generate_flow(port_id, flow_group,
 			flow_attrs, flow_items, flow_actions,
 			JUMP_ACTION_TABLE, counter,
 			hairpin_queues_num,
 			encap_data, decap_data,
-			&error);
+			core_id, &error);
 
 		if (force_quit)
-			counter = rules_count;
+			counter = end_counter;
 
 		if (!flow) {
 			print_flow_error(error);
@@ -984,23 +1024,25 @@ insert_flows(int port_id)
 	if (dump_iterations)
 		print_rules_batches(cpu_time_per_batch);
 
-	/* Insertion rate for all rules */
-	insertion_rate = ((double) (rules_count / cpu_time_used) / 1000);
-	printf(":: Total flow insertion rate -> %f K Rule/Sec\n",
-			insertion_rate);
-	printf(":: The time for creating %d in flows %f seconds\n",
-			rules_count, cpu_time_used);
+	printf(":: Port %d :: Core %d boundaries :: start @[%d] - end @[%d]\n",
+		port_id, core_id, start_counter, end_counter - 1);
+
+	/* Insertion rate for all rules in one core */
+	insertion_rate = ((double) (rules_count_per_core / cpu_time_used) / 1000);
+	printf(":: Port %d :: Core %d :: Rules insertion rate -> %f K Rule/Sec\n",
+		port_id, core_id, insertion_rate);
+	printf(":: Port %d :: Core %d :: The time for creating %d in rules %f seconds\n",
+		port_id, core_id, rules_count_per_core, cpu_time_used);
 
+	mc_pool.cpu_time_used_insertion[port_id][core_id] = cpu_time_used;
 	return flows_list;
 }
 
-static inline void
-flows_handler(void)
+static void
+flows_handler(uint8_t core_id)
 {
 	struct rte_flow **flows_list;
 	uint16_t nr_ports;
-	int64_t alloc, last_alloc;
-	int flow_size_in_bytes;
 	int port_id;
 
 	nr_ports = rte_eth_dev_count_avail();
@@ -1016,21 +1058,148 @@ flows_handler(void)
 			continue;
 
 		/* Insertion part. */
-		last_alloc = (int64_t)dump_socket_mem(stdout);
-		flows_list = insert_flows(port_id);
-		alloc = (int64_t)dump_socket_mem(stdout);
+		mc_pool.last_alloc[core_id] = (int64_t)dump_socket_mem(stdout);
+		flows_list = insert_flows(port_id, core_id);
+		if (flows_list == NULL)
+			rte_exit(EXIT_FAILURE, "Error: Insertion Failed!\n");
+		mc_pool.current_alloc[core_id] = (int64_t)dump_socket_mem(stdout);
 
 		/* Deletion part. */
 		if (delete_flag)
-			destroy_flows(port_id, flows_list);
+			destroy_flows(port_id, core_id, flows_list);
+	}
+}
+
+static int
+run_rte_flow_handler_cores(void *data __rte_unused)
+{
+	uint16_t port;
+	/* Latency: total count of rte rules divided
+	 * over max time used by thread between all
+	 * threads time.
+	 *
+	 * Throughput: total count of rte rules divided
+	 * over the average of the time cosumed by all
+	 * threads time.
+	 */
+	double insertion_latency_time;
+	double insertion_throughput_time;
+	double deletion_latency_time;
+	double deletion_throughput_time;
+	double insertion_latency, insertion_throughput;
+	double deletion_latency, deletion_throughput;
+	int64_t last_alloc, current_alloc;
+	int flow_size_in_bytes;
+	int lcore_counter = 0;
+	int lcore_id = rte_lcore_id();
+	int i;
+
+	RTE_LCORE_FOREACH(i) {
+		/*  If core not needed return. */
+		if (lcore_id == i) {
+			printf(":: lcore %d mapped with index %d\n", lcore_id, lcore_counter);
+			if (lcore_counter >= (int) mc_pool.cores_count)
+				return 0;
+			break;
+		}
+		lcore_counter++;
+	}
+	lcore_id = lcore_counter;
+
+	if (lcore_id >= (int) mc_pool.cores_count)
+		return 0;
+
+	mc_pool.rules_count = rules_count;
 
-		/* Report rte_flow size in huge pages. */
-		if (last_alloc) {
-			flow_size_in_bytes = (alloc - last_alloc) / rules_count;
-			printf("\n:: rte_flow size in DPDK layer: %d Bytes",
-				flow_size_in_bytes);
+	flows_handler(lcore_id);
+
+	/* Only main core to print total results. */
+	if (lcore_id != 0)
+		return 0;
+
+	/* Make sure all cores finished insertion/deletion process. */
+	rte_eal_mp_wait_lcore();
+
+	/* Save first insertion/deletion rates from first thread.
+	 * Start comparing with all threads, if any thread used
+	 * time more than current saved, replace it.
+	 *
+	 * Thus in the end we will have the max time used for
+	 * insertion/deletion by one thread.
+	 *
+	 * As for memory consumption, save the min of all threads
+	 * of last alloc, and save the max for all threads for
+	 * current alloc.
+	 */
+	RTE_ETH_FOREACH_DEV(port) {
+		last_alloc = mc_pool.last_alloc[0];
+		current_alloc = mc_pool.current_alloc[0];
+
+		insertion_latency_time = mc_pool.cpu_time_used_insertion[port][0];
+		deletion_latency_time = mc_pool.cpu_time_used_deletion[port][0];
+		insertion_throughput_time = mc_pool.cpu_time_used_insertion[port][0];
+		deletion_throughput_time = mc_pool.cpu_time_used_deletion[port][0];
+		i = mc_pool.cores_count;
+		while (i-- > 1) {
+			insertion_throughput_time += mc_pool.cpu_time_used_insertion[port][i];
+			deletion_throughput_time += mc_pool.cpu_time_used_deletion[port][i];
+			if (insertion_latency_time < mc_pool.cpu_time_used_insertion[port][i])
+				insertion_latency_time = mc_pool.cpu_time_used_insertion[port][i];
+			if (deletion_latency_time < mc_pool.cpu_time_used_deletion[port][i])
+				deletion_latency_time = mc_pool.cpu_time_used_deletion[port][i];
+			if (last_alloc > mc_pool.last_alloc[i])
+				last_alloc = mc_pool.last_alloc[i];
+			if (current_alloc < mc_pool.current_alloc[i])
+				current_alloc = mc_pool.current_alloc[i];
 		}
+
+		flow_size_in_bytes = (current_alloc - last_alloc) / mc_pool.rules_count;
+
+		insertion_latency = ((double) (mc_pool.rules_count / insertion_latency_time) / 1000);
+		deletion_latency = ((double) (mc_pool.rules_count / deletion_latency_time) / 1000);
+
+		insertion_throughput_time /= mc_pool.cores_count;
+		deletion_throughput_time /= mc_pool.cores_count;
+		insertion_throughput = ((double) (mc_pool.rules_count / insertion_throughput_time) / 1000);
+		deletion_throughput = ((double) (mc_pool.rules_count / deletion_throughput_time) / 1000);
+
+		/* Latency stats */
+		printf("\n:: [Latency | Insertion] All Cores :: Port %d :: ", port);
+		printf("Total flows insertion rate -> %f K Rules/Sec\n",
+			insertion_latency);
+		printf(":: [Latency | Insertion] All Cores :: Port %d :: ", port);
+		printf("The time for creating %d rules is %f seconds\n",
+			mc_pool.rules_count, insertion_latency_time);
+
+		/* Throughput stats */
+		printf(":: [Throughput | Insertion] All Cores :: Port %d :: ", port);
+		printf("Total flows insertion rate -> %f K Rules/Sec\n",
+			insertion_throughput);
+		printf(":: [Throughput | Insertion] All Cores :: Port %d :: ", port);
+		printf("The average time for creating %d rules is %f seconds\n",
+			mc_pool.rules_count, insertion_throughput_time);
+
+		if (delete_flag) {
+			/* Latency stats */
+			printf(":: [Latency | Deletion] All Cores :: Port %d :: Total flows "
+				"deletion rate -> %f K Rules/Sec\n",
+				port, deletion_latency);
+			printf(":: [Latency | Deletion] All Cores :: Port %d :: ", port);
+			printf("The time for deleting %d rules is %f seconds\n",
+			mc_pool.rules_count, deletion_latency_time);
+
+			/* Throughput stats */
+			printf(":: [Throughput | Deletion] All Cores :: Port %d :: Total flows "
+				"deletion rate -> %f K Rules/Sec\n", port, deletion_throughput);
+			printf(":: [Throughput | Deletion] All Cores :: Port %d :: ", port);
+			printf("The average time for deleting %d rules is %f seconds\n",
+			mc_pool.rules_count, deletion_throughput_time);
+		}
+		printf("\n:: Port %d :: rte_flow size in DPDK layer: %d Bytes\n",
+			port, flow_size_in_bytes);
 	}
+
+	return 0;
 }
 
 static void
@@ -1107,12 +1276,12 @@ packet_per_second_stats(void)
 	int i;
 
 	old = rte_zmalloc("old",
-		sizeof(struct lcore_info) * MAX_LCORES, 0);
+		sizeof(struct lcore_info) * RTE_MAX_LCORE, 0);
 	if (old == NULL)
 		rte_exit(EXIT_FAILURE, "No Memory available!");
 
 	memcpy(old, lcore_infos,
-		sizeof(struct lcore_info) * MAX_LCORES);
+		sizeof(struct lcore_info) * RTE_MAX_LCORE);
 
 	while (!force_quit) {
 		uint64_t total_tx_pkts = 0;
@@ -1135,7 +1304,7 @@ packet_per_second_stats(void)
 		printf("%6s %16s %16s %16s\n", "------", "----------------",
 			"----------------", "----------------");
 		nr_lines = 3;
-		for (i = 0; i < MAX_LCORES; i++) {
+		for (i = 0; i < RTE_MAX_LCORE; i++) {
 			li  = &lcore_infos[i];
 			oli = &old[i];
 			if (li->mode != LCORE_MODE_PKT)
@@ -1166,7 +1335,7 @@ packet_per_second_stats(void)
 		}
 
 		memcpy(old, lcore_infos,
-			sizeof(struct lcore_info) * MAX_LCORES);
+			sizeof(struct lcore_info) * RTE_MAX_LCORE);
 	}
 }
 
@@ -1227,7 +1396,7 @@ init_lcore_info(void)
 	 * This means that this stream is not used, or not set
 	 * yet.
 	 */
-	for (i = 0; i < MAX_LCORES; i++)
+	for (i = 0; i < RTE_MAX_LCORE; i++)
 		for (j = 0; j < MAX_STREAMS; j++) {
 			lcore_infos[i].streams[j].tx_port = -1;
 			lcore_infos[i].streams[j].rx_port = -1;
@@ -1289,7 +1458,7 @@ init_lcore_info(void)
 
 	/* Print all streams */
 	printf(":: Stream -> core id[N]: (rx_port, rx_queue)->(tx_port, tx_queue)\n");
-	for (i = 0; i < MAX_LCORES; i++)
+	for (i = 0; i < RTE_MAX_LCORE; i++)
 		for (j = 0; j < MAX_STREAMS; j++) {
 			/* No streams for this core */
 			if (lcore_infos[i].streams[j].tx_port == -1)
@@ -1470,7 +1639,10 @@ main(int argc, char **argv)
 	if (nb_lcores <= 1)
 		rte_exit(EXIT_FAILURE, "This app needs at least two cores\n");
 
-	flows_handler();
+
+	printf(":: Flows Count per port: %d\n\n", rules_count);
+
+	rte_eal_mp_remote_launch(run_rte_flow_handler_cores, NULL, CALL_MAIN);
 
 	if (enable_fwd) {
 		init_lcore_info();
diff --git a/doc/guides/tools/flow-perf.rst b/doc/guides/tools/flow-perf.rst
index 634009ccee..40d157e8cb 100644
--- a/doc/guides/tools/flow-perf.rst
+++ b/doc/guides/tools/flow-perf.rst
@@ -25,15 +25,8 @@ computes an average time across all windows.
 The application also provides the ability to measure rte flow deletion rate,
 in addition to memory consumption before and after the flow rules' creation.
 
-The app supports single and multi core performance measurements.
-
-
-Known Limitations
------------------
-
-The current version has limitations which can be removed in future:
-
-* Single core insertion only.
+The app supports single and multiple core performance measurements, and
+support multiple cores insertion/deletion as well.
 
 
 Compiling the Application
@@ -103,6 +96,9 @@ The command line options are:
 *	``--portmask=N``
 	hexadecimal bitmask of ports to be used.
 
+*	``--cores=N``
+	Set the number of needed cores to insert/delete rte_flow rules.
+	Default cores count is 1.
 
 Attributes:
 
-- 
2.21.0