* [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
2014-08-07 20:11 4% ` Neil Horman
@ 2014-08-21 20:15 1% ` Neil Horman
1 sibling, 0 replies; 40+ results
From: Neil Horman @ 2014-08-21 20:15 UTC (permalink / raw)
To: dev
Make ACL library to build/work on 'default' architecture:
- make rte_acl_classify_scalar really scalar
(make sure it wouldn't use sse4 instrincts through resolve_priority()).
- Provide two versions of rte_acl_classify code path:
rte_acl_classify_sse() - could be build and used only on systems with sse4.2
and upper, return -ENOTSUP on lower arch.
rte_acl_classify_scalar() - a slower version, but could be build and used
on all systems.
- keep common code shared between these two codepaths.
v2 chages:
run-time selection of most appropriate code-path for given ISA.
By default the highest supprted one is selected.
User can still override that selection by manually assigning new value to
the global function pointer rte_acl_default_classify.
rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
points to.
V3 Changes
Updated classify pointer to be a function so as to better preserve ABI
REmoved macro definitions for match check functions to make them static inline
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
app/test-acl/main.c | 13 +-
app/test/test_acl.c | 12 +-
lib/librte_acl/Makefile | 5 +-
lib/librte_acl/acl_bld.c | 5 +-
lib/librte_acl/acl_match_check.h | 83 ++++
lib/librte_acl/acl_run.c | 944 ---------------------------------------
lib/librte_acl/acl_run.h | 220 +++++++++
lib/librte_acl/acl_run_scalar.c | 198 ++++++++
lib/librte_acl/acl_run_sse.c | 627 ++++++++++++++++++++++++++
lib/librte_acl/rte_acl.c | 46 ++
lib/librte_acl/rte_acl.h | 26 +-
11 files changed, 1216 insertions(+), 963 deletions(-)
create mode 100644 lib/librte_acl/acl_match_check.h
delete mode 100644 lib/librte_acl/acl_run.c
create mode 100644 lib/librte_acl/acl_run.h
create mode 100644 lib/librte_acl/acl_run_scalar.c
create mode 100644 lib/librte_acl/acl_run_sse.c
diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d654409..a77f47d 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -787,6 +787,10 @@ acx_init(void)
/* perform build. */
ret = rte_acl_build(config.acx, &cfg);
+ /* setup default rte_acl_classify */
+ if (config.scalar)
+ rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+
dump_verbose(DUMP_NONE, stdout,
"rte_acl_build(%u) finished with %d\n",
config.bld_categories, ret);
@@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
v += config.trace_sz;
}
- if (scalar != 0)
- ret = rte_acl_classify_scalar(config.acx, data,
- results, n, categories);
-
- else
- ret = rte_acl_classify(config.acx, data,
- results, n, categories);
+ ret = rte_acl_classify(config.acx, data, results,
+ n, categories);
if (ret != 0)
rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 869f6d3..2fcef6e 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -148,7 +148,8 @@ test_classify_run(struct rte_acl_ctx *acx)
}
/* make a quick check for scalar */
- ret = rte_acl_classify_scalar(acx, data, results,
+ rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+ ret = rte_acl_classify(acx, data, results,
RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
if (ret != 0) {
printf("Line %i: SSE classify failed!\n", __LINE__);
@@ -362,7 +363,8 @@ test_invalid_layout(void)
}
/* classify tuples (scalar) */
- ret = rte_acl_classify_scalar(acx, data, results,
+ rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+ ret = rte_acl_classify(acx, data, results,
RTE_DIM(results), 1);
if (ret != 0) {
printf("Line %i: Scalar classify failed!\n", __LINE__);
@@ -850,7 +852,8 @@ test_invalid_parameters(void)
/* scalar classify test */
/* cover zero categories in classify (should not fail) */
- result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
+ rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+ result = rte_acl_classify(acx, NULL, NULL, 0, 0);
if (result != 0) {
printf("Line %i: Scalar classify with zero categories "
"failed!\n", __LINE__);
@@ -859,7 +862,8 @@ test_invalid_parameters(void)
}
/* cover invalid but positive categories in classify */
- result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
+ rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+ result = rte_acl_classify(acx, NULL, NULL, 0, 3);
if (result == 0) {
printf("Line %i: Scalar classify with 3 categories "
"should have failed!\n", __LINE__);
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..65e566d 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
-SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
+
+CFLAGS_acl_run_sse.o += -msse4.1
# install this header file
SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 873447b..09d58ea 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -31,7 +31,6 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
-#include <nmmintrin.h>
#include <rte_acl.h>
#include "tb_mem.h"
#include "acl.h"
@@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
switch (rule->config->defs[n].type) {
case RTE_ACL_FIELD_TYPE_BITMASK:
- wild = (size -
- _mm_popcnt_u32(fld->mask_range.u8)) /
+ wild = (size - __builtin_popcount(
+ fld->mask_range.u8)) /
size;
break;
diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
new file mode 100644
index 0000000..4dc1982
--- /dev/null
+++ b/lib/librte_acl/acl_match_check.h
@@ -0,0 +1,83 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _ACL_MATCH_CHECK_H_
+#define _ACL_MATCH_CHECK_H_
+
+/*
+ * Detect matches. If a match node transition is found, then this trie
+ * traversal is complete and fill the slot with the next trie
+ * to be processed.
+ */
+static inline uint64_t
+acl_match_check(uint64_t transition, int slot,
+ const struct rte_acl_ctx *ctx, struct parms *parms,
+ struct acl_flow_data *flows, void (*resolve_priority)(
+ uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+ struct parms *parms, const struct rte_acl_match_results *p,
+ uint32_t categories))
+{
+ const struct rte_acl_match_results *p;
+
+ p = (const struct rte_acl_match_results *)
+ (flows->trans + ctx->match_index);
+
+ if (transition & RTE_ACL_NODE_MATCH) {
+
+ /* Remove flags from index and decrement active traversals */
+ transition &= RTE_ACL_NODE_INDEX;
+ flows->started--;
+
+ /* Resolve priorities for this trie and running results */
+ if (flows->categories == 1)
+ resolve_single_priority(transition, slot, ctx,
+ parms, p);
+ else
+ resolve_priority(transition, slot, ctx, parms,
+ p, flows->categories);
+
+ /* Count down completed tries for this search request */
+ parms[slot].cmplt->count--;
+
+ /* Fill the slot with the next trie or idle trie */
+ transition = acl_start_next_trie(flows, parms, slot, ctx);
+
+ } else if (transition == ctx->idle) {
+ /* reset indirection table for idle slots */
+ parms[slot].data_index = idle;
+ }
+
+ return transition;
+}
+
+#endif
diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
deleted file mode 100644
index e3d9fc1..0000000
--- a/lib/librte_acl/acl_run.c
+++ /dev/null
@@ -1,944 +0,0 @@
-/*-
- * BSD LICENSE
- *
- * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * * Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * * Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in
- * the documentation and/or other materials provided with the
- * distribution.
- * * Neither the name of Intel Corporation nor the names of its
- * contributors may be used to endorse or promote products derived
- * from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <rte_acl.h>
-#include "acl_vect.h"
-#include "acl.h"
-
-#define MAX_SEARCHES_SSE8 8
-#define MAX_SEARCHES_SSE4 4
-#define MAX_SEARCHES_SSE2 2
-#define MAX_SEARCHES_SCALAR 2
-
-#define GET_NEXT_4BYTES(prm, idx) \
- (*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
-
-
-#define RTE_ACL_NODE_INDEX ((uint32_t)~RTE_ACL_NODE_TYPE)
-
-#define SCALAR_QRANGE_MULT 0x01010101
-#define SCALAR_QRANGE_MASK 0x7f7f7f7f
-#define SCALAR_QRANGE_MIN 0x80808080
-
-enum {
- SHUFFLE32_SLOT1 = 0xe5,
- SHUFFLE32_SLOT2 = 0xe6,
- SHUFFLE32_SLOT3 = 0xe7,
- SHUFFLE32_SWAP64 = 0x4e,
-};
-
-/*
- * Structure to manage N parallel trie traversals.
- * The runtime trie traversal routines can process 8, 4, or 2 tries
- * in parallel. Each packet may require multiple trie traversals (up to 4).
- * This structure is used to fill the slots (0 to n-1) for parallel processing
- * with the trie traversals needed for each packet.
- */
-struct acl_flow_data {
- uint32_t num_packets;
- /* number of packets processed */
- uint32_t started;
- /* number of trie traversals in progress */
- uint32_t trie;
- /* current trie index (0 to N-1) */
- uint32_t cmplt_size;
- uint32_t total_packets;
- uint32_t categories;
- /* number of result categories per packet. */
- /* maximum number of packets to process */
- const uint64_t *trans;
- const uint8_t **data;
- uint32_t *results;
- struct completion *last_cmplt;
- struct completion *cmplt_array;
-};
-
-/*
- * Structure to maintain running results for
- * a single packet (up to 4 tries).
- */
-struct completion {
- uint32_t *results; /* running results. */
- int32_t priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
- uint32_t count; /* num of remaining tries */
- /* true for allocated struct */
-} __attribute__((aligned(XMM_SIZE)));
-
-/*
- * One parms structure for each slot in the search engine.
- */
-struct parms {
- const uint8_t *data;
- /* input data for this packet */
- const uint32_t *data_index;
- /* data indirection for this trie */
- struct completion *cmplt;
- /* completion data for this packet */
-};
-
-/*
- * Define an global idle node for unused engine slots
- */
-static const uint32_t idle[UINT8_MAX + 1];
-
-static const rte_xmm_t mm_type_quad_range = {
- .u32 = {
- RTE_ACL_NODE_QRANGE,
- RTE_ACL_NODE_QRANGE,
- RTE_ACL_NODE_QRANGE,
- RTE_ACL_NODE_QRANGE,
- },
-};
-
-static const rte_xmm_t mm_type_quad_range64 = {
- .u32 = {
- RTE_ACL_NODE_QRANGE,
- RTE_ACL_NODE_QRANGE,
- 0,
- 0,
- },
-};
-
-static const rte_xmm_t mm_shuffle_input = {
- .u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
- .u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
- .u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_bytes = {
- .u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
-};
-
-static const rte_xmm_t mm_bytes64 = {
- .u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
-};
-
-static const rte_xmm_t mm_match_mask = {
- .u32 = {
- RTE_ACL_NODE_MATCH,
- RTE_ACL_NODE_MATCH,
- RTE_ACL_NODE_MATCH,
- RTE_ACL_NODE_MATCH,
- },
-};
-
-static const rte_xmm_t mm_match_mask64 = {
- .u32 = {
- RTE_ACL_NODE_MATCH,
- 0,
- RTE_ACL_NODE_MATCH,
- 0,
- },
-};
-
-static const rte_xmm_t mm_index_mask = {
- .u32 = {
- RTE_ACL_NODE_INDEX,
- RTE_ACL_NODE_INDEX,
- RTE_ACL_NODE_INDEX,
- RTE_ACL_NODE_INDEX,
- },
-};
-
-static const rte_xmm_t mm_index_mask64 = {
- .u32 = {
- RTE_ACL_NODE_INDEX,
- RTE_ACL_NODE_INDEX,
- 0,
- 0,
- },
-};
-
-/*
- * Allocate a completion structure to manage the tries for a packet.
- */
-static inline struct completion *
-alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
- uint32_t *results)
-{
- uint32_t n;
-
- for (n = 0; n < size; n++) {
-
- if (p[n].count == 0) {
-
- /* mark as allocated and set number of tries. */
- p[n].count = tries;
- p[n].results = results;
- return &(p[n]);
- }
- }
-
- /* should never get here */
- return NULL;
-}
-
-/*
- * Resolve priority for a single result trie.
- */
-static inline void
-resolve_single_priority(uint64_t transition, int n,
- const struct rte_acl_ctx *ctx, struct parms *parms,
- const struct rte_acl_match_results *p)
-{
- if (parms[n].cmplt->count == ctx->num_tries ||
- parms[n].cmplt->priority[0] <=
- p[transition].priority[0]) {
-
- parms[n].cmplt->priority[0] = p[transition].priority[0];
- parms[n].cmplt->results[0] = p[transition].results[0];
- }
-
- parms[n].cmplt->count--;
-}
-
-/*
- * Resolve priority for multiple results. This consists comparing
- * the priority of the current traversal with the running set of
- * results for the packet. For each result, keep a running array of
- * the result (rule number) and its priority for each category.
- */
-static inline void
-resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
- struct parms *parms, const struct rte_acl_match_results *p,
- uint32_t categories)
-{
- uint32_t x;
- xmm_t results, priority, results1, priority1, selector;
- xmm_t *saved_results, *saved_priority;
-
- for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
- saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
- saved_priority =
- (xmm_t *)(&parms[n].cmplt->priority[x]);
-
- /* get results and priorities for completed trie */
- results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
- priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
- /* if this is not the first completed trie */
- if (parms[n].cmplt->count != ctx->num_tries) {
-
- /* get running best results and their priorities */
- results1 = MM_LOADU(saved_results);
- priority1 = MM_LOADU(saved_priority);
-
- /* select results that are highest priority */
- selector = MM_CMPGT32(priority1, priority);
- results = MM_BLENDV8(results, results1, selector);
- priority = MM_BLENDV8(priority, priority1, selector);
- }
-
- /* save running best results and their priorities */
- MM_STOREU(saved_results, results);
- MM_STOREU(saved_priority, priority);
- }
-
- /* Count down completed tries for this search request */
- parms[n].cmplt->count--;
-}
-
-/*
- * Routine to fill a slot in the parallel trie traversal array (parms) from
- * the list of packets (flows).
- */
-static inline uint64_t
-acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
- const struct rte_acl_ctx *ctx)
-{
- uint64_t transition;
-
- /* if there are any more packets to process */
- if (flows->num_packets < flows->total_packets) {
- parms[n].data = flows->data[flows->num_packets];
- parms[n].data_index = ctx->trie[flows->trie].data_index;
-
- /* if this is the first trie for this packet */
- if (flows->trie == 0) {
- flows->last_cmplt = alloc_completion(flows->cmplt_array,
- flows->cmplt_size, ctx->num_tries,
- flows->results +
- flows->num_packets * flows->categories);
- }
-
- /* set completion parameters and starting index for this slot */
- parms[n].cmplt = flows->last_cmplt;
- transition =
- flows->trans[parms[n].data[*parms[n].data_index++] +
- ctx->trie[flows->trie].root_index];
-
- /*
- * if this is the last trie for this packet,
- * then setup next packet.
- */
- flows->trie++;
- if (flows->trie >= ctx->num_tries) {
- flows->trie = 0;
- flows->num_packets++;
- }
-
- /* keep track of number of active trie traversals */
- flows->started++;
-
- /* no more tries to process, set slot to an idle position */
- } else {
- transition = ctx->idle;
- parms[n].data = (const uint8_t *)idle;
- parms[n].data_index = idle;
- }
- return transition;
-}
-
-/*
- * Detect matches. If a match node transition is found, then this trie
- * traversal is complete and fill the slot with the next trie
- * to be processed.
- */
-static inline uint64_t
-acl_match_check_transition(uint64_t transition, int slot,
- const struct rte_acl_ctx *ctx, struct parms *parms,
- struct acl_flow_data *flows)
-{
- const struct rte_acl_match_results *p;
-
- p = (const struct rte_acl_match_results *)
- (flows->trans + ctx->match_index);
-
- if (transition & RTE_ACL_NODE_MATCH) {
-
- /* Remove flags from index and decrement active traversals */
- transition &= RTE_ACL_NODE_INDEX;
- flows->started--;
-
- /* Resolve priorities for this trie and running results */
- if (flows->categories == 1)
- resolve_single_priority(transition, slot, ctx,
- parms, p);
- else
- resolve_priority(transition, slot, ctx, parms, p,
- flows->categories);
-
- /* Fill the slot with the next trie or idle trie */
- transition = acl_start_next_trie(flows, parms, slot, ctx);
-
- } else if (transition == ctx->idle) {
- /* reset indirection table for idle slots */
- parms[slot].data_index = idle;
- }
-
- return transition;
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
- struct parms *parms, struct acl_flow_data *flows)
-{
- uint64_t transition1, transition2;
-
- /* extract transition from low 64 bits. */
- transition1 = MM_CVT64(*indicies);
-
- /* extract transition from high 64 bits. */
- *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
- transition2 = MM_CVT64(*indicies);
-
- transition1 = acl_match_check_transition(transition1, slot, ctx,
- parms, flows);
- transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
- parms, flows);
-
- /* update indicies with new transitions. */
- *indicies = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
- struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
-{
- xmm_t temp;
-
- temp = MM_AND(match_mask, *indicies);
- while (!MM_TESTZ(temp, temp)) {
- acl_process_matches(indicies, slot, ctx, parms, flows);
- temp = MM_AND(match_mask, *indicies);
- }
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
- struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
- xmm_t match_mask)
-{
- xmm_t temp;
-
- /* put low 32 bits of each transition into one register */
- temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
- 0x88);
- /* test for match node */
- temp = MM_AND(match_mask, temp);
-
- while (!MM_TESTZ(temp, temp)) {
- acl_process_matches(indicies1, slot, ctx, parms, flows);
- acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
-
- temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
- (__m128)*indicies2,
- 0x88);
- temp = MM_AND(match_mask, temp);
- }
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
- xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
- xmm_t *indicies1, xmm_t *indicies2)
-{
- xmm_t addr, node_types, temp;
-
- /*
- * Note that no transition is done for a match
- * node and therefore a stream freezes when
- * it reaches a match.
- */
-
- /* Shuffle low 32 into temp and high 32 into indicies2 */
- temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
- 0x88);
- *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
- (__m128)*indicies2, 0xdd);
-
- /* Calc node type and node addr */
- node_types = MM_ANDNOT(index_mask, temp);
- addr = MM_AND(index_mask, temp);
-
- /*
- * Calc addr for DFAs - addr = dfa_index + input_byte
- */
-
- /* mask for DFA type (0) nodes */
- temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
-
- /* add input byte to DFA position */
- temp = MM_AND(temp, bytes);
- temp = MM_AND(temp, next_input);
- addr = MM_ADD32(addr, temp);
-
- /*
- * Calc addr for Range nodes -> range_index + range(input)
- */
- node_types = MM_CMPEQ32(node_types, type_quad_range);
-
- /*
- * Calculate number of range boundaries that are less than the
- * input value. Range boundaries for each node are in signed 8 bit,
- * ordered from -128 to 127 in the indicies2 register.
- * This is effectively a popcnt of bytes that are greater than the
- * input byte.
- */
-
- /* shuffle input byte to all 4 positions of 32 bit value */
- temp = MM_SHUFFLE8(next_input, shuffle_input);
-
- /* check ranges */
- temp = MM_CMPGT8(temp, *indicies2);
-
- /* convert -1 to 1 (bytes greater than input byte */
- temp = MM_SIGN8(temp, temp);
-
- /* horizontal add pairs of bytes into words */
- temp = MM_MADD8(temp, temp);
-
- /* horizontal add pairs of words into dwords */
- temp = MM_MADD16(temp, ones_16);
-
- /* mask to range type nodes */
- temp = MM_AND(temp, node_types);
-
- /* add index into node position */
- return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
- xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
- const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
-{
- xmm_t addr;
- uint64_t trans0, trans2;
-
- /* Calculate the address (array index) for all 4 transitions. */
-
- addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
- bytes, type_quad_range, indicies1, indicies2);
-
- /* Gather 64 bit transitions and pack back into 2 registers. */
-
- trans0 = trans[MM_CVT32(addr)];
-
- /* get slot 2 */
-
- /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
- addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
- trans2 = trans[MM_CVT32(addr)];
-
- /* get slot 1 */
-
- /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
- addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
- *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
- /* get slot 3 */
-
- /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
- addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
- *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
- return MM_SRL32(next_input, 8);
-}
-
-static inline void
-acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
- uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
- uint32_t data_num, uint32_t categories, const uint64_t *trans)
-{
- flows->num_packets = 0;
- flows->started = 0;
- flows->trie = 0;
- flows->last_cmplt = NULL;
- flows->cmplt_array = cmplt;
- flows->total_packets = data_num;
- flows->categories = categories;
- flows->cmplt_size = cmplt_size;
- flows->data = data;
- flows->results = results;
- flows->trans = trans;
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline void
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
- uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
- int n;
- struct acl_flow_data flows;
- uint64_t index_array[MAX_SEARCHES_SSE8];
- struct completion cmplt[MAX_SEARCHES_SSE8];
- struct parms parms[MAX_SEARCHES_SSE8];
- xmm_t input0, input1;
- xmm_t indicies1, indicies2, indicies3, indicies4;
-
- acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
- total_packets, categories, ctx->trans_table);
-
- for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
- cmplt[n].count = 0;
- index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
- }
-
- /*
- * indicies1 contains index_array[0,1]
- * indicies2 contains index_array[2,3]
- * indicies3 contains index_array[4,5]
- * indicies4 contains index_array[6,7]
- */
-
- indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
- indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
- indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
- indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
-
- /* Check for any matches. */
- acl_match_check_x4(0, ctx, parms, &flows,
- &indicies1, &indicies2, mm_match_mask.m);
- acl_match_check_x4(4, ctx, parms, &flows,
- &indicies3, &indicies4, mm_match_mask.m);
-
- while (flows.started > 0) {
-
- /* Gather 4 bytes of input data for each stream. */
- input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
- 0);
- input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
- 0);
-
- input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
- input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
- input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
- input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
- input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
- input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
- /* Process the 4 bytes of input on each stream. */
-
- input0 = transition4(mm_index_mask.m, input0,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies1, &indicies2);
-
- input1 = transition4(mm_index_mask.m, input1,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies3, &indicies4);
-
- input0 = transition4(mm_index_mask.m, input0,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies1, &indicies2);
-
- input1 = transition4(mm_index_mask.m, input1,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies3, &indicies4);
-
- input0 = transition4(mm_index_mask.m, input0,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies1, &indicies2);
-
- input1 = transition4(mm_index_mask.m, input1,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies3, &indicies4);
-
- input0 = transition4(mm_index_mask.m, input0,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies1, &indicies2);
-
- input1 = transition4(mm_index_mask.m, input1,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies3, &indicies4);
-
- /* Check for any matches. */
- acl_match_check_x4(0, ctx, parms, &flows,
- &indicies1, &indicies2, mm_match_mask.m);
- acl_match_check_x4(4, ctx, parms, &flows,
- &indicies3, &indicies4, mm_match_mask.m);
- }
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline void
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
- uint32_t *results, int total_packets, uint32_t categories)
-{
- int n;
- struct acl_flow_data flows;
- uint64_t index_array[MAX_SEARCHES_SSE4];
- struct completion cmplt[MAX_SEARCHES_SSE4];
- struct parms parms[MAX_SEARCHES_SSE4];
- xmm_t input, indicies1, indicies2;
-
- acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
- total_packets, categories, ctx->trans_table);
-
- for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
- cmplt[n].count = 0;
- index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
- }
-
- indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
- indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
- /* Check for any matches. */
- acl_match_check_x4(0, ctx, parms, &flows,
- &indicies1, &indicies2, mm_match_mask.m);
-
- while (flows.started > 0) {
-
- /* Gather 4 bytes of input data for each stream. */
- input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
- input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
- input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
- input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
- /* Process the 4 bytes of input on each stream. */
- input = transition4(mm_index_mask.m, input,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies1, &indicies2);
-
- input = transition4(mm_index_mask.m, input,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies1, &indicies2);
-
- input = transition4(mm_index_mask.m, input,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies1, &indicies2);
-
- input = transition4(mm_index_mask.m, input,
- mm_shuffle_input.m, mm_ones_16.m,
- mm_bytes.m, mm_type_quad_range.m,
- flows.trans, &indicies1, &indicies2);
-
- /* Check for any matches. */
- acl_match_check_x4(0, ctx, parms, &flows,
- &indicies1, &indicies2, mm_match_mask.m);
- }
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
- xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
- const uint64_t *trans, xmm_t *indicies1)
-{
- uint64_t t;
- xmm_t addr, indicies2;
-
- indicies2 = MM_XOR(ones_16, ones_16);
-
- addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
- bytes, type_quad_range, indicies1, &indicies2);
-
- /* Gather 64 bit transitions and pack 2 per register. */
-
- t = trans[MM_CVT32(addr)];
-
- /* get slot 1 */
- addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
- *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
- return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline void
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
- uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
- int n;
- struct acl_flow_data flows;
- uint64_t index_array[MAX_SEARCHES_SSE2];
- struct completion cmplt[MAX_SEARCHES_SSE2];
- struct parms parms[MAX_SEARCHES_SSE2];
- xmm_t input, indicies;
-
- acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
- total_packets, categories, ctx->trans_table);
-
- for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
- cmplt[n].count = 0;
- index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
- }
-
- indicies = MM_LOADU((xmm_t *) &index_array[0]);
-
- /* Check for any matches. */
- acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
-
- while (flows.started > 0) {
-
- /* Gather 4 bytes of input data for each stream. */
- input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
- input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
- /* Process the 4 bytes of input on each stream. */
-
- input = transition2(mm_index_mask64.m, input,
- mm_shuffle_input64.m, mm_ones_16.m,
- mm_bytes64.m, mm_type_quad_range64.m,
- flows.trans, &indicies);
-
- input = transition2(mm_index_mask64.m, input,
- mm_shuffle_input64.m, mm_ones_16.m,
- mm_bytes64.m, mm_type_quad_range64.m,
- flows.trans, &indicies);
-
- input = transition2(mm_index_mask64.m, input,
- mm_shuffle_input64.m, mm_ones_16.m,
- mm_bytes64.m, mm_type_quad_range64.m,
- flows.trans, &indicies);
-
- input = transition2(mm_index_mask64.m, input,
- mm_shuffle_input64.m, mm_ones_16.m,
- mm_bytes64.m, mm_type_quad_range64.m,
- flows.trans, &indicies);
-
- /* Check for any matches. */
- acl_match_check_x2(0, ctx, parms, &flows, &indicies,
- mm_match_mask64.m);
- }
-}
-
-/*
- * When processing the transition, rather than using if/else
- * construct, the offset is calculated for DFA and QRANGE and
- * then conditionally added to the address based on node type.
- * This is done to avoid branch mis-predictions. Since the
- * offset is rather simple calculation it is more efficient
- * to do the calculation and do a condition move rather than
- * a conditional branch to determine which calculation to do.
- */
-static inline uint32_t
-scan_forward(uint32_t input, uint32_t max)
-{
- return (input == 0) ? max : rte_bsf32(input);
-}
-
-static inline uint64_t
-scalar_transition(const uint64_t *trans_table, uint64_t transition,
- uint8_t input)
-{
- uint32_t addr, index, ranges, x, a, b, c;
-
- /* break transition into component parts */
- ranges = transition >> (sizeof(index) * CHAR_BIT);
-
- /* calc address for a QRANGE node */
- c = input * SCALAR_QRANGE_MULT;
- a = ranges | SCALAR_QRANGE_MIN;
- index = transition & ~RTE_ACL_NODE_INDEX;
- a -= (c & SCALAR_QRANGE_MASK);
- b = c & SCALAR_QRANGE_MIN;
- addr = transition ^ index;
- a &= SCALAR_QRANGE_MIN;
- a ^= (ranges ^ b) & (a ^ b);
- x = scan_forward(a, 32) >> 3;
- addr += (index == RTE_ACL_NODE_DFA) ? input : x;
-
- /* pickup next transition */
- transition = *(trans_table + addr);
- return transition;
-}
-
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
- uint32_t *results, uint32_t num, uint32_t categories)
-{
- int n;
- uint64_t transition0, transition1;
- uint32_t input0, input1;
- struct acl_flow_data flows;
- uint64_t index_array[MAX_SEARCHES_SCALAR];
- struct completion cmplt[MAX_SEARCHES_SCALAR];
- struct parms parms[MAX_SEARCHES_SCALAR];
-
- if (categories != 1 &&
- ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
- return -EINVAL;
-
- acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
- categories, ctx->trans_table);
-
- for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
- cmplt[n].count = 0;
- index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
- }
-
- transition0 = index_array[0];
- transition1 = index_array[1];
-
- while (flows.started > 0) {
-
- input0 = GET_NEXT_4BYTES(parms, 0);
- input1 = GET_NEXT_4BYTES(parms, 1);
-
- for (n = 0; n < 4; n++) {
- if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
- transition0 = scalar_transition(flows.trans,
- transition0, (uint8_t)input0);
-
- input0 >>= CHAR_BIT;
-
- if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
- transition1 = scalar_transition(flows.trans,
- transition1, (uint8_t)input1);
-
- input1 >>= CHAR_BIT;
-
- }
- if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
- transition0 = acl_match_check_transition(transition0,
- 0, ctx, parms, &flows);
- transition1 = acl_match_check_transition(transition1,
- 1, ctx, parms, &flows);
-
- }
- }
- return 0;
-}
-
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
- uint32_t *results, uint32_t num, uint32_t categories)
-{
- if (categories != 1 &&
- ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
- return -EINVAL;
-
- if (likely(num >= MAX_SEARCHES_SSE8))
- search_sse_8(ctx, data, results, num, categories);
- else if (num >= MAX_SEARCHES_SSE4)
- search_sse_4(ctx, data, results, num, categories);
- else
- search_sse_2(ctx, data, results, num, categories);
-
- return 0;
-}
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
new file mode 100644
index 0000000..c39650e
--- /dev/null
+++ b/lib/librte_acl/acl_run.h
@@ -0,0 +1,220 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _ACL_RUN_H_
+#define _ACL_RUN_H_
+
+#include <rte_acl.h>
+#include "acl_vect.h"
+#include "acl.h"
+
+#define MAX_SEARCHES_SSE8 8
+#define MAX_SEARCHES_SSE4 4
+#define MAX_SEARCHES_SSE2 2
+#define MAX_SEARCHES_SCALAR 2
+
+#define GET_NEXT_4BYTES(prm, idx) \
+ (*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
+
+
+#define RTE_ACL_NODE_INDEX ((uint32_t)~RTE_ACL_NODE_TYPE)
+
+#define SCALAR_QRANGE_MULT 0x01010101
+#define SCALAR_QRANGE_MASK 0x7f7f7f7f
+#define SCALAR_QRANGE_MIN 0x80808080
+
+/*
+ * Structure to manage N parallel trie traversals.
+ * The runtime trie traversal routines can process 8, 4, or 2 tries
+ * in parallel. Each packet may require multiple trie traversals (up to 4).
+ * This structure is used to fill the slots (0 to n-1) for parallel processing
+ * with the trie traversals needed for each packet.
+ */
+struct acl_flow_data {
+ uint32_t num_packets;
+ /* number of packets processed */
+ uint32_t started;
+ /* number of trie traversals in progress */
+ uint32_t trie;
+ /* current trie index (0 to N-1) */
+ uint32_t cmplt_size;
+ uint32_t total_packets;
+ uint32_t categories;
+ /* number of result categories per packet. */
+ /* maximum number of packets to process */
+ const uint64_t *trans;
+ const uint8_t **data;
+ uint32_t *results;
+ struct completion *last_cmplt;
+ struct completion *cmplt_array;
+};
+
+/*
+ * Structure to maintain running results for
+ * a single packet (up to 4 tries).
+ */
+struct completion {
+ uint32_t *results; /* running results. */
+ int32_t priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
+ uint32_t count; /* num of remaining tries */
+ /* true for allocated struct */
+} __attribute__((aligned(XMM_SIZE)));
+
+/*
+ * One parms structure for each slot in the search engine.
+ */
+struct parms {
+ const uint8_t *data;
+ /* input data for this packet */
+ const uint32_t *data_index;
+ /* data indirection for this trie */
+ struct completion *cmplt;
+ /* completion data for this packet */
+};
+
+/*
+ * Define an global idle node for unused engine slots
+ */
+static const uint32_t idle[UINT8_MAX + 1];
+
+/*
+ * Allocate a completion structure to manage the tries for a packet.
+ */
+static inline struct completion *
+alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
+ uint32_t *results)
+{
+ uint32_t n;
+
+ for (n = 0; n < size; n++) {
+
+ if (p[n].count == 0) {
+
+ /* mark as allocated and set number of tries. */
+ p[n].count = tries;
+ p[n].results = results;
+ return &(p[n]);
+ }
+ }
+
+ /* should never get here */
+ return NULL;
+}
+
+/*
+ * Resolve priority for a single result trie.
+ */
+static inline void
+resolve_single_priority(uint64_t transition, int n,
+ const struct rte_acl_ctx *ctx, struct parms *parms,
+ const struct rte_acl_match_results *p)
+{
+ if (parms[n].cmplt->count == ctx->num_tries ||
+ parms[n].cmplt->priority[0] <=
+ p[transition].priority[0]) {
+
+ parms[n].cmplt->priority[0] = p[transition].priority[0];
+ parms[n].cmplt->results[0] = p[transition].results[0];
+ }
+}
+
+/*
+ * Routine to fill a slot in the parallel trie traversal array (parms) from
+ * the list of packets (flows).
+ */
+static inline uint64_t
+acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
+ const struct rte_acl_ctx *ctx)
+{
+ uint64_t transition;
+
+ /* if there are any more packets to process */
+ if (flows->num_packets < flows->total_packets) {
+ parms[n].data = flows->data[flows->num_packets];
+ parms[n].data_index = ctx->trie[flows->trie].data_index;
+
+ /* if this is the first trie for this packet */
+ if (flows->trie == 0) {
+ flows->last_cmplt = alloc_completion(flows->cmplt_array,
+ flows->cmplt_size, ctx->num_tries,
+ flows->results +
+ flows->num_packets * flows->categories);
+ }
+
+ /* set completion parameters and starting index for this slot */
+ parms[n].cmplt = flows->last_cmplt;
+ transition =
+ flows->trans[parms[n].data[*parms[n].data_index++] +
+ ctx->trie[flows->trie].root_index];
+
+ /*
+ * if this is the last trie for this packet,
+ * then setup next packet.
+ */
+ flows->trie++;
+ if (flows->trie >= ctx->num_tries) {
+ flows->trie = 0;
+ flows->num_packets++;
+ }
+
+ /* keep track of number of active trie traversals */
+ flows->started++;
+
+ /* no more tries to process, set slot to an idle position */
+ } else {
+ transition = ctx->idle;
+ parms[n].data = (const uint8_t *)idle;
+ parms[n].data_index = idle;
+ }
+ return transition;
+}
+
+static inline void
+acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
+ uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
+ uint32_t data_num, uint32_t categories, const uint64_t *trans)
+{
+ flows->num_packets = 0;
+ flows->started = 0;
+ flows->trie = 0;
+ flows->last_cmplt = NULL;
+ flows->cmplt_array = cmplt;
+ flows->total_packets = data_num;
+ flows->categories = categories;
+ flows->cmplt_size = cmplt_size;
+ flows->data = data;
+ flows->results = results;
+ flows->trans = trans;
+}
+
+#endif /* _ACL_RUN_H_ */
diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
new file mode 100644
index 0000000..a59ff17
--- /dev/null
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -0,0 +1,198 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+#include "acl_match_check.h"
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+ uint32_t *results, uint32_t num, uint32_t categories);
+
+/*
+ * Resolve priority for multiple results (scalar version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_scalar(uint64_t transition, int n,
+ const struct rte_acl_ctx *ctx, struct parms *parms,
+ const struct rte_acl_match_results *p, uint32_t categories)
+{
+ uint32_t i;
+ int32_t *saved_priority;
+ uint32_t *saved_results;
+ const int32_t *priority;
+ const uint32_t *results;
+
+ saved_results = parms[n].cmplt->results;
+ saved_priority = parms[n].cmplt->priority;
+
+ /* results and priorities for completed trie */
+ results = p[transition].results;
+ priority = p[transition].priority;
+
+ /* if this is not the first completed trie */
+ if (parms[n].cmplt->count != ctx->num_tries) {
+ for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+
+ if (saved_priority[i] <= priority[i]) {
+ saved_priority[i] = priority[i];
+ saved_results[i] = results[i];
+ }
+ if (saved_priority[i + 1] <= priority[i + 1]) {
+ saved_priority[i + 1] = priority[i + 1];
+ saved_results[i + 1] = results[i + 1];
+ }
+ if (saved_priority[i + 2] <= priority[i + 2]) {
+ saved_priority[i + 2] = priority[i + 2];
+ saved_results[i + 2] = results[i + 2];
+ }
+ if (saved_priority[i + 3] <= priority[i + 3]) {
+ saved_priority[i + 3] = priority[i + 3];
+ saved_results[i + 3] = results[i + 3];
+ }
+ }
+ } else {
+ for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+ saved_priority[i] = priority[i];
+ saved_priority[i + 1] = priority[i + 1];
+ saved_priority[i + 2] = priority[i + 2];
+ saved_priority[i + 3] = priority[i + 3];
+
+ saved_results[i] = results[i];
+ saved_results[i + 1] = results[i + 1];
+ saved_results[i + 2] = results[i + 2];
+ saved_results[i + 3] = results[i + 3];
+ }
+ }
+}
+
+/*
+ * When processing the transition, rather than using if/else
+ * construct, the offset is calculated for DFA and QRANGE and
+ * then conditionally added to the address based on node type.
+ * This is done to avoid branch mis-predictions. Since the
+ * offset is rather simple calculation it is more efficient
+ * to do the calculation and do a condition move rather than
+ * a conditional branch to determine which calculation to do.
+ */
+static inline uint32_t
+scan_forward(uint32_t input, uint32_t max)
+{
+ return (input == 0) ? max : rte_bsf32(input);
+}
+
+static inline uint64_t
+scalar_transition(const uint64_t *trans_table, uint64_t transition,
+ uint8_t input)
+{
+ uint32_t addr, index, ranges, x, a, b, c;
+
+ /* break transition into component parts */
+ ranges = transition >> (sizeof(index) * CHAR_BIT);
+
+ /* calc address for a QRANGE node */
+ c = input * SCALAR_QRANGE_MULT;
+ a = ranges | SCALAR_QRANGE_MIN;
+ index = transition & ~RTE_ACL_NODE_INDEX;
+ a -= (c & SCALAR_QRANGE_MASK);
+ b = c & SCALAR_QRANGE_MIN;
+ addr = transition ^ index;
+ a &= SCALAR_QRANGE_MIN;
+ a ^= (ranges ^ b) & (a ^ b);
+ x = scan_forward(a, 32) >> 3;
+ addr += (index == RTE_ACL_NODE_DFA) ? input : x;
+
+ /* pickup next transition */
+ transition = *(trans_table + addr);
+ return transition;
+}
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+ uint32_t *results, uint32_t num, uint32_t categories)
+{
+ int n;
+ uint64_t transition0, transition1;
+ uint32_t input0, input1;
+ struct acl_flow_data flows;
+ uint64_t index_array[MAX_SEARCHES_SCALAR];
+ struct completion cmplt[MAX_SEARCHES_SCALAR];
+ struct parms parms[MAX_SEARCHES_SCALAR];
+
+ if (categories != 1 &&
+ ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+ return -EINVAL;
+
+ acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
+ categories, ctx->trans_table);
+
+ for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
+ cmplt[n].count = 0;
+ index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+ }
+
+ transition0 = index_array[0];
+ transition1 = index_array[1];
+
+ while (flows.started > 0) {
+
+ input0 = GET_NEXT_4BYTES(parms, 0);
+ input1 = GET_NEXT_4BYTES(parms, 1);
+
+ for (n = 0; n < 4; n++) {
+ if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
+ transition0 = scalar_transition(flows.trans,
+ transition0, (uint8_t)input0);
+
+ input0 >>= CHAR_BIT;
+
+ if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
+ transition1 = scalar_transition(flows.trans,
+ transition1, (uint8_t)input1);
+
+ input1 >>= CHAR_BIT;
+
+ }
+ if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+ transition0 = acl_match_check(transition0,
+ 0, ctx, parms, &flows, resolve_priority_scalar);
+ transition1 = acl_match_check(transition1,
+ 1, ctx, parms, &flows, resolve_priority_scalar);
+
+ }
+ }
+ return 0;
+}
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
new file mode 100644
index 0000000..3f5c721
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.c
@@ -0,0 +1,627 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+#include "acl_match_check.h"
+
+enum {
+ SHUFFLE32_SLOT1 = 0xe5,
+ SHUFFLE32_SLOT2 = 0xe6,
+ SHUFFLE32_SLOT3 = 0xe7,
+ SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t mm_type_quad_range = {
+ .u32 = {
+ RTE_ACL_NODE_QRANGE,
+ RTE_ACL_NODE_QRANGE,
+ RTE_ACL_NODE_QRANGE,
+ RTE_ACL_NODE_QRANGE,
+ },
+};
+
+static const rte_xmm_t mm_type_quad_range64 = {
+ .u32 = {
+ RTE_ACL_NODE_QRANGE,
+ RTE_ACL_NODE_QRANGE,
+ 0,
+ 0,
+ },
+};
+
+static const rte_xmm_t mm_shuffle_input = {
+ .u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t mm_shuffle_input64 = {
+ .u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t mm_ones_16 = {
+ .u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t mm_bytes = {
+ .u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
+};
+
+static const rte_xmm_t mm_bytes64 = {
+ .u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
+};
+
+static const rte_xmm_t mm_match_mask = {
+ .u32 = {
+ RTE_ACL_NODE_MATCH,
+ RTE_ACL_NODE_MATCH,
+ RTE_ACL_NODE_MATCH,
+ RTE_ACL_NODE_MATCH,
+ },
+};
+
+static const rte_xmm_t mm_match_mask64 = {
+ .u32 = {
+ RTE_ACL_NODE_MATCH,
+ 0,
+ RTE_ACL_NODE_MATCH,
+ 0,
+ },
+};
+
+static const rte_xmm_t mm_index_mask = {
+ .u32 = {
+ RTE_ACL_NODE_INDEX,
+ RTE_ACL_NODE_INDEX,
+ RTE_ACL_NODE_INDEX,
+ RTE_ACL_NODE_INDEX,
+ },
+};
+
+static const rte_xmm_t mm_index_mask64 = {
+ .u32 = {
+ RTE_ACL_NODE_INDEX,
+ RTE_ACL_NODE_INDEX,
+ 0,
+ 0,
+ },
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+ struct parms *parms, const struct rte_acl_match_results *p,
+ uint32_t categories)
+{
+ uint32_t x;
+ xmm_t results, priority, results1, priority1, selector;
+ xmm_t *saved_results, *saved_priority;
+
+ for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+ saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+ saved_priority =
+ (xmm_t *)(&parms[n].cmplt->priority[x]);
+
+ /* get results and priorities for completed trie */
+ results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+ priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+ /* if this is not the first completed trie */
+ if (parms[n].cmplt->count != ctx->num_tries) {
+
+ /* get running best results and their priorities */
+ results1 = MM_LOADU(saved_results);
+ priority1 = MM_LOADU(saved_priority);
+
+ /* select results that are highest priority */
+ selector = MM_CMPGT32(priority1, priority);
+ results = MM_BLENDV8(results, results1, selector);
+ priority = MM_BLENDV8(priority, priority1, selector);
+ }
+
+ /* save running best results and their priorities */
+ MM_STOREU(saved_results, results);
+ MM_STOREU(saved_priority, priority);
+ }
+}
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
+ struct parms *parms, struct acl_flow_data *flows)
+{
+ uint64_t transition1, transition2;
+
+ /* extract transition from low 64 bits. */
+ transition1 = MM_CVT64(*indicies);
+
+ /* extract transition from high 64 bits. */
+ *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
+ transition2 = MM_CVT64(*indicies);
+
+ transition1 = acl_match_check(transition1, slot, ctx,
+ parms, flows, resolve_priority_sse);
+ transition2 = acl_match_check(transition2, slot + 1, ctx,
+ parms, flows, resolve_priority_sse);
+
+ /* update indicies with new transitions. */
+ *indicies = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+ struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
+{
+ xmm_t temp;
+
+ temp = MM_AND(match_mask, *indicies);
+ while (!MM_TESTZ(temp, temp)) {
+ acl_process_matches(indicies, slot, ctx, parms, flows);
+ temp = MM_AND(match_mask, *indicies);
+ }
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+ struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
+ xmm_t match_mask)
+{
+ xmm_t temp;
+
+ /* put low 32 bits of each transition into one register */
+ temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+ 0x88);
+ /* test for match node */
+ temp = MM_AND(match_mask, temp);
+
+ while (!MM_TESTZ(temp, temp)) {
+ acl_process_matches(indicies1, slot, ctx, parms, flows);
+ acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
+
+ temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+ (__m128)*indicies2,
+ 0x88);
+ temp = MM_AND(match_mask, temp);
+ }
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline xmm_t
+acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+ xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+ xmm_t *indicies1, xmm_t *indicies2)
+{
+ xmm_t addr, node_types, temp;
+
+ /*
+ * Note that no transition is done for a match
+ * node and therefore a stream freezes when
+ * it reaches a match.
+ */
+
+ /* Shuffle low 32 into temp and high 32 into indicies2 */
+ temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+ 0x88);
+ *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+ (__m128)*indicies2, 0xdd);
+
+ /* Calc node type and node addr */
+ node_types = MM_ANDNOT(index_mask, temp);
+ addr = MM_AND(index_mask, temp);
+
+ /*
+ * Calc addr for DFAs - addr = dfa_index + input_byte
+ */
+
+ /* mask for DFA type (0) nodes */
+ temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
+
+ /* add input byte to DFA position */
+ temp = MM_AND(temp, bytes);
+ temp = MM_AND(temp, next_input);
+ addr = MM_ADD32(addr, temp);
+
+ /*
+ * Calc addr for Range nodes -> range_index + range(input)
+ */
+ node_types = MM_CMPEQ32(node_types, type_quad_range);
+
+ /*
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127 in the indicies2 register.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ */
+
+ /* shuffle input byte to all 4 positions of 32 bit value */
+ temp = MM_SHUFFLE8(next_input, shuffle_input);
+
+ /* check ranges */
+ temp = MM_CMPGT8(temp, *indicies2);
+
+ /* convert -1 to 1 (bytes greater than input byte */
+ temp = MM_SIGN8(temp, temp);
+
+ /* horizontal add pairs of bytes into words */
+ temp = MM_MADD8(temp, temp);
+
+ /* horizontal add pairs of words into dwords */
+ temp = MM_MADD16(temp, ones_16);
+
+ /* mask to range type nodes */
+ temp = MM_AND(temp, node_types);
+
+ /* add index into node position */
+ return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline xmm_t
+transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+ xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+ const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
+{
+ xmm_t addr;
+ uint64_t trans0, trans2;
+
+ /* Calculate the address (array index) for all 4 transitions. */
+
+ addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+ bytes, type_quad_range, indicies1, indicies2);
+
+ /* Gather 64 bit transitions and pack back into 2 registers. */
+
+ trans0 = trans[MM_CVT32(addr)];
+
+ /* get slot 2 */
+
+ /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+ addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+ trans2 = trans[MM_CVT32(addr)];
+
+ /* get slot 1 */
+
+ /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+ addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+ *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+ /* get slot 3 */
+
+ /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+ addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+ *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+ return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+ uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+ int n;
+ struct acl_flow_data flows;
+ uint64_t index_array[MAX_SEARCHES_SSE8];
+ struct completion cmplt[MAX_SEARCHES_SSE8];
+ struct parms parms[MAX_SEARCHES_SSE8];
+ xmm_t input0, input1;
+ xmm_t indicies1, indicies2, indicies3, indicies4;
+
+ acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+ total_packets, categories, ctx->trans_table);
+
+ for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+ cmplt[n].count = 0;
+ index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+ }
+
+ /*
+ * indicies1 contains index_array[0,1]
+ * indicies2 contains index_array[2,3]
+ * indicies3 contains index_array[4,5]
+ * indicies4 contains index_array[6,7]
+ */
+
+ indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+ indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+ indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
+ indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+ /* Check for any matches. */
+ acl_match_check_x4(0, ctx, parms, &flows,
+ &indicies1, &indicies2, mm_match_mask.m);
+ acl_match_check_x4(4, ctx, parms, &flows,
+ &indicies3, &indicies4, mm_match_mask.m);
+
+ while (flows.started > 0) {
+
+ /* Gather 4 bytes of input data for each stream. */
+ input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
+ 0);
+ input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
+ 0);
+
+ input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+ input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+ input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+ input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+ input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+ input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+ /* Process the 4 bytes of input on each stream. */
+
+ input0 = transition4(mm_index_mask.m, input0,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies1, &indicies2);
+
+ input1 = transition4(mm_index_mask.m, input1,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies3, &indicies4);
+
+ input0 = transition4(mm_index_mask.m, input0,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies1, &indicies2);
+
+ input1 = transition4(mm_index_mask.m, input1,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies3, &indicies4);
+
+ input0 = transition4(mm_index_mask.m, input0,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies1, &indicies2);
+
+ input1 = transition4(mm_index_mask.m, input1,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies3, &indicies4);
+
+ input0 = transition4(mm_index_mask.m, input0,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies1, &indicies2);
+
+ input1 = transition4(mm_index_mask.m, input1,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies3, &indicies4);
+
+ /* Check for any matches. */
+ acl_match_check_x4(0, ctx, parms, &flows,
+ &indicies1, &indicies2, mm_match_mask.m);
+ acl_match_check_x4(4, ctx, parms, &flows,
+ &indicies3, &indicies4, mm_match_mask.m);
+ }
+
+ return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+ uint32_t *results, int total_packets, uint32_t categories)
+{
+ int n;
+ struct acl_flow_data flows;
+ uint64_t index_array[MAX_SEARCHES_SSE4];
+ struct completion cmplt[MAX_SEARCHES_SSE4];
+ struct parms parms[MAX_SEARCHES_SSE4];
+ xmm_t input, indicies1, indicies2;
+
+ acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+ total_packets, categories, ctx->trans_table);
+
+ for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+ cmplt[n].count = 0;
+ index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+ }
+
+ indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+ indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+ /* Check for any matches. */
+ acl_match_check_x4(0, ctx, parms, &flows,
+ &indicies1, &indicies2, mm_match_mask.m);
+
+ while (flows.started > 0) {
+
+ /* Gather 4 bytes of input data for each stream. */
+ input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+ input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+ input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+ input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+ /* Process the 4 bytes of input on each stream. */
+ input = transition4(mm_index_mask.m, input,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies1, &indicies2);
+
+ input = transition4(mm_index_mask.m, input,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies1, &indicies2);
+
+ input = transition4(mm_index_mask.m, input,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies1, &indicies2);
+
+ input = transition4(mm_index_mask.m, input,
+ mm_shuffle_input.m, mm_ones_16.m,
+ mm_bytes.m, mm_type_quad_range.m,
+ flows.trans, &indicies1, &indicies2);
+
+ /* Check for any matches. */
+ acl_match_check_x4(0, ctx, parms, &flows,
+ &indicies1, &indicies2, mm_match_mask.m);
+ }
+
+ return 0;
+}
+
+static inline xmm_t
+transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+ xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+ const uint64_t *trans, xmm_t *indicies1)
+{
+ uint64_t t;
+ xmm_t addr, indicies2;
+
+ indicies2 = MM_XOR(ones_16, ones_16);
+
+ addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+ bytes, type_quad_range, indicies1, &indicies2);
+
+ /* Gather 64 bit transitions and pack 2 per register. */
+
+ t = trans[MM_CVT32(addr)];
+
+ /* get slot 1 */
+ addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+ *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+ return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+ uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+ int n;
+ struct acl_flow_data flows;
+ uint64_t index_array[MAX_SEARCHES_SSE2];
+ struct completion cmplt[MAX_SEARCHES_SSE2];
+ struct parms parms[MAX_SEARCHES_SSE2];
+ xmm_t input, indicies;
+
+ acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+ total_packets, categories, ctx->trans_table);
+
+ for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+ cmplt[n].count = 0;
+ index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+ }
+
+ indicies = MM_LOADU((xmm_t *) &index_array[0]);
+
+ /* Check for any matches. */
+ acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
+
+ while (flows.started > 0) {
+
+ /* Gather 4 bytes of input data for each stream. */
+ input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+ input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+ /* Process the 4 bytes of input on each stream. */
+
+ input = transition2(mm_index_mask64.m, input,
+ mm_shuffle_input64.m, mm_ones_16.m,
+ mm_bytes64.m, mm_type_quad_range64.m,
+ flows.trans, &indicies);
+
+ input = transition2(mm_index_mask64.m, input,
+ mm_shuffle_input64.m, mm_ones_16.m,
+ mm_bytes64.m, mm_type_quad_range64.m,
+ flows.trans, &indicies);
+
+ input = transition2(mm_index_mask64.m, input,
+ mm_shuffle_input64.m, mm_ones_16.m,
+ mm_bytes64.m, mm_type_quad_range64.m,
+ flows.trans, &indicies);
+
+ input = transition2(mm_index_mask64.m, input,
+ mm_shuffle_input64.m, mm_ones_16.m,
+ mm_bytes64.m, mm_type_quad_range64.m,
+ flows.trans, &indicies);
+
+ /* Check for any matches. */
+ acl_match_check_x2(0, ctx, parms, &flows, &indicies,
+ mm_match_mask64.m);
+ }
+
+ return 0;
+}
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+ uint32_t *results, uint32_t num, uint32_t categories)
+{
+ if (categories != 1 &&
+ ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+ return -EINVAL;
+
+ if (likely(num >= MAX_SEARCHES_SSE8))
+ return search_sse_8(ctx, data, results, num, categories);
+ else if (num >= MAX_SEARCHES_SSE4)
+ return search_sse_4(ctx, data, results, num, categories);
+ else
+ return search_sse_2(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..b9173c1 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -38,6 +38,52 @@
TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
+typedef int (*rte_acl_classify_t)
+(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
+
+extern int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+ uint32_t *results, uint32_t num, uint32_t categories);
+
+/* by default, use always avaialbe scalar code path. */
+rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
+
+void rte_acl_select_classify(enum acl_classify_alg alg)
+{
+
+ switch(alg)
+ {
+ case ACL_CLASSIFY_DEFAULT:
+ case ACL_CLASSIFY_SCALAR:
+ rte_acl_default_classify = rte_acl_classify_scalar;
+ break;
+ case ACL_CLASSIFY_SSE:
+ rte_acl_default_classify = rte_acl_classify_sse;
+ break;
+ }
+
+}
+
+static void __attribute__((constructor))
+rte_acl_init(void)
+{
+ enum acl_classify_alg alg = ACL_CLASSIFY_DEFAULT;
+
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+ alg = ACL_CLASSIFY_SSE;
+
+ rte_acl_select_classify(alg);
+}
+
+inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
+ const uint8_t **data,
+ uint32_t *results, uint32_t num,
+ uint32_t categories)
+{
+ return rte_acl_default_classify(ctx, data, results, num, categories);
+}
+
+
struct rte_acl_ctx *
rte_acl_find_existing(const char *name)
{
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index afc0f69..650b306 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -267,6 +267,9 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
* RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
* If more than one rule is applicable for given input buffer and
* given category, then rule with highest priority will be returned as a match.
+ * Note, that this function could be run only on CPUs with SSE4.1 support.
+ * It is up to the caller to make sure that this function is only invoked on
+ * a machine that supports SSE4.1 ISA.
* Note, that it is a caller responsibility to ensure that input parameters
* are valid and point to correct memory locations.
*
@@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
* @return
* zero on successful completion.
* -EINVAL for incorrect arguments.
+ * -ENOTSUP for unsupported platforms.
*/
int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
uint32_t *results, uint32_t num, uint32_t categories);
/**
@@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
* zero on successful completion.
* -EINVAL for incorrect arguments.
*/
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
- uint32_t *results, uint32_t num, uint32_t categories);
+
+enum acl_classify_alg {
+ ACL_CLASSIFY_DEFAULT = 0,
+ ACL_CLASSIFY_SCALAR = 1,
+ ACL_CLASSIFY_SSE = 2,
+};
+
+extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
+ const uint8_t **data,
+ uint32_t *results, uint32_t num,
+ uint32_t categories);
+/**
+ * Analyze ISA of the current CPU and points rte_acl_default_classify
+ * to the highest applicable version of classify function.
+ */
+extern void
+rte_acl_select_classify(enum acl_classify_alg alg);
/**
* Dump an ACL context structure to the console.
--
1.9.3
^ permalink raw reply [relevance 1%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
2014-08-08 14:30 3% ` Neil Horman
@ 2014-08-11 22:23 0% ` Thomas Monjalon
0 siblings, 0 replies; 40+ results
From: Thomas Monjalon @ 2014-08-11 22:23 UTC (permalink / raw)
To: Ananyev, Konstantin; +Cc: dev
Hi all,
2014-08-08 10:30, Neil Horman:
> On Fri, Aug 08, 2014 at 01:09:34PM +0000, Ananyev, Konstantin wrote:
> > > > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> > > I agree, but both the methods we are advocating for allow that. Its really just
> > > a question of exposing the mechanism as data or text in the binary. Exposing it
> > > as data comes with implicit ABI constraints that are less prevalanet when done
> > > as code entry points.
> >
> > > > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
>
> > Of course, he probably will report about it and we probably fix it sooner or later.
> > But with such ability he can switch to the safe implementation immediately
> > without touching the library and then wait for the fix.
>
> Thats not how users of a binary pacakge from a distribution operate. If their
> using a binary package they either:
>
> 1) Don't want to rebuild anything themselves, in which case they file the bug,
> and wait for the developers to fix the issue.
>
> or
>
> 2) Have a staff to help them work around the issue, which will be done by
> rebuilding/fixing the library, not the application.
>
> With (2), what I am saying is that, if a 3rd party finds a bug in the classifier
> code within dpdk which is built as a shared library within a distribution, and
> they need it fixed immediately, they have a choice of what to do, they can
> either (a), write a custom classifier function and point the dpdk library to it,
> or (b), just fix the bug in the library directly. Given that, if they can
> accomplish (a), they by all rights can also accompilsh (b), the only decision
> they need to make is one which makes the most sense for them. The answer is
> (b), because thats where the functionality lives. i.e. when the fix occurs
> upstream and a new release gets issued, you can go back to using the library
> maintained version, and you don't have to clean up what has become vestigial
> unused code.
I think it's even simpler: thinking API to allow behaviour changes without
rebuilding is not sane. So we should expose all functions?
Please try to reduce API as much as possible.
Thanks
--
Thomas
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
2014-08-08 13:09 3% ` Ananyev, Konstantin
@ 2014-08-08 14:30 3% ` Neil Horman
2014-08-11 22:23 0% ` Thomas Monjalon
0 siblings, 1 reply; 40+ results
From: Neil Horman @ 2014-08-08 14:30 UTC (permalink / raw)
To: Ananyev, Konstantin; +Cc: dev
On Fri, Aug 08, 2014 at 01:09:34PM +0000, Ananyev, Konstantin wrote:
>
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Friday, August 08, 2014 1:25 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> >
> > On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Thursday, August 07, 2014 9:12 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > > >
> > > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > > > Make ACL library to build/work on 'default' architecture:
> > > > > - make rte_acl_classify_scalar really scalar
> > > > > (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > - Provide two versions of rte_acl_classify code path:
> > > > > rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > > > and upper, return -ENOTSUP on lower arch.
> > > > > rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > > on all systems.
> > > > > - keep common code shared between these two codepaths.
> > > > >
> > > > > v2 chages:
> > > > > run-time selection of most appropriate code-path for given ISA.
> > > > > By default the highest supprted one is selected.
> > > > > User can still override that selection by manually assigning new value to
> > > > > the global function pointer rte_acl_default_classify.
> > > > > rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > > > points to.
> > > > >
> > > > >
> > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > >
> > > > This is alot better thank you. A few remaining issues.
> > >
> > > My comments inline too.
> > > Thanks
> > > Konstantin
> > >
> > > >
> > > > > ---
> > > > > app/test-acl/main.c | 13 +-
> > > > > lib/librte_acl/Makefile | 5 +-
> > > > > lib/librte_acl/acl_bld.c | 5 +-
> > > > > lib/librte_acl/acl_match_check.def | 92 ++++
> > > > > lib/librte_acl/acl_run.c | 944 -------------------------------------
> > > > > lib/librte_acl/acl_run.h | 220 +++++++++
> > > > > lib/librte_acl/acl_run_scalar.c | 197 ++++++++
> > > > > lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++
> > > > > lib/librte_acl/rte_acl.c | 15 +
> > > > > lib/librte_acl/rte_acl.h | 24 +-
> > > > > 10 files changed, 1189 insertions(+), 956 deletions(-)
> > > > > create mode 100644 lib/librte_acl/acl_match_check.def
> > > > > delete mode 100644 lib/librte_acl/acl_run.c
> > > > > create mode 100644 lib/librte_acl/acl_run.h
> > > > > create mode 100644 lib/librte_acl/acl_run_scalar.c
> > > > > create mode 100644 lib/librte_acl/acl_run_sse.c
> > > > >
> > > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > > > index d654409..45c6fa6 100644
> > > > > --- a/app/test-acl/main.c
> > > > > +++ b/app/test-acl/main.c
> > > > > @@ -787,6 +787,10 @@ acx_init(void)
> > > > > /* perform build. */
> > > > > ret = rte_acl_build(config.acx, &cfg);
> > > > >
> > > > > + /* setup default rte_acl_classify */
> > > > > + if (config.scalar)
> > > > > + rte_acl_default_classify = rte_acl_classify_scalar;
> > > > > +
> > > > Exporting this variable as part of the ABI is a bad idea. If the prototype of
> > > > the function changes you have to update all your applications.
> > >
> > > If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway.
> > >
> > Why? If you hide this from the application, changes to the internal
> > implementation will also be invisible. When building as a DSO, an application
> > will be able to transition between libraries without the need for a rebuild.
>
> Because rte_acl_classify() is part of the ACL API that users use.
> If we'll add/modify its parameters and/or return value - users would have to change their apps anyway.
>
Thats not at all true. With API versioning scripts you can make several
versions of the same function with different prototypes as future needs dictate.
Hiding the internal implementation just makes that easier.
> > > > Make the pointer
> > > > an internal symbol and set it using a get/set routine with an enum to represent
> > > > the path to choose. That will help isolate the ABI from the internal
> > > > implementation.
> > >
> > > That's was my first intention too.
> > > But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> > > and it will cost us extra call (or jump).
> > Thats true, but I don't see that as a problem. We're not talking about a hot
> > code path here, its a setup function.
>
> I am talking not about rte_acl_select_classify() but about rte_acl_classify() itself (not code path).
> If I'll make rte_acl_default_classify statitc, the rte_acl_classiy() would need to become a real function and it'll be something like that:
>
> ->call rte_acl_acl_classify
> ---> load rte_acl_calssify_default value into the reg
> ---> jmp (*reg)
>
Ah, yes, the actual classification path, you will need an extra call
instruction there. I would say if thats the case, then you should either make
rte_acl_classify a macro or real function based on weather your building as a
shared library or a static library.
> > Or do you think that an application will
> > be switching between classification functions on every classify operation?
>
> God no.
>
> > > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> > I agree, but both the methods we are advocating for allow that. Its really just
> > a question of exposing the mechanism as data or text in the binary. Exposing it
> > as data comes with implicit ABI constraints that are less prevalanet when done
> > as code entry points.
>
> > > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
>
> > In the case of a bug in the optimized path, you just fix the bug.
>
> It is not about me. It is about a user who get librte_acl as part of binary distribution.
Yes, those are my users :)
> Of course, he probably will report about it and we probably fix it sooner or later.
> But with such ability he can switch to the safe implementation immediately
> without touching the library and then wait for the fix.
>
Thats not how users of a binary pacakge from a distribution operate. If their
using a binary package they either:
1) Don't want to rebuild anything themselves, in which case they file the bug,
and wait for the developers to fix the issue.
or
2) Have a staff to help them work around the issue, which will be done by
rebuilding/fixing the library, not the application.
With (2), what I am saying is that, if a 3rd party finds a bug in the classifier
code within dpdk which is built as a shared library within a distribution, and
they need it fixed immediately, they have a choice of what to do, they can
either (a), write a custom classifier function and point the dpdk library to it,
or (b), just fix the bug in the library directly. Given that, if they can
accomplish (a), they by all rights can also accompilsh (b), the only decision
they need to make is one which makes the most sense for them. The answer is
(b), because thats where the functionality lives. i.e. when the fix occurs
upstream and a new release gets issued, you can go back to using the library
maintained version, and you don't have to clean up what has become vestigial
unused code.
> > If you want
> > to provide your own classification function, thats fine I suppose, but that
> > seems completely outside the scope of what we're trying to do here. Its not
> > adventageous to just throw that in there. If you want to be able to provide
> > your own classifier function, lets at least take some time to make sure that the
> > function prototype is sufficiently capable to accept all the data you might want
> > to pass it in the future, before we go exposing it. Otherwise you'll have to
> > break the ABI in future versions, whcih is something we've been discussing
> > trying to avoid.
>
> rte_acl_classify() it is already exposed (PART of API), same as rte_acl_classify_scalar().
> If in future, we'll change these functions prototypes will break ABI anyway.
>
Well, at the moment, thats fine because you don't make any ABI promises anyway,
I've been working to change that, so distributions can have greater dpdk
adoption.
> >
> > > > It will also let you prevent things like selecting a run time
> > > > path that is incompatible with the running system
> > >
> > > If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
> > That really seems like poor design to me. I don't see why you wouldn't at least
> > want to warn the developer of an application if they were at run time to assign
> > a default classifier method that was incompatible with a running system. Yes,
> > they're likely smart enough to know what their doing, but smart people make
> > mistakes, and appreciate being told when they're doing so, especially if the
> > method of telling is something a bit more civil than a machine check that
> > might occur well after the application has been initilized.
>
> I have no problem providing rte_acl_check_classify(flags_required, classify_ptr) that would do checking and emit the warning.
> Though as I said above, I'll prefer not to hide rte_acl_default_classify it will cause extra overhead for rte_acl_classify().
>
> >
> > > From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
> > Not if the function is statically declared and not exposed to the application
> > they cant :)
>
> I don't really want to hide rte_acl_classify_sse/rte_acl_classify_scalar().
> Should be available directly I think.
> In future we might introduce new versions for more sophisticated ISAs (rte_acl_classify_avx() or something).
> Users should have an ability to downgrade their classify() function if they like.
What in your mind is the reasoning behind being able to do so? What is
adventageous about that? Asside possibly from debugging that is (for which I
can see a use). But in normal production operation, why would you choose to not
use the sse classifier over the scalar classifier?
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
2014-08-08 12:25 4% ` Neil Horman
@ 2014-08-08 13:09 3% ` Ananyev, Konstantin
2014-08-08 14:30 3% ` Neil Horman
0 siblings, 1 reply; 40+ results
From: Ananyev, Konstantin @ 2014-08-08 13:09 UTC (permalink / raw)
To: Neil Horman; +Cc: dev
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Friday, August 08, 2014 1:25 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
>
> On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Thursday, August 07, 2014 9:12 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > >
> > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > > Make ACL library to build/work on 'default' architecture:
> > > > - make rte_acl_classify_scalar really scalar
> > > > (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > - Provide two versions of rte_acl_classify code path:
> > > > rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > > and upper, return -ENOTSUP on lower arch.
> > > > rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > on all systems.
> > > > - keep common code shared between these two codepaths.
> > > >
> > > > v2 chages:
> > > > run-time selection of most appropriate code-path for given ISA.
> > > > By default the highest supprted one is selected.
> > > > User can still override that selection by manually assigning new value to
> > > > the global function pointer rte_acl_default_classify.
> > > > rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > > points to.
> > > >
> > > >
> > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > >
> > > This is alot better thank you. A few remaining issues.
> >
> > My comments inline too.
> > Thanks
> > Konstantin
> >
> > >
> > > > ---
> > > > app/test-acl/main.c | 13 +-
> > > > lib/librte_acl/Makefile | 5 +-
> > > > lib/librte_acl/acl_bld.c | 5 +-
> > > > lib/librte_acl/acl_match_check.def | 92 ++++
> > > > lib/librte_acl/acl_run.c | 944 -------------------------------------
> > > > lib/librte_acl/acl_run.h | 220 +++++++++
> > > > lib/librte_acl/acl_run_scalar.c | 197 ++++++++
> > > > lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++
> > > > lib/librte_acl/rte_acl.c | 15 +
> > > > lib/librte_acl/rte_acl.h | 24 +-
> > > > 10 files changed, 1189 insertions(+), 956 deletions(-)
> > > > create mode 100644 lib/librte_acl/acl_match_check.def
> > > > delete mode 100644 lib/librte_acl/acl_run.c
> > > > create mode 100644 lib/librte_acl/acl_run.h
> > > > create mode 100644 lib/librte_acl/acl_run_scalar.c
> > > > create mode 100644 lib/librte_acl/acl_run_sse.c
> > > >
> > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > > index d654409..45c6fa6 100644
> > > > --- a/app/test-acl/main.c
> > > > +++ b/app/test-acl/main.c
> > > > @@ -787,6 +787,10 @@ acx_init(void)
> > > > /* perform build. */
> > > > ret = rte_acl_build(config.acx, &cfg);
> > > >
> > > > + /* setup default rte_acl_classify */
> > > > + if (config.scalar)
> > > > + rte_acl_default_classify = rte_acl_classify_scalar;
> > > > +
> > > Exporting this variable as part of the ABI is a bad idea. If the prototype of
> > > the function changes you have to update all your applications.
> >
> > If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway.
> >
> Why? If you hide this from the application, changes to the internal
> implementation will also be invisible. When building as a DSO, an application
> will be able to transition between libraries without the need for a rebuild.
Because rte_acl_classify() is part of the ACL API that users use.
If we'll add/modify its parameters and/or return value - users would have to change their apps anyway.
> > > Make the pointer
> > > an internal symbol and set it using a get/set routine with an enum to represent
> > > the path to choose. That will help isolate the ABI from the internal
> > > implementation.
> >
> > That's was my first intention too.
> > But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> > and it will cost us extra call (or jump).
> Thats true, but I don't see that as a problem. We're not talking about a hot
> code path here, its a setup function.
I am talking not about rte_acl_select_classify() but about rte_acl_classify() itself (not code path).
If I'll make rte_acl_default_classify statitc, the rte_acl_classiy() would need to become a real function and it'll be something like that:
->call rte_acl_acl_classify
---> load rte_acl_calssify_default value into the reg
---> jmp (*reg)
> Or do you think that an application will
> be switching between classification functions on every classify operation?
God no.
> > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> I agree, but both the methods we are advocating for allow that. Its really just
> a question of exposing the mechanism as data or text in the binary. Exposing it
> as data comes with implicit ABI constraints that are less prevalanet when done
> as code entry points.
> > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
> In the case of a bug in the optimized path, you just fix the bug.
It is not about me. It is about a user who get librte_acl as part of binary distribution.
Of course, he probably will report about it and we probably fix it sooner or later.
But with such ability he can switch to the safe implementation immediately
without touching the library and then wait for the fix.
> If you want
> to provide your own classification function, thats fine I suppose, but that
> seems completely outside the scope of what we're trying to do here. Its not
> adventageous to just throw that in there. If you want to be able to provide
> your own classifier function, lets at least take some time to make sure that the
> function prototype is sufficiently capable to accept all the data you might want
> to pass it in the future, before we go exposing it. Otherwise you'll have to
> break the ABI in future versions, whcih is something we've been discussing
> trying to avoid.
rte_acl_classify() it is already exposed (PART of API), same as rte_acl_classify_scalar().
If in future, we'll change these functions prototypes will break ABI anyway.
>
> > > It will also let you prevent things like selecting a run time
> > > path that is incompatible with the running system
> >
> > If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
> That really seems like poor design to me. I don't see why you wouldn't at least
> want to warn the developer of an application if they were at run time to assign
> a default classifier method that was incompatible with a running system. Yes,
> they're likely smart enough to know what their doing, but smart people make
> mistakes, and appreciate being told when they're doing so, especially if the
> method of telling is something a bit more civil than a machine check that
> might occur well after the application has been initilized.
I have no problem providing rte_acl_check_classify(flags_required, classify_ptr) that would do checking and emit the warning.
Though as I said above, I'll prefer not to hide rte_acl_default_classify it will cause extra overhead for rte_acl_classify().
>
> > From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
> Not if the function is statically declared and not exposed to the application
> they cant :)
I don't really want to hide rte_acl_classify_sse/rte_acl_classify_scalar().
Should be available directly I think.
In future we might introduce new versions for more sophisticated ISAs (rte_acl_classify_avx() or something).
Users should have an ability to downgrade their classify() function if they like.
> >
> > > and prevent path switching
> > > during searches, which may produce unexpected results.
> >
> > Not that I am advertising it, but it should be safe to update rte_acl_default_classify during searches:
> > All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.
> >
> Fair enough.
>
> > >
> > > ><snip>
> > > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > > > deleted file mode 100644
> > > > index e3d9fc1..0000000
> > > > --- a/lib/librte_acl/acl_run.c
> > > > +++ /dev/null
> > > > @@ -1,944 +0,0 @@
> > > > -/*-
> > > > - * BSD LICENSE
> > > > - *
> > > > - * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > - * All rights reserved.
> > > > - *
> > > > - * Redistribution and use in source and binary forms, with or without
> > > > - * modification, are permitted provided that the following conditions
> > > ><snip>
> > > > +
> > > > +#define __func_resolve_priority__ resolve_priority_scalar
> > > > +#define __func_match_check__ acl_match_check_scalar
> > > > +#include "acl_match_check.def"
> > > > +
> > > I get this lets you make some more code common, but its just unpleasant to trace
> > > through. Looking at the defintion of __func_match_check__ I don't see anything
> > > particularly performance sensitive there. What if instead you simply redefined
> > > __func_match_check__ in a common internal header as acl_match_check (a generic
> > > function), and had it accept priority resolution function as an argument? That
> > > would still give you all the performance enhancements without having to include
> > > c files in the middle of other c files, and would make the code a bit more
> > > parseable.
> >
> > Yes, that way it would look much better.
> > And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
> > Will change as you suggested.
> >
> Thank you
> Neil
>
> > >
> > > > +/*
> > > > + * When processing the transition, rather than using if/else
> > > > + * construct, the offset is calculated for DFA and QRANGE and
> > > > + * then conditionally added to the address based on node type.
> > > > + * This is done to avoid branch mis-predictions. Since the
> > > > + * offset is rather simple calculation it is more efficient
> > > > + * to do the calculation and do a condition move rather than
> > > > + * a conditional branch to determine which calculation to do.
> > > > + */
> > > > +static inline uint32_t
> > > > +scan_forward(uint32_t input, uint32_t max)
> > > > +{
> > > > + return (input == 0) ? max : rte_bsf32(input);
> > > > +}
> > > > + }
> > > > +}
> > > ><snip>
> > > > +
> > > > +#define __func_resolve_priority__ resolve_priority_sse
> > > > +#define __func_match_check__ acl_match_check_sse
> > > > +#include "acl_match_check.def"
> > > > +
> > > Same deal as above.
> > >
> > > > +/*
> > > > + * Extract transitions from an XMM register and check for any matches
> > > > + */
> > > > +static void
> > > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > > > + struct parms *parms, struct acl_flow_data *flows)
> > > > +{
> > > > + uint64_t transition1, transition2;
> > > > +
> > > > + /* extract transition from low 64 bits. */
> > > > + transition1 = MM_CVT64(*indicies);
> > > > +
> > > > + /* extract transition from high 64 bits. */
> > > > + *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > > > + transition2 = MM_CVT64(*indicies);
> > > > +
> > > > + transition1 = acl_match_check_sse(transition1, slot, ctx,
> > > > + parms, flows);
> > > > + transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > > > + parms, flows);
> > > > +
> > > > + /* update indicies with new transitions. */
> > > > + *indicies = MM_SET64(transition2, transition1);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check for a match in 2 transitions (contained in SSE register)
> > > > + */
> > > > +static inline void
> > > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > + struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > > > +{
> > > > + xmm_t temp;
> > > > +
> > > > + temp = MM_AND(match_mask, *indicies);
> > > > + while (!MM_TESTZ(temp, temp)) {
> > > > + acl_process_matches(indicies, slot, ctx, parms, flows);
> > > > + temp = MM_AND(match_mask, *indicies);
> > > > + }
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > > > + */
> > > > +static inline void
> > > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > + struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > > > + xmm_t match_mask)
> > > > +{
> > > > + xmm_t temp;
> > > > +
> > > > + /* put low 32 bits of each transition into one register */
> > > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > > + 0x88);
> > > > + /* test for match node */
> > > > + temp = MM_AND(match_mask, temp);
> > > > +
> > > > + while (!MM_TESTZ(temp, temp)) {
> > > > + acl_process_matches(indicies1, slot, ctx, parms, flows);
> > > > + acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > > > +
> > > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > > + (__m128)*indicies2,
> > > > + 0x88);
> > > > + temp = MM_AND(match_mask, temp);
> > > > + }
> > > > +}
> > > > +
> > > > +/*
> > > > + * Calculate the address of the next transition for
> > > > + * all types of nodes. Note that only DFA nodes and range
> > > > + * nodes actually transition to another node. Match
> > > > + * nodes don't move.
> > > > + */
> > > > +static inline xmm_t
> > > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > + xmm_t *indicies1, xmm_t *indicies2)
> > > > +{
> > > > + xmm_t addr, node_types, temp;
> > > > +
> > > > + /*
> > > > + * Note that no transition is done for a match
> > > > + * node and therefore a stream freezes when
> > > > + * it reaches a match.
> > > > + */
> > > > +
> > > > + /* Shuffle low 32 into temp and high 32 into indicies2 */
> > > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > > + 0x88);
> > > > + *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > > + (__m128)*indicies2, 0xdd);
> > > > +
> > > > + /* Calc node type and node addr */
> > > > + node_types = MM_ANDNOT(index_mask, temp);
> > > > + addr = MM_AND(index_mask, temp);
> > > > +
> > > > + /*
> > > > + * Calc addr for DFAs - addr = dfa_index + input_byte
> > > > + */
> > > > +
> > > > + /* mask for DFA type (0) nodes */
> > > > + temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > > > +
> > > > + /* add input byte to DFA position */
> > > > + temp = MM_AND(temp, bytes);
> > > > + temp = MM_AND(temp, next_input);
> > > > + addr = MM_ADD32(addr, temp);
> > > > +
> > > > + /*
> > > > + * Calc addr for Range nodes -> range_index + range(input)
> > > > + */
> > > > + node_types = MM_CMPEQ32(node_types, type_quad_range);
> > > > +
> > > > + /*
> > > > + * Calculate number of range boundaries that are less than the
> > > > + * input value. Range boundaries for each node are in signed 8 bit,
> > > > + * ordered from -128 to 127 in the indicies2 register.
> > > > + * This is effectively a popcnt of bytes that are greater than the
> > > > + * input byte.
> > > > + */
> > > > +
> > > > + /* shuffle input byte to all 4 positions of 32 bit value */
> > > > + temp = MM_SHUFFLE8(next_input, shuffle_input);
> > > > +
> > > > + /* check ranges */
> > > > + temp = MM_CMPGT8(temp, *indicies2);
> > > > +
> > > > + /* convert -1 to 1 (bytes greater than input byte */
> > > > + temp = MM_SIGN8(temp, temp);
> > > > +
> > > > + /* horizontal add pairs of bytes into words */
> > > > + temp = MM_MADD8(temp, temp);
> > > > +
> > > > + /* horizontal add pairs of words into dwords */
> > > > + temp = MM_MADD16(temp, ones_16);
> > > > +
> > > > + /* mask to range type nodes */
> > > > + temp = MM_AND(temp, node_types);
> > > > +
> > > > + /* add index into node position */
> > > > + return MM_ADD32(addr, temp);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > > > + */
> > > > +static inline xmm_t
> > > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > + const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > > > +{
> > > > + xmm_t addr;
> > > > + uint64_t trans0, trans2;
> > > > +
> > > > + /* Calculate the address (array index) for all 4 transitions. */
> > > > +
> > > > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > > + bytes, type_quad_range, indicies1, indicies2);
> > > > +
> > > > + /* Gather 64 bit transitions and pack back into 2 registers. */
> > > > +
> > > > + trans0 = trans[MM_CVT32(addr)];
> > > > +
> > > > + /* get slot 2 */
> > > > +
> > > > + /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > > > + trans2 = trans[MM_CVT32(addr)];
> > > > +
> > > > + /* get slot 1 */
> > > > +
> > > > + /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > > > +
> > > > + /* get slot 3 */
> > > > +
> > > > + /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > > > + *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > > > +
> > > > + return MM_SRL32(next_input, 8);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 8 traversals in parallel
> > > > + */
> > > > +static inline int
> > > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > + uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > > +{
> > > > + int n;
> > > > + struct acl_flow_data flows;
> > > > + uint64_t index_array[MAX_SEARCHES_SSE8];
> > > > + struct completion cmplt[MAX_SEARCHES_SSE8];
> > > > + struct parms parms[MAX_SEARCHES_SSE8];
> > > > + xmm_t input0, input1;
> > > > + xmm_t indicies1, indicies2, indicies3, indicies4;
> > > > +
> > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > + total_packets, categories, ctx->trans_table);
> > > > +
> > > > + for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > > > + cmplt[n].count = 0;
> > > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > + }
> > > > +
> > > > + /*
> > > > + * indicies1 contains index_array[0,1]
> > > > + * indicies2 contains index_array[2,3]
> > > > + * indicies3 contains index_array[4,5]
> > > > + * indicies4 contains index_array[6,7]
> > > > + */
> > > > +
> > > > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > > +
> > > > + indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > > > + indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > > > +
> > > > + /* Check for any matches. */
> > > > + acl_match_check_x4(0, ctx, parms, &flows,
> > > > + &indicies1, &indicies2, mm_match_mask.m);
> > > > + acl_match_check_x4(4, ctx, parms, &flows,
> > > > + &indicies3, &indicies4, mm_match_mask.m);
> > > > +
> > > > + while (flows.started > 0) {
> > > > +
> > > > + /* Gather 4 bytes of input data for each stream. */
> > > > + input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > > > + 0);
> > > > + input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > > > + 0);
> > > > +
> > > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > > > +
> > > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > > > +
> > > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > > > +
> > > > + /* Process the 4 bytes of input on each stream. */
> > > > +
> > > > + input0 = transition4(mm_index_mask.m, input0,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies1, &indicies2);
> > > > +
> > > > + input1 = transition4(mm_index_mask.m, input1,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies3, &indicies4);
> > > > +
> > > > + input0 = transition4(mm_index_mask.m, input0,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies1, &indicies2);
> > > > +
> > > > + input1 = transition4(mm_index_mask.m, input1,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies3, &indicies4);
> > > > +
> > > > + input0 = transition4(mm_index_mask.m, input0,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies1, &indicies2);
> > > > +
> > > > + input1 = transition4(mm_index_mask.m, input1,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies3, &indicies4);
> > > > +
> > > > + input0 = transition4(mm_index_mask.m, input0,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies1, &indicies2);
> > > > +
> > > > + input1 = transition4(mm_index_mask.m, input1,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies3, &indicies4);
> > > > +
> > > > + /* Check for any matches. */
> > > > + acl_match_check_x4(0, ctx, parms, &flows,
> > > > + &indicies1, &indicies2, mm_match_mask.m);
> > > > + acl_match_check_x4(4, ctx, parms, &flows,
> > > > + &indicies3, &indicies4, mm_match_mask.m);
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 4 traversals in parallel
> > > > + */
> > > > +static inline int
> > > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > + uint32_t *results, int total_packets, uint32_t categories)
> > > > +{
> > > > + int n;
> > > > + struct acl_flow_data flows;
> > > > + uint64_t index_array[MAX_SEARCHES_SSE4];
> > > > + struct completion cmplt[MAX_SEARCHES_SSE4];
> > > > + struct parms parms[MAX_SEARCHES_SSE4];
> > > > + xmm_t input, indicies1, indicies2;
> > > > +
> > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > + total_packets, categories, ctx->trans_table);
> > > > +
> > > > + for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > > > + cmplt[n].count = 0;
> > > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > + }
> > > > +
> > > > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > > +
> > > > + /* Check for any matches. */
> > > > + acl_match_check_x4(0, ctx, parms, &flows,
> > > > + &indicies1, &indicies2, mm_match_mask.m);
> > > > +
> > > > + while (flows.started > 0) {
> > > > +
> > > > + /* Gather 4 bytes of input data for each stream. */
> > > > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > > > +
> > > > + /* Process the 4 bytes of input on each stream. */
> > > > + input = transition4(mm_index_mask.m, input,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies1, &indicies2);
> > > > +
> > > > + input = transition4(mm_index_mask.m, input,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies1, &indicies2);
> > > > +
> > > > + input = transition4(mm_index_mask.m, input,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies1, &indicies2);
> > > > +
> > > > + input = transition4(mm_index_mask.m, input,
> > > > + mm_shuffle_input.m, mm_ones_16.m,
> > > > + mm_bytes.m, mm_type_quad_range.m,
> > > > + flows.trans, &indicies1, &indicies2);
> > > > +
> > > > + /* Check for any matches. */
> > > > + acl_match_check_x4(0, ctx, parms, &flows,
> > > > + &indicies1, &indicies2, mm_match_mask.m);
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static inline xmm_t
> > > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > + const uint64_t *trans, xmm_t *indicies1)
> > > > +{
> > > > + uint64_t t;
> > > > + xmm_t addr, indicies2;
> > > > +
> > > > + indicies2 = MM_XOR(ones_16, ones_16);
> > > > +
> > > > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > > + bytes, type_quad_range, indicies1, &indicies2);
> > > > +
> > > > + /* Gather 64 bit transitions and pack 2 per register. */
> > > > +
> > > > + t = trans[MM_CVT32(addr)];
> > > > +
> > > > + /* get slot 1 */
> > > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > > > +
> > > > + return MM_SRL32(next_input, 8);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 2 traversals in parallel.
> > > > + */
> > > > +static inline int
> > > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > + uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > > +{
> > > > + int n;
> > > > + struct acl_flow_data flows;
> > > > + uint64_t index_array[MAX_SEARCHES_SSE2];
> > > > + struct completion cmplt[MAX_SEARCHES_SSE2];
> > > > + struct parms parms[MAX_SEARCHES_SSE2];
> > > > + xmm_t input, indicies;
> > > > +
> > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > + total_packets, categories, ctx->trans_table);
> > > > +
> > > > + for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > > > + cmplt[n].count = 0;
> > > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > + }
> > > > +
> > > > + indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > > > +
> > > > + /* Check for any matches. */
> > > > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > > > +
> > > > + while (flows.started > 0) {
> > > > +
> > > > + /* Gather 4 bytes of input data for each stream. */
> > > > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +
> > > > + /* Process the 4 bytes of input on each stream. */
> > > > +
> > > > + input = transition2(mm_index_mask64.m, input,
> > > > + mm_shuffle_input64.m, mm_ones_16.m,
> > > > + mm_bytes64.m, mm_type_quad_range64.m,
> > > > + flows.trans, &indicies);
> > > > +
> > > > + input = transition2(mm_index_mask64.m, input,
> > > > + mm_shuffle_input64.m, mm_ones_16.m,
> > > > + mm_bytes64.m, mm_type_quad_range64.m,
> > > > + flows.trans, &indicies);
> > > > +
> > > > + input = transition2(mm_index_mask64.m, input,
> > > > + mm_shuffle_input64.m, mm_ones_16.m,
> > > > + mm_bytes64.m, mm_type_quad_range64.m,
> > > > + flows.trans, &indicies);
> > > > +
> > > > + input = transition2(mm_index_mask64.m, input,
> > > > + mm_shuffle_input64.m, mm_ones_16.m,
> > > > + mm_bytes64.m, mm_type_quad_range64.m,
> > > > + flows.trans, &indicies);
> > > > +
> > > > + /* Check for any matches. */
> > > > + acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > > > + mm_match_mask64.m);
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +int
> > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > + uint32_t *results, uint32_t num, uint32_t categories)
> > > > +{
> > > > + if (categories != 1 &&
> > > > + ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > > > + return -EINVAL;
> > > > +
> > > > + if (likely(num >= MAX_SEARCHES_SSE8))
> > > > + return search_sse_8(ctx, data, results, num, categories);
> > > > + else if (num >= MAX_SEARCHES_SSE4)
> > > > + return search_sse_4(ctx, data, results, num, categories);
> > > > + else
> > > > + return search_sse_2(ctx, data, results, num, categories);
> > > > +}
> > > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > > > index 7c288bd..0cde07e 100644
> > > > --- a/lib/librte_acl/rte_acl.c
> > > > +++ b/lib/librte_acl/rte_acl.c
> > > > @@ -38,6 +38,21 @@
> > > >
> > > > TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> > > >
> > > > +/* by default, use always avaialbe scalar code path. */
> > > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > > > +
> > > make this static, the outside world shouldn't need to see it.
> >
> > As I said above, I think it more plausible to keep it globally visible.
> >
> > >
> > > > +void __attribute__((constructor(INT16_MAX)))
> > > > +rte_acl_select_classify(void)
> > > Make it static, The outside world doesn't need to call this.
> >
> > See above, would like user to have an ability to call it manually if needed.
> >
> > >
> > > > +{
> > > > + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > > > + /* SSE version requires SSE4.1 */
> > > > + rte_acl_default_classify = rte_acl_classify_sse;
> > > > + } else {
> > > > + /* reset to scalar version. */
> > > > + rte_acl_default_classify = rte_acl_classify_scalar;
> > > Don't need the else clause here, the static initalizer has you covered.
> >
> > I think we better keep it like that - in case user calls it manually.
> > We always reset rte_acl_default_classify to the 'best proper' value.
> >
> > > > + }
> > > > +}
> > > > +
> > > > +
> > > > +/**
> > > > + * Invokes default rte_acl_classify function.
> > > > + */
> > > > +extern rte_acl_classify_t rte_acl_default_classify;
> > > > +
> > > Doesn't need to be extern.
> > > > +#define rte_acl_classify(ctx, data, results, num, categories) \
> > > > + (*rte_acl_default_classify)(ctx, data, results, num, categories)
> > > > +
> > > Not sure why you need this either. The rte_acl_classify_t should be enough, no?
> >
> > We preserve existing rte_acl_classify() API, so users don't need to modify their code.
> >
> This would be a great candidate for versioning (Bruce and have been discussing
> this).
>
> Neil
>
> >
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
2014-08-08 11:49 0% ` Ananyev, Konstantin
@ 2014-08-08 12:25 4% ` Neil Horman
2014-08-08 13:09 3% ` Ananyev, Konstantin
0 siblings, 1 reply; 40+ results
From: Neil Horman @ 2014-08-08 12:25 UTC (permalink / raw)
To: Ananyev, Konstantin; +Cc: dev
On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Thursday, August 07, 2014 9:12 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> >
> > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > Make ACL library to build/work on 'default' architecture:
> > > - make rte_acl_classify_scalar really scalar
> > > (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > - Provide two versions of rte_acl_classify code path:
> > > rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > and upper, return -ENOTSUP on lower arch.
> > > rte_acl_classify_scalar() - a slower version, but could be build and used
> > > on all systems.
> > > - keep common code shared between these two codepaths.
> > >
> > > v2 chages:
> > > run-time selection of most appropriate code-path for given ISA.
> > > By default the highest supprted one is selected.
> > > User can still override that selection by manually assigning new value to
> > > the global function pointer rte_acl_default_classify.
> > > rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > points to.
> > >
> > >
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >
> > This is alot better thank you. A few remaining issues.
>
> My comments inline too.
> Thanks
> Konstantin
>
> >
> > > ---
> > > app/test-acl/main.c | 13 +-
> > > lib/librte_acl/Makefile | 5 +-
> > > lib/librte_acl/acl_bld.c | 5 +-
> > > lib/librte_acl/acl_match_check.def | 92 ++++
> > > lib/librte_acl/acl_run.c | 944 -------------------------------------
> > > lib/librte_acl/acl_run.h | 220 +++++++++
> > > lib/librte_acl/acl_run_scalar.c | 197 ++++++++
> > > lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++
> > > lib/librte_acl/rte_acl.c | 15 +
> > > lib/librte_acl/rte_acl.h | 24 +-
> > > 10 files changed, 1189 insertions(+), 956 deletions(-)
> > > create mode 100644 lib/librte_acl/acl_match_check.def
> > > delete mode 100644 lib/librte_acl/acl_run.c
> > > create mode 100644 lib/librte_acl/acl_run.h
> > > create mode 100644 lib/librte_acl/acl_run_scalar.c
> > > create mode 100644 lib/librte_acl/acl_run_sse.c
> > >
> > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > index d654409..45c6fa6 100644
> > > --- a/app/test-acl/main.c
> > > +++ b/app/test-acl/main.c
> > > @@ -787,6 +787,10 @@ acx_init(void)
> > > /* perform build. */
> > > ret = rte_acl_build(config.acx, &cfg);
> > >
> > > + /* setup default rte_acl_classify */
> > > + if (config.scalar)
> > > + rte_acl_default_classify = rte_acl_classify_scalar;
> > > +
> > Exporting this variable as part of the ABI is a bad idea. If the prototype of
> > the function changes you have to update all your applications.
>
> If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway.
>
Why? If you hide this from the application, changes to the internal
implementation will also be invisible. When building as a DSO, an application
will be able to transition between libraries without the need for a rebuild.
> > Make the pointer
> > an internal symbol and set it using a get/set routine with an enum to represent
> > the path to choose. That will help isolate the ABI from the internal
> > implementation.
>
> That's was my first intention too.
> But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> and it will cost us extra call (or jump).
Thats true, but I don't see that as a problem. We're not talking about a hot
code path here, its a setup function. Or do you think that an application will
be switching between classification functions on every classify operation?
> Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
I agree, but both the methods we are advocating for allow that. Its really just
a question of exposing the mechanism as data or text in the binary. Exposing it
as data comes with implicit ABI constraints that are less prevalanet when done
as code entry points.
> For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
In the case of a bug in the optimized path, you just fix the bug. If you want
to provide your own classification function, thats fine I suppose, but that
seems completely outside the scope of what we're trying to do here. Its not
adventageous to just throw that in there. If you want to be able to provide
your own classifier function, lets at least take some time to make sure that the
function prototype is sufficiently capable to accept all the data you might want
to pass it in the future, before we go exposing it. Otherwise you'll have to
break the ABI in future versions, whcih is something we've been discussing
trying to avoid.
> > It will also let you prevent things like selecting a run time
> > path that is incompatible with the running system
>
> If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
That really seems like poor design to me. I don't see why you wouldn't at least
want to warn the developer of an application if they were at run time to assign
a default classifier method that was incompatible with a running system. Yes,
they're likely smart enough to know what their doing, but smart people make
mistakes, and appreciate being told when they're doing so, especially if the
method of telling is something a bit more civil than a machine check that
might occur well after the application has been initilized.
> From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
Not if the function is statically declared and not exposed to the application
they cant :)
>
> > and prevent path switching
> > during searches, which may produce unexpected results.
>
> Not that I am advertising it, but it should be safe to update rte_acl_default_classify during searches:
> All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.
>
Fair enough.
> >
> > ><snip>
> > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > > deleted file mode 100644
> > > index e3d9fc1..0000000
> > > --- a/lib/librte_acl/acl_run.c
> > > +++ /dev/null
> > > @@ -1,944 +0,0 @@
> > > -/*-
> > > - * BSD LICENSE
> > > - *
> > > - * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > - * All rights reserved.
> > > - *
> > > - * Redistribution and use in source and binary forms, with or without
> > > - * modification, are permitted provided that the following conditions
> > ><snip>
> > > +
> > > +#define __func_resolve_priority__ resolve_priority_scalar
> > > +#define __func_match_check__ acl_match_check_scalar
> > > +#include "acl_match_check.def"
> > > +
> > I get this lets you make some more code common, but its just unpleasant to trace
> > through. Looking at the defintion of __func_match_check__ I don't see anything
> > particularly performance sensitive there. What if instead you simply redefined
> > __func_match_check__ in a common internal header as acl_match_check (a generic
> > function), and had it accept priority resolution function as an argument? That
> > would still give you all the performance enhancements without having to include
> > c files in the middle of other c files, and would make the code a bit more
> > parseable.
>
> Yes, that way it would look much better.
> And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
> Will change as you suggested.
>
Thank you
Neil
> >
> > > +/*
> > > + * When processing the transition, rather than using if/else
> > > + * construct, the offset is calculated for DFA and QRANGE and
> > > + * then conditionally added to the address based on node type.
> > > + * This is done to avoid branch mis-predictions. Since the
> > > + * offset is rather simple calculation it is more efficient
> > > + * to do the calculation and do a condition move rather than
> > > + * a conditional branch to determine which calculation to do.
> > > + */
> > > +static inline uint32_t
> > > +scan_forward(uint32_t input, uint32_t max)
> > > +{
> > > + return (input == 0) ? max : rte_bsf32(input);
> > > +}
> > > + }
> > > +}
> > ><snip>
> > > +
> > > +#define __func_resolve_priority__ resolve_priority_sse
> > > +#define __func_match_check__ acl_match_check_sse
> > > +#include "acl_match_check.def"
> > > +
> > Same deal as above.
> >
> > > +/*
> > > + * Extract transitions from an XMM register and check for any matches
> > > + */
> > > +static void
> > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > > + struct parms *parms, struct acl_flow_data *flows)
> > > +{
> > > + uint64_t transition1, transition2;
> > > +
> > > + /* extract transition from low 64 bits. */
> > > + transition1 = MM_CVT64(*indicies);
> > > +
> > > + /* extract transition from high 64 bits. */
> > > + *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > > + transition2 = MM_CVT64(*indicies);
> > > +
> > > + transition1 = acl_match_check_sse(transition1, slot, ctx,
> > > + parms, flows);
> > > + transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > > + parms, flows);
> > > +
> > > + /* update indicies with new transitions. */
> > > + *indicies = MM_SET64(transition2, transition1);
> > > +}
> > > +
> > > +/*
> > > + * Check for a match in 2 transitions (contained in SSE register)
> > > + */
> > > +static inline void
> > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > + struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > > +{
> > > + xmm_t temp;
> > > +
> > > + temp = MM_AND(match_mask, *indicies);
> > > + while (!MM_TESTZ(temp, temp)) {
> > > + acl_process_matches(indicies, slot, ctx, parms, flows);
> > > + temp = MM_AND(match_mask, *indicies);
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > > + */
> > > +static inline void
> > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > + struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > > + xmm_t match_mask)
> > > +{
> > > + xmm_t temp;
> > > +
> > > + /* put low 32 bits of each transition into one register */
> > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > + 0x88);
> > > + /* test for match node */
> > > + temp = MM_AND(match_mask, temp);
> > > +
> > > + while (!MM_TESTZ(temp, temp)) {
> > > + acl_process_matches(indicies1, slot, ctx, parms, flows);
> > > + acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > > +
> > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > + (__m128)*indicies2,
> > > + 0x88);
> > > + temp = MM_AND(match_mask, temp);
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * Calculate the address of the next transition for
> > > + * all types of nodes. Note that only DFA nodes and range
> > > + * nodes actually transition to another node. Match
> > > + * nodes don't move.
> > > + */
> > > +static inline xmm_t
> > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > + xmm_t *indicies1, xmm_t *indicies2)
> > > +{
> > > + xmm_t addr, node_types, temp;
> > > +
> > > + /*
> > > + * Note that no transition is done for a match
> > > + * node and therefore a stream freezes when
> > > + * it reaches a match.
> > > + */
> > > +
> > > + /* Shuffle low 32 into temp and high 32 into indicies2 */
> > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > + 0x88);
> > > + *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > + (__m128)*indicies2, 0xdd);
> > > +
> > > + /* Calc node type and node addr */
> > > + node_types = MM_ANDNOT(index_mask, temp);
> > > + addr = MM_AND(index_mask, temp);
> > > +
> > > + /*
> > > + * Calc addr for DFAs - addr = dfa_index + input_byte
> > > + */
> > > +
> > > + /* mask for DFA type (0) nodes */
> > > + temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > > +
> > > + /* add input byte to DFA position */
> > > + temp = MM_AND(temp, bytes);
> > > + temp = MM_AND(temp, next_input);
> > > + addr = MM_ADD32(addr, temp);
> > > +
> > > + /*
> > > + * Calc addr for Range nodes -> range_index + range(input)
> > > + */
> > > + node_types = MM_CMPEQ32(node_types, type_quad_range);
> > > +
> > > + /*
> > > + * Calculate number of range boundaries that are less than the
> > > + * input value. Range boundaries for each node are in signed 8 bit,
> > > + * ordered from -128 to 127 in the indicies2 register.
> > > + * This is effectively a popcnt of bytes that are greater than the
> > > + * input byte.
> > > + */
> > > +
> > > + /* shuffle input byte to all 4 positions of 32 bit value */
> > > + temp = MM_SHUFFLE8(next_input, shuffle_input);
> > > +
> > > + /* check ranges */
> > > + temp = MM_CMPGT8(temp, *indicies2);
> > > +
> > > + /* convert -1 to 1 (bytes greater than input byte */
> > > + temp = MM_SIGN8(temp, temp);
> > > +
> > > + /* horizontal add pairs of bytes into words */
> > > + temp = MM_MADD8(temp, temp);
> > > +
> > > + /* horizontal add pairs of words into dwords */
> > > + temp = MM_MADD16(temp, ones_16);
> > > +
> > > + /* mask to range type nodes */
> > > + temp = MM_AND(temp, node_types);
> > > +
> > > + /* add index into node position */
> > > + return MM_ADD32(addr, temp);
> > > +}
> > > +
> > > +/*
> > > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > > + */
> > > +static inline xmm_t
> > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > + const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > > +{
> > > + xmm_t addr;
> > > + uint64_t trans0, trans2;
> > > +
> > > + /* Calculate the address (array index) for all 4 transitions. */
> > > +
> > > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > + bytes, type_quad_range, indicies1, indicies2);
> > > +
> > > + /* Gather 64 bit transitions and pack back into 2 registers. */
> > > +
> > > + trans0 = trans[MM_CVT32(addr)];
> > > +
> > > + /* get slot 2 */
> > > +
> > > + /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > > + trans2 = trans[MM_CVT32(addr)];
> > > +
> > > + /* get slot 1 */
> > > +
> > > + /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > > +
> > > + /* get slot 3 */
> > > +
> > > + /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > > + *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > > +
> > > + return MM_SRL32(next_input, 8);
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 8 traversals in parallel
> > > + */
> > > +static inline int
> > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > + uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > +{
> > > + int n;
> > > + struct acl_flow_data flows;
> > > + uint64_t index_array[MAX_SEARCHES_SSE8];
> > > + struct completion cmplt[MAX_SEARCHES_SSE8];
> > > + struct parms parms[MAX_SEARCHES_SSE8];
> > > + xmm_t input0, input1;
> > > + xmm_t indicies1, indicies2, indicies3, indicies4;
> > > +
> > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > + total_packets, categories, ctx->trans_table);
> > > +
> > > + for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > > + cmplt[n].count = 0;
> > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > + }
> > > +
> > > + /*
> > > + * indicies1 contains index_array[0,1]
> > > + * indicies2 contains index_array[2,3]
> > > + * indicies3 contains index_array[4,5]
> > > + * indicies4 contains index_array[6,7]
> > > + */
> > > +
> > > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > +
> > > + indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > > + indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > > +
> > > + /* Check for any matches. */
> > > + acl_match_check_x4(0, ctx, parms, &flows,
> > > + &indicies1, &indicies2, mm_match_mask.m);
> > > + acl_match_check_x4(4, ctx, parms, &flows,
> > > + &indicies3, &indicies4, mm_match_mask.m);
> > > +
> > > + while (flows.started > 0) {
> > > +
> > > + /* Gather 4 bytes of input data for each stream. */
> > > + input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > > + 0);
> > > + input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > > + 0);
> > > +
> > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > > +
> > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > > +
> > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > > +
> > > + /* Process the 4 bytes of input on each stream. */
> > > +
> > > + input0 = transition4(mm_index_mask.m, input0,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies1, &indicies2);
> > > +
> > > + input1 = transition4(mm_index_mask.m, input1,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies3, &indicies4);
> > > +
> > > + input0 = transition4(mm_index_mask.m, input0,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies1, &indicies2);
> > > +
> > > + input1 = transition4(mm_index_mask.m, input1,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies3, &indicies4);
> > > +
> > > + input0 = transition4(mm_index_mask.m, input0,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies1, &indicies2);
> > > +
> > > + input1 = transition4(mm_index_mask.m, input1,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies3, &indicies4);
> > > +
> > > + input0 = transition4(mm_index_mask.m, input0,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies1, &indicies2);
> > > +
> > > + input1 = transition4(mm_index_mask.m, input1,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies3, &indicies4);
> > > +
> > > + /* Check for any matches. */
> > > + acl_match_check_x4(0, ctx, parms, &flows,
> > > + &indicies1, &indicies2, mm_match_mask.m);
> > > + acl_match_check_x4(4, ctx, parms, &flows,
> > > + &indicies3, &indicies4, mm_match_mask.m);
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 4 traversals in parallel
> > > + */
> > > +static inline int
> > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > + uint32_t *results, int total_packets, uint32_t categories)
> > > +{
> > > + int n;
> > > + struct acl_flow_data flows;
> > > + uint64_t index_array[MAX_SEARCHES_SSE4];
> > > + struct completion cmplt[MAX_SEARCHES_SSE4];
> > > + struct parms parms[MAX_SEARCHES_SSE4];
> > > + xmm_t input, indicies1, indicies2;
> > > +
> > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > + total_packets, categories, ctx->trans_table);
> > > +
> > > + for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > > + cmplt[n].count = 0;
> > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > + }
> > > +
> > > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > +
> > > + /* Check for any matches. */
> > > + acl_match_check_x4(0, ctx, parms, &flows,
> > > + &indicies1, &indicies2, mm_match_mask.m);
> > > +
> > > + while (flows.started > 0) {
> > > +
> > > + /* Gather 4 bytes of input data for each stream. */
> > > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > > +
> > > + /* Process the 4 bytes of input on each stream. */
> > > + input = transition4(mm_index_mask.m, input,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies1, &indicies2);
> > > +
> > > + input = transition4(mm_index_mask.m, input,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies1, &indicies2);
> > > +
> > > + input = transition4(mm_index_mask.m, input,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies1, &indicies2);
> > > +
> > > + input = transition4(mm_index_mask.m, input,
> > > + mm_shuffle_input.m, mm_ones_16.m,
> > > + mm_bytes.m, mm_type_quad_range.m,
> > > + flows.trans, &indicies1, &indicies2);
> > > +
> > > + /* Check for any matches. */
> > > + acl_match_check_x4(0, ctx, parms, &flows,
> > > + &indicies1, &indicies2, mm_match_mask.m);
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static inline xmm_t
> > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > + const uint64_t *trans, xmm_t *indicies1)
> > > +{
> > > + uint64_t t;
> > > + xmm_t addr, indicies2;
> > > +
> > > + indicies2 = MM_XOR(ones_16, ones_16);
> > > +
> > > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > + bytes, type_quad_range, indicies1, &indicies2);
> > > +
> > > + /* Gather 64 bit transitions and pack 2 per register. */
> > > +
> > > + t = trans[MM_CVT32(addr)];
> > > +
> > > + /* get slot 1 */
> > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > > +
> > > + return MM_SRL32(next_input, 8);
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 2 traversals in parallel.
> > > + */
> > > +static inline int
> > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > + uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > +{
> > > + int n;
> > > + struct acl_flow_data flows;
> > > + uint64_t index_array[MAX_SEARCHES_SSE2];
> > > + struct completion cmplt[MAX_SEARCHES_SSE2];
> > > + struct parms parms[MAX_SEARCHES_SSE2];
> > > + xmm_t input, indicies;
> > > +
> > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > + total_packets, categories, ctx->trans_table);
> > > +
> > > + for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > > + cmplt[n].count = 0;
> > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > + }
> > > +
> > > + indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > > +
> > > + /* Check for any matches. */
> > > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > > +
> > > + while (flows.started > 0) {
> > > +
> > > + /* Gather 4 bytes of input data for each stream. */
> > > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > +
> > > + /* Process the 4 bytes of input on each stream. */
> > > +
> > > + input = transition2(mm_index_mask64.m, input,
> > > + mm_shuffle_input64.m, mm_ones_16.m,
> > > + mm_bytes64.m, mm_type_quad_range64.m,
> > > + flows.trans, &indicies);
> > > +
> > > + input = transition2(mm_index_mask64.m, input,
> > > + mm_shuffle_input64.m, mm_ones_16.m,
> > > + mm_bytes64.m, mm_type_quad_range64.m,
> > > + flows.trans, &indicies);
> > > +
> > > + input = transition2(mm_index_mask64.m, input,
> > > + mm_shuffle_input64.m, mm_ones_16.m,
> > > + mm_bytes64.m, mm_type_quad_range64.m,
> > > + flows.trans, &indicies);
> > > +
> > > + input = transition2(mm_index_mask64.m, input,
> > > + mm_shuffle_input64.m, mm_ones_16.m,
> > > + mm_bytes64.m, mm_type_quad_range64.m,
> > > + flows.trans, &indicies);
> > > +
> > > + /* Check for any matches. */
> > > + acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > > + mm_match_mask64.m);
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +int
> > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > + uint32_t *results, uint32_t num, uint32_t categories)
> > > +{
> > > + if (categories != 1 &&
> > > + ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > > + return -EINVAL;
> > > +
> > > + if (likely(num >= MAX_SEARCHES_SSE8))
> > > + return search_sse_8(ctx, data, results, num, categories);
> > > + else if (num >= MAX_SEARCHES_SSE4)
> > > + return search_sse_4(ctx, data, results, num, categories);
> > > + else
> > > + return search_sse_2(ctx, data, results, num, categories);
> > > +}
> > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > > index 7c288bd..0cde07e 100644
> > > --- a/lib/librte_acl/rte_acl.c
> > > +++ b/lib/librte_acl/rte_acl.c
> > > @@ -38,6 +38,21 @@
> > >
> > > TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> > >
> > > +/* by default, use always avaialbe scalar code path. */
> > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > > +
> > make this static, the outside world shouldn't need to see it.
>
> As I said above, I think it more plausible to keep it globally visible.
>
> >
> > > +void __attribute__((constructor(INT16_MAX)))
> > > +rte_acl_select_classify(void)
> > Make it static, The outside world doesn't need to call this.
>
> See above, would like user to have an ability to call it manually if needed.
>
> >
> > > +{
> > > + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > > + /* SSE version requires SSE4.1 */
> > > + rte_acl_default_classify = rte_acl_classify_sse;
> > > + } else {
> > > + /* reset to scalar version. */
> > > + rte_acl_default_classify = rte_acl_classify_scalar;
> > Don't need the else clause here, the static initalizer has you covered.
>
> I think we better keep it like that - in case user calls it manually.
> We always reset rte_acl_default_classify to the 'best proper' value.
>
> > > + }
> > > +}
> > > +
> > > +
> > > +/**
> > > + * Invokes default rte_acl_classify function.
> > > + */
> > > +extern rte_acl_classify_t rte_acl_default_classify;
> > > +
> > Doesn't need to be extern.
> > > +#define rte_acl_classify(ctx, data, results, num, categories) \
> > > + (*rte_acl_default_classify)(ctx, data, results, num, categories)
> > > +
> > Not sure why you need this either. The rte_acl_classify_t should be enough, no?
>
> We preserve existing rte_acl_classify() API, so users don't need to modify their code.
>
This would be a great candidate for versioning (Bruce and have been discussing
this).
Neil
>
^ permalink raw reply [relevance 4%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
2014-08-07 20:11 4% ` Neil Horman
2014-08-07 20:58 0% ` Vincent JARDIN
@ 2014-08-08 11:49 0% ` Ananyev, Konstantin
2014-08-08 12:25 4% ` Neil Horman
1 sibling, 1 reply; 40+ results
From: Ananyev, Konstantin @ 2014-08-08 11:49 UTC (permalink / raw)
To: Neil Horman; +Cc: dev
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, August 07, 2014 9:12 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
>
> On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> > (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> > rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > and upper, return -ENOTSUP on lower arch.
> > rte_acl_classify_scalar() - a slower version, but could be build and used
> > on all systems.
> > - keep common code shared between these two codepaths.
> >
> > v2 chages:
> > run-time selection of most appropriate code-path for given ISA.
> > By default the highest supprted one is selected.
> > User can still override that selection by manually assigning new value to
> > the global function pointer rte_acl_default_classify.
> > rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > points to.
> >
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>
> This is alot better thank you. A few remaining issues.
My comments inline too.
Thanks
Konstantin
>
> > ---
> > app/test-acl/main.c | 13 +-
> > lib/librte_acl/Makefile | 5 +-
> > lib/librte_acl/acl_bld.c | 5 +-
> > lib/librte_acl/acl_match_check.def | 92 ++++
> > lib/librte_acl/acl_run.c | 944 -------------------------------------
> > lib/librte_acl/acl_run.h | 220 +++++++++
> > lib/librte_acl/acl_run_scalar.c | 197 ++++++++
> > lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++
> > lib/librte_acl/rte_acl.c | 15 +
> > lib/librte_acl/rte_acl.h | 24 +-
> > 10 files changed, 1189 insertions(+), 956 deletions(-)
> > create mode 100644 lib/librte_acl/acl_match_check.def
> > delete mode 100644 lib/librte_acl/acl_run.c
> > create mode 100644 lib/librte_acl/acl_run.h
> > create mode 100644 lib/librte_acl/acl_run_scalar.c
> > create mode 100644 lib/librte_acl/acl_run_sse.c
> >
> > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > index d654409..45c6fa6 100644
> > --- a/app/test-acl/main.c
> > +++ b/app/test-acl/main.c
> > @@ -787,6 +787,10 @@ acx_init(void)
> > /* perform build. */
> > ret = rte_acl_build(config.acx, &cfg);
> >
> > + /* setup default rte_acl_classify */
> > + if (config.scalar)
> > + rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> Exporting this variable as part of the ABI is a bad idea. If the prototype of
> the function changes you have to update all your applications.
If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway.
> Make the pointer
> an internal symbol and set it using a get/set routine with an enum to represent
> the path to choose. That will help isolate the ABI from the internal
> implementation.
That's was my first intention too.
But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
and it will cost us extra call (or jump).
Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
> It will also let you prevent things like selecting a run time
> path that is incompatible with the running system
If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
>From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
> and prevent path switching
> during searches, which may produce unexpected results.
Not that I am advertising it, but it should be safe to update rte_acl_default_classify during searches:
All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.
>
> ><snip>
> > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > deleted file mode 100644
> > index e3d9fc1..0000000
> > --- a/lib/librte_acl/acl_run.c
> > +++ /dev/null
> > @@ -1,944 +0,0 @@
> > -/*-
> > - * BSD LICENSE
> > - *
> > - * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > - * All rights reserved.
> > - *
> > - * Redistribution and use in source and binary forms, with or without
> > - * modification, are permitted provided that the following conditions
> ><snip>
> > +
> > +#define __func_resolve_priority__ resolve_priority_scalar
> > +#define __func_match_check__ acl_match_check_scalar
> > +#include "acl_match_check.def"
> > +
> I get this lets you make some more code common, but its just unpleasant to trace
> through. Looking at the defintion of __func_match_check__ I don't see anything
> particularly performance sensitive there. What if instead you simply redefined
> __func_match_check__ in a common internal header as acl_match_check (a generic
> function), and had it accept priority resolution function as an argument? That
> would still give you all the performance enhancements without having to include
> c files in the middle of other c files, and would make the code a bit more
> parseable.
Yes, that way it would look much better.
And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
Will change as you suggested.
>
> > +/*
> > + * When processing the transition, rather than using if/else
> > + * construct, the offset is calculated for DFA and QRANGE and
> > + * then conditionally added to the address based on node type.
> > + * This is done to avoid branch mis-predictions. Since the
> > + * offset is rather simple calculation it is more efficient
> > + * to do the calculation and do a condition move rather than
> > + * a conditional branch to determine which calculation to do.
> > + */
> > +static inline uint32_t
> > +scan_forward(uint32_t input, uint32_t max)
> > +{
> > + return (input == 0) ? max : rte_bsf32(input);
> > +}
> > + }
> > +}
> ><snip>
> > +
> > +#define __func_resolve_priority__ resolve_priority_sse
> > +#define __func_match_check__ acl_match_check_sse
> > +#include "acl_match_check.def"
> > +
> Same deal as above.
>
> > +/*
> > + * Extract transitions from an XMM register and check for any matches
> > + */
> > +static void
> > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > + struct parms *parms, struct acl_flow_data *flows)
> > +{
> > + uint64_t transition1, transition2;
> > +
> > + /* extract transition from low 64 bits. */
> > + transition1 = MM_CVT64(*indicies);
> > +
> > + /* extract transition from high 64 bits. */
> > + *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > + transition2 = MM_CVT64(*indicies);
> > +
> > + transition1 = acl_match_check_sse(transition1, slot, ctx,
> > + parms, flows);
> > + transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > + parms, flows);
> > +
> > + /* update indicies with new transitions. */
> > + *indicies = MM_SET64(transition2, transition1);
> > +}
> > +
> > +/*
> > + * Check for a match in 2 transitions (contained in SSE register)
> > + */
> > +static inline void
> > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > + struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > +{
> > + xmm_t temp;
> > +
> > + temp = MM_AND(match_mask, *indicies);
> > + while (!MM_TESTZ(temp, temp)) {
> > + acl_process_matches(indicies, slot, ctx, parms, flows);
> > + temp = MM_AND(match_mask, *indicies);
> > + }
> > +}
> > +
> > +/*
> > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > + */
> > +static inline void
> > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > + struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > + xmm_t match_mask)
> > +{
> > + xmm_t temp;
> > +
> > + /* put low 32 bits of each transition into one register */
> > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > + 0x88);
> > + /* test for match node */
> > + temp = MM_AND(match_mask, temp);
> > +
> > + while (!MM_TESTZ(temp, temp)) {
> > + acl_process_matches(indicies1, slot, ctx, parms, flows);
> > + acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > +
> > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > + (__m128)*indicies2,
> > + 0x88);
> > + temp = MM_AND(match_mask, temp);
> > + }
> > +}
> > +
> > +/*
> > + * Calculate the address of the next transition for
> > + * all types of nodes. Note that only DFA nodes and range
> > + * nodes actually transition to another node. Match
> > + * nodes don't move.
> > + */
> > +static inline xmm_t
> > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > + xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > + xmm_t addr, node_types, temp;
> > +
> > + /*
> > + * Note that no transition is done for a match
> > + * node and therefore a stream freezes when
> > + * it reaches a match.
> > + */
> > +
> > + /* Shuffle low 32 into temp and high 32 into indicies2 */
> > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > + 0x88);
> > + *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > + (__m128)*indicies2, 0xdd);
> > +
> > + /* Calc node type and node addr */
> > + node_types = MM_ANDNOT(index_mask, temp);
> > + addr = MM_AND(index_mask, temp);
> > +
> > + /*
> > + * Calc addr for DFAs - addr = dfa_index + input_byte
> > + */
> > +
> > + /* mask for DFA type (0) nodes */
> > + temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > +
> > + /* add input byte to DFA position */
> > + temp = MM_AND(temp, bytes);
> > + temp = MM_AND(temp, next_input);
> > + addr = MM_ADD32(addr, temp);
> > +
> > + /*
> > + * Calc addr for Range nodes -> range_index + range(input)
> > + */
> > + node_types = MM_CMPEQ32(node_types, type_quad_range);
> > +
> > + /*
> > + * Calculate number of range boundaries that are less than the
> > + * input value. Range boundaries for each node are in signed 8 bit,
> > + * ordered from -128 to 127 in the indicies2 register.
> > + * This is effectively a popcnt of bytes that are greater than the
> > + * input byte.
> > + */
> > +
> > + /* shuffle input byte to all 4 positions of 32 bit value */
> > + temp = MM_SHUFFLE8(next_input, shuffle_input);
> > +
> > + /* check ranges */
> > + temp = MM_CMPGT8(temp, *indicies2);
> > +
> > + /* convert -1 to 1 (bytes greater than input byte */
> > + temp = MM_SIGN8(temp, temp);
> > +
> > + /* horizontal add pairs of bytes into words */
> > + temp = MM_MADD8(temp, temp);
> > +
> > + /* horizontal add pairs of words into dwords */
> > + temp = MM_MADD16(temp, ones_16);
> > +
> > + /* mask to range type nodes */
> > + temp = MM_AND(temp, node_types);
> > +
> > + /* add index into node position */
> > + return MM_ADD32(addr, temp);
> > +}
> > +
> > +/*
> > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > + */
> > +static inline xmm_t
> > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > + const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > + xmm_t addr;
> > + uint64_t trans0, trans2;
> > +
> > + /* Calculate the address (array index) for all 4 transitions. */
> > +
> > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > + bytes, type_quad_range, indicies1, indicies2);
> > +
> > + /* Gather 64 bit transitions and pack back into 2 registers. */
> > +
> > + trans0 = trans[MM_CVT32(addr)];
> > +
> > + /* get slot 2 */
> > +
> > + /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > + trans2 = trans[MM_CVT32(addr)];
> > +
> > + /* get slot 1 */
> > +
> > + /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > +
> > + /* get slot 3 */
> > +
> > + /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > + *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > +
> > + return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 8 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > + uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > + int n;
> > + struct acl_flow_data flows;
> > + uint64_t index_array[MAX_SEARCHES_SSE8];
> > + struct completion cmplt[MAX_SEARCHES_SSE8];
> > + struct parms parms[MAX_SEARCHES_SSE8];
> > + xmm_t input0, input1;
> > + xmm_t indicies1, indicies2, indicies3, indicies4;
> > +
> > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > + total_packets, categories, ctx->trans_table);
> > +
> > + for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > + cmplt[n].count = 0;
> > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > + }
> > +
> > + /*
> > + * indicies1 contains index_array[0,1]
> > + * indicies2 contains index_array[2,3]
> > + * indicies3 contains index_array[4,5]
> > + * indicies4 contains index_array[6,7]
> > + */
> > +
> > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > + indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > + indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x4(0, ctx, parms, &flows,
> > + &indicies1, &indicies2, mm_match_mask.m);
> > + acl_match_check_x4(4, ctx, parms, &flows,
> > + &indicies3, &indicies4, mm_match_mask.m);
> > +
> > + while (flows.started > 0) {
> > +
> > + /* Gather 4 bytes of input data for each stream. */
> > + input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > + 0);
> > + input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > + 0);
> > +
> > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > +
> > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > +
> > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > +
> > + /* Process the 4 bytes of input on each stream. */
> > +
> > + input0 = transition4(mm_index_mask.m, input0,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input1 = transition4(mm_index_mask.m, input1,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies3, &indicies4);
> > +
> > + input0 = transition4(mm_index_mask.m, input0,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input1 = transition4(mm_index_mask.m, input1,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies3, &indicies4);
> > +
> > + input0 = transition4(mm_index_mask.m, input0,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input1 = transition4(mm_index_mask.m, input1,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies3, &indicies4);
> > +
> > + input0 = transition4(mm_index_mask.m, input0,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input1 = transition4(mm_index_mask.m, input1,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies3, &indicies4);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x4(0, ctx, parms, &flows,
> > + &indicies1, &indicies2, mm_match_mask.m);
> > + acl_match_check_x4(4, ctx, parms, &flows,
> > + &indicies3, &indicies4, mm_match_mask.m);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 4 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > + uint32_t *results, int total_packets, uint32_t categories)
> > +{
> > + int n;
> > + struct acl_flow_data flows;
> > + uint64_t index_array[MAX_SEARCHES_SSE4];
> > + struct completion cmplt[MAX_SEARCHES_SSE4];
> > + struct parms parms[MAX_SEARCHES_SSE4];
> > + xmm_t input, indicies1, indicies2;
> > +
> > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > + total_packets, categories, ctx->trans_table);
> > +
> > + for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > + cmplt[n].count = 0;
> > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > + }
> > +
> > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x4(0, ctx, parms, &flows,
> > + &indicies1, &indicies2, mm_match_mask.m);
> > +
> > + while (flows.started > 0) {
> > +
> > + /* Gather 4 bytes of input data for each stream. */
> > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > +
> > + /* Process the 4 bytes of input on each stream. */
> > + input = transition4(mm_index_mask.m, input,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input = transition4(mm_index_mask.m, input,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input = transition4(mm_index_mask.m, input,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input = transition4(mm_index_mask.m, input,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x4(0, ctx, parms, &flows,
> > + &indicies1, &indicies2, mm_match_mask.m);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static inline xmm_t
> > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > + const uint64_t *trans, xmm_t *indicies1)
> > +{
> > + uint64_t t;
> > + xmm_t addr, indicies2;
> > +
> > + indicies2 = MM_XOR(ones_16, ones_16);
> > +
> > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > + bytes, type_quad_range, indicies1, &indicies2);
> > +
> > + /* Gather 64 bit transitions and pack 2 per register. */
> > +
> > + t = trans[MM_CVT32(addr)];
> > +
> > + /* get slot 1 */
> > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > +
> > + return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 2 traversals in parallel.
> > + */
> > +static inline int
> > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > + uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > + int n;
> > + struct acl_flow_data flows;
> > + uint64_t index_array[MAX_SEARCHES_SSE2];
> > + struct completion cmplt[MAX_SEARCHES_SSE2];
> > + struct parms parms[MAX_SEARCHES_SSE2];
> > + xmm_t input, indicies;
> > +
> > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > + total_packets, categories, ctx->trans_table);
> > +
> > + for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > + cmplt[n].count = 0;
> > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > + }
> > +
> > + indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > +
> > + while (flows.started > 0) {
> > +
> > + /* Gather 4 bytes of input data for each stream. */
> > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +
> > + /* Process the 4 bytes of input on each stream. */
> > +
> > + input = transition2(mm_index_mask64.m, input,
> > + mm_shuffle_input64.m, mm_ones_16.m,
> > + mm_bytes64.m, mm_type_quad_range64.m,
> > + flows.trans, &indicies);
> > +
> > + input = transition2(mm_index_mask64.m, input,
> > + mm_shuffle_input64.m, mm_ones_16.m,
> > + mm_bytes64.m, mm_type_quad_range64.m,
> > + flows.trans, &indicies);
> > +
> > + input = transition2(mm_index_mask64.m, input,
> > + mm_shuffle_input64.m, mm_ones_16.m,
> > + mm_bytes64.m, mm_type_quad_range64.m,
> > + flows.trans, &indicies);
> > +
> > + input = transition2(mm_index_mask64.m, input,
> > + mm_shuffle_input64.m, mm_ones_16.m,
> > + mm_bytes64.m, mm_type_quad_range64.m,
> > + flows.trans, &indicies);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > + mm_match_mask64.m);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +int
> > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > + uint32_t *results, uint32_t num, uint32_t categories)
> > +{
> > + if (categories != 1 &&
> > + ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > + return -EINVAL;
> > +
> > + if (likely(num >= MAX_SEARCHES_SSE8))
> > + return search_sse_8(ctx, data, results, num, categories);
> > + else if (num >= MAX_SEARCHES_SSE4)
> > + return search_sse_4(ctx, data, results, num, categories);
> > + else
> > + return search_sse_2(ctx, data, results, num, categories);
> > +}
> > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > index 7c288bd..0cde07e 100644
> > --- a/lib/librte_acl/rte_acl.c
> > +++ b/lib/librte_acl/rte_acl.c
> > @@ -38,6 +38,21 @@
> >
> > TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> >
> > +/* by default, use always avaialbe scalar code path. */
> > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> make this static, the outside world shouldn't need to see it.
As I said above, I think it more plausible to keep it globally visible.
>
> > +void __attribute__((constructor(INT16_MAX)))
> > +rte_acl_select_classify(void)
> Make it static, The outside world doesn't need to call this.
See above, would like user to have an ability to call it manually if needed.
>
> > +{
> > + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > + /* SSE version requires SSE4.1 */
> > + rte_acl_default_classify = rte_acl_classify_sse;
> > + } else {
> > + /* reset to scalar version. */
> > + rte_acl_default_classify = rte_acl_classify_scalar;
> Don't need the else clause here, the static initalizer has you covered.
I think we better keep it like that - in case user calls it manually.
We always reset rte_acl_default_classify to the 'best proper' value.
> > + }
> > +}
> > +
> > +
> > +/**
> > + * Invokes default rte_acl_classify function.
> > + */
> > +extern rte_acl_classify_t rte_acl_default_classify;
> > +
> Doesn't need to be extern.
> > +#define rte_acl_classify(ctx, data, results, num, categories) \
> > + (*rte_acl_default_classify)(ctx, data, results, num, categories)
> > +
> Not sure why you need this either. The rte_acl_classify_t should be enough, no?
We preserve existing rte_acl_classify() API, so users don't need to modify their code.
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
2014-08-07 20:11 4% ` Neil Horman
@ 2014-08-07 20:58 0% ` Vincent JARDIN
2014-08-08 11:49 0% ` Ananyev, Konstantin
1 sibling, 0 replies; 40+ results
From: Vincent JARDIN @ 2014-08-07 20:58 UTC (permalink / raw)
To: Neil Horman; +Cc: dev
What's about using function versioning attributes too:
https://gcc.gnu.org/wiki/FunctionMultiVersioning
?
Le 7 août 2014 22:11, "Neil Horman" <nhorman@tuxdriver.com> a écrit :
>
> On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> > (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> > rte_acl_classify_sse() - could be build and used only on systems with
sse4.2
> > and upper, return -ENOTSUP on lower arch.
> > rte_acl_classify_scalar() - a slower version, but could be build and
used
> > on all systems.
> > - keep common code shared between these two codepaths.
> >
> > v2 chages:
> > run-time selection of most appropriate code-path for given ISA.
> > By default the highest supprted one is selected.
> > User can still override that selection by manually assigning new value
to
> > the global function pointer rte_acl_default_classify.
> > rte_acl_classify() becomes a macro calling whatever
rte_acl_default_classify
> > points to.
> >
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>
> This is alot better thank you. A few remaining issues.
>
> > ---
> > app/test-acl/main.c | 13 +-
> > lib/librte_acl/Makefile | 5 +-
> > lib/librte_acl/acl_bld.c | 5 +-
> > lib/librte_acl/acl_match_check.def | 92 ++++
> > lib/librte_acl/acl_run.c | 944
-------------------------------------
> > lib/librte_acl/acl_run.h | 220 +++++++++
> > lib/librte_acl/acl_run_scalar.c | 197 ++++++++
> > lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++
> > lib/librte_acl/rte_acl.c | 15 +
> > lib/librte_acl/rte_acl.h | 24 +-
> > 10 files changed, 1189 insertions(+), 956 deletions(-)
> > create mode 100644 lib/librte_acl/acl_match_check.def
> > delete mode 100644 lib/librte_acl/acl_run.c
> > create mode 100644 lib/librte_acl/acl_run.h
> > create mode 100644 lib/librte_acl/acl_run_scalar.c
> > create mode 100644 lib/librte_acl/acl_run_sse.c
> >
> > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > index d654409..45c6fa6 100644
> > --- a/app/test-acl/main.c
> > +++ b/app/test-acl/main.c
> > @@ -787,6 +787,10 @@ acx_init(void)
> > /* perform build. */
> > ret = rte_acl_build(config.acx, &cfg);
> >
> > + /* setup default rte_acl_classify */
> > + if (config.scalar)
> > + rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> Exporting this variable as part of the ABI is a bad idea. If the
prototype of
> the function changes you have to update all your applications. Make the
pointer
> an internal symbol and set it using a get/set routine with an enum to
represent
> the path to choose. That will help isolate the ABI from the internal
> implementation. It will also let you prevent things like selecting a run
time
> path that is incompatible with the running system, and prevent path
switching
> during searches, which may produce unexpected results.
>
> ><snip>
> > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > deleted file mode 100644
> > index e3d9fc1..0000000
> > --- a/lib/librte_acl/acl_run.c
> > +++ /dev/null
> > @@ -1,944 +0,0 @@
> > -/*-
> > - * BSD LICENSE
> > - *
> > - * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > - * All rights reserved.
> > - *
> > - * Redistribution and use in source and binary forms, with or without
> > - * modification, are permitted provided that the following conditions
> ><snip>
> > +
> > +#define __func_resolve_priority__ resolve_priority_scalar
> > +#define __func_match_check__ acl_match_check_scalar
> > +#include "acl_match_check.def"
> > +
> I get this lets you make some more code common, but its just unpleasant
to trace
> through. Looking at the defintion of __func_match_check__ I don't see
anything
> particularly performance sensitive there. What if instead you simply
redefined
> __func_match_check__ in a common internal header as acl_match_check (a
generic
> function), and had it accept priority resolution function as an argument?
That
> would still give you all the performance enhancements without having to
include
> c files in the middle of other c files, and would make the code a bit more
> parseable.
>
> > +/*
> > + * When processing the transition, rather than using if/else
> > + * construct, the offset is calculated for DFA and QRANGE and
> > + * then conditionally added to the address based on node type.
> > + * This is done to avoid branch mis-predictions. Since the
> > + * offset is rather simple calculation it is more efficient
> > + * to do the calculation and do a condition move rather than
> > + * a conditional branch to determine which calculation to do.
> > + */
> > +static inline uint32_t
> > +scan_forward(uint32_t input, uint32_t max)
> > +{
> > + return (input == 0) ? max : rte_bsf32(input);
> > +}
> > + }
> > +}
> ><snip>
> > +
> > +#define __func_resolve_priority__ resolve_priority_sse
> > +#define __func_match_check__ acl_match_check_sse
> > +#include "acl_match_check.def"
> > +
> Same deal as above.
>
> > +/*
> > + * Extract transitions from an XMM register and check for any matches
> > + */
> > +static void
> > +acl_process_matches(xmm_t *indicies, int slot, const struct
rte_acl_ctx *ctx,
> > + struct parms *parms, struct acl_flow_data *flows)
> > +{
> > + uint64_t transition1, transition2;
> > +
> > + /* extract transition from low 64 bits. */
> > + transition1 = MM_CVT64(*indicies);
> > +
> > + /* extract transition from high 64 bits. */
> > + *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > + transition2 = MM_CVT64(*indicies);
> > +
> > + transition1 = acl_match_check_sse(transition1, slot, ctx,
> > + parms, flows);
> > + transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > + parms, flows);
> > +
> > + /* update indicies with new transitions. */
> > + *indicies = MM_SET64(transition2, transition1);
> > +}
> > +
> > +/*
> > + * Check for a match in 2 transitions (contained in SSE register)
> > + */
> > +static inline void
> > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct
parms *parms,
> > + struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > +{
> > + xmm_t temp;
> > +
> > + temp = MM_AND(match_mask, *indicies);
> > + while (!MM_TESTZ(temp, temp)) {
> > + acl_process_matches(indicies, slot, ctx, parms, flows);
> > + temp = MM_AND(match_mask, *indicies);
> > + }
> > +}
> > +
> > +/*
> > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > + */
> > +static inline void
> > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct
parms *parms,
> > + struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > + xmm_t match_mask)
> > +{
> > + xmm_t temp;
> > +
> > + /* put low 32 bits of each transition into one register */
> > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > + 0x88);
> > + /* test for match node */
> > + temp = MM_AND(match_mask, temp);
> > +
> > + while (!MM_TESTZ(temp, temp)) {
> > + acl_process_matches(indicies1, slot, ctx, parms, flows);
> > + acl_process_matches(indicies2, slot + 2, ctx, parms,
flows);
> > +
> > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > + (__m128)*indicies2,
> > + 0x88);
> > + temp = MM_AND(match_mask, temp);
> > + }
> > +}
> > +
> > +/*
> > + * Calculate the address of the next transition for
> > + * all types of nodes. Note that only DFA nodes and range
> > + * nodes actually transition to another node. Match
> > + * nodes don't move.
> > + */
> > +static inline xmm_t
> > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > + xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > + xmm_t addr, node_types, temp;
> > +
> > + /*
> > + * Note that no transition is done for a match
> > + * node and therefore a stream freezes when
> > + * it reaches a match.
> > + */
> > +
> > + /* Shuffle low 32 into temp and high 32 into indicies2 */
> > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > + 0x88);
> > + *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > + (__m128)*indicies2, 0xdd);
> > +
> > + /* Calc node type and node addr */
> > + node_types = MM_ANDNOT(index_mask, temp);
> > + addr = MM_AND(index_mask, temp);
> > +
> > + /*
> > + * Calc addr for DFAs - addr = dfa_index + input_byte
> > + */
> > +
> > + /* mask for DFA type (0) nodes */
> > + temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > +
> > + /* add input byte to DFA position */
> > + temp = MM_AND(temp, bytes);
> > + temp = MM_AND(temp, next_input);
> > + addr = MM_ADD32(addr, temp);
> > +
> > + /*
> > + * Calc addr for Range nodes -> range_index + range(input)
> > + */
> > + node_types = MM_CMPEQ32(node_types, type_quad_range);
> > +
> > + /*
> > + * Calculate number of range boundaries that are less than the
> > + * input value. Range boundaries for each node are in signed 8
bit,
> > + * ordered from -128 to 127 in the indicies2 register.
> > + * This is effectively a popcnt of bytes that are greater than the
> > + * input byte.
> > + */
> > +
> > + /* shuffle input byte to all 4 positions of 32 bit value */
> > + temp = MM_SHUFFLE8(next_input, shuffle_input);
> > +
> > + /* check ranges */
> > + temp = MM_CMPGT8(temp, *indicies2);
> > +
> > + /* convert -1 to 1 (bytes greater than input byte */
> > + temp = MM_SIGN8(temp, temp);
> > +
> > + /* horizontal add pairs of bytes into words */
> > + temp = MM_MADD8(temp, temp);
> > +
> > + /* horizontal add pairs of words into dwords */
> > + temp = MM_MADD16(temp, ones_16);
> > +
> > + /* mask to range type nodes */
> > + temp = MM_AND(temp, node_types);
> > +
> > + /* add index into node position */
> > + return MM_ADD32(addr, temp);
> > +}
> > +
> > +/*
> > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > + */
> > +static inline xmm_t
> > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > + const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > + xmm_t addr;
> > + uint64_t trans0, trans2;
> > +
> > + /* Calculate the address (array index) for all 4 transitions. */
> > +
> > + addr = acl_calc_addr(index_mask, next_input, shuffle_input,
ones_16,
> > + bytes, type_quad_range, indicies1, indicies2);
> > +
> > + /* Gather 64 bit transitions and pack back into 2 registers. */
> > +
> > + trans0 = trans[MM_CVT32(addr)];
> > +
> > + /* get slot 2 */
> > +
> > + /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > + trans2 = trans[MM_CVT32(addr)];
> > +
> > + /* get slot 1 */
> > +
> > + /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > +
> > + /* get slot 3 */
> > +
> > + /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > + *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > +
> > + return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 8 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > + uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > + int n;
> > + struct acl_flow_data flows;
> > + uint64_t index_array[MAX_SEARCHES_SSE8];
> > + struct completion cmplt[MAX_SEARCHES_SSE8];
> > + struct parms parms[MAX_SEARCHES_SSE8];
> > + xmm_t input0, input1;
> > + xmm_t indicies1, indicies2, indicies3, indicies4;
> > +
> > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > + total_packets, categories, ctx->trans_table);
> > +
> > + for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > + cmplt[n].count = 0;
> > + index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > + }
> > +
> > + /*
> > + * indicies1 contains index_array[0,1]
> > + * indicies2 contains index_array[2,3]
> > + * indicies3 contains index_array[4,5]
> > + * indicies4 contains index_array[6,7]
> > + */
> > +
> > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > + indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > + indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x4(0, ctx, parms, &flows,
> > + &indicies1, &indicies2, mm_match_mask.m);
> > + acl_match_check_x4(4, ctx, parms, &flows,
> > + &indicies3, &indicies4, mm_match_mask.m);
> > +
> > + while (flows.started > 0) {
> > +
> > + /* Gather 4 bytes of input data for each stream. */
> > + input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0),
> > + 0);
> > + input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
4),
> > + 0);
> > +
> > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1),
1);
> > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5),
1);
> > +
> > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2),
2);
> > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6),
2);
> > +
> > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3),
3);
> > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7),
3);
> > +
> > + /* Process the 4 bytes of input on each stream. */
> > +
> > + input0 = transition4(mm_index_mask.m, input0,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input1 = transition4(mm_index_mask.m, input1,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies3, &indicies4);
> > +
> > + input0 = transition4(mm_index_mask.m, input0,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input1 = transition4(mm_index_mask.m, input1,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies3, &indicies4);
> > +
> > + input0 = transition4(mm_index_mask.m, input0,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input1 = transition4(mm_index_mask.m, input1,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies3, &indicies4);
> > +
> > + input0 = transition4(mm_index_mask.m, input0,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input1 = transition4(mm_index_mask.m, input1,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies3, &indicies4);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x4(0, ctx, parms, &flows,
> > + &indicies1, &indicies2, mm_match_mask.m);
> > + acl_match_check_x4(4, ctx, parms, &flows,
> > + &indicies3, &indicies4, mm_match_mask.m);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 4 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > + uint32_t *results, int total_packets, uint32_t categories)
> > +{
> > + int n;
> > + struct acl_flow_data flows;
> > + uint64_t index_array[MAX_SEARCHES_SSE4];
> > + struct completion cmplt[MAX_SEARCHES_SSE4];
> > + struct parms parms[MAX_SEARCHES_SSE4];
> > + xmm_t input, indicies1, indicies2;
> > +
> > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > + total_packets, categories, ctx->trans_table);
> > +
> > + for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > + cmplt[n].count = 0;
> > + index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > + }
> > +
> > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x4(0, ctx, parms, &flows,
> > + &indicies1, &indicies2, mm_match_mask.m);
> > +
> > + while (flows.started > 0) {
> > +
> > + /* Gather 4 bytes of input data for each stream. */
> > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0), 0);
> > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > +
> > + /* Process the 4 bytes of input on each stream. */
> > + input = transition4(mm_index_mask.m, input,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input = transition4(mm_index_mask.m, input,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input = transition4(mm_index_mask.m, input,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + input = transition4(mm_index_mask.m, input,
> > + mm_shuffle_input.m, mm_ones_16.m,
> > + mm_bytes.m, mm_type_quad_range.m,
> > + flows.trans, &indicies1, &indicies2);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x4(0, ctx, parms, &flows,
> > + &indicies1, &indicies2, mm_match_mask.m);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static inline xmm_t
> > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > + const uint64_t *trans, xmm_t *indicies1)
> > +{
> > + uint64_t t;
> > + xmm_t addr, indicies2;
> > +
> > + indicies2 = MM_XOR(ones_16, ones_16);
> > +
> > + addr = acl_calc_addr(index_mask, next_input, shuffle_input,
ones_16,
> > + bytes, type_quad_range, indicies1, &indicies2);
> > +
> > + /* Gather 64 bit transitions and pack 2 per register. */
> > +
> > + t = trans[MM_CVT32(addr)];
> > +
> > + /* get slot 1 */
> > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > +
> > + return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 2 traversals in parallel.
> > + */
> > +static inline int
> > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > + uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > + int n;
> > + struct acl_flow_data flows;
> > + uint64_t index_array[MAX_SEARCHES_SSE2];
> > + struct completion cmplt[MAX_SEARCHES_SSE2];
> > + struct parms parms[MAX_SEARCHES_SSE2];
> > + xmm_t input, indicies;
> > +
> > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > + total_packets, categories, ctx->trans_table);
> > +
> > + for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > + cmplt[n].count = 0;
> > + index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > + }
> > +
> > + indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x2(0, ctx, parms, &flows, &indicies,
mm_match_mask64.m);
> > +
> > + while (flows.started > 0) {
> > +
> > + /* Gather 4 bytes of input data for each stream. */
> > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0), 0);
> > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +
> > + /* Process the 4 bytes of input on each stream. */
> > +
> > + input = transition2(mm_index_mask64.m, input,
> > + mm_shuffle_input64.m, mm_ones_16.m,
> > + mm_bytes64.m, mm_type_quad_range64.m,
> > + flows.trans, &indicies);
> > +
> > + input = transition2(mm_index_mask64.m, input,
> > + mm_shuffle_input64.m, mm_ones_16.m,
> > + mm_bytes64.m, mm_type_quad_range64.m,
> > + flows.trans, &indicies);
> > +
> > + input = transition2(mm_index_mask64.m, input,
> > + mm_shuffle_input64.m, mm_ones_16.m,
> > + mm_bytes64.m, mm_type_quad_range64.m,
> > + flows.trans, &indicies);
> > +
> > + input = transition2(mm_index_mask64.m, input,
> > + mm_shuffle_input64.m, mm_ones_16.m,
> > + mm_bytes64.m, mm_type_quad_range64.m,
> > + flows.trans, &indicies);
> > +
> > + /* Check for any matches. */
> > + acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > + mm_match_mask64.m);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +int
> > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t
**data,
> > + uint32_t *results, uint32_t num, uint32_t categories)
> > +{
> > + if (categories != 1 &&
> > + ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > + return -EINVAL;
> > +
> > + if (likely(num >= MAX_SEARCHES_SSE8))
> > + return search_sse_8(ctx, data, results, num, categories);
> > + else if (num >= MAX_SEARCHES_SSE4)
> > + return search_sse_4(ctx, data, results, num, categories);
> > + else
> > + return search_sse_2(ctx, data, results, num, categories);
> > +}
> > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > index 7c288bd..0cde07e 100644
> > --- a/lib/librte_acl/rte_acl.c
> > +++ b/lib/librte_acl/rte_acl.c
> > @@ -38,6 +38,21 @@
> >
> > TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> >
> > +/* by default, use always avaialbe scalar code path. */
> > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> make this static, the outside world shouldn't need to see it.
>
> > +void __attribute__((constructor(INT16_MAX)))
> > +rte_acl_select_classify(void)
> Make it static, The outside world doesn't need to call this.
>
> > +{
> > + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > + /* SSE version requires SSE4.1 */
> > + rte_acl_default_classify = rte_acl_classify_sse;
> > + } else {
> > + /* reset to scalar version. */
> > + rte_acl_default_classify = rte_acl_classify_scalar;
> Don't need the else clause here, the static initalizer has you covered.
> > + }
> > +}
> > +
> > +
> > +/**
> > + * Invokes default rte_acl_classify function.
> > + */
> > +extern rte_acl_classify_t rte_acl_default_classify;
> > +
> Doesn't need to be extern.
> > +#define rte_acl_classify(ctx, data, results, num, categories) \
> > + (*rte_acl_default_classify)(ctx, data, results, num, categories)
> > +
> Not sure why you need this either. The rte_acl_classify_t should be
enough, no?
>
> Regards
> Neil
>
>
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
@ 2014-08-07 20:11 4% ` Neil Horman
2014-08-07 20:58 0% ` Vincent JARDIN
2014-08-08 11:49 0% ` Ananyev, Konstantin
2014-08-21 20:15 1% ` [dpdk-dev] [PATCHv3] " Neil Horman
1 sibling, 2 replies; 40+ results
From: Neil Horman @ 2014-08-07 20:11 UTC (permalink / raw)
To: Konstantin Ananyev; +Cc: dev
On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
> (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
> rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> and upper, return -ENOTSUP on lower arch.
> rte_acl_classify_scalar() - a slower version, but could be build and used
> on all systems.
> - keep common code shared between these two codepaths.
>
> v2 chages:
> run-time selection of most appropriate code-path for given ISA.
> By default the highest supprted one is selected.
> User can still override that selection by manually assigning new value to
> the global function pointer rte_acl_default_classify.
> rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> points to.
>
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
This is alot better thank you. A few remaining issues.
> ---
> app/test-acl/main.c | 13 +-
> lib/librte_acl/Makefile | 5 +-
> lib/librte_acl/acl_bld.c | 5 +-
> lib/librte_acl/acl_match_check.def | 92 ++++
> lib/librte_acl/acl_run.c | 944 -------------------------------------
> lib/librte_acl/acl_run.h | 220 +++++++++
> lib/librte_acl/acl_run_scalar.c | 197 ++++++++
> lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++
> lib/librte_acl/rte_acl.c | 15 +
> lib/librte_acl/rte_acl.h | 24 +-
> 10 files changed, 1189 insertions(+), 956 deletions(-)
> create mode 100644 lib/librte_acl/acl_match_check.def
> delete mode 100644 lib/librte_acl/acl_run.c
> create mode 100644 lib/librte_acl/acl_run.h
> create mode 100644 lib/librte_acl/acl_run_scalar.c
> create mode 100644 lib/librte_acl/acl_run_sse.c
>
> diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> index d654409..45c6fa6 100644
> --- a/app/test-acl/main.c
> +++ b/app/test-acl/main.c
> @@ -787,6 +787,10 @@ acx_init(void)
> /* perform build. */
> ret = rte_acl_build(config.acx, &cfg);
>
> + /* setup default rte_acl_classify */
> + if (config.scalar)
> + rte_acl_default_classify = rte_acl_classify_scalar;
> +
Exporting this variable as part of the ABI is a bad idea. If the prototype of
the function changes you have to update all your applications. Make the pointer
an internal symbol and set it using a get/set routine with an enum to represent
the path to choose. That will help isolate the ABI from the internal
implementation. It will also let you prevent things like selecting a run time
path that is incompatible with the running system, and prevent path switching
during searches, which may produce unexpected results.
><snip>
> diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> deleted file mode 100644
> index e3d9fc1..0000000
> --- a/lib/librte_acl/acl_run.c
> +++ /dev/null
> @@ -1,944 +0,0 @@
> -/*-
> - * BSD LICENSE
> - *
> - * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> - * All rights reserved.
> - *
> - * Redistribution and use in source and binary forms, with or without
> - * modification, are permitted provided that the following conditions
><snip>
> +
> +#define __func_resolve_priority__ resolve_priority_scalar
> +#define __func_match_check__ acl_match_check_scalar
> +#include "acl_match_check.def"
> +
I get this lets you make some more code common, but its just unpleasant to trace
through. Looking at the defintion of __func_match_check__ I don't see anything
particularly performance sensitive there. What if instead you simply redefined
__func_match_check__ in a common internal header as acl_match_check (a generic
function), and had it accept priority resolution function as an argument? That
would still give you all the performance enhancements without having to include
c files in the middle of other c files, and would make the code a bit more
parseable.
> +/*
> + * When processing the transition, rather than using if/else
> + * construct, the offset is calculated for DFA and QRANGE and
> + * then conditionally added to the address based on node type.
> + * This is done to avoid branch mis-predictions. Since the
> + * offset is rather simple calculation it is more efficient
> + * to do the calculation and do a condition move rather than
> + * a conditional branch to determine which calculation to do.
> + */
> +static inline uint32_t
> +scan_forward(uint32_t input, uint32_t max)
> +{
> + return (input == 0) ? max : rte_bsf32(input);
> +}
> + }
> +}
><snip>
> +
> +#define __func_resolve_priority__ resolve_priority_sse
> +#define __func_match_check__ acl_match_check_sse
> +#include "acl_match_check.def"
> +
Same deal as above.
> +/*
> + * Extract transitions from an XMM register and check for any matches
> + */
> +static void
> +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> + struct parms *parms, struct acl_flow_data *flows)
> +{
> + uint64_t transition1, transition2;
> +
> + /* extract transition from low 64 bits. */
> + transition1 = MM_CVT64(*indicies);
> +
> + /* extract transition from high 64 bits. */
> + *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> + transition2 = MM_CVT64(*indicies);
> +
> + transition1 = acl_match_check_sse(transition1, slot, ctx,
> + parms, flows);
> + transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> + parms, flows);
> +
> + /* update indicies with new transitions. */
> + *indicies = MM_SET64(transition2, transition1);
> +}
> +
> +/*
> + * Check for a match in 2 transitions (contained in SSE register)
> + */
> +static inline void
> +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> + struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> +{
> + xmm_t temp;
> +
> + temp = MM_AND(match_mask, *indicies);
> + while (!MM_TESTZ(temp, temp)) {
> + acl_process_matches(indicies, slot, ctx, parms, flows);
> + temp = MM_AND(match_mask, *indicies);
> + }
> +}
> +
> +/*
> + * Check for any match in 4 transitions (contained in 2 SSE registers)
> + */
> +static inline void
> +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> + struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> + xmm_t match_mask)
> +{
> + xmm_t temp;
> +
> + /* put low 32 bits of each transition into one register */
> + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> + 0x88);
> + /* test for match node */
> + temp = MM_AND(match_mask, temp);
> +
> + while (!MM_TESTZ(temp, temp)) {
> + acl_process_matches(indicies1, slot, ctx, parms, flows);
> + acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> +
> + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> + (__m128)*indicies2,
> + 0x88);
> + temp = MM_AND(match_mask, temp);
> + }
> +}
> +
> +/*
> + * Calculate the address of the next transition for
> + * all types of nodes. Note that only DFA nodes and range
> + * nodes actually transition to another node. Match
> + * nodes don't move.
> + */
> +static inline xmm_t
> +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> + xmm_t *indicies1, xmm_t *indicies2)
> +{
> + xmm_t addr, node_types, temp;
> +
> + /*
> + * Note that no transition is done for a match
> + * node and therefore a stream freezes when
> + * it reaches a match.
> + */
> +
> + /* Shuffle low 32 into temp and high 32 into indicies2 */
> + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> + 0x88);
> + *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> + (__m128)*indicies2, 0xdd);
> +
> + /* Calc node type and node addr */
> + node_types = MM_ANDNOT(index_mask, temp);
> + addr = MM_AND(index_mask, temp);
> +
> + /*
> + * Calc addr for DFAs - addr = dfa_index + input_byte
> + */
> +
> + /* mask for DFA type (0) nodes */
> + temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> +
> + /* add input byte to DFA position */
> + temp = MM_AND(temp, bytes);
> + temp = MM_AND(temp, next_input);
> + addr = MM_ADD32(addr, temp);
> +
> + /*
> + * Calc addr for Range nodes -> range_index + range(input)
> + */
> + node_types = MM_CMPEQ32(node_types, type_quad_range);
> +
> + /*
> + * Calculate number of range boundaries that are less than the
> + * input value. Range boundaries for each node are in signed 8 bit,
> + * ordered from -128 to 127 in the indicies2 register.
> + * This is effectively a popcnt of bytes that are greater than the
> + * input byte.
> + */
> +
> + /* shuffle input byte to all 4 positions of 32 bit value */
> + temp = MM_SHUFFLE8(next_input, shuffle_input);
> +
> + /* check ranges */
> + temp = MM_CMPGT8(temp, *indicies2);
> +
> + /* convert -1 to 1 (bytes greater than input byte */
> + temp = MM_SIGN8(temp, temp);
> +
> + /* horizontal add pairs of bytes into words */
> + temp = MM_MADD8(temp, temp);
> +
> + /* horizontal add pairs of words into dwords */
> + temp = MM_MADD16(temp, ones_16);
> +
> + /* mask to range type nodes */
> + temp = MM_AND(temp, node_types);
> +
> + /* add index into node position */
> + return MM_ADD32(addr, temp);
> +}
> +
> +/*
> + * Process 4 transitions (in 2 SIMD registers) in parallel
> + */
> +static inline xmm_t
> +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> + const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> +{
> + xmm_t addr;
> + uint64_t trans0, trans2;
> +
> + /* Calculate the address (array index) for all 4 transitions. */
> +
> + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> + bytes, type_quad_range, indicies1, indicies2);
> +
> + /* Gather 64 bit transitions and pack back into 2 registers. */
> +
> + trans0 = trans[MM_CVT32(addr)];
> +
> + /* get slot 2 */
> +
> + /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> + trans2 = trans[MM_CVT32(addr)];
> +
> + /* get slot 1 */
> +
> + /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> +
> + /* get slot 3 */
> +
> + /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> + *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> +
> + return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 8 traversals in parallel
> + */
> +static inline int
> +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> + uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> + int n;
> + struct acl_flow_data flows;
> + uint64_t index_array[MAX_SEARCHES_SSE8];
> + struct completion cmplt[MAX_SEARCHES_SSE8];
> + struct parms parms[MAX_SEARCHES_SSE8];
> + xmm_t input0, input1;
> + xmm_t indicies1, indicies2, indicies3, indicies4;
> +
> + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> + total_packets, categories, ctx->trans_table);
> +
> + for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> + cmplt[n].count = 0;
> + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> + }
> +
> + /*
> + * indicies1 contains index_array[0,1]
> + * indicies2 contains index_array[2,3]
> + * indicies3 contains index_array[4,5]
> + * indicies4 contains index_array[6,7]
> + */
> +
> + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> + indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> + indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> +
> + /* Check for any matches. */
> + acl_match_check_x4(0, ctx, parms, &flows,
> + &indicies1, &indicies2, mm_match_mask.m);
> + acl_match_check_x4(4, ctx, parms, &flows,
> + &indicies3, &indicies4, mm_match_mask.m);
> +
> + while (flows.started > 0) {
> +
> + /* Gather 4 bytes of input data for each stream. */
> + input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> + 0);
> + input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> + 0);
> +
> + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> +
> + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> +
> + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> +
> + /* Process the 4 bytes of input on each stream. */
> +
> + input0 = transition4(mm_index_mask.m, input0,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies1, &indicies2);
> +
> + input1 = transition4(mm_index_mask.m, input1,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies3, &indicies4);
> +
> + input0 = transition4(mm_index_mask.m, input0,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies1, &indicies2);
> +
> + input1 = transition4(mm_index_mask.m, input1,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies3, &indicies4);
> +
> + input0 = transition4(mm_index_mask.m, input0,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies1, &indicies2);
> +
> + input1 = transition4(mm_index_mask.m, input1,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies3, &indicies4);
> +
> + input0 = transition4(mm_index_mask.m, input0,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies1, &indicies2);
> +
> + input1 = transition4(mm_index_mask.m, input1,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies3, &indicies4);
> +
> + /* Check for any matches. */
> + acl_match_check_x4(0, ctx, parms, &flows,
> + &indicies1, &indicies2, mm_match_mask.m);
> + acl_match_check_x4(4, ctx, parms, &flows,
> + &indicies3, &indicies4, mm_match_mask.m);
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Execute trie traversal with 4 traversals in parallel
> + */
> +static inline int
> +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> + uint32_t *results, int total_packets, uint32_t categories)
> +{
> + int n;
> + struct acl_flow_data flows;
> + uint64_t index_array[MAX_SEARCHES_SSE4];
> + struct completion cmplt[MAX_SEARCHES_SSE4];
> + struct parms parms[MAX_SEARCHES_SSE4];
> + xmm_t input, indicies1, indicies2;
> +
> + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> + total_packets, categories, ctx->trans_table);
> +
> + for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> + cmplt[n].count = 0;
> + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> + }
> +
> + indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> + indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> + /* Check for any matches. */
> + acl_match_check_x4(0, ctx, parms, &flows,
> + &indicies1, &indicies2, mm_match_mask.m);
> +
> + while (flows.started > 0) {
> +
> + /* Gather 4 bytes of input data for each stream. */
> + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> +
> + /* Process the 4 bytes of input on each stream. */
> + input = transition4(mm_index_mask.m, input,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies1, &indicies2);
> +
> + input = transition4(mm_index_mask.m, input,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies1, &indicies2);
> +
> + input = transition4(mm_index_mask.m, input,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies1, &indicies2);
> +
> + input = transition4(mm_index_mask.m, input,
> + mm_shuffle_input.m, mm_ones_16.m,
> + mm_bytes.m, mm_type_quad_range.m,
> + flows.trans, &indicies1, &indicies2);
> +
> + /* Check for any matches. */
> + acl_match_check_x4(0, ctx, parms, &flows,
> + &indicies1, &indicies2, mm_match_mask.m);
> + }
> +
> + return 0;
> +}
> +
> +static inline xmm_t
> +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> + const uint64_t *trans, xmm_t *indicies1)
> +{
> + uint64_t t;
> + xmm_t addr, indicies2;
> +
> + indicies2 = MM_XOR(ones_16, ones_16);
> +
> + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> + bytes, type_quad_range, indicies1, &indicies2);
> +
> + /* Gather 64 bit transitions and pack 2 per register. */
> +
> + t = trans[MM_CVT32(addr)];
> +
> + /* get slot 1 */
> + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> +
> + return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 2 traversals in parallel.
> + */
> +static inline int
> +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> + uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> + int n;
> + struct acl_flow_data flows;
> + uint64_t index_array[MAX_SEARCHES_SSE2];
> + struct completion cmplt[MAX_SEARCHES_SSE2];
> + struct parms parms[MAX_SEARCHES_SSE2];
> + xmm_t input, indicies;
> +
> + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> + total_packets, categories, ctx->trans_table);
> +
> + for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> + cmplt[n].count = 0;
> + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> + }
> +
> + indicies = MM_LOADU((xmm_t *) &index_array[0]);
> +
> + /* Check for any matches. */
> + acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> +
> + while (flows.started > 0) {
> +
> + /* Gather 4 bytes of input data for each stream. */
> + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +
> + /* Process the 4 bytes of input on each stream. */
> +
> + input = transition2(mm_index_mask64.m, input,
> + mm_shuffle_input64.m, mm_ones_16.m,
> + mm_bytes64.m, mm_type_quad_range64.m,
> + flows.trans, &indicies);
> +
> + input = transition2(mm_index_mask64.m, input,
> + mm_shuffle_input64.m, mm_ones_16.m,
> + mm_bytes64.m, mm_type_quad_range64.m,
> + flows.trans, &indicies);
> +
> + input = transition2(mm_index_mask64.m, input,
> + mm_shuffle_input64.m, mm_ones_16.m,
> + mm_bytes64.m, mm_type_quad_range64.m,
> + flows.trans, &indicies);
> +
> + input = transition2(mm_index_mask64.m, input,
> + mm_shuffle_input64.m, mm_ones_16.m,
> + mm_bytes64.m, mm_type_quad_range64.m,
> + flows.trans, &indicies);
> +
> + /* Check for any matches. */
> + acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> + mm_match_mask64.m);
> + }
> +
> + return 0;
> +}
> +
> +int
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> + uint32_t *results, uint32_t num, uint32_t categories)
> +{
> + if (categories != 1 &&
> + ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> + return -EINVAL;
> +
> + if (likely(num >= MAX_SEARCHES_SSE8))
> + return search_sse_8(ctx, data, results, num, categories);
> + else if (num >= MAX_SEARCHES_SSE4)
> + return search_sse_4(ctx, data, results, num, categories);
> + else
> + return search_sse_2(ctx, data, results, num, categories);
> +}
> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 7c288bd..0cde07e 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -38,6 +38,21 @@
>
> TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
>
> +/* by default, use always avaialbe scalar code path. */
> +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> +
make this static, the outside world shouldn't need to see it.
> +void __attribute__((constructor(INT16_MAX)))
> +rte_acl_select_classify(void)
Make it static, The outside world doesn't need to call this.
> +{
> + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> + /* SSE version requires SSE4.1 */
> + rte_acl_default_classify = rte_acl_classify_sse;
> + } else {
> + /* reset to scalar version. */
> + rte_acl_default_classify = rte_acl_classify_scalar;
Don't need the else clause here, the static initalizer has you covered.
> + }
> +}
> +
> +
> +/**
> + * Invokes default rte_acl_classify function.
> + */
> +extern rte_acl_classify_t rte_acl_default_classify;
> +
Doesn't need to be extern.
> +#define rte_acl_classify(ctx, data, results, num, categories) \
> + (*rte_acl_default_classify)(ctx, data, results, num, categories)
> +
Not sure why you need this either. The rte_acl_classify_t should be enough, no?
Regards
Neil
^ permalink raw reply [relevance 4%]
* Re: [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54)
2014-07-24 14:28 11% [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54) Pablo de Lara
2014-07-24 14:54 0% ` Thomas Monjalon
@ 2014-07-24 15:20 0% ` Chris Wright
1 sibling, 0 replies; 40+ results
From: Chris Wright @ 2014-07-24 15:20 UTC (permalink / raw)
To: Pablo de Lara; +Cc: dev, Patrice Buriez
* Pablo de Lara (pablo.de.lara.guarch@intel.com) wrote:
> Signed-off-by: Patrice Buriez <patrice.buriez@intel.com>
Just a mechanical nitpick on DCO. Pablo, this patch appears to be
written by Patrice. If so, it should begin with "From: Patrice Buriez
<patrice.buriez@intel.com>" and should include your own Signed-off-by.
thanks,
-chris
> ---
> lib/librte_eal/linuxapp/kni/Makefile | 9 +++++++++
> lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h | 16 ++++++++++++++++
> 2 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/lib/librte_eal/linuxapp/kni/Makefile b/lib/librte_eal/linuxapp/kni/Makefile
> index fb9462f..725d3e7 100644
> --- a/lib/librte_eal/linuxapp/kni/Makefile
> +++ b/lib/librte_eal/linuxapp/kni/Makefile
> @@ -44,6 +44,15 @@ MODULE_CFLAGS += -I$(RTE_OUTPUT)/include -I$(SRCDIR)/ethtool/ixgbe -I$(SRCDIR)/e
> MODULE_CFLAGS += -include $(RTE_OUTPUT)/include/rte_config.h
> MODULE_CFLAGS += -Wall -Werror
>
> +ifeq ($(shell type lsb_release >/dev/null 2>&1 && lsb_release -si),Ubuntu)
> +MODULE_CFLAGS += -DUBUNTU_RELEASE_CODE=$(subst .,,$(shell lsb_release -sr))
> +UBUNTU_KERNEL_CODE := $(shell cut -d' ' -f2 /proc/version_signature |cut -d- -f1,2)
> +UBUNTU_KERNEL_CODE := $(subst -,$(comma),$(UBUNTU_KERNEL_CODE))
> +UBUNTU_KERNEL_CODE := $(subst .,$(comma),$(UBUNTU_KERNEL_CODE))
> +MODULE_CFLAGS += -D"UBUNTU_KERNEL_CODE=UBUNTU_KERNEL_VERSION($(UBUNTU_KERNEL_CODE))"
> +endif
> +
> +
> # this lib needs main eal
> DEPDIRS-y += lib/librte_eal/linuxapp/eal
>
> diff --git a/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h b/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
> index 521a35d..5a06383 100644
> --- a/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
> +++ b/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
> @@ -713,6 +713,20 @@ struct _kc_ethtool_pauseparam {
> #define SLE_VERSION_CODE 0
> #endif /* SLE_VERSION_CODE */
>
> +/* Ubuntu release and kernel codes must be specified from Makefile */
> +#ifndef UBUNTU_RELEASE_VERSION
> +#define UBUNTU_RELEASE_VERSION(a,b) (((a) * 100) + (b))
> +#endif
> +#ifndef UBUNTU_KERNEL_VERSION
> +#define UBUNTU_KERNEL_VERSION(a,b,c,abi,upload) (((a) << 40) + ((b) << 32) + ((c) << 24) + ((abi) << 8) + (upload))
> +#endif
> +#ifndef UBUNTU_RELEASE_CODE
> +#define UBUNTU_RELEASE_CODE 0
> +#endif
> +#ifndef UBUNTU_KERNEL_CODE
> +#define UBUNTU_KERNEL_CODE 0
> +#endif
> +
> #ifdef __KLOCWORK__
> #ifdef ARRAY_SIZE
> #undef ARRAY_SIZE
> @@ -3847,6 +3861,7 @@ static inline struct sk_buff *__kc__vlan_hwaccel_put_tag(struct sk_buff *skb,
>
> #if ( LINUX_VERSION_CODE < KERNEL_VERSION(3,14,0) )
> #if (!(RHEL_RELEASE_CODE && RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7,0)))
> +#if (!(UBUNTU_RELEASE_CODE == UBUNTU_RELEASE_VERSION(14,4) && UBUNTU_KERNEL_CODE >= UBUNTU_KERNEL_VERSION(3,13,0,30,54)))
> #ifdef NETIF_F_RXHASH
> #define PKT_HASH_TYPE_L3 0
> static inline void
> @@ -3855,6 +3870,7 @@ skb_set_hash(struct sk_buff *skb, __u32 hash, __always_unused int type)
> skb->rxhash = hash;
> }
> #endif /* NETIF_F_RXHASH */
> +#endif /* < 3.13.0-30.54 (Ubuntu 14.04) */
> #endif /* < RHEL7 */
> #endif /* < 3.14.0 */
>
> --
> 1.7.0.7
>
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54)
2014-07-24 14:54 0% ` Thomas Monjalon
@ 2014-07-24 14:59 0% ` Thomas Monjalon
0 siblings, 0 replies; 40+ results
From: Thomas Monjalon @ 2014-07-24 14:59 UTC (permalink / raw)
To: Patrice Buriez; +Cc: dev
2014-07-24 16:54, Thomas Monjalon:
> > Unlike RHEL_RELEASE_CODE, there is no such UBUNTU_RELEASE_CODE available out of
> > the box, so it needs to be crafted from the Makefile
> > Similarly, UBUNTU_KERNEL_CODE is generated with ABI and upload numbers.
>
> It's quite amazing to see that Linux distributions do backports and do not
> provide a way to check them.
> Anyway, thanks for the fix.
>
> > +ifeq ($(shell type lsb_release >/dev/null 2>&1 && lsb_release -si),Ubuntu)
>
> Why not this simpler form?
> $(shell lsb_release -si 2>/dev/null)
>
> > +MODULE_CFLAGS += -DUBUNTU_RELEASE_CODE=$(subst .,,$(shell lsb_release -sr))
>
> Or you can use | tr -d . instead of subst and keep the flow from left to right.
>
> > +UBUNTU_KERNEL_CODE := $(shell cut -d' ' -f2 /proc/version_signature |cut -d- -f1,2)
> ^
> space missing here
>
> > +UBUNTU_KERNEL_CODE := $(subst -,$(comma),$(UBUNTU_KERNEL_CODE))
> > +UBUNTU_KERNEL_CODE := $(subst .,$(comma),$(UBUNTU_KERNEL_CODE))
>
> Would be simpler with | tr -d .-
Sorry, I mean tr -d .- $(comma)
--
Thomas
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54)
2014-07-24 14:28 11% [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54) Pablo de Lara
@ 2014-07-24 14:54 0% ` Thomas Monjalon
2014-07-24 14:59 0% ` Thomas Monjalon
2014-07-24 15:20 0% ` Chris Wright
1 sibling, 1 reply; 40+ results
From: Thomas Monjalon @ 2014-07-24 14:54 UTC (permalink / raw)
To: Patrice Buriez; +Cc: dev
> Unlike RHEL_RELEASE_CODE, there is no such UBUNTU_RELEASE_CODE available out of
> the box, so it needs to be crafted from the Makefile
> Similarly, UBUNTU_KERNEL_CODE is generated with ABI and upload numbers.
It's quite amazing to see that Linux distributions do backports and do not
provide a way to check them.
Anyway, thanks for the fix.
> +ifeq ($(shell type lsb_release >/dev/null 2>&1 && lsb_release -si),Ubuntu)
Why not this simpler form?
$(shell lsb_release -si 2>/dev/null)
> +MODULE_CFLAGS += -DUBUNTU_RELEASE_CODE=$(subst .,,$(shell lsb_release -sr))
Or you can use | tr -d . instead of subst and keep the flow from left to right.
> +UBUNTU_KERNEL_CODE := $(shell cut -d' ' -f2 /proc/version_signature |cut -d- -f1,2)
^
space missing here
> +UBUNTU_KERNEL_CODE := $(subst -,$(comma),$(UBUNTU_KERNEL_CODE))
> +UBUNTU_KERNEL_CODE := $(subst .,$(comma),$(UBUNTU_KERNEL_CODE))
Would be simpler with | tr -d .-
--
Thomas
^ permalink raw reply [relevance 0%]
* [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54)
@ 2014-07-24 14:28 11% Pablo de Lara
2014-07-24 14:54 0% ` Thomas Monjalon
2014-07-24 15:20 0% ` Chris Wright
0 siblings, 2 replies; 40+ results
From: Pablo de Lara @ 2014-07-24 14:28 UTC (permalink / raw)
To: dev; +Cc: Patrice Buriez
Recent Ubuntu kernel 3.13.0-30.54, although based on Linux kernel 3.13.11,
already provides skb_set_hash() inline function, slightly different than
the one provided by lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
Ubuntu kernel 3.13.0-30.54 provides:
* i40e/i40evf: i40e implementation for skb_set_hash
- https://bugs.launchpad.net/bugs/1328037
- http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_3.13.0-30.54/changelog
As a result, the implementation provided by kcompat.h must be skipped.
It is not appropriate to test whether LINUX_VERSION_CODE >= KERNEL_VERSION(3,13,11)
because previous Ubuntu kernel 3.13.0-29.53, already based on 3.13.11, needs to
get the implementation provided by kcompat.h
So the full Ubuntu kernel version numbering scheme must be tested:
<base kernel version>-<ABI number>.<upload number>-<flavour>
See "What does a specific Ubuntu kernel version number mean?"
and "How can we determine the version of the running kernel?"
at: https://wiki.ubuntu.com/Kernel/FAQ
Unlike RHEL_RELEASE_CODE, there is no such UBUNTU_RELEASE_CODE available out of
the box, so it needs to be crafted from the Makefile
Similarly, UBUNTU_KERNEL_CODE is generated with ABI and upload numbers.
`lsb_release -si` is first used to check whether we are running Ubuntu
`lsb_release -sr` provides release number 14.04, then converted to integer 1404
/proc/version_signature is parsed to get base kernel version, ABI and upload
numbers, and flavour is dropped
UBUNTU_KERNEL_CODE is indirectly defined using the UBUNTU_KERNEL_VERSION macro,
which in turn is defined in kcompat.h
This makes a single place to define the Ubuntu kernel version numbering scheme,
which is slightly different than the usual "shift by 8" scheme: ABI numbers can
be big (see: https://wiki.ubuntu.com/Kernel/Dev/TopicBranches), so 16-bits have
been reserved for them.
Finally, the implementaion of skb_set_hash is skipped in kcompat.h if we are
running Ubuntu 14.04 with an Ubuntu kernel >= 3.13.0-30.54
Signed-off-by: Patrice Buriez <patrice.buriez@intel.com>
---
lib/librte_eal/linuxapp/kni/Makefile | 9 +++++++++
lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h | 16 ++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/lib/librte_eal/linuxapp/kni/Makefile b/lib/librte_eal/linuxapp/kni/Makefile
index fb9462f..725d3e7 100644
--- a/lib/librte_eal/linuxapp/kni/Makefile
+++ b/lib/librte_eal/linuxapp/kni/Makefile
@@ -44,6 +44,15 @@ MODULE_CFLAGS += -I$(RTE_OUTPUT)/include -I$(SRCDIR)/ethtool/ixgbe -I$(SRCDIR)/e
MODULE_CFLAGS += -include $(RTE_OUTPUT)/include/rte_config.h
MODULE_CFLAGS += -Wall -Werror
+ifeq ($(shell type lsb_release >/dev/null 2>&1 && lsb_release -si),Ubuntu)
+MODULE_CFLAGS += -DUBUNTU_RELEASE_CODE=$(subst .,,$(shell lsb_release -sr))
+UBUNTU_KERNEL_CODE := $(shell cut -d' ' -f2 /proc/version_signature |cut -d- -f1,2)
+UBUNTU_KERNEL_CODE := $(subst -,$(comma),$(UBUNTU_KERNEL_CODE))
+UBUNTU_KERNEL_CODE := $(subst .,$(comma),$(UBUNTU_KERNEL_CODE))
+MODULE_CFLAGS += -D"UBUNTU_KERNEL_CODE=UBUNTU_KERNEL_VERSION($(UBUNTU_KERNEL_CODE))"
+endif
+
+
# this lib needs main eal
DEPDIRS-y += lib/librte_eal/linuxapp/eal
diff --git a/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h b/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
index 521a35d..5a06383 100644
--- a/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
+++ b/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
@@ -713,6 +713,20 @@ struct _kc_ethtool_pauseparam {
#define SLE_VERSION_CODE 0
#endif /* SLE_VERSION_CODE */
+/* Ubuntu release and kernel codes must be specified from Makefile */
+#ifndef UBUNTU_RELEASE_VERSION
+#define UBUNTU_RELEASE_VERSION(a,b) (((a) * 100) + (b))
+#endif
+#ifndef UBUNTU_KERNEL_VERSION
+#define UBUNTU_KERNEL_VERSION(a,b,c,abi,upload) (((a) << 40) + ((b) << 32) + ((c) << 24) + ((abi) << 8) + (upload))
+#endif
+#ifndef UBUNTU_RELEASE_CODE
+#define UBUNTU_RELEASE_CODE 0
+#endif
+#ifndef UBUNTU_KERNEL_CODE
+#define UBUNTU_KERNEL_CODE 0
+#endif
+
#ifdef __KLOCWORK__
#ifdef ARRAY_SIZE
#undef ARRAY_SIZE
@@ -3847,6 +3861,7 @@ static inline struct sk_buff *__kc__vlan_hwaccel_put_tag(struct sk_buff *skb,
#if ( LINUX_VERSION_CODE < KERNEL_VERSION(3,14,0) )
#if (!(RHEL_RELEASE_CODE && RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7,0)))
+#if (!(UBUNTU_RELEASE_CODE == UBUNTU_RELEASE_VERSION(14,4) && UBUNTU_KERNEL_CODE >= UBUNTU_KERNEL_VERSION(3,13,0,30,54)))
#ifdef NETIF_F_RXHASH
#define PKT_HASH_TYPE_L3 0
static inline void
@@ -3855,6 +3870,7 @@ skb_set_hash(struct sk_buff *skb, __u32 hash, __always_unused int type)
skb->rxhash = hash;
}
#endif /* NETIF_F_RXHASH */
+#endif /* < 3.13.0-30.54 (Ubuntu 14.04) */
#endif /* < RHEL7 */
#endif /* < 3.14.0 */
--
1.7.0.7
^ permalink raw reply [relevance 11%]
* Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
2014-05-21 10:21 3% ` Richardson, Bruce
@ 2014-05-21 15:23 3% ` Neil Horman
0 siblings, 0 replies; 40+ results
From: Neil Horman @ 2014-05-21 15:23 UTC (permalink / raw)
To: Richardson, Bruce; +Cc: dev
On Wed, May 21, 2014 at 10:21:26AM +0000, Richardson, Bruce wrote:
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Tuesday, May 20, 2014 7:19 PM
> > To: Richardson, Bruce
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
> >
> > On Tue, May 20, 2014 at 11:00:55AM +0100, Bruce Richardson wrote:
> > > This adds the code for a new Intel DPDK library for packet distribution.
> > > The distributor is a component which is designed to pass packets
> > > one-at-a-time to workers, with dynamic load balancing. Using the RSS
> > > field in the mbuf as a tag, the distributor tracks what packet tag is
> > > being processed by what worker and then ensures that no two packets with
> > > the same tag are in-flight simultaneously. Once a tag is not in-flight,
> > > then the next packet with that tag will be sent to the next available
> > > core.
> > >
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ><snip>
> >
> > ><snip other comments as I agree with your responses to them all save below>
> > Don't need to reserve an extra argument here. You're not ABI safe currently,
> > and if DPDK becomes ABI safe in the future, we will use a linker script to
> > provide versions with backward compatibility easily enough.
> We may not have ABI compatibility between releases, but on the other hand we try to reduce the amount of code changes that need to be made by our customers who are compiling their code against the libraries - generally linking against static rather than shared libraries. Since we have a reasonable expectation that this field will be needed in a future release, we want to include it now so that when we do need it, no code changes need to be made to upgrade this particular library to a new Intel DPDK version.
I understand why you added the reserved argument, but I still don't think its a
good idea, especially since you're not ABI safe/stable at the moment. By adding
this argument, you're forcing early users to declare a variable to pass into
your library that they know is unused, and as such likely uninitalized (or at
least initilized to an unknown value). When you do in the future make use of
this unknown value, your internal implementation will have to support being
called by both 'old' applications that just pass in any old value, and 'new'
users who pass in valid data, and the implementation wont have any way to
differentiate between the two. You can certainly document a reserved value that
current users must initilize that variable too, so that you can make that
differentiation, but you have to hope they do that correctly and consistently.
It seems to me it would be better to do something like:
1) Not include the reserved parameter
2) When you do add the extra parameter, rename the function as well, and
3) provide a compatibility function that preserves the old API and passes the
reserved value as the new parameter to the renamed function in (2)
That way old applications will run transparently, and you don't have to hope
they code the reserved values properly (note you can also do this with a macro
if you want to save the call instruction)
Ideally, you would just do this with a version script during linking, so that
you could include 2 versions of the same function name (v1 without the extra
paramter and v2 with the extra parameter), and old applications linked against
v1 would just continue to work, but dpdk isn't there yet :)
Neil
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
2014-05-20 18:18 4% ` Neil Horman
@ 2014-05-21 10:21 3% ` Richardson, Bruce
2014-05-21 15:23 3% ` Neil Horman
0 siblings, 1 reply; 40+ results
From: Richardson, Bruce @ 2014-05-21 10:21 UTC (permalink / raw)
To: Neil Horman; +Cc: dev
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Tuesday, May 20, 2014 7:19 PM
> To: Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
>
> On Tue, May 20, 2014 at 11:00:55AM +0100, Bruce Richardson wrote:
> > This adds the code for a new Intel DPDK library for packet distribution.
> > The distributor is a component which is designed to pass packets
> > one-at-a-time to workers, with dynamic load balancing. Using the RSS
> > field in the mbuf as a tag, the distributor tracks what packet tag is
> > being processed by what worker and then ensures that no two packets with
> > the same tag are in-flight simultaneously. Once a tag is not in-flight,
> > then the next packet with that tag will be sent to the next available
> > core.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ><snip>
>
> > +#define RTE_DISTRIB_GET_BUF (1)
> > +#define RTE_DISTRIB_RETURN_BUF (2)
> > +
> Can you document the meaning of these bits please, the code makes it
> somewhat
> confusing to differentiate them. As I read the code, GET_BUF is used as a flag
> to indicate that rte_distributor_get_pkt needs to wait while a buffer is
> filled in by the processing thread, while RETURN_BUF indicates that a worker is
> leaving and the buffer needs to be (re)assigned to an alternate worker, is that
> correct?
Pretty much. I'll add additional comments to the code.
>
> > +#define RTE_DISTRIB_BACKLOG_SIZE 8
> > +#define RTE_DISTRIB_BACKLOG_MASK (RTE_DISTRIB_BACKLOG_SIZE - 1)
> > +
> > +#define RTE_DISTRIB_MAX_RETURNS 128
> > +#define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> > +
> > +union rte_distributor_buffer {
> > + volatile int64_t bufptr64;
> > + char pad[CACHE_LINE_SIZE*3];
> Do you need the pad, if you mark the struct as cache aligned?
Yes, for performance reasons we actually want the structure to take up three cache lines, not just one. For instance, this will guarantee that we don't have adjacent line prefetcher in hardware pull in an additional cache line -belonging to a different worker - when we access the memory.
> > +} __rte_cache_aligned;
> >
> +
> ><snip>
> > +
> > +struct rte_mbuf *
> > +rte_distributor_get_pkt(struct rte_distributor *d,
> > + unsigned worker_id, struct rte_mbuf *oldpkt,
> > + unsigned reserved __rte_unused)
> > +{
> > + union rte_distributor_buffer *buf = &d->bufs[worker_id];
> > + int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS) |
> \
> > + RTE_DISTRIB_GET_BUF;
> > + while (unlikely(buf->bufptr64 & RTE_DISTRIB_FLAGS_MASK))
> > + rte_pause();
> > + buf->bufptr64 = req;
> > + while (buf->bufptr64 & RTE_DISTRIB_GET_BUF)
> > + rte_pause();
> You may want to document the fact that this is deadlock prone. You clearly
> state that only a single thread can run the processing routine, but if a user
> selects a single worker thread to preform double duty, the GET_BUF_FLAG will
> never get cleared here, and no other queues will get processed.
Agreed, I'll update the comments.
>
> > + /* since bufptr64 is a signed value, this should be an arithmetic shift */
> > + int64_t ret = buf->bufptr64 >> RTE_DISTRIB_FLAG_BITS;
> > + return (struct rte_mbuf *)((uintptr_t)ret);
> > +}
> > +
> > +int
> > +rte_distributor_return_pkt(struct rte_distributor *d,
> > + unsigned worker_id, struct rte_mbuf *oldpkt)
> > +{
> Maybe some optional sanity checking, here and above, to ensure that a packet
> returned through get_pkt doesn't also get returned here, mangling the flags
> field?
That actually shouldn't be an issue.
When we return a packet using this call, we just get the in_flight_ids value for the worker to zero (and re-assign the backlog, if any), and move on to the next worker. No checking of the returned packet is done. Also, since get_pkt always returns a new packet, the internal logic will still work ok - all that will happen if you return the wrong packet, e.g. by returning the same packet twice rather than returning the latest packet each time, is that the returns array will have the duplicated pointer in it. Whatever gets passed back by the worker gets stored directly there - it's up to the worker to return the correct pointer to the distributor.
>
> ><snip>
> > +
> > +/* flush the distributor, so that there are no outstanding packets in flight or
> > + * queued up. */
> > +int
> > +rte_distributor_flush(struct rte_distributor *d)
> > +{
> You need to document that this function can only be called by the same thread
> that is running rte_distributor_process, lest your corrupt your queue data.
> Alternatively, it might be nicer to modify this functions internals to set a
> flag in the distributor status bits to make the process routine do the flush
> work when it gets set. that would allow this function to be called by any
> other thread, which seems like a more natural interface.
Agreed. At minimum I'll update the comments, and I'll also look into what would be involved in changing the mechanism like you describe. However, given the limited time to code freeze date, it may not be possible to do here. [I also don't anticipate this function being much used in normal operations anyway - it was written in order to allow me to write proper unit tests to test the process function. We need a flush function for unit testing to sure that our packet counts are predictable at the end of each test run, and eliminate any dependency in the tests on the internal buffer sizes of the distributor.]
>
> ><snip>
> > +}
> > +
> > +/* clears the internal returns array in the distributor */
> > +void
> > +rte_distributor_clear_returns(struct rte_distributor *d)
> > +{
> This can also only be called by the same thread that runs the process routine,
> lest the start and count values get mis-assigned.
Agreed. Will update comments.
>
> > + d->returns.start = d->returns.count = 0;
> > +#ifndef __OPTIMIZE__
> > + memset(d->returns.mbufs, 0, sizeof(d->returns.mbufs));
> > +#endif
> > +}
> > +
> > +/* creates a distributor instance */
> > +struct rte_distributor *
> > +rte_distributor_create(const char *name,
> > + unsigned socket_id,
> > + unsigned num_workers,
> > + struct rte_distributor_extra_args *args __rte_unused)
> > +{
> > + struct rte_distributor *d;
> > + struct rte_distributor_list *distributor_list;
> > + char mz_name[RTE_MEMZONE_NAMESIZE];
> > + const struct rte_memzone *mz;
> > +
> > + /* compilation-time checks */
> > + RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> > + RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> > +
> > + if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> > + rte_errno = EINVAL;
> > + return NULL;
> > + }
> > + rte_snprintf(mz_name, sizeof(mz_name), RTE_DISTRIB_PREFIX"%s",
> name);
> > + mz = rte_memzone_reserve(mz_name, sizeof(*d), socket_id,
> NO_FLAGS);
> > + if (mz == NULL) {
> > + rte_errno = ENOMEM;
> > + return NULL;
> > + }
> > +
> > + /* check that we have an initialised tail queue */
> > + if ((distributor_list =
> RTE_TAILQ_LOOKUP_BY_IDX(RTE_TAILQ_DISTRIBUTOR,
> > + rte_distributor_list)) == NULL) {
> > + rte_errno = E_RTE_NO_TAILQ;
> > + return NULL;
> > + }
> > +
> > + d = mz->addr;
> > + rte_snprintf(d->name, sizeof(d->name), "%s", name);
> > + d->num_workers = num_workers;
> > + TAILQ_INSERT_TAIL(distributor_list, d, next);
> You need locking around this list unless you intend to assert that distributor
> creation and destruction must only be preformed from a single thread. Also,
> where is the API method to tear down a distributor instance?
Ack re locking, will make this as used in other structures.
For tearing down, that's not possible until such time as we get a function to free memzones back. Rings and mempools similarly have no free function.
>
> ><snip>
> > +#endif
> > +
> > +#include <rte_mbuf.h>
> > +
> > +#define RTE_DISTRIBUTOR_NAMESIZE 32 /**< Length of name for instance
> */
> > +
> > +struct rte_distributor;
> > +
> > +struct rte_distributor_extra_args { }; /**< reserved for future use*/
> > +
> You don't need to reserve a struct name for future use. No one will use it
> until you create it.
>
> > +struct rte_mbuf *
> > +rte_distributor_get_pkt(struct rte_distributor *d,
> > + unsigned worker_id, struct rte_mbuf *oldpkt, unsigned
> reserved);
> > +
> Don't need to reserve an extra argument here. You're not ABI safe currently,
> and if DPDK becomes ABI safe in the future, we will use a linker script to
> provide versions with backward compatibility easily enough.
We may not have ABI compatibility between releases, but on the other hand we try to reduce the amount of code changes that need to be made by our customers who are compiling their code against the libraries - generally linking against static rather than shared libraries. Since we have a reasonable expectation that this field will be needed in a future release, we want to include it now so that when we do need it, no code changes need to be made to upgrade this particular library to a new Intel DPDK version.
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
@ 2014-05-20 18:18 4% ` Neil Horman
2014-05-21 10:21 3% ` Richardson, Bruce
0 siblings, 1 reply; 40+ results
From: Neil Horman @ 2014-05-20 18:18 UTC (permalink / raw)
To: Bruce Richardson; +Cc: dev
On Tue, May 20, 2014 at 11:00:55AM +0100, Bruce Richardson wrote:
> This adds the code for a new Intel DPDK library for packet distribution.
> The distributor is a component which is designed to pass packets
> one-at-a-time to workers, with dynamic load balancing. Using the RSS
> field in the mbuf as a tag, the distributor tracks what packet tag is
> being processed by what worker and then ensures that no two packets with
> the same tag are in-flight simultaneously. Once a tag is not in-flight,
> then the next packet with that tag will be sent to the next available
> core.
>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
><snip>
> +#define RTE_DISTRIB_GET_BUF (1)
> +#define RTE_DISTRIB_RETURN_BUF (2)
> +
Can you document the meaning of these bits please, the code makes it somewhat
confusing to differentiate them. As I read the code, GET_BUF is used as a flag
to indicate that rte_distributor_get_pkt needs to wait while a buffer is
filled in by the processing thread, while RETURN_BUF indicates that a worker is
leaving and the buffer needs to be (re)assigned to an alternate worker, is that
correct?
> +#define RTE_DISTRIB_BACKLOG_SIZE 8
> +#define RTE_DISTRIB_BACKLOG_MASK (RTE_DISTRIB_BACKLOG_SIZE - 1)
> +
> +#define RTE_DISTRIB_MAX_RETURNS 128
> +#define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> +
> +union rte_distributor_buffer {
> + volatile int64_t bufptr64;
> + char pad[CACHE_LINE_SIZE*3];
Do you need the pad, if you mark the struct as cache aligned?
> +} __rte_cache_aligned;
>
+
><snip>
> +
> +struct rte_mbuf *
> +rte_distributor_get_pkt(struct rte_distributor *d,
> + unsigned worker_id, struct rte_mbuf *oldpkt,
> + unsigned reserved __rte_unused)
> +{
> + union rte_distributor_buffer *buf = &d->bufs[worker_id];
> + int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS) | \
> + RTE_DISTRIB_GET_BUF;
> + while (unlikely(buf->bufptr64 & RTE_DISTRIB_FLAGS_MASK))
> + rte_pause();
> + buf->bufptr64 = req;
> + while (buf->bufptr64 & RTE_DISTRIB_GET_BUF)
> + rte_pause();
You may want to document the fact that this is deadlock prone. You clearly
state that only a single thread can run the processing routine, but if a user
selects a single worker thread to preform double duty, the GET_BUF_FLAG will
never get cleared here, and no other queues will get processed.
> + /* since bufptr64 is a signed value, this should be an arithmetic shift */
> + int64_t ret = buf->bufptr64 >> RTE_DISTRIB_FLAG_BITS;
> + return (struct rte_mbuf *)((uintptr_t)ret);
> +}
> +
> +int
> +rte_distributor_return_pkt(struct rte_distributor *d,
> + unsigned worker_id, struct rte_mbuf *oldpkt)
> +{
Maybe some optional sanity checking, here and above, to ensure that a packet
returned through get_pkt doesn't also get returned here, mangling the flags
field?
><snip>
> +
> +/* flush the distributor, so that there are no outstanding packets in flight or
> + * queued up. */
> +int
> +rte_distributor_flush(struct rte_distributor *d)
> +{
You need to document that this function can only be called by the same thread
that is running rte_distributor_process, lest your corrupt your queue data.
Alternatively, it might be nicer to modify this functions internals to set a
flag in the distributor status bits to make the process routine do the flush
work when it gets set. that would allow this function to be called by any
other thread, which seems like a more natural interface.
><snip>
> +}
> +
> +/* clears the internal returns array in the distributor */
> +void
> +rte_distributor_clear_returns(struct rte_distributor *d)
> +{
This can also only be called by the same thread that runs the process routine,
lest the start and count values get mis-assigned.
> + d->returns.start = d->returns.count = 0;
> +#ifndef __OPTIMIZE__
> + memset(d->returns.mbufs, 0, sizeof(d->returns.mbufs));
> +#endif
> +}
> +
> +/* creates a distributor instance */
> +struct rte_distributor *
> +rte_distributor_create(const char *name,
> + unsigned socket_id,
> + unsigned num_workers,
> + struct rte_distributor_extra_args *args __rte_unused)
> +{
> + struct rte_distributor *d;
> + struct rte_distributor_list *distributor_list;
> + char mz_name[RTE_MEMZONE_NAMESIZE];
> + const struct rte_memzone *mz;
> +
> + /* compilation-time checks */
> + RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> + RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> +
> + if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> + rte_errno = EINVAL;
> + return NULL;
> + }
> + rte_snprintf(mz_name, sizeof(mz_name), RTE_DISTRIB_PREFIX"%s", name);
> + mz = rte_memzone_reserve(mz_name, sizeof(*d), socket_id, NO_FLAGS);
> + if (mz == NULL) {
> + rte_errno = ENOMEM;
> + return NULL;
> + }
> +
> + /* check that we have an initialised tail queue */
> + if ((distributor_list = RTE_TAILQ_LOOKUP_BY_IDX(RTE_TAILQ_DISTRIBUTOR,
> + rte_distributor_list)) == NULL) {
> + rte_errno = E_RTE_NO_TAILQ;
> + return NULL;
> + }
> +
> + d = mz->addr;
> + rte_snprintf(d->name, sizeof(d->name), "%s", name);
> + d->num_workers = num_workers;
> + TAILQ_INSERT_TAIL(distributor_list, d, next);
You need locking around this list unless you intend to assert that distributor
creation and destruction must only be preformed from a single thread. Also,
where is the API method to tear down a distributor instance?
><snip>
> +#endif
> +
> +#include <rte_mbuf.h>
> +
> +#define RTE_DISTRIBUTOR_NAMESIZE 32 /**< Length of name for instance */
> +
> +struct rte_distributor;
> +
> +struct rte_distributor_extra_args { }; /**< reserved for future use*/
> +
You don't need to reserve a struct name for future use. No one will use it
until you create it.
> +struct rte_mbuf *
> +rte_distributor_get_pkt(struct rte_distributor *d,
> + unsigned worker_id, struct rte_mbuf *oldpkt, unsigned reserved);
> +
Don't need to reserve an extra argument here. You're not ABI safe currently,
and if DPDK becomes ABI safe in the future, we will use a linker script to
provide versions with backward compatibility easily enough.
Neil
^ permalink raw reply [relevance 4%]
* Re: [dpdk-dev] Heads up: Fedora packaging plans
2014-05-19 10:11 0% ` Thomas Monjalon
@ 2014-05-19 13:18 0% ` Neil Horman
0 siblings, 0 replies; 40+ results
From: Neil Horman @ 2014-05-19 13:18 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: dev
On Mon, May 19, 2014 at 12:11:35PM +0200, Thomas Monjalon wrote:
> Hi Neil,
>
> Thanks for sharing your progress.
>
No worries.
> My main concerns are about naming and extensions.
> We must keep "dpdk-core" naming in order to distinguish it from PMD
> extensions.
I don't see why. We can name packages whatever we want, as long as the spec and
srpm share the same name. It seems to me that the core should be the base name
of the package while the extensions should have some extension on their name.
> And then, packaging of memnic and non-uio paravirtualization PMDs
> (virtio/vmxnet3) are missing.
>
They're in separate repositories, I was planning on packaging them at a later
time separately, since their versioning and development is handled separately.
> 2014-05-13 15:08, Neil Horman:
> > My current effort to do so. I've made some changes from the stock spec file
> > included in dpdk:
>
> We should try to get .spec for Fedora and in-tree .spec as common as possible.
> There are probably some things to push.
>
Ok, sure, just keep in mind that different distributions have different
packaging requirements that may affect the contents of the spec file, and so
attaining parity may not be possible (or even worthwhile).
> > * Modified the version and release values to be separate from the name. I
> > did some reading on requirements for packaging and it seems we can be a bit
> > more lax with ABI version on a pre-release I think, so I setup the N-V-R to
> > use pre-release conventions, which makes sense, give that this is a 1.7.0
> > pre-release. The git tag on the relase value will get bumped as we move
> > forward in the patch series.
>
> I thought that we should put version in the name, in order to be able to
> install many versions together. How is it handled by yum?
>
So, I spent some time thinking about this, and I _really_ want to avoid the
inclusion of a version with the package name. Doing so, while it allows yum to
install multiple versions side-by-side, is a real overhead for me, as it
requires that I go through a new pacakge review process for each released
version that we want to package. I do not have time to do that. If someone
from 6wind or intel wants to get involved in the Packaging process we can look
at that as a solution, but while I'm doing it, its really just too much
overhead. This method will allow multiple version to be installed side by side
as well. The tradeoff is that yum doesn't directly allow that, as it will just
preform an upgrade. The multiple version solution will require that you
download older versions and install them directly using rpm commands. I think
thats a fair tradeoff.
> > * Added config files to match desired configs for Fedora (i.e. disabled
> > PMD's that require out of tree kernel modules
>
> It would be clearer to make your configuration changes with "sed -i".
> In a near future we would probably need a "configure" script to do it.
>
I really disagree. Its not clearer in my mind at all - in that the final config
file is a product of two pieces of information (the base config file, and the
sed scripts that you run on it), as opposed to one piece (the canonical modified
config specified in the source line). Using sed also implies that you need to
list sed as a BuildRequires (minimal buildroots may not include sed when they
are spun up).
> So you don't package igb_uio but you build it because there is no option to
> disable it currently. We should add such option.
>
Not sure what you mean here. The only uio code I see in the package is the uio
unbind script for igb, which should still work just fine (save for the fact that
we don't have a user space PMD to attach the hardware to). I can certainly
remove the script though so it doesn't appear in the package until such time as
the LAD group integrates the uio code in the upstream driver.
> > * Moved the package target directories to include N-V of the package in the
> > path names. This allows for multiple versions of the dpdk to be installed
> > in parallel (I.e. dpdk-1.7.0 files are in /lib/dpdk-1.7.0,
> > /usr/include/dpdk-1.7.0, etc). This is how java packages allow for
> > multiple version installs, and makes sense given ABI instability in dpdk.
> > It will require that developers add some -I / -L paths to their makefiles
> > to pull the proper version, but I think thats a fair tradeoff.
>
> I don't see version for include directory and bin directory (testpmd).
>
Yup, need to fix that. Thank you!
Neil
> Thanks
> --
> Thomas
>
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] Heads up: Fedora packaging plans
2014-05-13 19:08 4% [dpdk-dev] Heads up: Fedora packaging plans Neil Horman
@ 2014-05-19 10:11 0% ` Thomas Monjalon
2014-05-19 13:18 0% ` Neil Horman
0 siblings, 1 reply; 40+ results
From: Thomas Monjalon @ 2014-05-19 10:11 UTC (permalink / raw)
To: Neil Horman; +Cc: dev
Hi Neil,
Thanks for sharing your progress.
My main concerns are about naming and extensions.
We must keep "dpdk-core" naming in order to distinguish it from PMD
extensions. And then, packaging of memnic and non-uio paravirtualization PMDs
(virtio/vmxnet3) are missing.
2014-05-13 15:08, Neil Horman:
> My current effort to do so. I've made some changes from the stock spec file
> included in dpdk:
We should try to get .spec for Fedora and in-tree .spec as common as possible.
There are probably some things to push.
> * Modified the version and release values to be separate from the name. I
> did some reading on requirements for packaging and it seems we can be a bit
> more lax with ABI version on a pre-release I think, so I setup the N-V-R to
> use pre-release conventions, which makes sense, give that this is a 1.7.0
> pre-release. The git tag on the relase value will get bumped as we move
> forward in the patch series.
I thought that we should put version in the name, in order to be able to
install many versions together. How is it handled by yum?
> * Added config files to match desired configs for Fedora (i.e. disabled
> PMD's that require out of tree kernel modules
It would be clearer to make your configuration changes with "sed -i".
In a near future we would probably need a "configure" script to do it.
So you don't package igb_uio but you build it because there is no option to
disable it currently. We should add such option.
> * Moved the package target directories to include N-V of the package in the
> path names. This allows for multiple versions of the dpdk to be installed
> in parallel (I.e. dpdk-1.7.0 files are in /lib/dpdk-1.7.0,
> /usr/include/dpdk-1.7.0, etc). This is how java packages allow for
> multiple version installs, and makes sense given ABI instability in dpdk.
> It will require that developers add some -I / -L paths to their makefiles
> to pull the proper version, but I think thats a fair tradeoff.
I don't see version for include directory and bin directory (testpmd).
Thanks
--
Thomas
^ permalink raw reply [relevance 0%]
* [dpdk-dev] Heads up: Fedora packaging plans
@ 2014-05-13 19:08 4% Neil Horman
2014-05-19 10:11 0% ` Thomas Monjalon
0 siblings, 1 reply; 40+ results
From: Neil Horman @ 2014-05-13 19:08 UTC (permalink / raw)
To: dev
Hey all-
This isn't really germaine to dpdk development, but Thomas and Vincent,
you expressed interest in my progress regarding packaging of dpdk for Fedora, so
I figured I would post here in case others were interested.
Please find here:
http://people.redhat.com/nhorman/dpdk-1.7.0-0.1.gitb20539d68.src.rpm
My current effort to do so. I've made some changes from the stock spec file
included in dpdk:
* Modified the version and release values to be separate from the name. I did
some reading on requirements for packaging and it seems we can be a bit more lax
with ABI version on a pre-release I think, so I setup the N-V-R to use
pre-release conventions, which makes sense, give that this is a 1.7.0
pre-release. The git tag on the relase value will get bumped as we move forward
in the patch series.
* Added config files to match desired configs for Fedora (i.e. disabled PMD's
that require out of tree kernel modules
* Removed Packager tag (Fedora doesn't use those)
* Moved the package target directories to include N-V of the package in the path
names. This allows for multiple versions of the dpdk to be installed in
parallel (I.e. dpdk-1.7.0 files are in /lib/dpdk-1.7.0, /usr/include/dpdk-1.7.0,
etc). This is how java packages allow for multiple version installs, and makes
sense given ABI instability in dpdk. It will require that developers add some
-I / -L paths to their makefiles to pull the proper version, but I think thats a
fair tradeoff.
My plan is to go through the review process with this package, and update to
tagged 1.7.0 as soon as its ready.
Neil
^ permalink raw reply [relevance 4%]
* Re: [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages
2014-05-01 13:14 0% ` Neil Horman
@ 2014-05-01 21:15 0% ` Thomas Monjalon
0 siblings, 0 replies; 40+ results
From: Thomas Monjalon @ 2014-05-01 21:15 UTC (permalink / raw)
To: Neil Horman; +Cc: dev
2014-05-01 09:14, Neil Horman:
> On Wed, Apr 30, 2014 at 02:46:41AM +0200, Thomas Monjalon wrote:
> > The goal of this patch serie is to be able to package DPDK
> > for RPM-based distributions.
> >
> > The file naming currently doesn't allow to install different DPDK
> > versions.
> > But the packaging naming should be ready to manage different DPDK versions
> >
> > having different API/ABI for different applications:
> > - dpdk-core has full version in its name to manage API breaking
> > - extensions have a number as name suffix to manage PMD API breaking.
> >
> > When API/ABI will be stable, package names could be simpler.
> >
> > I suggest to add these .spec files as a starting point for integration
> > in Linux distributions.
> >
> > Changes since v1:
> > - name of .spec file match package name
> > - version in package name
> > - no static library
> > - ldconfig/depmod in scriplets
> >
> > Thanks for your comments/reviews.
>
> I understand that this is holding up the 1.6.0r2 release, as well as the
> 1.7.0 integration. As such, given that my concerns, while valid IMO,
> aren't required for the release:
>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
Applied for dpdk-1.6.0r2, memnic-1.1, vmxnet3-usermap 1.2
and virtio-net-pmd-1.2.
Thanks Neil and other RedHat people for helping in this first step.
--
Thomas
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages
2014-04-30 0:46 4% [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Thomas Monjalon
2014-04-30 0:46 4% ` [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM Thomas Monjalon
2014-04-30 10:52 0% ` [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Neil Horman
@ 2014-05-01 13:14 0% ` Neil Horman
2014-05-01 21:15 0% ` Thomas Monjalon
2 siblings, 1 reply; 40+ results
From: Neil Horman @ 2014-05-01 13:14 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: dev
On Wed, Apr 30, 2014 at 02:46:41AM +0200, Thomas Monjalon wrote:
> The goal of this patch serie is to be able to package DPDK
> for RPM-based distributions.
>
> The file naming currently doesn't allow to install different DPDK versions.
> But the packaging naming should be ready to manage different DPDK versions
> having different API/ABI for different applications:
> - dpdk-core has full version in its name to manage API breaking
> - extensions have a number as name suffix to manage PMD API breaking.
> When API/ABI will be stable, package names could be simpler.
>
> I suggest to add these .spec files as a starting point for integration
> in Linux distributions.
>
> Changes since v1:
> - name of .spec file match package name
> - version in package name
> - no static library
> - ldconfig/depmod in scriplets
>
> Thanks for your comments/reviews.
> --
> Thomas
>
I understand that this is holding up the 1.6.0r2 release, as well as the 1.7.0
integration. As such, given that my concerns, while valid IMO, aren't required
for the release:
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages
2014-04-30 0:46 4% [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Thomas Monjalon
2014-04-30 0:46 4% ` [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM Thomas Monjalon
@ 2014-04-30 10:52 0% ` Neil Horman
2014-05-01 13:14 0% ` Neil Horman
2 siblings, 0 replies; 40+ results
From: Neil Horman @ 2014-04-30 10:52 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: dev
On Wed, Apr 30, 2014 at 02:46:41AM +0200, Thomas Monjalon wrote:
> The goal of this patch serie is to be able to package DPDK
> for RPM-based distributions.
>
> The file naming currently doesn't allow to install different DPDK versions.
> But the packaging naming should be ready to manage different DPDK versions
> having different API/ABI for different applications:
> - dpdk-core has full version in its name to manage API breaking
> - extensions have a number as name suffix to manage PMD API breaking.
> When API/ABI will be stable, package names could be simpler.
>
> I suggest to add these .spec files as a starting point for integration
> in Linux distributions.
>
> Changes since v1:
> - name of .spec file match package name
> - version in package name
> - no static library
> - ldconfig/depmod in scriplets
>
> Thanks for your comments/reviews.
> --
> Thomas
>
You should merge these into a single spec file so that you only have to build
once. That also cleans up the need to adjust the version information in the
spec file once, and build packages all get the same versioning.
Neil
^ permalink raw reply [relevance 0%]
* [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM
2014-04-30 0:46 4% [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Thomas Monjalon
@ 2014-04-30 0:46 4% ` Thomas Monjalon
2014-04-30 10:52 0% ` [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Neil Horman
2014-05-01 13:14 0% ` Neil Horman
2 siblings, 0 replies; 40+ results
From: Thomas Monjalon @ 2014-04-30 0:46 UTC (permalink / raw)
To: dev
Packages can be built with:
RPM_BUILD_NCPUS=8 rpmbuild -ta dpdk-1.6.0r1.tar.gz
There are packages for runtime and development.
Once devel package is installed, it can be used like this:
make -C /usr/share/dpdk/examples/helloworld RTE_SDK=/usr/share/dpdk
Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com>
---
pkg/dpdk-core.spec | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 129 insertions(+)
create mode 100644 pkg/dpdk-core.spec
diff --git a/pkg/dpdk-core.spec b/pkg/dpdk-core.spec
new file mode 100644
index 0000000..77d6c76
--- /dev/null
+++ b/pkg/dpdk-core.spec
@@ -0,0 +1,129 @@
+# Copyright 2014 6WIND S.A.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#
+# - Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# - Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in
+# the documentation and/or other materials provided with the
+# distribution.
+#
+# - Neither the name of 6WIND S.A. nor the names of its
+# contributors may be used to endorse or promote products derived
+# from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+# FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+# COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+# OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# name includes full version because there is no ABI stability yet
+Name: dpdk-core-1.6.0
+Version: r1
+%define fullversion 1.6.0%{version}
+Release: 1
+Packager: packaging@6wind.com
+URL: http://dpdk.org
+Source: http://dpdk.org/browse/dpdk/snapshot/dpdk-%{fullversion}.tar.gz
+
+Summary: Intel(r) Data Plane Development Kit core
+Group: System Environment/Libraries
+License: BSD and LGPLv2 and GPLv2
+
+ExclusiveArch: i686, x86_64
+%define target %{_arch}-default-linuxapp-gcc
+%define machine default
+
+BuildRequires: kernel-devel, kernel-headers, doxygen
+
+%description
+Intel(r) DPDK core includes kernel modules, core libraries and tools.
+testpmd application allows to test fast packet processing environments
+on x86 platforms. For instance, it can be used to check that environment
+can support fast path applications such as 6WINDGate, pktgen, rumptcpip, etc.
+More libraries are available as extensions in other packages.
+
+%package devel
+Summary: Intel(r) Data Plane Development Kit core for development
+%description devel
+Intel(r) DPDK core-devel is a set of makefiles, headers, examples and documentation
+for fast packet processing on x86 platforms.
+More libraries are available as extensions in other packages.
+
+%define destdir %{buildroot}%{_prefix}
+%define moddir /lib/modules/%(uname -r)/extra
+%define datadir %{_datadir}/dpdk
+%define docdir %{_docdir}/dpdk
+
+%prep
+%setup -qn dpdk-%{fullversion}
+
+%build
+make O=%{target} T=%{target} config
+sed -ri 's,(RTE_MACHINE=).*,\1%{machine},' %{target}/.config
+sed -ri 's,(RTE_APP_TEST=).*,\1n,' %{target}/.config
+sed -ri 's,(RTE_BUILD_SHARED_LIB=).*,\1y,' %{target}/.config
+make O=%{target} %{?_smp_mflags}
+make O=%{target} doc
+
+%install
+rm -rf %{buildroot}
+make O=%{target} DESTDIR=%{destdir}
+mkdir -p %{buildroot}%{moddir}
+mv %{destdir}/%{target}/kmod/*.ko %{buildroot}%{moddir}
+rmdir %{destdir}/%{target}/kmod
+mkdir -p %{buildroot}%{_sbindir}
+ln -s %{datadir}/tools/igb_uio_bind.py %{buildroot}%{_sbindir}/igb_uio_bind
+mkdir -p %{buildroot}%{_bindir}
+mv %{destdir}/%{target}/app/testpmd %{buildroot}%{_bindir}
+rmdir %{destdir}/%{target}/app
+mv %{destdir}/%{target}/include %{buildroot}%{_includedir}
+mv %{destdir}/%{target}/lib %{buildroot}%{_libdir}
+mkdir -p %{buildroot}%{docdir}
+mv %{destdir}/%{target}/doc/* %{buildroot}%{docdir}
+rmdir %{destdir}/%{target}/doc
+mkdir -p %{buildroot}%{datadir}
+mv %{destdir}/%{target}/.config %{buildroot}%{datadir}/config
+mv %{destdir}/%{target} %{buildroot}%{datadir}
+mv %{destdir}/mk %{buildroot}%{datadir}
+cp -a examples %{buildroot}%{datadir}
+cp -a tools %{buildroot}%{datadir}
+ln -s %{datadir}/config %{buildroot}%{datadir}/%{target}/.config
+ln -s %{_includedir} %{buildroot}%{datadir}/%{target}/include
+ln -s %{_libdir} %{buildroot}%{datadir}/%{target}/lib
+
+%files
+%dir %{datadir}
+%{datadir}/config
+%{datadir}/tools
+%{moddir}/*
+%{_sbindir}/*
+%{_bindir}/*
+%{_libdir}/*
+
+%files devel
+%{_includedir}/*
+%{datadir}/mk
+%{datadir}/%{target}
+%{datadir}/examples
+%doc %{docdir}
+
+%post
+/sbin/ldconfig
+/sbin/depmod
+
+%postun
+/sbin/ldconfig
+/sbin/depmod
--
1.9.2
^ permalink raw reply [relevance 4%]
* [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages
@ 2014-04-30 0:46 4% Thomas Monjalon
2014-04-30 0:46 4% ` [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM Thomas Monjalon
` (2 more replies)
0 siblings, 3 replies; 40+ results
From: Thomas Monjalon @ 2014-04-30 0:46 UTC (permalink / raw)
To: dev
The goal of this patch serie is to be able to package DPDK
for RPM-based distributions.
The file naming currently doesn't allow to install different DPDK versions.
But the packaging naming should be ready to manage different DPDK versions
having different API/ABI for different applications:
- dpdk-core has full version in its name to manage API breaking
- extensions have a number as name suffix to manage PMD API breaking.
When API/ABI will be stable, package names could be simpler.
I suggest to add these .spec files as a starting point for integration
in Linux distributions.
Changes since v1:
- name of .spec file match package name
- version in package name
- no static library
- ldconfig/depmod in scriplets
Thanks for your comments/reviews.
--
Thomas
^ permalink raw reply [relevance 4%]
* Re: [dpdk-dev] [PATCH 0/19] Separate compile time linkage between eal lib and pmd's
@ 2014-04-15 13:46 3% ` Neil Horman
0 siblings, 0 replies; 40+ results
From: Neil Horman @ 2014-04-15 13:46 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: dev
On Tue, Apr 15, 2014 at 10:31:25AM +0200, Thomas Monjalon wrote:
> 2014-04-12 07:04, Neil Horman:
> > On Thu, Apr 10, 2014 at 04:47:07PM -0400, Neil Horman wrote:
> > > Disconnect compile time linkage between eal library / applications and
> > > pmd's
> > >
> > > I noticed that, while tinkering with dpdk, building for shared libraries
> > > still resulted in all the test applications linking to all the built
> > > pmd's, despite not actually needing them all. We are able to tell an
> > > application at run time (via the -d/--blacklist/--whitelist/--vdev
> > > options) which pmd's we want to use, and so have no need to link them at
> > > all. The only reason they get pulled in is because
> > > rte_eal_non_pci_init_etherdev and rte_pmd_init_all contain static lists
> > > to the individual pmd init functions. The result is that, even when
> > > building as DSO's, we have to load all the pmd libraries, which is space
> > > inefficient and defeating of some of the purpose of shared objects.
> > >
> > > To correct this, I developed this patch series, which introduces two new
> > > macros, PMD_INIT_NONPCI and PMD_INIT. These two macros use constructors
> > > to register their init routines at runtime, either prior to the execution
> > > of main() when linked statically, or when dlopen is called on a DSO at
> > > run time. The result is that PMD's can be loaded at run time without the
> > > application or eal library having to hold a reference to them. They work
> > > in a very simmilar fashion to the module_init routine in the linux
> > > kernel.
> > >
> > > I've tested this feature using the igb and pcap pmd's, both statically and
> > > dynamically linked with the test and testpmd sample applications, and it
> > > seems to work well.
> > >
> > > Note, I encountered a few bugs along the way, which I fixed and noted in
> > > the series.
> > >
> > > Regards
> > > Neil
> >
> > Self NAK on this, based on the conversation Thomas and I had about Oliviers
> > patches from a while back, I'm going to rebase and repost these soon.
> > Neil
>
> I'll be glad to get your fixes soon. So I could apply them for version 1.6.0r2
> and release it.
> But I think you should post API changes (if any) in another series. Then we'll
> think if we want to push it in another branch for next major version.
>
I presume at this point you're fairly close to tagging
1.6.0r2, which, based on what I see in the git tree is usually the last rc
before you merge to the next major version. Do you want to put this in now,
before that happens, or will you commit to the first 1.7.0 rc? If the latter,
that seems like the best time to make ABI changes, so you maximize testing
Neil
> Thanks Neil
> --
> Thomas
>
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] DPDK API/ABI Stability
2014-04-09 21:08 4% ` Stephen Hemminger
@ 2014-04-10 10:54 7% ` Neil Horman
0 siblings, 0 replies; 40+ results
From: Neil Horman @ 2014-04-10 10:54 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev
On Wed, Apr 09, 2014 at 02:08:49PM -0700, Stephen Hemminger wrote:
> On Wed, 9 Apr 2014 14:39:52 -0400
> Neil Horman <nhorman@tuxdriver.com> wrote:
>
> > Hey all-
> > I was going to include this as an addendum to the packaging thread on
> > this list, but I can't seem to find it in my inbox, so forgive me starting a new
> > one.
> >
> > I wanted to broach the subject of ABI/API stability on the list here.
> > Given the recent great efforts to make dpdk packagable by disributions, I think
> > we probably need to discuss API stability in more depth and come up with a plan
> > to implement it. Has anyone started looking into this? If not, it seems to me
> > to be reasonable to start by placing a line in the sand with the functions
> > documented here:
> >
> > http://dpdk.org/doc/api/
> >
> > It seems to me we can start reviewing the API library by library, enusring only
> > those functions are exported, making sure the data types are appropriate for
> > export, and marking them with a linker script to version them appropriately.
>
> To what level? source? binary, internal functions?
>
Well, I was thinking both (hence the API/ABI comment above), but at least API
stability as a start. Stabilizing internal functions doesn't make any sense to
me since, by definition those aren't exposed to users trying to make use of the
library.
> Some of the API's could be stablized without much impact but others such
> as the device driver interface is incomplete and freezing it would make
> live hard.
But the driver interface isn't listed on the api documentation above. Clearly
we'd need to address that eventually, but as a start it can likely be ignored,
at least we can give applications a modicum of stability.
Neil
^ permalink raw reply [relevance 7%]
* Re: [dpdk-dev] DPDK API/ABI Stability
2014-04-09 18:39 7% [dpdk-dev] DPDK API/ABI Stability Neil Horman
@ 2014-04-09 21:08 4% ` Stephen Hemminger
2014-04-10 10:54 7% ` Neil Horman
0 siblings, 1 reply; 40+ results
From: Stephen Hemminger @ 2014-04-09 21:08 UTC (permalink / raw)
To: Neil Horman; +Cc: dev
On Wed, 9 Apr 2014 14:39:52 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:
> Hey all-
> I was going to include this as an addendum to the packaging thread on
> this list, but I can't seem to find it in my inbox, so forgive me starting a new
> one.
>
> I wanted to broach the subject of ABI/API stability on the list here.
> Given the recent great efforts to make dpdk packagable by disributions, I think
> we probably need to discuss API stability in more depth and come up with a plan
> to implement it. Has anyone started looking into this? If not, it seems to me
> to be reasonable to start by placing a line in the sand with the functions
> documented here:
>
> http://dpdk.org/doc/api/
>
> It seems to me we can start reviewing the API library by library, enusring only
> those functions are exported, making sure the data types are appropriate for
> export, and marking them with a linker script to version them appropriately.
To what level? source? binary, internal functions?
Some of the API's could be stablized without much impact but others such
as the device driver interface is incomplete and freezing it would make
live hard.
^ permalink raw reply [relevance 4%]
* [dpdk-dev] DPDK API/ABI Stability
@ 2014-04-09 18:39 7% Neil Horman
2014-04-09 21:08 4% ` Stephen Hemminger
0 siblings, 1 reply; 40+ results
From: Neil Horman @ 2014-04-09 18:39 UTC (permalink / raw)
To: dev
Hey all-
I was going to include this as an addendum to the packaging thread on
this list, but I can't seem to find it in my inbox, so forgive me starting a new
one.
I wanted to broach the subject of ABI/API stability on the list here.
Given the recent great efforts to make dpdk packagable by disributions, I think
we probably need to discuss API stability in more depth and come up with a plan
to implement it. Has anyone started looking into this? If not, it seems to me
to be reasonable to start by placing a line in the sand with the functions
documented here:
http://dpdk.org/doc/api/
It seems to me we can start reviewing the API library by library, enusring only
those functions are exported, making sure the data types are appropriate for
export, and marking them with a linker script to version them appropriately.
Thoughts?
Neil
^ permalink raw reply [relevance 7%]
* Re: [dpdk-dev] [PATCH 03/16] pkg: add recipe for RPM
2014-04-02 9:53 3% ` Thomas Monjalon
@ 2014-04-02 11:29 0% ` Neil Horman
0 siblings, 0 replies; 40+ results
From: Neil Horman @ 2014-04-02 11:29 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: dev
On Wed, Apr 02, 2014 at 11:53:51AM +0200, Thomas Monjalon wrote:
> Hello,
>
> Sorry for the long delay.
>
> 2014-02-26 14:07, Thomas Graf:
> > On 02/04/2014 04:54 PM, Thomas Monjalon wrote:
> > > +Version: 1.5.2r1
> > > +Release: 1
> >
> > What kind of upgrade strategy do you have in mind?
> >
> > I'm raising this because Fedora and other distributions will require
> > a unique package name for every version of the package that is not
> > backwards compatible.
> >
> > Typically libraries provide backwards compatible within a major release,
> > i.e. all 1.x.x releases would be compatible. I realize that this might
> > not be applicable yet but maybe 1.5.x?
> >
> > Depending on the versioning schema the name would be dpdk15, dpdk16, ...
> > or dpdk152, dpdk153, ...
>
> We are working on this but at the moment there is no restriction on API/ABI
> breakage. So I think it's too early to define such rule.
>
Now that you have DSO builds in place, theres no reason not to take the extra
step of versioning your API's, making backwards compatibility fairly
straightforward. Monolithic builds are still somewhat problematic regarding API
stability, but you could certainly offer stability in the DSOs.
> > > +BuildRequires: kernel-devel, kernel-headers, doxygen
> >
> > Is a python environment required as well?
>
> Python is only needed to run some tools on the target. But is is optional.
> Do you think it should be written somewhere?
>
> > > +%description
> > > +Dummy main package. Make only subpackages.
> >
> > I would just call the main package "libdpdk152" so you don't have to
> > repeat the encoding versioning in all the subpackages.
> >
> > > +
> > > +%package core-runtime
> >
> > What about calling it just "libdpdk"?
>
The version name should be left out of the library name, whatever you do.
Packaging can be responsible for versioning.
> In this case, it should be libdpdk-core in order to distinguish it from dpdk
> extensions. But the name of the project is dpdk so it seems simpler to call it
> dpdk-core.
> Is the "lib" prefix mandatory for libraries?
>
Not strictly, but IIRC if you don't add the lib, the linker won't find it with
the -l option, so you'll want to add it.
> > > +%files core-runtime
> > > +%dir %{datadir}
> > > +%{datadir}/config
> > > +%{datadir}/tools
> > > +%{moddir}/*
> > > +%{_sbindir}/*
> > > +%{_bindir}/*
> > > +%{_libdir}/*.so
> >
> > This brings up the question of multiple parallel DPDK installations.
> > A specific application linking to library version X will also require
> > tools of version X, right? A second application linking against version
> > Y will require tools version Y. Right now, these could not be installed
> > in parallel. Any chance we can make the runtime version independent?
>
> Are you thinking about installing different major versions? In my
> understanding, we cannot install 2 different minor versions of a package.
> As long as there is no stable API, there is no major versions defined.
> So don't you think we should speak about it later?
>
If the versioning is done properly (i.e shared libraries get version ids
attached to the library files properly), you can install as many library
versions as you like. You can only install a single -devel package, since it
links lib<name>.so to a specific version.
> > Same applies to header files. A good option here would be to install
> > them to /usr/include/libdpdk{version}/ and have a dpdk-1.5.2.pc which
> > provides Cflags: -I${includedir}/libdpdk${version}
>
> Yes same applies :)
> I agree that a .pc file would be a good idea. But we also must allow to build
> with the DPDK framework.
>
> > You'll also need for all packages and subpackages installing shared
> > libraries:
> >
> > %post -p /sbin/ldconfig
> > %postun -p /sbin/ldconfig
>
> OK
>
> Thanks for the review
> --
> Thomas
>
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCH 03/16] pkg: add recipe for RPM
@ 2014-04-02 9:53 3% ` Thomas Monjalon
2014-04-02 11:29 0% ` Neil Horman
0 siblings, 1 reply; 40+ results
From: Thomas Monjalon @ 2014-04-02 9:53 UTC (permalink / raw)
To: Thomas Graf; +Cc: dev
Hello,
Sorry for the long delay.
2014-02-26 14:07, Thomas Graf:
> On 02/04/2014 04:54 PM, Thomas Monjalon wrote:
> > +Version: 1.5.2r1
> > +Release: 1
>
> What kind of upgrade strategy do you have in mind?
>
> I'm raising this because Fedora and other distributions will require
> a unique package name for every version of the package that is not
> backwards compatible.
>
> Typically libraries provide backwards compatible within a major release,
> i.e. all 1.x.x releases would be compatible. I realize that this might
> not be applicable yet but maybe 1.5.x?
>
> Depending on the versioning schema the name would be dpdk15, dpdk16, ...
> or dpdk152, dpdk153, ...
We are working on this but at the moment there is no restriction on API/ABI
breakage. So I think it's too early to define such rule.
> > +BuildRequires: kernel-devel, kernel-headers, doxygen
>
> Is a python environment required as well?
Python is only needed to run some tools on the target. But is is optional.
Do you think it should be written somewhere?
> > +%description
> > +Dummy main package. Make only subpackages.
>
> I would just call the main package "libdpdk152" so you don't have to
> repeat the encoding versioning in all the subpackages.
>
> > +
> > +%package core-runtime
>
> What about calling it just "libdpdk"?
In this case, it should be libdpdk-core in order to distinguish it from dpdk
extensions. But the name of the project is dpdk so it seems simpler to call it
dpdk-core.
Is the "lib" prefix mandatory for libraries?
> > +%files core-runtime
> > +%dir %{datadir}
> > +%{datadir}/config
> > +%{datadir}/tools
> > +%{moddir}/*
> > +%{_sbindir}/*
> > +%{_bindir}/*
> > +%{_libdir}/*.so
>
> This brings up the question of multiple parallel DPDK installations.
> A specific application linking to library version X will also require
> tools of version X, right? A second application linking against version
> Y will require tools version Y. Right now, these could not be installed
> in parallel. Any chance we can make the runtime version independent?
Are you thinking about installing different major versions? In my
understanding, we cannot install 2 different minor versions of a package.
As long as there is no stable API, there is no major versions defined.
So don't you think we should speak about it later?
> Same applies to header files. A good option here would be to install
> them to /usr/include/libdpdk{version}/ and have a dpdk-1.5.2.pc which
> provides Cflags: -I${includedir}/libdpdk${version}
Yes same applies :)
I agree that a .pc file would be a good idea. But we also must allow to build
with the DPDK framework.
> You'll also need for all packages and subpackages installing shared
> libraries:
>
> %post -p /sbin/ldconfig
> %postun -p /sbin/ldconfig
OK
Thanks for the review
--
Thomas
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-29 20:47 0% ` François-Frédéric Ozog
2014-01-29 23:15 3% ` Thomas Graf
@ 2014-03-13 7:37 0% ` David Nyström
1 sibling, 0 replies; 40+ results
From: David Nyström @ 2014-03-13 7:37 UTC (permalink / raw)
To: François-Frédéric Ozog, 'Thomas Graf',
'Vincent JARDIN'
Cc: dev, dev, dpdk-ovs
On 2014-01-29 21:47, François-Frédéric Ozog wrote:
>>> First and easy answer: it is open source, so anyone can recompile. So,
>>> what's the issue?
>>
>> I'm talking from a pure distribution perspective here: Requiring to
>> recompile all DPDK based applications to distribute a bugfix or to add
>> support for a new PMD is not ideal.
>
>>
>> So ideally OVS would have the possibility to link against the shared
>> library long term.
>
> I agree that distribution of DPDK apps is not covered properly at present.
> Identifying the proper scheme requires a specific analysis based on the
> constraints of the Telecom/Cloud/Networking markets.
>
> In the telecom world, if you fix the underlying framework of an app, you
> will still have to validate the solution, ie app/framework. In addition, the
> idea of shared libraries introduces the implied requirement to validate apps
> against diverse versions of DPDK shared libraries. This translates into
> development and support costs.
>
> I also expect many DPDK applications to tackle core networking features,
> with sub micro second packet handling delays and even lower than 200ns
> (NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
> mentioning that optimization stops are shared libraries boundaries (gcc
> whole program optimization can be very effective...). Microsoft DLL linkage
> are an order of magnitude faster. If Linux was to provide that, I would
> probably revise my judgment. (I haven't checked Linux dynamic linking
> implementation for some time so my understanding of Linux dynamic linking
> may be outdated).
>
>
>>
>>> I get lost: do you mean ABI + API toward the PMDs or towards the
>>> applications using the librte ?
>>
>> Towards the PMDs is more straight forward at first so it seems logical to
>> focus on that first.
>
> I don't think it is so straight forward. Many recent cards such as Chelsio
> and Myricom have a very different "packet memory layout" that does not fit
> so easily into actual DPDK architecture.
>
> 1) "traditional" architecture: the driver reserves X buffers and provide the
> card with descriptors of those buffers. Each packet is DMA'ed into exactly
> one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
> one buffer
>
> 2) "alternative" new architecture: the driver reserves a memory zone, say
> 4MB, without any structure, and provide a a single zone description and a
> ring buffer to the card. (there no individual buffer descriptors any more).
> The card fills the memory zone with packets, one next to the other and
> specifies where the packets are by updating the supplied ring. Out of the
> many issues fitting this scheme into DPDK, you cannot free a single mbuf:
> you have to maintain a ref count to the memory zone so that, when all mbufs
> have been "released", the memory zone can be freed.
> That's quite a stretch from actual paradigm.
>
> Apart from this aspect, managing RSS is two tied to Intel's flow director
> concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
>
> That said, I fully agree PMD API should be revisited.
Hi,
Sorry for jumping in late.
Perhaps you are already aware of OpenDataPlane, which can use DPDK as
its south bound NIC interface.
>
> Cordially,
>
> François-Frédéric
>
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-29 20:47 0% ` François-Frédéric Ozog
@ 2014-01-29 23:15 3% ` Thomas Graf
2014-03-13 7:37 0% ` David Nyström
1 sibling, 0 replies; 40+ results
From: Thomas Graf @ 2014-01-29 23:15 UTC (permalink / raw)
To: François-Frédéric Ozog, 'Vincent JARDIN'
Cc: dev, dev, 'Gerald Rogers', dpdk-ovs
On 01/29/2014 09:47 PM, François-Frédéric Ozog wrote:
> In the telecom world, if you fix the underlying framework of an app, you
> will still have to validate the solution, ie app/framework. In addition, the
> idea of shared libraries introduces the implied requirement to validate apps
> against diverse versions of DPDK shared libraries. This translates into
> development and support costs.
>
> I also expect many DPDK applications to tackle core networking features,
> with sub micro second packet handling delays and even lower than 200ns
> (NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
> mentioning that optimization stops are shared libraries boundaries (gcc
> whole program optimization can be very effective...). Microsoft DLL linkage
> are an order of magnitude faster. If Linux was to provide that, I would
> probably revise my judgment. (I haven't checked Linux dynamic linking
> implementation for some time so my understanding of Linux dynamic linking
> may be outdated).
All very valid points and I am not suggesting to stop offering the
static linking option in any way. Dynamic linking will by design result
in more cycles. My sole point is that for a core platform component
like OVS, the shared library benefits _might_ outweigh the performance
difference. In order for a shared library to be effective, some form of
ABI compatibility must be guaranteed though.
> I don't think it is so straight forward. Many recent cards such as Chelsio
> and Myricom have a very different "packet memory layout" that does not fit
> so easily into actual DPDK architecture.
>
> 1) "traditional" architecture: the driver reserves X buffers and provide the
> card with descriptors of those buffers. Each packet is DMA'ed into exactly
> one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
> one buffer
>
> 2) "alternative" new architecture: the driver reserves a memory zone, say
> 4MB, without any structure, and provide a a single zone description and a
> ring buffer to the card. (there no individual buffer descriptors any more).
> The card fills the memory zone with packets, one next to the other and
> specifies where the packets are by updating the supplied ring. Out of the
> many issues fitting this scheme into DPDK, you cannot free a single mbuf:
> you have to maintain a ref count to the memory zone so that, when all mbufs
> have been "released", the memory zone can be freed.
> That's quite a stretch from actual paradigm.
>
> Apart from this aspect, managing RSS is two tied to Intel's flow director
> concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
>
> That said, I fully agree PMD API should be revisited.
Fair enough. I don't see a reason why multiple interfaces could not
coexist in order to support multiple memory layouts. What I'm hearing
so far is that while there is no objection to bringing stability to the
APIs, it should not result in performance side effects and it is still
early to nail down the yet fluent APIs.
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-29 17:14 3% ` Thomas Graf
2014-01-29 18:42 4% ` Stephen Hemminger
@ 2014-01-29 20:47 0% ` François-Frédéric Ozog
2014-01-29 23:15 3% ` Thomas Graf
2014-03-13 7:37 0% ` David Nyström
1 sibling, 2 replies; 40+ results
From: François-Frédéric Ozog @ 2014-01-29 20:47 UTC (permalink / raw)
To: 'Thomas Graf', 'Vincent JARDIN'
Cc: dev, dev, 'Gerald Rogers', dpdk-ovs
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
>
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to add
> support for a new PMD is not ideal.
>
> So ideally OVS would have the possibility to link against the shared
> library long term.
I agree that distribution of DPDK apps is not covered properly at present.
Identifying the proper scheme requires a specific analysis based on the
constraints of the Telecom/Cloud/Networking markets.
In the telecom world, if you fix the underlying framework of an app, you
will still have to validate the solution, ie app/framework. In addition, the
idea of shared libraries introduces the implied requirement to validate apps
against diverse versions of DPDK shared libraries. This translates into
development and support costs.
I also expect many DPDK applications to tackle core networking features,
with sub micro second packet handling delays and even lower than 200ns
(NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
mentioning that optimization stops are shared libraries boundaries (gcc
whole program optimization can be very effective...). Microsoft DLL linkage
are an order of magnitude faster. If Linux was to provide that, I would
probably revise my judgment. (I haven't checked Linux dynamic linking
implementation for some time so my understanding of Linux dynamic linking
may be outdated).
>
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
>
> Towards the PMDs is more straight forward at first so it seems logical to
> focus on that first.
I don't think it is so straight forward. Many recent cards such as Chelsio
and Myricom have a very different "packet memory layout" that does not fit
so easily into actual DPDK architecture.
1) "traditional" architecture: the driver reserves X buffers and provide the
card with descriptors of those buffers. Each packet is DMA'ed into exactly
one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
one buffer
2) "alternative" new architecture: the driver reserves a memory zone, say
4MB, without any structure, and provide a a single zone description and a
ring buffer to the card. (there no individual buffer descriptors any more).
The card fills the memory zone with packets, one next to the other and
specifies where the packets are by updating the supplied ring. Out of the
many issues fitting this scheme into DPDK, you cannot free a single mbuf:
you have to maintain a ref count to the memory zone so that, when all mbufs
have been "released", the memory zone can be freed.
That's quite a stretch from actual paradigm.
Apart from this aspect, managing RSS is two tied to Intel's flow director
concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
That said, I fully agree PMD API should be revisited.
Cordially,
François-Frédéric
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-29 17:14 3% ` Thomas Graf
@ 2014-01-29 18:42 4% ` Stephen Hemminger
2014-01-29 20:47 0% ` François-Frédéric Ozog
1 sibling, 0 replies; 40+ results
From: Stephen Hemminger @ 2014-01-29 18:42 UTC (permalink / raw)
To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
On Wed, 29 Jan 2014 18:14:01 +0100
Thomas Graf <tgraf@redhat.com> wrote:
> On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> > Thomas,
> >
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
>
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to
> add support for a new PMD is not ideal.
>
> So ideally OVS would have the possibility to link against the shared
> library long term.
>
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
>
> Towards the PMDs is more straight forward at first so it seems logical
> to focus on that first.
>
> A stable API and ABI for librte seems required as well long term as
> DPDK does offer shared libraries but I realize that this is a stretch
> goal in the initial phase.
I would hate to see the API/ABI nailed down. We have lots of bug fixes
and new drivers that are ready to contribute, but most of them have some
change to existing ABI.
^ permalink raw reply [relevance 4%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-29 16:34 3% ` Vincent JARDIN
@ 2014-01-29 17:14 3% ` Thomas Graf
2014-01-29 18:42 4% ` Stephen Hemminger
2014-01-29 20:47 0% ` François-Frédéric Ozog
0 siblings, 2 replies; 40+ results
From: Thomas Graf @ 2014-01-29 17:14 UTC (permalink / raw)
To: Vincent JARDIN; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> Thomas,
>
> First and easy answer: it is open source, so anyone can recompile. So,
> what's the issue?
I'm talking from a pure distribution perspective here: Requiring to
recompile all DPDK based applications to distribute a bugfix or to
add support for a new PMD is not ideal.
So ideally OVS would have the possibility to link against the shared
library long term.
> I get lost: do you mean ABI + API toward the PMDs or towards the
> applications using the librte ?
Towards the PMDs is more straight forward at first so it seems logical
to focus on that first.
A stable API and ABI for librte seems required as well long term as
DPDK does offer shared libraries but I realize that this is a stretch
goal in the initial phase.
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-29 11:14 4% ` Thomas Graf
@ 2014-01-29 16:34 3% ` Vincent JARDIN
2014-01-29 17:14 3% ` Thomas Graf
0 siblings, 1 reply; 40+ results
From: Vincent JARDIN @ 2014-01-29 16:34 UTC (permalink / raw)
To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
Thomas,
First and easy answer: it is open source, so anyone can recompile. So,
what's the issue?
> Without a concept of stable interfaces, it will be difficult to
> package and distribute RTE libraries, PMD, and DPDK applications. Right
> now, the obvious path would include packaging the PMD bits together
> with each DPDK application depending on the version of DPDK the binary
> was compiled against. This is clearly not ideal.
>
>> I agree that some areas could be improved since they are not into the
>> critical datapath of packets, but still other areas remain very CPU
>> constraints. For instance:
>> http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
>>
>> is bad:
>> struct eth_dev_ops
>> is churned, no comment, and a #ifdef that changes the structure
>> according to compilation!
>
> This is a very good example as it outlines the difference between
> control structures and the fast path. We have this same exact trade off
> in the kernel a lot where we have highly optimized internal APIs
> towards modules and drivers but want to provide binary compatibility to
> a certain extend.
As long as we agree on this limited scope, we'll think about it and
provide a proposal on dev@dpdk.org mailing list.
> As for the specific example you mention, it is relatively trivial to
> make eth_dev_ops backwards compatible by appending appropriate padding
> to the struct before a new major release and ensure that new members
> are added by replacing the padding accordingly. Obviously no ifdefs
> would be allowed anymore.
Of course, it is basic C!
>> Should an application use the librte libraries of the DPDK:
>> - you can use RTE_VERSION and RTE_VERSION_NUM :
>> http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
>
> Right. This would be more or less identical to requiring a specific
> DPDK version in OVS_CHEC_DPDK. It's not ideal to require application to
> clutter their code with #ifdefs all over for every new minor release
> though.
>
>> - you can write your own wrapper (with CPU overhead) in order to have
>> a stable ABI, that wrapper should be tight to the versions of the librte
>> => the overhead is part of your application instead of the DPDK,
>> - *otherwise recompile your software, it is opensource, what's the
>> issue?*
>>
>> We are opened to any suggestion to have stable ABI, but it should never
>> remove the options to have fast/efficient/compilation/CPU execution
>> processing.
>
> Absolutely agreed. We also don't want to add tons of abstraction and
> overcomplicate everything. Still, I strongly believe that the definition
> of stable interfaces towards applications and especially PMD is
> essential.
>
> I'm not proposing to standardize all the APIs towards applications on
> the level of POSIX. DPDK is in early stages and disruptive changes will
> come along. What I would propose on an abstract level is:
>
> 1. Extend but not break API between minor releases. Postpone API
> breakages to the next major release. High cadence of major
> releases initially, lower cadence as DPDK matures.
>
> 2. Define ABI stability towards PMD for minor releases to allow
> isolated packaging of PMD by padding control structures and keeping
> functions ABI stable.
I get lost: do you mean ABI + API toward the PMDs or towards the
applications using the librte ?
Best regards,
Vincent
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-29 10:26 5% ` Vincent JARDIN
@ 2014-01-29 11:14 4% ` Thomas Graf
2014-01-29 16:34 3% ` Vincent JARDIN
0 siblings, 1 reply; 40+ results
From: Thomas Graf @ 2014-01-29 11:14 UTC (permalink / raw)
To: Vincent JARDIN; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
Vincent,
On 01/29/2014 11:26 AM, Vincent JARDIN wrote:
> DPDK's ABIs are not Kernel's ABIs, they are not POSIX, there is no
> standard. Currently, there is no such plan to have a stable ABI since we
> need to keep freedom to chase CPU cycles over having a stable ABI. For
> instance, some applications on top of the DPDK process the packets in
> less than 150 CPU cycles (have a look at testpmd:
> http://dpdk.org/browse/dpdk/tree/app/test-pmd )
I understand the requirement to not introduce overhead with wrappers
or shim layers. No problem with that. I believe this is mainly a policy
and release process issue.
Without a concept of stable interfaces, it will be difficult to
package and distribute RTE libraries, PMD, and DPDK applications. Right
now, the obvious path would include packaging the PMD bits together
with each DPDK application depending on the version of DPDK the binary
was compiled against. This is clearly not ideal.
> I agree that some areas could be improved since they are not into the
> critical datapath of packets, but still other areas remain very CPU
> constraints. For instance:
> http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
>
> is bad:
> struct eth_dev_ops
> is churned, no comment, and a #ifdef that changes the structure
> according to compilation!
This is a very good example as it outlines the difference between
control structures and the fast path. We have this same exact trade off
in the kernel a lot where we have highly optimized internal APIs
towards modules and drivers but want to provide binary compatibility to
a certain extend.
As for the specific example you mention, it is relatively trivial to
make eth_dev_ops backwards compatible by appending appropriate padding
to the struct before a new major release and ensure that new members
are added by replacing the padding accordingly. Obviously no ifdefs
would be allowed anymore.
> Should an application use the librte libraries of the DPDK:
> - you can use RTE_VERSION and RTE_VERSION_NUM :
> http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
Right. This would be more or less identical to requiring a specific
DPDK version in OVS_CHEC_DPDK. It's not ideal to require application to
clutter their code with #ifdefs all over for every new minor release
though.
> - you can write your own wrapper (with CPU overhead) in order to have
> a stable ABI, that wrapper should be tight to the versions of the librte
> => the overhead is part of your application instead of the DPDK,
> - *otherwise recompile your software, it is opensource, what's the
> issue?*
>
> We are opened to any suggestion to have stable ABI, but it should never
> remove the options to have fast/efficient/compilation/CPU execution
> processing.
Absolutely agreed. We also don't want to add tons of abstraction and
overcomplicate everything. Still, I strongly believe that the definition
of stable interfaces towards applications and especially PMD is
essential.
I'm not proposing to standardize all the APIs towards applications on
the level of POSIX. DPDK is in early stages and disruptive changes will
come along. What I would propose on an abstract level is:
1. Extend but not break API between minor releases. Postpone API
breakages to the next major release. High cadence of major
releases initially, lower cadence as DPDK matures.
2. Define ABI stability towards PMD for minor releases to allow
isolated packaging of PMD by padding control structures and keeping
functions ABI stable.
I realize that this might be less trivial than it seems without
sacrificing performance but I consider it effort well spent.
Thomas
^ permalink raw reply [relevance 4%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-29 8:15 3% ` Thomas Graf
@ 2014-01-29 10:26 5% ` Vincent JARDIN
2014-01-29 11:14 4% ` Thomas Graf
0 siblings, 1 reply; 40+ results
From: Vincent JARDIN @ 2014-01-29 10:26 UTC (permalink / raw)
To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
Hi Thomas,
On 29/01/2014 09:15, Thomas Graf wrote:
> The obvious and usual best practise would be for DPDK to guarantee
> ABI stability between minor releases.
>
> Since dpdk-dev is copied as well, any comments?
DPDK's ABIs are not Kernel's ABIs, they are not POSIX, there is no
standard. Currently, there is no such plan to have a stable ABI since we
need to keep freedom to chase CPU cycles over having a stable ABI. For
instance, some applications on top of the DPDK process the packets in
less than 150 CPU cycles (have a look at testpmd:
http://dpdk.org/browse/dpdk/tree/app/test-pmd )
I agree that some areas could be improved since they are not into the
critical datapath of packets, but still other areas remain very CPU
constraints. For instance:
http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
is bad:
struct eth_dev_ops
is churned, no comment, and a #ifdef that changes the structure
according to compilation!
Should an application use the librte libraries of the DPDK:
- you can use RTE_VERSION and RTE_VERSION_NUM :
http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
- you can write your own wrapper (with CPU overhead) in order to have
a stable ABI, that wrapper should be tight to the versions of the librte
=> the overhead is part of your application instead of the DPDK,
- *otherwise recompile your software, it is opensource, what's the
issue?*
We are opened to any suggestion to have stable ABI, but it should never
remove the options to have fast/efficient/compilation/CPU execution
processing.
Best regards,
Vincent
^ permalink raw reply [relevance 5%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
2014-01-28 18:17 0% ` [dpdk-dev] [ovs-dev] " Pravin Shelar
@ 2014-01-29 8:15 3% ` Thomas Graf
2014-01-29 10:26 5% ` Vincent JARDIN
0 siblings, 1 reply; 40+ results
From: Thomas Graf @ 2014-01-29 8:15 UTC (permalink / raw)
To: Pravin Shelar; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
On 01/28/2014 07:17 PM, Pravin Shelar wrote:
> Right, version mismatch will not work. API provided by DPDK are not
> stable, So OVS has to be built for different releases for now.
>
> I do not see how we can fix it from OVS side. DPDK needs to
> standardize API, Actually OVS also needs more API, like DPDK
> initialization, mempool destroy, etc.
Agreed. It's not fixable from the OVS side. I also don't want to
object to including this. I'm just raising awareness of the issue
as this will become essential for dstribution.
The obvious and usual best practise would be for DPDK to guarantee
ABI stability between minor releases.
Since dpdk-dev is copied as well, any comments?
^ permalink raw reply [relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
[not found] ` <52E7D13B.9020404@redhat.com>
@ 2014-01-28 18:17 0% ` Pravin Shelar
2014-01-29 8:15 3% ` Thomas Graf
0 siblings, 1 reply; 40+ results
From: Pravin Shelar @ 2014-01-28 18:17 UTC (permalink / raw)
To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
On Tue, Jan 28, 2014 at 7:48 AM, Thomas Graf <tgraf@redhat.com> wrote:
> On 01/28/2014 02:48 AM, pshelar@nicira.com wrote:
>>
>> From: Pravin B Shelar <pshelar@nicira.com>
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel® DPDK vSwitch
>> where DPDK datapath switching is done in saparate process. This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching. Therefore all DPDK processing and flow
>> miss handling is done in single process. This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports. Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>
>
> I'm happy to see this happen!
>
>
>
>> +static const struct rte_eth_conf port_conf = {
>> + .rxmode = {
>> + .mq_mode = ETH_MQ_RX_RSS,
>> + .split_hdr_size = 0,
>> + .header_split = 0, /* Header Split disabled */
>> + .hw_ip_checksum = 0, /* IP checksum offload enabled */
>> + .hw_vlan_filter = 0, /* VLAN filtering disabled */
>> + .jumbo_frame = 0, /* Jumbo Frame Support disabled */
>> + .hw_strip_crc = 0, /* CRC stripped by hardware */
>> + },
>> + .rx_adv_conf = {
>> + .rss_conf = {
>> + .rss_key = NULL,
>> + .rss_hf = ETH_RSS_IPV4_TCP | ETH_RSS_IPV4 |
>> ETH_RSS_IPV6,
>> + },
>> + },
>> + .txmode = {
>> + .mq_mode = ETH_MQ_TX_NONE,
>> + },
>> +};
>
>
> I realize this is an RFC patch but I will ask anyway:
>
> What are the plans on managing runtime dependencies of a DPDK enabled OVS
> and DPDK itself? Will a OVS built against DPDK 1.5.2 work with
> drivers written for 1.5.3?
>
> Based on the above use of struct rte_eth_conf it would seem that once
> released, rte_eth_conf cannot be extended anymore without breaking
> ABI compatibility. The same applies to many of the other user
> structures. I see various structures changes between minor releases,
> for example dpdk.org ed2c69c3ef7 between 1.5.1 and 1.5.2.
>
Right, version mismatch will not work. API provided by DPDK are not
stable, So OVS has to be built for different releases for now.
I do not see how we can fix it from OVS side. DPDK needs to
standardize API, Actually OVS also needs more API, like DPDK
initialization, mempool destroy, etc.
^ permalink raw reply [relevance 0%]
* Re: [dpdk-dev] [PATCH 4/7] eal: support different modules
@ 2013-06-03 17:25 3% ` Stephen Hemminger
0 siblings, 0 replies; 40+ results
From: Stephen Hemminger @ 2013-06-03 17:25 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: dev
On Mon, 3 Jun 2013 18:29:02 +0200
Thomas Monjalon <thomas.monjalon@6wind.com> wrote:
> 03/06/2013 18:08, Antti Kantee :
> > On 03.06.2013 10:58, Damien Millescamps wrote:
> > >> -/** Device needs igb_uio kernel module */
> > >> -#define RTE_PCI_DRV_NEED_IGB_UIO 0x0001
> > >>
> > >> /** Device driver must be registered several times until failure */
> > >>
> > >> -#define RTE_PCI_DRV_MULTIPLE 0x0002
> > >> +#define RTE_PCI_DRV_MULTIPLE 0x0001
> > >
> > > You are breaking a public API here, and I don't see any technical reason
> > > to do so. The RTE_PCI_DRV_NEED_IGB_UIO flag could be deprecated, but
> > > there is no way its value could be recycled into an already existing
> > > flag.
> >
> > Is breaking the API a bad thing in this context? IMHO the
> > initialization APIs need work before they're general enough and
> > perpetually supporting the current ones seems like an unnecessary
> > burden. I'm trying to understand the general guidelines of the project.
> >
> > (and nittily, recycling flag values is fine for static-only libs as long
> > as you remove the old macro, but of course removal is the API breakage
> > you mentioned)
>
> Yes, DPDK is a young project but breaking API should be always justified.
> In this case it is not mandatory to change it.
>
This is a source project, there is no fixed ABI.
^ permalink raw reply [relevance 3%]
Results 14001-14040 of 14040 prev (newer) | reverse | sort options + mbox downloads above
-- links below jump to the message on this page --
2013-05-30 17:12 [dpdk-dev] [PATCH 0/7] Vyatta patches Stephen Hemminger
2013-06-03 8:58 ` [dpdk-dev] [PATCH 4/7] eal: support different modules Damien Millescamps
2013-06-03 16:08 ` Antti Kantee
2013-06-03 16:29 ` Thomas Monjalon
2013-06-03 17:25 3% ` Stephen Hemminger
2014-01-28 1:48 [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports pshelar
[not found] ` <52E7D13B.9020404@redhat.com>
2014-01-28 18:17 0% ` [dpdk-dev] [ovs-dev] " Pravin Shelar
2014-01-29 8:15 3% ` Thomas Graf
2014-01-29 10:26 5% ` Vincent JARDIN
2014-01-29 11:14 4% ` Thomas Graf
2014-01-29 16:34 3% ` Vincent JARDIN
2014-01-29 17:14 3% ` Thomas Graf
2014-01-29 18:42 4% ` Stephen Hemminger
2014-01-29 20:47 0% ` François-Frédéric Ozog
2014-01-29 23:15 3% ` Thomas Graf
2014-03-13 7:37 0% ` David Nyström
2014-02-04 15:54 [dpdk-dev] [PATCH 00/16] recipes for RPM packages Thomas Monjalon
2014-02-04 15:54 ` [dpdk-dev] [PATCH 03/16] pkg: add recipe for RPM Thomas Monjalon
2014-02-26 13:07 ` Thomas Graf
2014-04-02 9:53 3% ` Thomas Monjalon
2014-04-02 11:29 0% ` Neil Horman
2014-04-09 18:39 7% [dpdk-dev] DPDK API/ABI Stability Neil Horman
2014-04-09 21:08 4% ` Stephen Hemminger
2014-04-10 10:54 7% ` Neil Horman
2014-04-10 20:47 [dpdk-dev] [PATCH 0/19] Separate compile time linkage between eal lib and pmd's Neil Horman
2014-04-12 11:04 ` Neil Horman
2014-04-15 8:31 ` Thomas Monjalon
2014-04-15 13:46 3% ` Neil Horman
2014-04-30 0:46 4% [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Thomas Monjalon
2014-04-30 0:46 4% ` [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM Thomas Monjalon
2014-04-30 10:52 0% ` [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Neil Horman
2014-05-01 13:14 0% ` Neil Horman
2014-05-01 21:15 0% ` Thomas Monjalon
2014-05-13 19:08 4% [dpdk-dev] Heads up: Fedora packaging plans Neil Horman
2014-05-19 10:11 0% ` Thomas Monjalon
2014-05-19 13:18 0% ` Neil Horman
2014-05-20 10:00 [dpdk-dev] [PATCH 0/4] New library: rte_distributor Bruce Richardson
2014-05-20 10:00 ` [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library Bruce Richardson
2014-05-20 18:18 4% ` Neil Horman
2014-05-21 10:21 3% ` Richardson, Bruce
2014-05-21 15:23 3% ` Neil Horman
2014-07-24 14:28 11% [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54) Pablo de Lara
2014-07-24 14:54 0% ` Thomas Monjalon
2014-07-24 14:59 0% ` Thomas Monjalon
2014-07-24 15:20 0% ` Chris Wright
2014-08-07 18:31 [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target Konstantin Ananyev
2014-08-07 20:11 4% ` Neil Horman
2014-08-07 20:58 0% ` Vincent JARDIN
2014-08-08 11:49 0% ` Ananyev, Konstantin
2014-08-08 12:25 4% ` Neil Horman
2014-08-08 13:09 3% ` Ananyev, Konstantin
2014-08-08 14:30 3% ` Neil Horman
2014-08-11 22:23 0% ` Thomas Monjalon
2014-08-21 20:15 1% ` [dpdk-dev] [PATCHv3] " Neil Horman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).