From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id A9F98A04BB; Tue, 6 Oct 2020 17:12:38 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 067891BBE6; Tue, 6 Oct 2020 17:08:40 +0200 (CEST) Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by dpdk.org (Postfix) with ESMTP id 0DAF71BB9A for ; Tue, 6 Oct 2020 17:08:37 +0200 (CEST) IronPort-SDR: YHDhOKq+jaOoYHVkKx4OyiENGvSlDKXgWB+Mt4YcuuFHe8YzhAa3I0D/0RnTdJ/aEaOP7NZPhX elXzHqmzGR/w== X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919677" X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919677" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Oct 2020 08:03:52 -0700 IronPort-SDR: Ck/dHGNw8Vfi2XcBwh7odJJKJFe5+QylH59Bl7ZhHjMYhbO7/9d0MVWsDLlcTyIlr2YA2eyomI /Z8urss2bQSA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315459" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:49 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com, Konstantin Ananyev Date: Tue, 6 Oct 2020 16:03:13 +0100 Message-Id: <20201006150316.5776-12-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com> References: <20201005184526.7465-1-konstantin.ananyev@intel.com> <20201006150316.5776-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH v4 11/14] acl: for AVX512 classify use 4B load whenever possible X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" With current ACL implementation first field in the rule definition has always to be one byte long. Though for optimising classify implementation it might be useful to do 4B reads (as we do for rest of the fields). So at build phase, check user provided field definitions to determine is it safe to do 4B loads for first ACL field. Then at run-time this information can be used to choose classify behavior. Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl.h | 1 + lib/librte_acl/acl_bld.c | 34 ++++++++++++++++++++++++++++++ lib/librte_acl/acl_run_avx512.c | 2 ++ lib/librte_acl/acl_run_avx512x16.h | 8 +++---- lib/librte_acl/acl_run_avx512x8.h | 8 +++---- lib/librte_acl/rte_acl.c | 1 + 6 files changed, 46 insertions(+), 8 deletions(-) diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h index 7ac0d12f08..4089ab2a04 100644 --- a/lib/librte_acl/acl.h +++ b/lib/librte_acl/acl.h @@ -169,6 +169,7 @@ struct rte_acl_ctx { int32_t socket_id; /** Socket ID to allocate memory from. */ enum rte_acl_classify_alg alg; + uint32_t first_load_sz; void *rules; uint32_t max_rules; uint32_t rule_sz; diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c index d1f920b09c..da10864cd8 100644 --- a/lib/librte_acl/acl_bld.c +++ b/lib/librte_acl/acl_bld.c @@ -1581,6 +1581,37 @@ acl_check_bld_param(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg) return 0; } +/* + * With current ACL implementation first field in the rule definition + * has always to be one byte long. Though for optimising *classify* + * implementation it might be useful to be able to use 4B reads + * (as we do for rest of the fields). + * This function checks input config to determine is it safe to do 4B + * loads for first ACL field. For that we need to make sure that + * first field in our rule definition doesn't have the biggest offset, + * i.e. we still do have other fields located after the first one. + * Contrary if first field has the largest offset, then it means + * first field can occupy the very last byte in the input data buffer, + * and we have to do single byte load for it. + */ +static uint32_t +get_first_load_size(const struct rte_acl_config *cfg) +{ + uint32_t i, max_ofs, ofs; + + ofs = 0; + max_ofs = 0; + + for (i = 0; i != cfg->num_fields; i++) { + if (cfg->defs[i].field_index == 0) + ofs = cfg->defs[i].offset; + else if (max_ofs < cfg->defs[i].offset) + max_ofs = cfg->defs[i].offset; + } + + return (ofs < max_ofs) ? sizeof(uint32_t) : sizeof(uint8_t); +} + int rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg) { @@ -1618,6 +1649,9 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg) /* set data indexes. */ acl_set_data_indexes(ctx); + /* determine can we always do 4B load */ + ctx->first_load_sz = get_first_load_size(cfg); + /* copy in build config. */ ctx->config = *cfg; } diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c index 74698fa2ea..3fd1e33c3f 100644 --- a/lib/librte_acl/acl_run_avx512.c +++ b/lib/librte_acl/acl_run_avx512.c @@ -11,6 +11,7 @@ struct acl_flow_avx512 { uint32_t num_packets; /* number of packets processed */ uint32_t total_packets; /* max number of packets to process */ uint32_t root_index; /* current root index */ + uint32_t first_load_sz; /* first load size for new packet */ const uint64_t *trans; /* transition table */ const uint32_t *data_index; /* input data indexes */ const uint8_t **idata; /* input data */ @@ -24,6 +25,7 @@ acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx, { flow->num_packets = 0; flow->total_packets = total_packets; + flow->first_load_sz = ctx->first_load_sz; flow->root_index = ctx->trie[trie].root_index; flow->trans = ctx->trans_table; flow->data_index = ctx->trie[trie].data_index; diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h index 981f8d16da..a39df8f3c0 100644 --- a/lib/librte_acl/acl_run_avx512x16.h +++ b/lib/librte_acl/acl_run_avx512x16.h @@ -460,7 +460,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2], if (n[0] != 0) { inp[0] = get_next_bytes_avx512x16(flow, &pdata[0], - rm[0], &di[0], sizeof(uint8_t)); + rm[0], &di[0], flow->first_load_sz); first_trans16(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]); rm[0] = _mm512_test_epi32_mask(tr_lo[0], @@ -469,7 +469,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2], if (n[1] != 0) { inp[1] = get_next_bytes_avx512x16(flow, &pdata[2], - rm[1], &di[1], sizeof(uint8_t)); + rm[1], &di[1], flow->first_load_sz); first_trans16(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]); rm[1] = _mm512_test_epi32_mask(tr_lo[1], @@ -494,9 +494,9 @@ search_trie_avx512x16x2(struct acl_flow_avx512 *flow) start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]); in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0], - sizeof(uint8_t)); + flow->first_load_sz); in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1], - sizeof(uint8_t)); + flow->first_load_sz); first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]); first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]); diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h index cfba0299ed..fedd79b9ae 100644 --- a/lib/librte_acl/acl_run_avx512x8.h +++ b/lib/librte_acl/acl_run_avx512x8.h @@ -418,7 +418,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2], if (n[0] != 0) { inp[0] = get_next_bytes_avx512x8(flow, &pdata[0], - rm[0], &di[0], sizeof(uint8_t)); + rm[0], &di[0], flow->first_load_sz); first_trans8(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]); rm[0] = _mm256_test_epi32_mask(tr_lo[0], @@ -427,7 +427,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2], if (n[1] != 0) { inp[1] = get_next_bytes_avx512x8(flow, &pdata[2], - rm[1], &di[1], sizeof(uint8_t)); + rm[1], &di[1], flow->first_load_sz); first_trans8(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]); rm[1] = _mm256_test_epi32_mask(tr_lo[1], @@ -452,9 +452,9 @@ search_trie_avx512x8x2(struct acl_flow_avx512 *flow) start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]); in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0], - sizeof(uint8_t)); + flow->first_load_sz); in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1], - sizeof(uint8_t)); + flow->first_load_sz); first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]); first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]); diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c index 245af672ee..f1474038e5 100644 --- a/lib/librte_acl/rte_acl.c +++ b/lib/librte_acl/rte_acl.c @@ -500,6 +500,7 @@ rte_acl_dump(const struct rte_acl_ctx *ctx) printf("acl context <%s>@%p\n", ctx->name, ctx); printf(" socket_id=%"PRId32"\n", ctx->socket_id); printf(" alg=%"PRId32"\n", ctx->alg); + printf(" first_load_sz=%"PRIu32"\n", ctx->first_load_sz); printf(" max_rules=%"PRIu32"\n", ctx->max_rules); printf(" rule_size=%"PRIu32"\n", ctx->rule_sz); printf(" num_rules=%"PRIu32"\n", ctx->num_rules); -- 2.17.1