From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3997DA0583; Thu, 19 Mar 2020 16:50:50 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 1F3DB1C002; Thu, 19 Mar 2020 16:50:50 +0100 (CET) Received: from qrelay213.mxroute.com (qrelay213.mxroute.com [172.82.139.213]) by dpdk.org (Postfix) with ESMTP id 7A8DE1AFF for ; Thu, 19 Mar 2020 16:50:48 +0100 (CET) Received: from filter004.mxroute.com ([149.28.56.236] 149.28.56.236.vultr.com) (Authenticated sender: mN4UYu2MZsgR) by qrelay213.mxroute.com (ZoneMTA) with ESMTPA id 170f37d67e80006ab5.001 for ; Thu, 19 Mar 2020 15:50:47 +0000 X-Zone-Loop: d7ba64b7d25dd0e497e1eda6ef2ddc780f3605fa5e47 X-Originating-IP: [149.28.56.236] Received: from galaxy.mxroute.com (unknown [23.92.70.113]) by filter004.mxroute.com (Postfix) with ESMTPS id 5F35A3E825; Thu, 19 Mar 2020 15:50:41 +0000 (UTC) Received: from [192.198.151.43] by galaxy.mxroute.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.91) (envelope-from ) id 1jEx4V-00087o-Bl; Thu, 19 Mar 2020 11:26:59 -0400 To: Pavan Nikhilesh Bhagavatula , Jerin Jacob Kollanukkaran , Nithin Kumar Dabilpuram Cc: "dev@dpdk.org" , "thomas@monjalon.net" , "david.marchand@redhat.com" , "mattias.ronnblom@ericsson.com" , Kiran Kumar Kokkilagadda References: <20200318213551.3489504-1-jerinj@marvell.com> <20200318213551.3489504-21-jerinj@marvell.com> <02c4c25a-83ba-dac5-20e6-7b140cbcb4f1@ashroe.eu> From: Ray Kinsella Autocrypt: addr=mdr@ashroe.eu; keydata= mQINBFv8B3wBEAC+5ImcgbIvadt3axrTnt7Sxch3FsmWTTomXfB8YiuHT8KL8L/bFRQSL1f6 ASCHu3M89EjYazlY+vJUWLr0BhK5t/YI7bQzrOuYrl9K94vlLwzD19s/zB/g5YGGR5plJr0s JtJsFGEvF9LL3e+FKMRXveQxBB8A51nAHfwG0WSyx53d61DYz7lp4/Y4RagxaJoHp9lakn8j HV2N6rrnF+qt5ukj5SbbKWSzGg5HQF2t0QQ5tzWhCAKTfcPlnP0GymTBfNMGOReWivi3Qqzr S51Xo7hoGujUgNAM41sxpxmhx8xSwcQ5WzmxgAhJ/StNV9cb3HWIoE5StCwQ4uXOLplZNGnS uxNdegvKB95NHZjRVRChg/uMTGpg9PqYbTIFoPXjuk27sxZLRJRrueg4tLbb3HM39CJwSB++ YICcqf2N+GVD48STfcIlpp12/HI+EcDSThzfWFhaHDC0hyirHxJyHXjnZ8bUexI/5zATn/ux TpMbc/vicJxeN+qfaVqPkCbkS71cHKuPluM3jE8aNCIBNQY1/j87k5ELzg3qaesLo2n1krBH bKvFfAmQuUuJT84/IqfdVtrSCTabvDuNBDpYBV0dGbTwaRfE7i+LiJJclUr8lOvHUpJ4Y6a5 0cxEPxm498G12Z3NoY/mP5soItPIPtLR0rA0fage44zSPwp6cQARAQABtBxSYXkgS2luc2Vs bGEgPG1kckBhc2hyb2UuZXU+iQJUBBMBCAA+FiEEcDUDlKDJaDuJlfZfdJdaH/sCCpsFAlv8 B3wCGyMFCQlmAYAFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQdJdaH/sCCptdtRAAl0oE msa+djBVYLIsax+0f8acidtWg2l9f7kc2hEjp9h9aZCpPchQvhhemtew/nKavik3RSnLTAyn B3C/0GNlmvI1l5PFROOgPZwz4xhJKGN7jOsRrbkJa23a8ly5UXwF3Vqnlny7D3z+7cu1qq/f VRK8qFyWkAb+xgqeZ/hTcbJUWtW+l5Zb+68WGEp8hB7TuJLEWb4+VKgHTpQ4vElYj8H3Z94a 04s2PJMbLIZSgmKDASnyrKY0CzTpPXx5rSJ1q+B1FCsfepHLqt3vKSALa3ld6bJ8fSJtDUJ7 JLiU8dFZrywgDIVme01jPbjJtUScW6jONLvhI8Z2sheR71UoKqGomMHNQpZ03ViVWBEALzEt TcjWgJFn8yAmxqM4nBnZ+hE3LbMo34KCHJD4eg18ojDt3s9VrDLa+V9fNxUHPSib9FD9UX/1 +nGfU/ZABmiTuUDM7WZdXri7HaMpzDRJUKI6b+/uunF8xH/h/MHW16VuMzgI5dkOKKv1LejD dT5mA4R+2zBS+GsM0oa2hUeX9E5WwjaDzXtVDg6kYq8YvEd+m0z3M4e6diFeLS77/sAOgaYL 92UcoKD+Beym/fVuC6/55a0e12ksTmgk5/ZoEdoNQLlVgd2INtvnO+0k5BJcn66ZjKn3GbEC VqFbrnv1GnA58nEInRCTzR1k26h9nmS5Ag0EW/wHfAEQAMth1vHr3fOZkVOPfod3M6DkQir5 xJvUW5EHgYUjYCPIa2qzgIVVuLDqZgSCCinyooG5dUJONVHj3nCbITCpJp4eB3PI84RPfDcC hf/V34N/Gx5mTeoymSZDBmXT8YtvV/uJvn+LvHLO4ZJdvq5ZxmDyxfXFmkm3/lLw0+rrNdK5 pt6OnVlCqEU9tcDBezjUwDtOahyV20XqxtUttN4kQWbDRkhT+HrA9WN9l2HX91yEYC+zmF1S OhBqRoTPLrR6g4sCWgFywqztpvZWhyIicJipnjac7qL/wRS+wrWfsYy6qWLIV80beN7yoa6v ccnuy4pu2uiuhk9/edtlmFE4dNdoRf7843CV9k1yRASTlmPkU59n0TJbw+okTa9fbbQgbIb1 pWsAuicRHyLUIUz4f6kPgdgty2FgTKuPuIzJd1s8s6p2aC1qo+Obm2gnBTduB+/n1Jw+vKpt 07d+CKEKu4CWwvZZ8ktJJLeofi4hMupTYiq+oMzqH+V1k6QgNm0Da489gXllU+3EFC6W1qKj tkvQzg2rYoWeYD1Qn8iXcO4Fpk6wzylclvatBMddVlQ6qrYeTmSbCsk+m2KVrz5vIyja0o5Y yfeN29s9emXnikmNfv/dA5fpi8XCANNnz3zOfA93DOB9DBf0TQ2/OrSPGjB3op7RCfoPBZ7u AjJ9dM7VABEBAAGJAjwEGAEIACYWIQRwNQOUoMloO4mV9l90l1of+wIKmwUCW/wHfAIbDAUJ CWYBgAAKCRB0l1of+wIKm3KlD/9w/LOG5rtgtCUWPl4B3pZvGpNym6XdK8cop9saOnE85zWf u+sKWCrxNgYkYP7aZrYMPwqDvilxhbTsIJl5HhPgpTO1b0i+c0n1Tij3EElj5UCg3q8mEc17 c+5jRrY3oz77g7E3oPftAjaq1ybbXjY4K32o3JHFR6I8wX3m9wJZJe1+Y+UVrrjY65gZFxcA thNVnWKErarVQGjeNgHV4N1uF3pIx3kT1N4GSnxhoz4Bki91kvkbBhUgYfNflGURfZT3wIKK +d50jd7kqRouXUCzTdzmDh7jnYrcEFM4nvyaYu0JjSS5R672d9SK5LVIfWmoUGzqD4AVmUW8 pcv461+PXchuS8+zpltR9zajl72Q3ymlT4BTAQOlCWkD0snBoKNUB5d2EXPNV13nA0qlm4U2 GpROfJMQXjV6fyYRvttKYfM5xYKgRgtP0z5lTAbsjg9WFKq0Fndh7kUlmHjuAIwKIV4Tzo75 QO2zC0/NTaTjmrtiXhP+vkC4pcrOGNsbHuaqvsc/ZZ0siXyYsqbctj/sCd8ka2r94u+c7o4l BGaAm+FtwAfEAkXHu4y5Phuv2IRR+x1wTey1U1RaEPgN8xq0LQ1OitX4t2mQwjdPihZQBCnZ wzOrkbzlJMNrMKJpEgulmxAHmYJKgvZHXZXtLJSejFjR0GdHJcL5rwVOMWB8cg== Message-ID: <5a99e696-3853-5782-0a4c-0debcc74faa8@ashroe.eu> Date: Thu, 19 Mar 2020 15:50:36 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-AuthUser: mdr@ashroe.eu Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v1 20/26] node: ipv4 lookup for x86 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 19/03/2020 14:22, Pavan Nikhilesh Bhagavatula wrote: >> On 18/03/2020 21:35, jerinj@marvell.com wrote: >>> From: Pavan Nikhilesh >>> >>> Add IPv4 lookup process function for ip4_lookup >>> rte_node. This node performs LPM lookup using x86_64 >>> vector supported RTE_LPM API on every packet received >>> and forwards it to a next node that is identified by >>> lookup result. >>> >>> Signed-off-by: Pavan Nikhilesh >>> Signed-off-by: Nithin Dabilpuram >>> Signed-off-by: Kiran Kumar K >>> --- >>> lib/librte_node/ip4_lookup.c | 245 >> +++++++++++++++++++++++++++++++++++ >>> 1 file changed, 245 insertions(+) >>> >>> diff --git a/lib/librte_node/ip4_lookup.c >> b/lib/librte_node/ip4_lookup.c >>> index d7fcd1158..c003e9c91 100644 >>> --- a/lib/librte_node/ip4_lookup.c >>> +++ b/lib/librte_node/ip4_lookup.c >>> @@ -264,6 +264,251 @@ ip4_lookup_node_process(struct rte_graph >> *graph, struct rte_node *node, >>> return nb_objs; >>> } >>> >>> +#elif defined(RTE_ARCH_X86) >>> + >>> +/* X86 SSE */ >>> +static uint16_t >>> +ip4_lookup_node_process(struct rte_graph *graph, struct rte_node >> *node, >>> + void **objs, uint16_t nb_objs) >>> +{ >>> + struct rte_mbuf *mbuf0, *mbuf1, *mbuf2, *mbuf3, **pkts; >>> + rte_edge_t next0, next1, next2, next3, next_index; >>> + struct rte_ipv4_hdr *ipv4_hdr; >>> + struct rte_ether_hdr *eth_hdr; >>> + uint32_t ip0, ip1, ip2, ip3; >>> + void **to_next, **from; >>> + uint16_t last_spec = 0; >>> + uint16_t n_left_from; >>> + struct rte_lpm *lpm; >>> + uint16_t held = 0; >>> + uint32_t drop_nh; >>> + rte_xmm_t dst; >>> + __m128i dip; /* SSE register */ >>> + int rc, i; >>> + >>> + /* Speculative next */ >>> + next_index = RTE_NODE_IP4_LOOKUP_NEXT_REWRITE; >>> + /* Drop node */ >>> + drop_nh = >> ((uint32_t)RTE_NODE_IP4_LOOKUP_NEXT_PKT_DROP) << 16; >>> + >>> + /* Get socket specific LPM from ctx */ >>> + lpm = *((struct rte_lpm **)node->ctx); >>> + >>> + pkts = (struct rte_mbuf **)objs; >>> + from = objs; >>> + n_left_from = nb_objs; >> >> I doubt this initial prefetch of the first 4 packets has any benefit. > > Ack will remove in v2 for x86. > >> >>> + if (n_left_from >= 4) { >>> + for (i = 0; i < 4; i++) { >>> + rte_prefetch0(rte_pktmbuf_mtod(pkts[i], >>> + struct rte_ether_hdr >> *) + >>> + 1); >>> + } >>> + } >>> + >>> + /* Get stream for the speculated next node */ >>> + to_next = rte_node_next_stream_get(graph, node, >> next_index, nb_objs); >> >> Suggest you don't reuse the hand-unrolling optimization from FD.io >> VPP. >> I have never found any performance benefit from them, and they >> make the code unnecessarily verbose. >> > > How would be take the benefit of rte_lpm_lookupx4 without unrolling the loop?. > Also, in future if we are using rte_rib and fib with a CPU supporting wider SIMD we might > need to unroll them further (AVX256 AND 512 currently rte_lpm_lookup uses only 128bit > since it is only uses SSE extension). Let the compiler do it for you, but using a constant vector length. for (int i=0; i < 4; ++i) { ... } > >> >>> + while (n_left_from >= 4) { >>> + /* Prefetch next-next mbufs */ >>> + if (likely(n_left_from >= 11)) { >>> + rte_prefetch0(pkts[8]); >>> + rte_prefetch0(pkts[9]); >>> + rte_prefetch0(pkts[10]); >>> + rte_prefetch0(pkts[11]); >>> + } >>> + >>> + /* Prefetch next mbuf data */ >>> + if (likely(n_left_from >= 7)) { >>> + rte_prefetch0(rte_pktmbuf_mtod(pkts[4], >>> + struct rte_ether_hdr >> *) + >>> + 1); >>> + rte_prefetch0(rte_pktmbuf_mtod(pkts[5], >>> + struct rte_ether_hdr >> *) + >>> + 1); >>> + rte_prefetch0(rte_pktmbuf_mtod(pkts[6], >>> + struct rte_ether_hdr >> *) + >>> + 1); >>> + rte_prefetch0(rte_pktmbuf_mtod(pkts[7], >>> + struct rte_ether_hdr >> *) + >>> + 1); >>> + } >>> + >>> + mbuf0 = pkts[0]; >>> + mbuf1 = pkts[1]; >>> + mbuf2 = pkts[2]; >>> + mbuf3 = pkts[3]; >>> + >>> + pkts += 4; >>> + n_left_from -= 4; >>> + >>> + /* Extract DIP of mbuf0 */ >>> + eth_hdr = rte_pktmbuf_mtod(mbuf0, struct >> rte_ether_hdr *); >>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>> + ip0 = ipv4_hdr->dst_addr; >>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>> + rte_node_mbuf_priv1(mbuf0)->cksum = ipv4_hdr- >>> hdr_checksum; >>> + rte_node_mbuf_priv1(mbuf0)->ttl = ipv4_hdr- >>> time_to_live; >>> + >>> + /* Extract DIP of mbuf1 */ >>> + eth_hdr = rte_pktmbuf_mtod(mbuf1, struct >> rte_ether_hdr *); >>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>> + ip1 = ipv4_hdr->dst_addr; >>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>> + rte_node_mbuf_priv1(mbuf1)->cksum = ipv4_hdr- >>> hdr_checksum; >>> + rte_node_mbuf_priv1(mbuf1)->ttl = ipv4_hdr- >>> time_to_live; >>> + >>> + /* Extract DIP of mbuf2 */ >>> + eth_hdr = rte_pktmbuf_mtod(mbuf2, struct >> rte_ether_hdr *); >>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>> + ip2 = ipv4_hdr->dst_addr; >>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>> + rte_node_mbuf_priv1(mbuf2)->cksum = ipv4_hdr- >>> hdr_checksum; >>> + rte_node_mbuf_priv1(mbuf2)->ttl = ipv4_hdr- >>> time_to_live; >>> + >>> + /* Extract DIP of mbuf3 */ >>> + eth_hdr = rte_pktmbuf_mtod(mbuf3, struct >> rte_ether_hdr *); >>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>> + ip3 = ipv4_hdr->dst_addr; >>> + >>> + /* Prepare for lookup x4 */ >>> + dip = _mm_set_epi32(ip3, ip2, ip1, ip0); >>> + >>> + /* Byte swap 4 IPV4 addresses. */ >>> + const __m128i bswap_mask = _mm_set_epi8( >>> + 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3); >>> + dip = _mm_shuffle_epi8(dip, bswap_mask); >>> + >>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>> + rte_node_mbuf_priv1(mbuf3)->cksum = ipv4_hdr- >>> hdr_checksum; >>> + rte_node_mbuf_priv1(mbuf3)->ttl = ipv4_hdr- >>> time_to_live; >>> + >>> + /* Perform LPM lookup to get NH and next node */ >>> + rte_lpm_lookupx4(lpm, dip, dst.u32, drop_nh); >>> + >>> + /* Extract next node id and NH */ >>> + rte_node_mbuf_priv1(mbuf0)->nh = dst.u32[0] & >> 0xFFFF; >>> + next0 = (dst.u32[0] >> 16); >>> + >>> + rte_node_mbuf_priv1(mbuf1)->nh = dst.u32[1] & >> 0xFFFF; >>> + next1 = (dst.u32[1] >> 16); >>> + >>> + rte_node_mbuf_priv1(mbuf2)->nh = dst.u32[2] & >> 0xFFFF; >>> + next2 = (dst.u32[2] >> 16); >>> + >>> + rte_node_mbuf_priv1(mbuf3)->nh = dst.u32[3] & >> 0xFFFF; >>> + next3 = (dst.u32[3] >> 16); >>> + >>> + /* Enqueue four to next node */ >>> + rte_edge_t fix_spec = >>> + (next_index ^ next0) | (next_index ^ next1) | >>> + (next_index ^ next2) | (next_index ^ next3); >>> + >>> + if (unlikely(fix_spec)) { >>> + /* Copy things successfully speculated till now >> */ >>> + rte_memcpy(to_next, from, last_spec * >> sizeof(from[0])); >>> + from += last_spec; >>> + to_next += last_spec; >>> + held += last_spec; >>> + last_spec = 0; >>> + >>> + /* Next0 */ >>> + if (next_index == next0) { >>> + to_next[0] = from[0]; >>> + to_next++; >>> + held++; >>> + } else { >>> + rte_node_enqueue_x1(graph, node, >> next0, >>> + from[0]); >>> + } >>> + >>> + /* Next1 */ >>> + if (next_index == next1) { >>> + to_next[0] = from[1]; >>> + to_next++; >>> + held++; >>> + } else { >>> + rte_node_enqueue_x1(graph, node, >> next1, >>> + from[1]); >>> + } >>> + >>> + /* Next2 */ >>> + if (next_index == next2) { >>> + to_next[0] = from[2]; >>> + to_next++; >>> + held++; >>> + } else { >>> + rte_node_enqueue_x1(graph, node, >> next2, >>> + from[2]); >>> + } >>> + >>> + /* Next3 */ >>> + if (next_index == next3) { >>> + to_next[0] = from[3]; >>> + to_next++; >>> + held++; >>> + } else { >>> + rte_node_enqueue_x1(graph, node, >> next3, >>> + from[3]); >>> + } >>> + >>> + from += 4; >>> + >>> + } else { >>> + last_spec += 4; >>> + } >>> + } >>> + >>> + while (n_left_from > 0) { >>> + uint32_t next_hop; >>> + >>> + mbuf0 = pkts[0]; >>> + >>> + pkts += 1; >>> + n_left_from -= 1; >>> + >>> + /* Extract DIP of mbuf0 */ >>> + eth_hdr = rte_pktmbuf_mtod(mbuf0, struct >> rte_ether_hdr *); >>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>> + rte_node_mbuf_priv1(mbuf0)->cksum = ipv4_hdr- >>> hdr_checksum; >>> + rte_node_mbuf_priv1(mbuf0)->ttl = ipv4_hdr- >>> time_to_live; >>> + >>> + rc = rte_lpm_lookup(lpm, rte_be_to_cpu_32(ipv4_hdr- >>> dst_addr), >>> + &next_hop); >>> + next_hop = (rc == 0) ? next_hop : drop_nh; >>> + >>> + rte_node_mbuf_priv1(mbuf0)->nh = next_hop & >> 0xFFFF; >>> + next0 = (next_hop >> 16); >>> + >>> + if (unlikely(next_index ^ next0)) { >>> + /* Copy things successfully speculated till now >> */ >>> + rte_memcpy(to_next, from, last_spec * >> sizeof(from[0])); >>> + from += last_spec; >>> + to_next += last_spec; >>> + held += last_spec; >>> + last_spec = 0; >>> + >>> + rte_node_enqueue_x1(graph, node, next0, >> from[0]); >>> + from += 1; >>> + } else { >>> + last_spec += 1; >>> + } >>> + } >>> + >>> + /* !!! Home run !!! */ >>> + if (likely(last_spec == nb_objs)) { >>> + rte_node_next_stream_move(graph, node, >> next_index); >>> + return nb_objs; >>> + } >>> + >>> + held += last_spec; >>> + /* Copy things successfully speculated till now */ >>> + rte_memcpy(to_next, from, last_spec * sizeof(from[0])); >>> + rte_node_next_stream_put(graph, node, next_index, held); >>> + >>> + return nb_objs; >>> +} >>> + >>> #else >>> >>> static uint16_t >>>