From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id AB236A057C; Thu, 26 Mar 2020 17:54:37 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 230E12C15; Thu, 26 Mar 2020 17:54:37 +0100 (CET) Received: from chi01-79.mxroute.com (chi01-79.mxroute.com [45.152.178.79]) by dpdk.org (Postfix) with ESMTP id 443563B5 for ; Thu, 26 Mar 2020 17:54:35 +0100 (CET) Received: from filter004.mxroute.com ([149.28.56.236] 149.28.56.236.vultr.com) (Authenticated sender: mN4UYu2MZsgR) by chi01-79.mxroute.com (ZoneMTA) with ESMTPSA id 17117c444900001c89.001 for (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES128-GCM-SHA256); Thu, 26 Mar 2020 16:54:30 +0000 X-Zone-Loop: 341546ea37f45a408566c0f60987392f92789a858b49 X-Originating-IP: [149.28.56.236] Received: from galaxy.mxroute.com (unknown [23.92.70.113]) by filter004.mxroute.com (Postfix) with ESMTPS id 5E43E3EDA4; Thu, 26 Mar 2020 16:54:22 +0000 (UTC) Received: from [192.198.151.43] by galaxy.mxroute.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.91) (envelope-from ) id 1jHVOP-0004pv-Tp; Thu, 26 Mar 2020 12:30:06 -0400 To: Pavan Nikhilesh Bhagavatula , Jerin Jacob Kollanukkaran , Nithin Kumar Dabilpuram Cc: "dev@dpdk.org" , "thomas@monjalon.net" , "david.marchand@redhat.com" , "mattias.ronnblom@ericsson.com" , Kiran Kumar Kokkilagadda References: <20200318213551.3489504-1-jerinj@marvell.com> <20200318213551.3489504-21-jerinj@marvell.com> <02c4c25a-83ba-dac5-20e6-7b140cbcb4f1@ashroe.eu> <5a99e696-3853-5782-0a4c-0debcc74faa8@ashroe.eu> <20a3cb35-d57b-4799-f084-919f3f55da6f@ashroe.eu> <7a6842d9-43fd-1746-113f-887d6afb16e5@ashroe.eu> From: Ray Kinsella Autocrypt: addr=mdr@ashroe.eu; keydata= mQINBFv8B3wBEAC+5ImcgbIvadt3axrTnt7Sxch3FsmWTTomXfB8YiuHT8KL8L/bFRQSL1f6 ASCHu3M89EjYazlY+vJUWLr0BhK5t/YI7bQzrOuYrl9K94vlLwzD19s/zB/g5YGGR5plJr0s JtJsFGEvF9LL3e+FKMRXveQxBB8A51nAHfwG0WSyx53d61DYz7lp4/Y4RagxaJoHp9lakn8j HV2N6rrnF+qt5ukj5SbbKWSzGg5HQF2t0QQ5tzWhCAKTfcPlnP0GymTBfNMGOReWivi3Qqzr S51Xo7hoGujUgNAM41sxpxmhx8xSwcQ5WzmxgAhJ/StNV9cb3HWIoE5StCwQ4uXOLplZNGnS uxNdegvKB95NHZjRVRChg/uMTGpg9PqYbTIFoPXjuk27sxZLRJRrueg4tLbb3HM39CJwSB++ YICcqf2N+GVD48STfcIlpp12/HI+EcDSThzfWFhaHDC0hyirHxJyHXjnZ8bUexI/5zATn/ux TpMbc/vicJxeN+qfaVqPkCbkS71cHKuPluM3jE8aNCIBNQY1/j87k5ELzg3qaesLo2n1krBH bKvFfAmQuUuJT84/IqfdVtrSCTabvDuNBDpYBV0dGbTwaRfE7i+LiJJclUr8lOvHUpJ4Y6a5 0cxEPxm498G12Z3NoY/mP5soItPIPtLR0rA0fage44zSPwp6cQARAQABtBxSYXkgS2luc2Vs bGEgPG1kckBhc2hyb2UuZXU+iQJUBBMBCAA+FiEEcDUDlKDJaDuJlfZfdJdaH/sCCpsFAlv8 B3wCGyMFCQlmAYAFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQdJdaH/sCCptdtRAAl0oE msa+djBVYLIsax+0f8acidtWg2l9f7kc2hEjp9h9aZCpPchQvhhemtew/nKavik3RSnLTAyn B3C/0GNlmvI1l5PFROOgPZwz4xhJKGN7jOsRrbkJa23a8ly5UXwF3Vqnlny7D3z+7cu1qq/f VRK8qFyWkAb+xgqeZ/hTcbJUWtW+l5Zb+68WGEp8hB7TuJLEWb4+VKgHTpQ4vElYj8H3Z94a 04s2PJMbLIZSgmKDASnyrKY0CzTpPXx5rSJ1q+B1FCsfepHLqt3vKSALa3ld6bJ8fSJtDUJ7 JLiU8dFZrywgDIVme01jPbjJtUScW6jONLvhI8Z2sheR71UoKqGomMHNQpZ03ViVWBEALzEt TcjWgJFn8yAmxqM4nBnZ+hE3LbMo34KCHJD4eg18ojDt3s9VrDLa+V9fNxUHPSib9FD9UX/1 +nGfU/ZABmiTuUDM7WZdXri7HaMpzDRJUKI6b+/uunF8xH/h/MHW16VuMzgI5dkOKKv1LejD dT5mA4R+2zBS+GsM0oa2hUeX9E5WwjaDzXtVDg6kYq8YvEd+m0z3M4e6diFeLS77/sAOgaYL 92UcoKD+Beym/fVuC6/55a0e12ksTmgk5/ZoEdoNQLlVgd2INtvnO+0k5BJcn66ZjKn3GbEC VqFbrnv1GnA58nEInRCTzR1k26h9nmS5Ag0EW/wHfAEQAMth1vHr3fOZkVOPfod3M6DkQir5 xJvUW5EHgYUjYCPIa2qzgIVVuLDqZgSCCinyooG5dUJONVHj3nCbITCpJp4eB3PI84RPfDcC hf/V34N/Gx5mTeoymSZDBmXT8YtvV/uJvn+LvHLO4ZJdvq5ZxmDyxfXFmkm3/lLw0+rrNdK5 pt6OnVlCqEU9tcDBezjUwDtOahyV20XqxtUttN4kQWbDRkhT+HrA9WN9l2HX91yEYC+zmF1S OhBqRoTPLrR6g4sCWgFywqztpvZWhyIicJipnjac7qL/wRS+wrWfsYy6qWLIV80beN7yoa6v ccnuy4pu2uiuhk9/edtlmFE4dNdoRf7843CV9k1yRASTlmPkU59n0TJbw+okTa9fbbQgbIb1 pWsAuicRHyLUIUz4f6kPgdgty2FgTKuPuIzJd1s8s6p2aC1qo+Obm2gnBTduB+/n1Jw+vKpt 07d+CKEKu4CWwvZZ8ktJJLeofi4hMupTYiq+oMzqH+V1k6QgNm0Da489gXllU+3EFC6W1qKj tkvQzg2rYoWeYD1Qn8iXcO4Fpk6wzylclvatBMddVlQ6qrYeTmSbCsk+m2KVrz5vIyja0o5Y yfeN29s9emXnikmNfv/dA5fpi8XCANNnz3zOfA93DOB9DBf0TQ2/OrSPGjB3op7RCfoPBZ7u AjJ9dM7VABEBAAGJAjwEGAEIACYWIQRwNQOUoMloO4mV9l90l1of+wIKmwUCW/wHfAIbDAUJ CWYBgAAKCRB0l1of+wIKm3KlD/9w/LOG5rtgtCUWPl4B3pZvGpNym6XdK8cop9saOnE85zWf u+sKWCrxNgYkYP7aZrYMPwqDvilxhbTsIJl5HhPgpTO1b0i+c0n1Tij3EElj5UCg3q8mEc17 c+5jRrY3oz77g7E3oPftAjaq1ybbXjY4K32o3JHFR6I8wX3m9wJZJe1+Y+UVrrjY65gZFxcA thNVnWKErarVQGjeNgHV4N1uF3pIx3kT1N4GSnxhoz4Bki91kvkbBhUgYfNflGURfZT3wIKK +d50jd7kqRouXUCzTdzmDh7jnYrcEFM4nvyaYu0JjSS5R672d9SK5LVIfWmoUGzqD4AVmUW8 pcv461+PXchuS8+zpltR9zajl72Q3ymlT4BTAQOlCWkD0snBoKNUB5d2EXPNV13nA0qlm4U2 GpROfJMQXjV6fyYRvttKYfM5xYKgRgtP0z5lTAbsjg9WFKq0Fndh7kUlmHjuAIwKIV4Tzo75 QO2zC0/NTaTjmrtiXhP+vkC4pcrOGNsbHuaqvsc/ZZ0siXyYsqbctj/sCd8ka2r94u+c7o4l BGaAm+FtwAfEAkXHu4y5Phuv2IRR+x1wTey1U1RaEPgN8xq0LQ1OitX4t2mQwjdPihZQBCnZ wzOrkbzlJMNrMKJpEgulmxAHmYJKgvZHXZXtLJSejFjR0GdHJcL5rwVOMWB8cg== Message-ID: Date: Thu, 26 Mar 2020 16:54:14 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-AuthUser: mdr@ashroe.eu Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v1 20/26] node: ipv4 lookup for x86 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 26/03/2020 09:56, Pavan Nikhilesh Bhagavatula wrote: > >> -----Original Message----- >> From: Ray Kinsella >> Sent: Tuesday, March 24, 2020 8:08 PM >> To: Pavan Nikhilesh Bhagavatula ; Jerin >> Jacob Kollanukkaran ; Nithin Kumar Dabilpuram >> >> Cc: dev@dpdk.org; thomas@monjalon.net; >> david.marchand@redhat.com; mattias.ronnblom@ericsson.com; Kiran >> Kumar Kokkilagadda >> Subject: Re: [EXT] Re: [dpdk-dev] [PATCH v1 20/26] node: ipv4 lookup >> for x86 >> >> >> >> On 24/03/2020 09:40, Pavan Nikhilesh Bhagavatula wrote: >>> Hi Ray, >>> >>> I have tried to avoid hand unrolling loops and found the following >> observations. >>> >>> 1. Although it decreases LOC it also takes away readability too. >>> Example: >>> Avoiding unrolled code below >> [SNIP] >>> Which is kind of unreadable. >> >> I am confused - isn't it exactly the same code? >> You still haven't completely unrolled the loop either? >> >> I don't know how one is readable and the other is not. > > I guess it’s a matter of personal preference. You have a rare preference for verbosity, sir. > >> >>> >>> 2. Not all compilers are made equal. I found that most of the >> compilers don’t >>> Unroll the loop above even when compiled with `-funroll-all-loops`. >>> I have checked with following compilers: >>> GCC 9.2.0 >>> Clang 9.0.1 >>> Aarch64 GCC 7.3.0 >>> Aarch64 GCC 9.2.0 >> >> Compilers have been unrolling fixed length loops for as long time - this >> isn't new technology. >> > > In theory, I agree with your view, but even the latest compiler is not doing a decent > job on unrolling the loop. > We can revisit this scheme, if and when compiler smart enough to do this as just unrolling > the loops is not good enough. It has to do it better than hand unrolling. > For example on arm64 GCC doesn’t merge load/stores to load/store pairs when unrolling. > (both ldr/str and ldp/stp latency is 3cyc and effectively ldp cost is halved). > > Even on x86 we see extra ~100 cycles as mentioned in [1]. > >> If the compiler isn't unrolling you are doing something that makes it >> think it is a bad idea. >> Hand unrolling the loop isn't the solution, understanding what the >> compiler is doing is a better idea. >> >> In front of your for loop insert, to indicate to the compiler what you >> want to do. >> #pragma unroll BUF_PER_LOOP > > Can you check which versions of compiler which this pragma works on? https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html https://clang.llvm.org/docs/AttributeReference.htm > > Most of the gcc versions that I have tried just spit out the following warning. > ../lib/librte_node/ip4_rewrite.c:59: warning: ignoring #pragma unroll BUF_PER_LOOP [-Wunknown-pragmas] > 59 | #pragma unroll BUF_PER_LOOP Not sure on this one ... >> >> With clang you can ask it why it is not unrolling the loop with the >> following switches. >> (output is verbose, but the reason is in there). >> >> -Rpass=loop-unroll -Rpass-missed=loop-unroll >> > > ../lib/librte_node/ip4_rewrite.c:57:2: remark: unrolled loop by a factor of 4 with run-time trip count [-Rpass=loop-unroll] > for (i = 0; i < nb_objs; i++) { > ^ That is good ... >>> >>> 3. Performance wise I see a lot of degradation on our platform at least >> 13%. >> >> Is the loop being unrolled? >> > > Yes, in a suboptimal way. https://pastebin.com/nkXvzMiW > > we decide to stick with hand unrolling based on the test result instead of depending on compilers mercy to do it for us[2]. Why not code in assembler then, taking to it logical conclusion. > if you think, it can be improved then please submit a patch. I don't have an ARM platform to test on. >From my own work in FD.io VPP, which you are clear borrowing heavily from, to put it kindly. I have never seen an advantage from hand-unrolling compared to letting the compiler do it. Perhaps this case is the exception. I would be suspicious though that the compiler is making bad decisions for some reason. In anycase, if you (and the presumabily the maintainer) are happy with verbose code like this making it's way into the codebase. I won't argue with it. > [2]https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > >>> On IA with a Broadwell(Xeon E5-2690) and i40e the performance >> remain same w.r.t Rx/Tx since the >>> hotspot is in the Tx path of the driver which limits the per core >> capability. >>> But the performance difference in number of cycles per node can >> be seen below: >>> > > [1] > >>> Hand unrolling: >>> +-------------------------------+---------------+---------------+---------------+- >> --------------+---------------+-----------+ >>> |Node |calls |objs |realloc_count |objs/call >> |objs/sec(10E6) |cycles/call| >>> +-------------------------------+---------------+---------------+---------------+- >> --------------+---------------+-----------+ >>> |ip4_lookup |7765918 |248509344 |1 |32.000 >> |27.725408 |779.0000 | >>> |ip4_rewrite |7765925 |248509568 |1 |32.000 >> |27.725408 |425.0000 | >>> |ethdev_tx-1 |7765927 |204056223 |1 |26.000 >> |22.762720 |597.0000 | >>> |pkt_drop |1389170 |44453409 |1 |32.000 >> |4.962688 |298.0000 | >>> |ethdev_rx-0-0 |63604111 |248509792 |2 |32.000 >> |27.725408 |982.0000 | >>> +-------------------------------+---------------+---------------+---------------+- >> --------------+---------------+-----------+ >>> >>> W/o unrolling: >>> >>> +-------------------------------+---------------+---------------+---------------+- >> --------------+---------------+-----------+ >>> |Node |calls |objs |realloc_count |objs/call >> |objs/sec(10E6) |cycles/call| >>> +-------------------------------+---------------+---------------+---------------+- >> --------------+---------------+-----------+ >>> |ip4_lookup |18864640 |603668448 |1 |32.000 >> |26.051328 |828.0000 | >>> |ip4_rewrite |18864646 |603668640 |1 |32.000 >> |26.051328 |534.0000 | >>> |ethdev_tx-1 |18864648 |527874175 |1 |27.000 >> |22.780256 |633.0000 | >>> |pkt_drop |2368580 |75794529 |1 |32.000 >> |3.271072 |286.0000 | >>> |ethdev_rx-0-0 |282058226 |603668864 |2 |32.000 >> |26.051328 |994.0000 | >>> +-------------------------------+---------------+---------------+---------------+- >> --------------+---------------+-----------+ >>> >>> Considering the above findings we would like to continue unrolling >> the loops by hand. >>> >>> Regards, >>> Pavan. >>> >>>> -----Original Message----- >>>> From: Ray Kinsella >>>> Sent: Friday, March 20, 2020 2:44 PM >>>> To: Pavan Nikhilesh Bhagavatula ; >> Jerin >>>> Jacob Kollanukkaran ; Nithin Kumar >> Dabilpuram >>>> >>>> Cc: dev@dpdk.org; thomas@monjalon.net; >>>> david.marchand@redhat.com; mattias.ronnblom@ericsson.com; >> Kiran >>>> Kumar Kokkilagadda >>>> Subject: Re: [EXT] Re: [dpdk-dev] [PATCH v1 20/26] node: ipv4 >> lookup >>>> for x86 >>>> >>>> >>>> >>>> On 19/03/2020 16:13, Pavan Nikhilesh Bhagavatula wrote: >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Ray Kinsella >>>>>> Sent: Thursday, March 19, 2020 9:21 PM >>>>>> To: Pavan Nikhilesh Bhagavatula ; >>>> Jerin >>>>>> Jacob Kollanukkaran ; Nithin Kumar >>>> Dabilpuram >>>>>> >>>>>> Cc: dev@dpdk.org; thomas@monjalon.net; >>>>>> david.marchand@redhat.com; mattias.ronnblom@ericsson.com; >>>> Kiran >>>>>> Kumar Kokkilagadda >>>>>> Subject: Re: [EXT] Re: [dpdk-dev] [PATCH v1 20/26] node: ipv4 >>>> lookup >>>>>> for x86 >>>>>> >>>>>> >>>>>> >>>>>> On 19/03/2020 14:22, Pavan Nikhilesh Bhagavatula wrote: >>>>>>>> On 18/03/2020 21:35, jerinj@marvell.com wrote: >>>>>>>>> From: Pavan Nikhilesh >>>>>>>>> >>>>>>>>> Add IPv4 lookup process function for ip4_lookup >>>>>>>>> rte_node. This node performs LPM lookup using x86_64 >>>>>>>>> vector supported RTE_LPM API on every packet received >>>>>>>>> and forwards it to a next node that is identified by >>>>>>>>> lookup result. >>>>>>>>> >>>>>>>>> Signed-off-by: Pavan Nikhilesh >>>>>>>>> Signed-off-by: Nithin Dabilpuram >> >>>>>>>>> Signed-off-by: Kiran Kumar K >>>>>>>>> --- >>>>>>>>> lib/librte_node/ip4_lookup.c | 245 >>>>>>>> +++++++++++++++++++++++++++++++++++ >>>>>>>>> 1 file changed, 245 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/lib/librte_node/ip4_lookup.c >>>>>>>> b/lib/librte_node/ip4_lookup.c >>>>>>>>> index d7fcd1158..c003e9c91 100644 >>>>>>>>> --- a/lib/librte_node/ip4_lookup.c >>>>>>>>> +++ b/lib/librte_node/ip4_lookup.c >>>>>>>>> @@ -264,6 +264,251 @@ ip4_lookup_node_process(struct >>>>>> rte_graph >>>>>>>> *graph, struct rte_node *node, >>>>>>>>> return nb_objs; >>>>>>>>> } >>>>>>>>> >>>>>>>>> +#elif defined(RTE_ARCH_X86) >>>>>>>>> + >>>>>>>>> +/* X86 SSE */ >>>>>>>>> +static uint16_t >>>>>>>>> +ip4_lookup_node_process(struct rte_graph *graph, struct >>>>>> rte_node >>>>>>>> *node, >>>>>>>>> + void **objs, uint16_t nb_objs) >>>>>>>>> +{ >>>>>>>>> + struct rte_mbuf *mbuf0, *mbuf1, *mbuf2, *mbuf3, >>>> **pkts; >>>>>>>>> + rte_edge_t next0, next1, next2, next3, next_index; >>>>>>>>> + struct rte_ipv4_hdr *ipv4_hdr; >>>>>>>>> + struct rte_ether_hdr *eth_hdr; >>>>>>>>> + uint32_t ip0, ip1, ip2, ip3; >>>>>>>>> + void **to_next, **from; >>>>>>>>> + uint16_t last_spec = 0; >>>>>>>>> + uint16_t n_left_from; >>>>>>>>> + struct rte_lpm *lpm; >>>>>>>>> + uint16_t held = 0; >>>>>>>>> + uint32_t drop_nh; >>>>>>>>> + rte_xmm_t dst; >>>>>>>>> + __m128i dip; /* SSE register */ >>>>>>>>> + int rc, i; >>>>>>>>> + >>>>>>>>> + /* Speculative next */ >>>>>>>>> + next_index = >>>> RTE_NODE_IP4_LOOKUP_NEXT_REWRITE; >>>>>>>>> + /* Drop node */ >>>>>>>>> + drop_nh = >>>>>>>> ((uint32_t)RTE_NODE_IP4_LOOKUP_NEXT_PKT_DROP) << 16; >>>>>>>>> + >>>>>>>>> + /* Get socket specific LPM from ctx */ >>>>>>>>> + lpm = *((struct rte_lpm **)node->ctx); >>>>>>>>> + >>>>>>>>> + pkts = (struct rte_mbuf **)objs; >>>>>>>>> + from = objs; >>>>>>>>> + n_left_from = nb_objs; >>>>>>>> >>>>>>>> I doubt this initial prefetch of the first 4 packets has any >> benefit. >>>>>>> >>>>>>> Ack will remove in v2 for x86. >>>>>>> >>>>>>>> >>>>>>>>> + if (n_left_from >= 4) { >>>>>>>>> + for (i = 0; i < 4; i++) { >>>>>>>>> + >>>> rte_prefetch0(rte_pktmbuf_mtod(pkts[i], >>>>>>>>> + struct >>>> rte_ether_hdr >>>>>>>> *) + >>>>>>>>> + 1); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + /* Get stream for the speculated next node */ >>>>>>>>> + to_next = rte_node_next_stream_get(graph, node, >>>>>>>> next_index, nb_objs); >>>>>>>> >>>>>>>> Suggest you don't reuse the hand-unrolling optimization from >>>> FD.io >>>>>>>> VPP. >>>>>>>> I have never found any performance benefit from them, and >> they >>>>>>>> make the code unnecessarily verbose. >>>>>>>> >>>>>>> >>>>>>> How would be take the benefit of rte_lpm_lookupx4 without >>>>>> unrolling the loop?. >>>>>>> Also, in future if we are using rte_rib and fib with a CPU >> supporting >>>>>> wider SIMD we might >>>>>>> need to unroll them further (AVX256 AND 512 currently >>>>>> rte_lpm_lookup uses only 128bit >>>>>>> since it is only uses SSE extension). >>>>>> >>>>>> Let the compiler do it for you, but using a constant vector length. >>>>>> for (int i=0; i < 4; ++i) { ... } >>>>>> >>>>> >>>>> Ok, I think I misunderstood the previous comment. >>>>> It was only for the prefetches in the loop right? >>>> >>>> >>>> no, it was for all the needless repetition. >>>> hand-unrolling loops serve no purpose but to add verbosity. >>>> >>>>> >>>>>>> >>>>>>>> >>>>>>>>> + while (n_left_from >= 4) { >>>>>>>>> + /* Prefetch next-next mbufs */ >>>>>>>>> + if (likely(n_left_from >= 11)) { >>>>>>>>> + rte_prefetch0(pkts[8]); >>>>>>>>> + rte_prefetch0(pkts[9]); >>>>>>>>> + rte_prefetch0(pkts[10]); >>>>>>>>> + rte_prefetch0(pkts[11]); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + /* Prefetch next mbuf data */ >>>>>>>>> + if (likely(n_left_from >= 7)) { >>>>>>>>> + >>>> rte_prefetch0(rte_pktmbuf_mtod(pkts[4], >>>>>>>>> + struct >>>> rte_ether_hdr >>>>>>>> *) + >>>>>>>>> + 1); >>>>>>>>> + >>>> rte_prefetch0(rte_pktmbuf_mtod(pkts[5], >>>>>>>>> + struct >>>> rte_ether_hdr >>>>>>>> *) + >>>>>>>>> + 1); >>>>>>>>> + >>>> rte_prefetch0(rte_pktmbuf_mtod(pkts[6], >>>>>>>>> + struct >>>> rte_ether_hdr >>>>>>>> *) + >>>>>>>>> + 1); >>>>>>>>> + >>>> rte_prefetch0(rte_pktmbuf_mtod(pkts[7], >>>>>>>>> + struct >>>> rte_ether_hdr >>>>>>>> *) + >>>>>>>>> + 1); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + mbuf0 = pkts[0]; >>>>>>>>> + mbuf1 = pkts[1]; >>>>>>>>> + mbuf2 = pkts[2]; >>>>>>>>> + mbuf3 = pkts[3]; >>>>>>>>> + >>>>>>>>> + pkts += 4; >>>>>>>>> + n_left_from -= 4; >>>>>>>>> + >>>>>>>>> + /* Extract DIP of mbuf0 */ >>>>>>>>> + eth_hdr = rte_pktmbuf_mtod(mbuf0, struct >>>>>>>> rte_ether_hdr *); >>>>>>>>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>>>>>>>> + ip0 = ipv4_hdr->dst_addr; >>>>>>>>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>>>>>>>> + rte_node_mbuf_priv1(mbuf0)->cksum = >>>> ipv4_hdr- >>>>>>>>> hdr_checksum; >>>>>>>>> + rte_node_mbuf_priv1(mbuf0)->ttl = ipv4_hdr- >>>>>>>>> time_to_live; >>>>>>>>> + >>>>>>>>> + /* Extract DIP of mbuf1 */ >>>>>>>>> + eth_hdr = rte_pktmbuf_mtod(mbuf1, struct >>>>>>>> rte_ether_hdr *); >>>>>>>>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>>>>>>>> + ip1 = ipv4_hdr->dst_addr; >>>>>>>>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>>>>>>>> + rte_node_mbuf_priv1(mbuf1)->cksum = >>>> ipv4_hdr- >>>>>>>>> hdr_checksum; >>>>>>>>> + rte_node_mbuf_priv1(mbuf1)->ttl = ipv4_hdr- >>>>>>>>> time_to_live; >>>>>>>>> + >>>>>>>>> + /* Extract DIP of mbuf2 */ >>>>>>>>> + eth_hdr = rte_pktmbuf_mtod(mbuf2, struct >>>>>>>> rte_ether_hdr *); >>>>>>>>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>>>>>>>> + ip2 = ipv4_hdr->dst_addr; >>>>>>>>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>>>>>>>> + rte_node_mbuf_priv1(mbuf2)->cksum = >>>> ipv4_hdr- >>>>>>>>> hdr_checksum; >>>>>>>>> + rte_node_mbuf_priv1(mbuf2)->ttl = ipv4_hdr- >>>>>>>>> time_to_live; >>>>>>>>> + >>>>>>>>> + /* Extract DIP of mbuf3 */ >>>>>>>>> + eth_hdr = rte_pktmbuf_mtod(mbuf3, struct >>>>>>>> rte_ether_hdr *); >>>>>>>>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>>>>>>>> + ip3 = ipv4_hdr->dst_addr; >>>>>>>>> + >>>>>>>>> + /* Prepare for lookup x4 */ >>>>>>>>> + dip = _mm_set_epi32(ip3, ip2, ip1, ip0); >>>>>>>>> + >>>>>>>>> + /* Byte swap 4 IPV4 addresses. */ >>>>>>>>> + const __m128i bswap_mask = _mm_set_epi8( >>>>>>>>> + 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, >>>> 2, 3); >>>>>>>>> + dip = _mm_shuffle_epi8(dip, bswap_mask); >>>>>>>>> + >>>>>>>>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>>>>>>>> + rte_node_mbuf_priv1(mbuf3)->cksum = >>>> ipv4_hdr- >>>>>>>>> hdr_checksum; >>>>>>>>> + rte_node_mbuf_priv1(mbuf3)->ttl = ipv4_hdr- >>>>>>>>> time_to_live; >>>>>>>>> + >>>>>>>>> + /* Perform LPM lookup to get NH and next >>>> node */ >>>>>>>>> + rte_lpm_lookupx4(lpm, dip, dst.u32, drop_nh); >>>>>>>>> + >>>>>>>>> + /* Extract next node id and NH */ >>>>>>>>> + rte_node_mbuf_priv1(mbuf0)->nh = dst.u32[0] >>>> & >>>>>>>> 0xFFFF; >>>>>>>>> + next0 = (dst.u32[0] >> 16); >>>>>>>>> + >>>>>>>>> + rte_node_mbuf_priv1(mbuf1)->nh = dst.u32[1] >>>> & >>>>>>>> 0xFFFF; >>>>>>>>> + next1 = (dst.u32[1] >> 16); >>>>>>>>> + >>>>>>>>> + rte_node_mbuf_priv1(mbuf2)->nh = dst.u32[2] >>>> & >>>>>>>> 0xFFFF; >>>>>>>>> + next2 = (dst.u32[2] >> 16); >>>>>>>>> + >>>>>>>>> + rte_node_mbuf_priv1(mbuf3)->nh = dst.u32[3] >>>> & >>>>>>>> 0xFFFF; >>>>>>>>> + next3 = (dst.u32[3] >> 16); >>>>>>>>> + >>>>>>>>> + /* Enqueue four to next node */ >>>>>>>>> + rte_edge_t fix_spec = >>>>>>>>> + (next_index ^ next0) | (next_index ^ >>>> next1) | >>>>>>>>> + (next_index ^ next2) | (next_index ^ >>>> next3); >>>>>>>>> + >>>>>>>>> + if (unlikely(fix_spec)) { >>>>>>>>> + /* Copy things successfully speculated >>>> till now >>>>>>>> */ >>>>>>>>> + rte_memcpy(to_next, from, last_spec * >>>>>>>> sizeof(from[0])); >>>>>>>>> + from += last_spec; >>>>>>>>> + to_next += last_spec; >>>>>>>>> + held += last_spec; >>>>>>>>> + last_spec = 0; >>>>>>>>> + >>>>>>>>> + /* Next0 */ >>>>>>>>> + if (next_index == next0) { >>>>>>>>> + to_next[0] = from[0]; >>>>>>>>> + to_next++; >>>>>>>>> + held++; >>>>>>>>> + } else { >>>>>>>>> + rte_node_enqueue_x1(graph, >>>> node, >>>>>>>> next0, >>>>>>>>> + from[0]); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + /* Next1 */ >>>>>>>>> + if (next_index == next1) { >>>>>>>>> + to_next[0] = from[1]; >>>>>>>>> + to_next++; >>>>>>>>> + held++; >>>>>>>>> + } else { >>>>>>>>> + rte_node_enqueue_x1(graph, >>>> node, >>>>>>>> next1, >>>>>>>>> + from[1]); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + /* Next2 */ >>>>>>>>> + if (next_index == next2) { >>>>>>>>> + to_next[0] = from[2]; >>>>>>>>> + to_next++; >>>>>>>>> + held++; >>>>>>>>> + } else { >>>>>>>>> + rte_node_enqueue_x1(graph, >>>> node, >>>>>>>> next2, >>>>>>>>> + from[2]); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + /* Next3 */ >>>>>>>>> + if (next_index == next3) { >>>>>>>>> + to_next[0] = from[3]; >>>>>>>>> + to_next++; >>>>>>>>> + held++; >>>>>>>>> + } else { >>>>>>>>> + rte_node_enqueue_x1(graph, >>>> node, >>>>>>>> next3, >>>>>>>>> + from[3]); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + from += 4; >>>>>>>>> + >>>>>>>>> + } else { >>>>>>>>> + last_spec += 4; >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + while (n_left_from > 0) { >>>>>>>>> + uint32_t next_hop; >>>>>>>>> + >>>>>>>>> + mbuf0 = pkts[0]; >>>>>>>>> + >>>>>>>>> + pkts += 1; >>>>>>>>> + n_left_from -= 1; >>>>>>>>> + >>>>>>>>> + /* Extract DIP of mbuf0 */ >>>>>>>>> + eth_hdr = rte_pktmbuf_mtod(mbuf0, struct >>>>>>>> rte_ether_hdr *); >>>>>>>>> + ipv4_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1); >>>>>>>>> + /* Extract cksum, ttl as ipv4 hdr is in cache */ >>>>>>>>> + rte_node_mbuf_priv1(mbuf0)->cksum = >>>> ipv4_hdr- >>>>>>>>> hdr_checksum; >>>>>>>>> + rte_node_mbuf_priv1(mbuf0)->ttl = ipv4_hdr- >>>>>>>>> time_to_live; >>>>>>>>> + >>>>>>>>> + rc = rte_lpm_lookup(lpm, >>>> rte_be_to_cpu_32(ipv4_hdr- >>>>>>>>> dst_addr), >>>>>>>>> + &next_hop); >>>>>>>>> + next_hop = (rc == 0) ? next_hop : drop_nh; >>>>>>>>> + >>>>>>>>> + rte_node_mbuf_priv1(mbuf0)->nh = next_hop >>>> & >>>>>>>> 0xFFFF; >>>>>>>>> + next0 = (next_hop >> 16); >>>>>>>>> + >>>>>>>>> + if (unlikely(next_index ^ next0)) { >>>>>>>>> + /* Copy things successfully speculated >>>> till now >>>>>>>> */ >>>>>>>>> + rte_memcpy(to_next, from, last_spec * >>>>>>>> sizeof(from[0])); >>>>>>>>> + from += last_spec; >>>>>>>>> + to_next += last_spec; >>>>>>>>> + held += last_spec; >>>>>>>>> + last_spec = 0; >>>>>>>>> + >>>>>>>>> + rte_node_enqueue_x1(graph, node, >>>> next0, >>>>>>>> from[0]); >>>>>>>>> + from += 1; >>>>>>>>> + } else { >>>>>>>>> + last_spec += 1; >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + /* !!! Home run !!! */ >>>>>>>>> + if (likely(last_spec == nb_objs)) { >>>>>>>>> + rte_node_next_stream_move(graph, node, >>>>>>>> next_index); >>>>>>>>> + return nb_objs; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + held += last_spec; >>>>>>>>> + /* Copy things successfully speculated till now */ >>>>>>>>> + rte_memcpy(to_next, from, last_spec * >>>> sizeof(from[0])); >>>>>>>>> + rte_node_next_stream_put(graph, node, next_index, >>>> held); >>>>>>>>> + >>>>>>>>> + return nb_objs; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> #else >>>>>>>>> >>>>>>>>> static uint16_t >>>>>>>>>