From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f170.google.com (mail-yw0-f170.google.com [209.85.161.170]) by dpdk.org (Postfix) with ESMTP id 7F47E1C00 for ; Thu, 11 May 2017 05:16:16 +0200 (CEST) Received: by mail-yw0-f170.google.com with SMTP id l14so6892154ywk.1 for ; Wed, 10 May 2017 20:16:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=JyF+XbkYl4Sh6vvf1aZqHwsx0026OjSEuN/CVfTH5Gg=; b=NHABo8AFQPz+22LSOIgmmwF+rIftY7SNN0rYrXgsVNIPWTAsFUicCzSAXklrBccRAC KgUGjvqrfZrU8fDK4j0FMfXNJ9SXsg5ZcbCYLjcz9ISmXLalGLU39diualZcxxPxHC1S yfzax655OQzNiowb/lPmbO+pZWpOL5KSYWJ5c= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=JyF+XbkYl4Sh6vvf1aZqHwsx0026OjSEuN/CVfTH5Gg=; b=NeZmCYdBS41FkpFqHTJIKrz5HrBB6ZZLtHoJ4wKGzv8YRpL5E1vrtRgMGZBUJquRT6 QZ3ESrAejCPdMOkZ8dx9QnD448aVpoR7dzEpUURKzpolH8lHq0qjIDehR98WaoSTmsDF IIncok805ooeuC80Fdas6mSt9AC9AV9iaaUm2mPayTJSydnQ50BMvQSobvzEH6EuCrZc WSPOctPu26ZHWbeKWKVL3IfLeeeN05sriwmY7wZeP292hkRtGmv37q5rSjCASEI3t6XX klh7KBCpAUc+gIKs5LcKV0oE2PfsYDv7xdF0ExAjKQO5TlgZ5hvwcfuM65LmBx0J3G9d vhXw== X-Gm-Message-State: AODbwcDovni6W5U5WbJI+sik9D+jR1VvXUE3LROLuTymCakzYkmsasUE 6b/blxvgFzM98s68kCPsD/rPSCIqpPN1 X-Received: by 10.129.89.131 with SMTP id n125mr7365551ywb.181.1494472575422; Wed, 10 May 2017 20:16:15 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.4.16 with HTTP; Wed, 10 May 2017 20:16:14 -0700 (PDT) In-Reply-To: <1494428417.2713.55.camel@caviumnetworks.com> References: <1493709255-8887-1-git-send-email-jianbo.liu@linaro.org> <1494383419-9677-1-git-send-email-jianbo.liu@linaro.org> <1494383419-9677-6-git-send-email-jianbo.liu@linaro.org> <1494428417.2713.55.camel@caviumnetworks.com> From: Jianbo Liu Date: Thu, 11 May 2017 11:16:14 +0800 Message-ID: To: "Sekhar, Ashwin" Cc: "tomasz.kantecki@intel.com" , "Jacob, Jerin" , "dev@dpdk.org" Content-Type: text/plain; charset=UTF-8 Subject: Re: [dpdk-dev] [PATCH v2 5/7] examples/l3fwd: add neon support for l3fwd X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 May 2017 03:16:16 -0000 Hi Ashwin, On 10 May 2017 at 23:00, Sekhar, Ashwin wrote: > Hi Jianbo, > > Thanks for version v2. Addition of the prefetch instructions is > definitely helping performance on ThunderX. But still performance is > slightly less than that of scalar. > > I tried few small tweaks which helped improve performance on my > Thunderx setup. For details see comments inline. > > > On Wed, 2017-05-10 at 10:30 +0800, Jianbo Liu wrote: >> Use ARM NEON intrinsics to accelerate l3 fowarding. >> >> Signed-off-by: Jianbo Liu >> --- >> examples/l3fwd/l3fwd_em.c | 4 +- >> examples/l3fwd/l3fwd_em_hlm.h | 19 ++- >> examples/l3fwd/l3fwd_em_hlm_neon.h | 74 ++++++++++ >> examples/l3fwd/l3fwd_em_sequential.h | 20 ++- >> examples/l3fwd/l3fwd_lpm.c | 4 +- >> examples/l3fwd/l3fwd_lpm_neon.h | 165 ++++++++++++++++++++++ >> examples/l3fwd/l3fwd_neon.h | 259 >> +++++++++++++++++++++++++++++++++++ >> 7 files changed, 539 insertions(+), 6 deletions(-) >> create mode 100644 examples/l3fwd/l3fwd_em_hlm_neon.h >> create mode 100644 examples/l3fwd/l3fwd_lpm_neon.h >> create mode 100644 examples/l3fwd/l3fwd_neon.h >> >> [...] >> diff --git a/examples/l3fwd/l3fwd_em_hlm.h >> b/examples/l3fwd/l3fwd_em_hlm.h >> index 636dea4..4ec600a 100644 >> --- a/examples/l3fwd/l3fwd_em_hlm.h >> +++ b/examples/l3fwd/l3fwd_em_hlm.h >> @@ -35,8 +35,13 @@ >> #ifndef __L3FWD_EM_HLM_H__ >> #define __L3FWD_EM_HLM_H__ >> >> +#if defined(__SSE4_1__) >> #include "l3fwd_sse.h" >> #include "l3fwd_em_hlm_sse.h" >> +#elif defined(RTE_MACHINE_CPUFLAG_NEON) >> +#include "l3fwd_neon.h" >> +#include "l3fwd_em_hlm_neon.h" >> +#endif >> >> static inline __attribute__((always_inline)) void >> em_get_dst_port_ipv4x8(struct lcore_conf *qconf, struct rte_mbuf >> *m[8], >> @@ -238,7 +243,7 @@ static inline __attribute__((always_inline)) >> uint16_t >> l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, >> uint8_t portid, struct lcore_conf *qconf) >> { >> - int32_t j; >> + int32_t i, j, pos; >> uint16_t dst_port[MAX_PKT_BURST]; >> >> /* >> @@ -247,6 +252,12 @@ static inline __attribute__((always_inline)) >> uint16_t >> */ >> int32_t n = RTE_ALIGN_FLOOR(nb_rx, 8); >> >> + for (j = 0; j < 8 && j < nb_rx; j++) { >> + rte_prefetch0(pkts_burst[j]); > The above prefetch of rte_mbuf struct is unnecessary. With this we wont > see any performance improvement as the contents of rte_mbuf (buf_addr > and data_off) is used in right next instruction. Removing the above > prefetch and similar prefetches at multiple places was improving > performance on my ThunderX setup. Yes, will remove them. > >> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], >> + struct ether_hdr *) + >> 1); > Better to prefetch at eth_hdr itself and not at eth_hdr + 1. In > process_packet in l3fwd_neon.h, eth_header is accessed in > But ip headers are used right in each 8/FWDSTEP loop. Since ip headers are accessed first, we should prefetch eth_hdr + 1 first. After all nb_rx packets are handled in above small loop, their eth_header are then accessed in processx4_step3 over again. I'm not sure prefretching eth_hdr still works if we prefetch eth_hdr in first step, as cache may be already filled with new data at that time. >> + } >> + >> for (j = 0; j < n; j += 8) { >> >> uint32_t pkt_type = >> @@ -263,6 +274,12 @@ static inline __attribute__((always_inline)) >> uint16_t >> uint32_t tcp_or_udp = pkt_type & >> (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP); >> >> + for (i = 0, pos = j + 8; i < 8 && pos < nb_rx; i++, >> pos++) { >> + rte_prefetch0(pkts_burst[pos]); > The above prefetch of rte_mbuf struct is unnecessary. > >> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[po >> s], >> + struct >> ether_hdr *) + 1); > Better to prefetch at eth_hdr itself and not at eth_hdr + 1 > >> + } >> + >> if (tcp_or_udp && (l3_type == RTE_PTYPE_L3_IPV4)) { >> >> em_get_dst_port_ipv4x8(qconf, >> &pkts_burst[j], portid, >> >> [...] > .... >> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h >> b/examples/l3fwd/l3fwd_lpm_neon.h >> new file mode 100644 >> index 0000000..2f047b3 >> --- /dev/null >> +++ b/examples/l3fwd/l3fwd_lpm_neon.h >> >> [...] > >> +/* >> + * Buffer optimized handling of packets, invoked >> + * from main_loop. >> + */ >> +static inline void >> +l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, >> + uint8_t portid, struct lcore_conf *qconf) >> +{ >> + int32_t i, j, pos; >> + uint16_t dst_port[MAX_PKT_BURST]; >> + int32x4_t dip[MAX_PKT_BURST / FWDSTEP]; > If you see carefully, we dont need an array of dip. We just need a > single element. dip value is calculated in processx4_step1 and consumed > in processx4_step2, and thats it. No need to save it in an array. > Will change, thanks! >> + uint32_t ipv4_flag[MAX_PKT_BURST / FWDSTEP]; > Same as dip. We dont need an array of ipv4_flag. > >> + const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP); >> + >> + for (j = 0; j < FWDSTEP && j < nb_rx; j++) { >> + rte_prefetch0(pkts_burst[j]); > The above prefetch of rte_mbuf struct is unnecessary. > >> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], >> + struct ether_hdr *) + >> 1); > Better to prefetch at eth_hdr itself and not at eth_hdr + 1 > >> + } >> + >> + for (j = 0; j != k; j += FWDSTEP) { >> + for (i = 0, pos = j + FWDSTEP; i < FWDSTEP && pos < >> nb_rx; >> + i++, pos++) { >> + rte_prefetch0(pkts_burst[pos]); > The above prefetch of rte_mbuf struct is unnecessary. > >> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[po >> s], >> + struct >> ether_hdr *) + 1); > Better to prefetch at eth_hdr itself and not at eth_hdr + 1 > >> + } >> + processx4_step1(&pkts_burst[j], &dip[j / FWDSTEP], >> + &ipv4_flag[j / FWDSTEP]); >> + >> + processx4_step2(qconf, dip[j / FWDSTEP], >> + ipv4_flag[j / FWDSTEP], portid, >> &pkts_burst[j], >> + &dst_port[j]); >> + } >> + >> + /* Classify last up to 3 packets one by one */ >> + switch (nb_rx % FWDSTEP) { >> + case 3: >> + dst_port[j] = lpm_get_dst_port(qconf, pkts_burst[j], >> portid); >> + j++; >> + /* fallthrough */ >> + case 2: >> + dst_port[j] = lpm_get_dst_port(qconf, pkts_burst[j], >> portid); >> + j++; >> + /* fallthrough */ >> + case 1: >> + dst_port[j] = lpm_get_dst_port(qconf, pkts_burst[j], >> portid); >> + j++; >> + } >> + >> + send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); >> +} >> + >> +#endif /* __L3FWD_LPM_NEON_H__ */ >> diff --git a/examples/l3fwd/l3fwd_neon.h >> b/examples/l3fwd/l3fwd_neon.h >> new file mode 100644 >> index 0000000..75c8976 >> --- /dev/null >> +++ b/examples/l3fwd/l3fwd_neon.h >> [...] > >> + >> +/** >> + * Process one packet: >> + * Update source and destination MAC addresses in the ethernet >> header. >> + * Perform RFC1812 checks and updates for IPV4 packets. >> + */ >> +static inline void >> +process_packet(struct rte_mbuf *pkt, uint16_t *dst_port) >> +{ >> + struct ether_hdr *eth_hdr; >> + uint32x4_t te, ve; >> + >> + eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *); > eth_hdr accessed here. Hence the earlier comments about prefetching at > eth header. > process_packet is called only for the last 1-3 packets, most are handled in processx4_step3. As these 2 functions access packets from the first one once again, the prefetch may not work. Please see my explanation in the above... >> + >> + te = vld1q_u32((uint32_t *)eth_hdr); >> + ve = vreinterpretq_u32_s32(val_eth[dst_port[0]]); >> + >> + >> + rfc1812_process((struct ipv4_hdr *)(eth_hdr + 1), dst_port, >> + pkt->packet_type); >> + >> + ve = vsetq_lane_u32(vgetq_lane_u32(te, 3), ve, 3); > Use vcopyq_laneq_u32 for easily doing the above. > Will change. Thanks! >> + vst1q_u32((uint32_t *)eth_hdr, ve); >> +} >> + >> [...] >> +#endif /* _L3FWD_NEON_H_ */ > > Combining all the above comments, I made some changes on top of your > patch. These changes are giving 3-4% improvement over your version. > > You may find the changes at > https://gist.github.com/ashwinyes/34cbdd999784402c859c71613587fafc > Is the correct in Line 103/104, you only process one packets in the last FWDSTEP packets? Actually, I don't like your change in l3fwd_lpm_send_packets, making the simple logic complicated. And I don't think it can help to improve performance. :-) > Please check it out and let me know your comments. > > Thanks > Ashwin