DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Sekhar, Ashwin" <Ashwin.Sekhar@cavium.com>
To: "Sekhar, Ashwin" <Ashwin.Sekhar@cavium.com>,
	"jianbo.liu@linaro.org" <jianbo.liu@linaro.org>
Cc: "Jacob,  Jerin" <Jerin.JacobKollanukkaran@cavium.com>,
	"tomasz.kantecki@intel.com" <tomasz.kantecki@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH v2 5/7] examples/l3fwd: add neon support for l3fwd
Date: Thu, 11 May 2017 04:14:13 +0000	[thread overview]
Message-ID: <1494476052.2563.10.camel@caviumnetworks.com> (raw)
In-Reply-To: <CAP4Qi39cBj+kPtPjs_QOwbvixMrEExUjUeq2FWUasiCWm20O8w@mail.gmail.com>

On Thu, 2017-05-11 at 11:16 +0800, Jianbo Liu wrote:
> Hi Ashwin,
> 
> On 10 May 2017 at 23:00, Sekhar, Ashwin <Ashwin.Sekhar@cavium.com>
> wrote:
> > 
> > Hi Jianbo,
> > 
> > Thanks for version v2. Addition of the prefetch instructions is
> > definitely helping performance on ThunderX. But still performance
> > is
> > slightly less than that of scalar.
> > 
> > I tried few small tweaks which helped improve performance on my
> > Thunderx setup. For details see comments inline.
> > 
> > 
> > On Wed, 2017-05-10 at 10:30 +0800, Jianbo Liu wrote:
> > > 
> > > Use ARM NEON intrinsics to accelerate l3 fowarding.
> > > 
> > > Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> > > ---
> > >  examples/l3fwd/l3fwd_em.c            |   4 +-
> > >  examples/l3fwd/l3fwd_em_hlm.h        |  19 ++-
> > >  examples/l3fwd/l3fwd_em_hlm_neon.h   |  74 ++++++++++
> > >  examples/l3fwd/l3fwd_em_sequential.h |  20 ++-
> > >  examples/l3fwd/l3fwd_lpm.c           |   4 +-
> > >  examples/l3fwd/l3fwd_lpm_neon.h      | 165
> > > ++++++++++++++++++++++
> > >  examples/l3fwd/l3fwd_neon.h          | 259
> > > +++++++++++++++++++++++++++++++++++
> > >  7 files changed, 539 insertions(+), 6 deletions(-)
> > >  create mode 100644 examples/l3fwd/l3fwd_em_hlm_neon.h
> > >  create mode 100644 examples/l3fwd/l3fwd_lpm_neon.h
> > >  create mode 100644 examples/l3fwd/l3fwd_neon.h
> > > 
> > > [...]
> > > diff --git a/examples/l3fwd/l3fwd_em_hlm.h
> > > b/examples/l3fwd/l3fwd_em_hlm.h
> > > index 636dea4..4ec600a 100644
> > > --- a/examples/l3fwd/l3fwd_em_hlm.h
> > > +++ b/examples/l3fwd/l3fwd_em_hlm.h
> > > @@ -35,8 +35,13 @@
> > >  #ifndef __L3FWD_EM_HLM_H__
> > >  #define __L3FWD_EM_HLM_H__
> > > 
> > > +#if defined(__SSE4_1__)
> > >  #include "l3fwd_sse.h"
> > >  #include "l3fwd_em_hlm_sse.h"
> > > +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
> > > +#include "l3fwd_neon.h"
> > > +#include "l3fwd_em_hlm_neon.h"
> > > +#endif
> > > 
> > >  static inline __attribute__((always_inline)) void
> > >  em_get_dst_port_ipv4x8(struct lcore_conf *qconf, struct rte_mbuf
> > > *m[8],
> > > @@ -238,7 +243,7 @@ static inline __attribute__((always_inline))
> > > uint16_t
> > >  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
> > >               uint8_t portid, struct lcore_conf *qconf)
> > >  {
> > > -     int32_t j;
> > > +     int32_t i, j, pos;
> > >       uint16_t dst_port[MAX_PKT_BURST];
> > > 
> > >       /*
> > > @@ -247,6 +252,12 @@ static inline __attribute__((always_inline))
> > > uint16_t
> > >        */
> > >       int32_t n = RTE_ALIGN_FLOOR(nb_rx, 8);
> > > 
> > > +     for (j = 0; j < 8 && j < nb_rx; j++) {
> > > +             rte_prefetch0(pkts_burst[j]);
> > The above prefetch of rte_mbuf struct is unnecessary. With this we
> > wont
> > see any performance improvement as the contents of rte_mbuf
> > (buf_addr
> > and data_off) is used in right next instruction. Removing the above
> > prefetch and similar prefetches at multiple places was improving
> > performance on my ThunderX setup.
> Yes, will remove them.
> 
> > 
> > 
> > > 
> > > +             rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > > +                                            struct ether_hdr *)
> > > +
> > > 1);
> > Better to prefetch at eth_hdr itself and not at eth_hdr + 1. In
> > process_packet in l3fwd_neon.h, eth_header is accessed in
> > 
> But ip headers are used right in each 8/FWDSTEP loop.
> Since ip headers are accessed first, we should prefetch eth_hdr + 1
> first.
> After all nb_rx packets are handled in above small loop, their
> eth_header are then accessed in processx4_step3 over again.
> I'm not sure prefretching eth_hdr still works if we prefetch eth_hdr
> in first step,  as cache may be already filled with new data at that
> time.
> 
Okay. 
Also, I guess if the ethernet header and ip header falls in the same
cache line (which I think would be the case mostly as I hope the packet
data will be cache aligned), it doesn't make much of a  difference
whether you prefetch at ethernet header address or ip header address.
> > 
> > > 
> > > +     }
> > > +
> > >       for (j = 0; j < n; j += 8) {
> > > 
> > >               uint32_t pkt_type =
> > > @@ -263,6 +274,12 @@ static inline __attribute__((always_inline))
> > > uint16_t
> > >               uint32_t tcp_or_udp = pkt_type &
> > >                       (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
> > > 
> > > +             for (i = 0, pos = j + 8; i < 8 && pos < nb_rx; i++,
> > > pos++) {
> > > +                     rte_prefetch0(pkts_burst[pos]);
> > The above prefetch of rte_mbuf struct is unnecessary.
> > 
> > > 
> > > +                     rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[p
> > > o
> > > s],
> > > +                                                    struct
> > > ether_hdr *) + 1);
> > Better to prefetch at eth_hdr itself and not at eth_hdr + 1
> > 
> > > 
> > > +             }
> > > +
> > >               if (tcp_or_udp && (l3_type == RTE_PTYPE_L3_IPV4)) {
> > > 
> > >                       em_get_dst_port_ipv4x8(qconf,
> > > &pkts_burst[j], portid,
> > > 
> > > [...]
> ....
> 
> > 
> > > 
> > > diff --git a/examples/l3fwd/l3fwd_lpm_neon.h
> > > b/examples/l3fwd/l3fwd_lpm_neon.h
> > > new file mode 100644
> > > index 0000000..2f047b3
> > > --- /dev/null
> > > +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> > > 
> > > [...]
> > > 
> > > +/*
> > > + * Buffer optimized handling of packets, invoked
> > > + * from main_loop.
> > > + */
> > > +static inline void
> > > +l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
> > > +                     uint8_t portid, struct lcore_conf *qconf)
> > > +{
> > > +     int32_t i, j, pos;
> > > +     uint16_t dst_port[MAX_PKT_BURST];
> > > +     int32x4_t dip[MAX_PKT_BURST / FWDSTEP];
> > If you see carefully, we dont need an array of dip. We just need a
> > single element. dip value is calculated in processx4_step1 and
> > consumed
> > in processx4_step2, and thats it. No need to save it in an array.
> > 
> Will change, thanks!
> 
> > 
> > > 
> > > +     uint32_t ipv4_flag[MAX_PKT_BURST / FWDSTEP];
> > Same as dip. We dont need an array of ipv4_flag.
> > 
> > > 
> > > +     const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
> > > +
> > > +     for (j = 0; j < FWDSTEP && j < nb_rx; j++) {
> > > +             rte_prefetch0(pkts_burst[j]);
> > The above prefetch of rte_mbuf struct is unnecessary.
> > 
> > > 
> > > +             rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > > +                                            struct ether_hdr *)
> > > +
> > > 1);
> > Better to prefetch at eth_hdr itself and not at eth_hdr + 1
> > 
> > > 
> > > +     }
> > > +
> > > +     for (j = 0; j != k; j += FWDSTEP) {
> > > +             for (i = 0, pos = j + FWDSTEP; i < FWDSTEP && pos <
> > > nb_rx;
> > > +                  i++, pos++) {
> > > +                     rte_prefetch0(pkts_burst[pos]);
> > The above prefetch of rte_mbuf struct is unnecessary.
> > 
> > > 
> > > +                     rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[p
> > > o
> > > s],
> > > +                                                    struct
> > > ether_hdr *) + 1);
> > Better to prefetch at eth_hdr itself and not at eth_hdr + 1
> > 
> > > 
> > > +             }
> > > +             processx4_step1(&pkts_burst[j], &dip[j / FWDSTEP],
> > > +                             &ipv4_flag[j / FWDSTEP]);
> > > +
> > > +             processx4_step2(qconf, dip[j / FWDSTEP],
> > > +                             ipv4_flag[j / FWDSTEP], portid,
> > > &pkts_burst[j],
> > > +                             &dst_port[j]);
> > > +     }
> > > +
> > > +     /* Classify last up to 3 packets one by one */
> > > +     switch (nb_rx % FWDSTEP) {
> > > +     case 3:
> > > +             dst_port[j] = lpm_get_dst_port(qconf,
> > > pkts_burst[j],
> > > portid);
> > > +             j++;
> > > +             /* fallthrough */
> > > +     case 2:
> > > +             dst_port[j] = lpm_get_dst_port(qconf,
> > > pkts_burst[j],
> > > portid);
> > > +             j++;
> > > +             /* fallthrough */
> > > +     case 1:
> > > +             dst_port[j] = lpm_get_dst_port(qconf,
> > > pkts_burst[j],
> > > portid);
> > > +             j++;
> > > +     }
> > > +
> > > +     send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
> > > +}
> > > +
> > > +#endif /* __L3FWD_LPM_NEON_H__ */
> > > diff --git a/examples/l3fwd/l3fwd_neon.h
> > > b/examples/l3fwd/l3fwd_neon.h
> > > new file mode 100644
> > > index 0000000..75c8976
> > > --- /dev/null
> > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > [...]
> > > 
> > > +
> > > +/**
> > > + * Process one packet:
> > > + * Update source and destination MAC addresses in the ethernet
> > > header.
> > > + * Perform RFC1812 checks and updates for IPV4 packets.
> > > + */
> > > +static inline void
> > > +process_packet(struct rte_mbuf *pkt, uint16_t *dst_port)
> > > +{
> > > +     struct ether_hdr *eth_hdr;
> > > +     uint32x4_t te, ve;
> > > +
> > > +     eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
> > eth_hdr accessed here. Hence the earlier comments about prefetching
> > at
> > eth header.
> > 
> process_packet is called only for the last 1-3 packets, most are
> handled in processx4_step3.
> As these 2 functions access packets from the first one once again,
> the
> prefetch may not work.
> Please see my explanation in the above...
> 
Okay.
> > 
> > > 
> > > +
> > > +     te = vld1q_u32((uint32_t *)eth_hdr);
> > > +     ve = vreinterpretq_u32_s32(val_eth[dst_port[0]]);
> > > +
> > > +
> > > +     rfc1812_process((struct ipv4_hdr *)(eth_hdr + 1), dst_port,
> > > +                     pkt->packet_type);
> > > +
> > > +     ve = vsetq_lane_u32(vgetq_lane_u32(te, 3), ve, 3);
> > Use vcopyq_laneq_u32 for easily doing the above.
> > 
> Will change. Thanks!
> 
> > 
> > > 
> > > +     vst1q_u32((uint32_t *)eth_hdr, ve);
> > > +}
> > > +
> > > [...]
> > > +#endif /* _L3FWD_NEON_H_ */
> > Combining all the above comments, I made some changes on top of
> > your
> > patch. These changes are giving 3-4% improvement over your version.
> > 
> > You may find the changes at
> > https://gist.github.com/ashwinyes/34cbdd999784402c859c71613587fafc
> > 
> Is the correct in Line 103/104, you only process one packets in the
> last FWDSTEP packets?
Its doing processx4_* there. So its processing 4 packets.

> Actually, I don't like your change in l3fwd_lpm_send_packets, making
> the simple logic complicated. And I don't think it can help to
> improve
> performance. :-)
Its not making it complicated. The number of lines of code may be
higher by may be 10 lines, but the conditions of the loops are
simplified which reduces the number of branch instructions and helps
the processor to go through them faster.

If possible, please try it out on your machine.
> 
> > 
> > Please check it out and let me know your comments.
> > 
> > Thanks
> > Ashwin

  reply	other threads:[~2017-05-11  4:14 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-02  7:14 [dpdk-dev] [PATCH 1/5] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-05-02  7:14 ` [dpdk-dev] [PATCH 2/5] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_single.h Jianbo Liu
2017-05-02  9:40   ` Sekhar, Ashwin
2017-05-02  7:14 ` [dpdk-dev] [PATCH 3/5] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-05-02  7:14 ` [dpdk-dev] [PATCH 4/5] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-05-02  7:14 ` [dpdk-dev] [PATCH 5/5] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-05-02 11:20   ` Sekhar, Ashwin
2017-05-02 11:47   ` Sekhar, Ashwin
2017-05-03  5:24     ` Jianbo Liu
2017-05-04  8:42       ` Jianbo Liu
2017-05-05  4:24         ` Sekhar, Ashwin
2017-05-05  5:43           ` Jianbo Liu
2017-05-09  8:10             ` Sekhar, Ashwin
2017-05-10  2:39               ` Jianbo Liu
2017-05-10  2:30 ` [dpdk-dev] [PATCH v2 0/7] accelerate examples/l3fwd with NEON on ARM64 platform Jianbo Liu
2017-05-10  2:30   ` [dpdk-dev] [PATCH v2 1/7] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-05-10  2:30   ` [dpdk-dev] [PATCH v2 2/7] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_sequential.h Jianbo Liu
2017-05-10  2:30   ` [dpdk-dev] [PATCH v2 3/7] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-05-10  2:30   ` [dpdk-dev] [PATCH v2 4/7] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-05-10  2:30   ` [dpdk-dev] [PATCH v2 5/7] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-05-10 15:00     ` Sekhar, Ashwin
2017-05-11  3:16       ` Jianbo Liu
2017-05-11  4:14         ` Sekhar, Ashwin [this message]
2017-05-11  4:27           ` Sekhar, Ashwin
2017-05-11  6:11             ` Jianbo Liu
2017-05-10  2:30   ` [dpdk-dev] [PATCH v2 6/7] examples/l3fwd: add the times of hash multi-lookup for different Archs Jianbo Liu
2017-05-10  2:30   ` [dpdk-dev] [PATCH v2 7/7] examples/l3fwd: change the guard micro name for header file Jianbo Liu
2017-05-10 11:57     ` Sekhar, Ashwin
2017-05-11  9:25 ` [dpdk-dev] [PATCH v3 0/7] accelerate examples/l3fwd with NEON on ARM64 platform Jianbo Liu
2017-05-11  9:25   ` [dpdk-dev] [PATCH v3 1/7] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-05-11  9:25   ` [dpdk-dev] [PATCH v3 2/7] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_sequential.h Jianbo Liu
2017-05-11  9:25   ` [dpdk-dev] [PATCH v3 3/7] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-05-11  9:25   ` [dpdk-dev] [PATCH v3 4/7] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-05-11  9:25   ` [dpdk-dev] [PATCH v3 5/7] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-05-11  9:49     ` Sekhar, Ashwin
2017-05-11 10:01       ` Jianbo Liu
2017-05-11 10:27         ` Sekhar, Ashwin
2017-05-12  2:40           ` Jianbo Liu
2017-05-11  9:25   ` [dpdk-dev] [PATCH v3 6/7] examples/l3fwd: add the times of hash multi-lookup for different Archs Jianbo Liu
2017-05-11  9:25   ` [dpdk-dev] [PATCH v3 7/7] examples/l3fwd: change the guard macro name for header file Jianbo Liu
2017-05-15  3:34 ` [dpdk-dev] [PATCH v4 0/8] accelerate examples/l3fwd with NEON on ARM64 platform Jianbo Liu
2017-05-15  3:34   ` [dpdk-dev] [PATCH v4 1/8] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-05-15  3:34   ` [dpdk-dev] [PATCH v4 2/8] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_sequential.h Jianbo Liu
2017-05-15  3:34   ` [dpdk-dev] [PATCH v4 3/8] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-05-15  3:34   ` [dpdk-dev] [PATCH v4 4/8] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-05-15  3:34   ` [dpdk-dev] [PATCH v4 5/8] arch/arm: add vcopyq_laneq_u32 for old version of gcc Jianbo Liu
2017-05-15  4:01     ` Jerin Jacob
2017-05-15  3:34   ` [dpdk-dev] [PATCH v4 6/8] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-05-15  5:22     ` Sekhar, Ashwin
2017-05-15  3:34   ` [dpdk-dev] [PATCH v4 7/8] examples/l3fwd: add the times of hash multi-lookup for different Archs Jianbo Liu
2017-05-15  3:34   ` [dpdk-dev] [PATCH v4 8/8] examples/l3fwd: change the guard macro name for header file Jianbo Liu
2017-07-03 21:02   ` [dpdk-dev] [PATCH v4 0/8] accelerate examples/l3fwd with NEON on ARM64 platform Thomas Monjalon
2017-07-04 10:23 ` [dpdk-dev] [PATCH v5 " Jianbo Liu
2017-07-04 10:23   ` [dpdk-dev] [PATCH v5 1/8] examples/l3fwd: extract arch independent code from multi hash lookup Jianbo Liu
2017-07-04 10:23   ` [dpdk-dev] [PATCH v5 2/8] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_sequential.h Jianbo Liu
2017-07-04 10:24   ` [dpdk-dev] [PATCH v5 3/8] examples/l3fwd: extract common code from multi packet send Jianbo Liu
2017-07-04 10:24   ` [dpdk-dev] [PATCH v5 4/8] examples/l3fwd: rearrange the code for lpm_l3fwd Jianbo Liu
2017-07-04 10:24   ` [dpdk-dev] [PATCH v5 5/8] arch/arm: add vcopyq_laneq_u32 for old version of gcc Jianbo Liu
2017-07-04 10:24   ` [dpdk-dev] [PATCH v5 6/8] examples/l3fwd: add neon support for l3fwd Jianbo Liu
2017-07-04 10:24   ` [dpdk-dev] [PATCH v5 7/8] examples/l3fwd: add the times of hash multi-lookup for different Archs Jianbo Liu
2017-07-04 10:24   ` [dpdk-dev] [PATCH v5 8/8] examples/l3fwd: change the guard macro name for header file Jianbo Liu
2017-07-04 15:11   ` [dpdk-dev] [PATCH v5 0/8] accelerate examples/l3fwd with NEON on ARM64 platform Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1494476052.2563.10.camel@caviumnetworks.com \
    --to=ashwin.sekhar@cavium.com \
    --cc=Jerin.JacobKollanukkaran@cavium.com \
    --cc=dev@dpdk.org \
    --cc=jianbo.liu@linaro.org \
    --cc=tomasz.kantecki@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).