From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 325756CD0 for ; Wed, 12 Oct 2016 04:46:27 +0200 (CEST) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga101.fm.intel.com with ESMTP; 11 Oct 2016 19:46:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,332,1473145200"; d="scan'208";a="1063615174" Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203]) by orsmga002.jf.intel.com with ESMTP; 11 Oct 2016 19:46:25 -0700 Received: from fmsmsx112.amr.corp.intel.com (10.18.116.6) by FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS) id 14.3.248.2; Tue, 11 Oct 2016 19:46:25 -0700 Received: from shsmsx152.ccr.corp.intel.com (10.239.6.52) by FMSMSX112.amr.corp.intel.com (10.18.116.6) with Microsoft SMTP Server (TLS) id 14.3.248.2; Tue, 11 Oct 2016 19:46:25 -0700 Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.139]) by SHSMSX152.ccr.corp.intel.com ([169.254.6.2]) with mapi id 14.03.0248.002; Wed, 12 Oct 2016 10:46:22 +0800 From: "Zhang, Qi Z" To: Jianbo Liu CC: "Zhang, Helin" , "Wu, Jingjing" , "jerin.jacob@caviumnetworks.com" , "dev@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH 2/5] i40e: implement vector PMD for ARM architecture Thread-Index: AQHR/e2aIPWDyuT4NU+4KNmUnNTV5KCkZb4A Date: Wed, 12 Oct 2016 02:46:22 +0000 Message-ID: <039ED4275CED7440929022BC67E7061150659198@SHSMSX103.ccr.corp.intel.com> References: <1472032425-16136-1-git-send-email-jianbo.liu@linaro.org> <1472032425-16136-3-git-send-email-jianbo.liu@linaro.org> In-Reply-To: <1472032425-16136-3-git-send-email-jianbo.liu@linaro.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNGZkMjRiMTctNGJjMC00ZWMxLWFiODAtY2Y5NGQ5MzMzNTkzIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IkNkS0o0U2dyb0FSS2NJaHY1MTVJbVp1UkViS3VLbk1wTG5HSEUzcUhNdkU9In0= x-ctpclassification: CTP_IC x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 2/5] i40e: implement vector PMD for ARM architecture X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Oct 2016 02:46:28 -0000 Hi Jianbo: > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jianbo Liu > Sent: Wednesday, August 24, 2016 5:54 PM > To: Zhang, Helin ; Wu, Jingjing > ; jerin.jacob@caviumnetworks.com; dev@dpdk.org > Cc: Jianbo Liu > Subject: [dpdk-dev] [PATCH 2/5] i40e: implement vector PMD for ARM > architecture >=20 > Use ARM NEON intrinsic to implement i40e vPMD >=20 > Signed-off-by: Jianbo Liu > --- > drivers/net/i40e/Makefile | 4 + > drivers/net/i40e/i40e_rxtx_vec_neon.c | 581 > ++++++++++++++++++++++++++++++++++ > 2 files changed, 585 insertions(+) > create mode 100644 drivers/net/i40e/i40e_rxtx_vec_neon.c >=20 > diff --git a/drivers/net/i40e/Makefile b/drivers/net/i40e/Makefile index > 53fe145..9e92b38 100644 > --- a/drivers/net/i40e/Makefile > +++ b/drivers/net/i40e/Makefile > @@ -97,7 +97,11 @@ SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=3D i40e_dcb.c >=20 > SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=3D i40e_ethdev.c > SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=3D i40e_rxtx.c > +ifeq ($(CONFIG_RTE_ARCH_ARM64),y) > +SRCS-$(CONFIG_RTE_LIBRTE_I40E_INC_VECTOR) +=3D i40e_rxtx_vec_neon.c > else > SRCS-$(CONFIG_RTE_LIBRTE_I40E_INC_VECTOR) +=3D i40e_rxtx_vec.c > +endif > SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=3D i40e_ethdev_vf.c > SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=3D i40e_pf.c > SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=3D i40e_fdir.c diff --git > a/drivers/net/i40e/i40e_rxtx_vec_neon.c > b/drivers/net/i40e/i40e_rxtx_vec_neon.c > new file mode 100644 > index 0000000..015fa9f > --- /dev/null > +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c > @@ -0,0 +1,581 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright(c) 2010-2015 Intel Corporation. All rights reserved. > + * Copyright(c) 2016, Linaro Limited > + * All rights reserved. > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions > + * are met: > + * > + * * Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * * Redistributions in binary form must reproduce the above copyrig= ht > + * notice, this list of conditions and the following disclaimer in > + * the documentation and/or other materials provided with the > + * distribution. > + * * Neither the name of Intel Corporation nor the names of its > + * contributors may be used to endorse or promote products derived > + * from this software without specific prior written permission. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND > CONTRIBUTORS > + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT > NOT > + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND > FITNESS FOR > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE > COPYRIGHT > + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, > INCIDENTAL, > + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, > BUT NOT > + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; > LOSS OF USE, > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED > AND ON ANY > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR > TORT > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT > OF THE USE > + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH > DAMAGE. > + */ > + > +#include > +#include > +#include > + > +#include "base/i40e_prototype.h" > +#include "base/i40e_type.h" > +#include "i40e_ethdev.h" > +#include "i40e_rxtx.h" > +#include "i40e_rxtx_vec_common.h" > + > +#include > + > +#pragma GCC diagnostic ignored "-Wcast-qual" > + > +static inline void > +i40e_rxq_rearm(struct i40e_rx_queue *rxq) { > + int i; > + uint16_t rx_id; > + volatile union i40e_rx_desc *rxdp; > + struct i40e_rx_entry *rxep =3D &rxq->sw_ring[rxq->rxrearm_start]; > + struct rte_mbuf *mb0, *mb1; > + uint64x2_t dma_addr0, dma_addr1; > + uint64x2_t zero =3D vdupq_n_u64(0); > + uint64_t paddr; > + uint8x8_t p; > + > + rxdp =3D rxq->rx_ring + rxq->rxrearm_start; > + > + /* Pull 'n' more MBUFs into the software ring */ > + if (unlikely(rte_mempool_get_bulk(rxq->mp, > + (void *)rxep, > + RTE_I40E_RXQ_REARM_THRESH) < 0)) { > + if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=3D > + rxq->nb_rx_desc) { > + for (i =3D 0; i < RTE_I40E_DESCS_PER_LOOP; i++) { > + rxep[i].mbuf =3D &rxq->fake_mbuf; > + vst1q_u64((uint64_t *)&rxdp[i].read, zero); > + } > + } > + rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=3D > + RTE_I40E_RXQ_REARM_THRESH; > + return; > + } > + > + p =3D vld1_u8((uint8_t *)&rxq->mbuf_initializer); > + > + /* Initialize the mbufs in vector, process 2 mbufs in one loop */ > + for (i =3D 0; i < RTE_I40E_RXQ_REARM_THRESH; i +=3D 2, rxep +=3D 2) { > + mb0 =3D rxep[0].mbuf; > + mb1 =3D rxep[1].mbuf; > + > + /* Flush mbuf with pkt template. > + * Data to be rearmed is 6 bytes long. > + * Though, RX will overwrite ol_flags that are coming next > + * anyway. So overwrite whole 8 bytes with one load: > + * 6 bytes of rearm_data plus first 2 bytes of ol_flags. > + */ > + vst1_u8((uint8_t *)&mb0->rearm_data, p); > + paddr =3D mb0->buf_physaddr + RTE_PKTMBUF_HEADROOM; > + dma_addr0 =3D vdupq_n_u64(paddr); > + > + /* flush desc with pa dma_addr */ > + vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0); > + > + vst1_u8((uint8_t *)&mb1->rearm_data, p); > + paddr =3D mb1->buf_physaddr + RTE_PKTMBUF_HEADROOM; > + dma_addr1 =3D vdupq_n_u64(paddr); > + vst1q_u64((uint64_t *)&rxdp++->read, dma_addr1); > + } > + > + rxq->rxrearm_start +=3D RTE_I40E_RXQ_REARM_THRESH; > + if (rxq->rxrearm_start >=3D rxq->nb_rx_desc) > + rxq->rxrearm_start =3D 0; > + > + rxq->rxrearm_nb -=3D RTE_I40E_RXQ_REARM_THRESH; > + > + rx_id =3D (uint16_t)((rxq->rxrearm_start =3D=3D 0) ? > + (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1)); > + > + /* Update the tail pointer on the NIC */ > + I40E_PCI_REG_WRITE(rxq->qrx_tail, rx_id); } > + > +/* Handling the offload flags (olflags) field takes computation > + * time when receiving packets. Therefore we provide a flag to disable > + * the processing of the olflags field when they are not needed. This > + * gives improved performance, at the cost of losing the offload info > + * in the received packet > + */ > +#ifdef RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE > + > +static inline void > +desc_to_olflags_v(uint16x8_t staterr, struct rte_mbuf **rx_pkts) { > + uint16x8_t vlan0, vlan1, rss; > + union { > + uint16_t e[4]; > + uint64_t dword; > + } vol; > + > + /* mask everything except RSS, flow director and VLAN flags > + * bit2 is for VLAN tag, bit11 for flow director indication > + * bit13:12 for RSS indication. > + */ > + const uint16x8_t rss_vlan_msk =3D { > + 0x3804, 0x3804, 0x3804, 0x3804, > + 0x0000, 0x0000, 0x0000, 0x0000}; > + > + /* map rss and vlan type to rss hash and vlan flag */ > + const uint8x16_t vlan_flags =3D { > + 0, 0, 0, 0, > + PKT_RX_VLAN_PKT | PKT_RX_VLAN_STRIPPED, 0, 0, 0, > + 0, 0, 0, 0, > + 0, 0, 0, 0}; > + > + const uint8x16_t rss_flags =3D { > + 0, PKT_RX_FDIR, 0, 0, > + 0, 0, PKT_RX_RSS_HASH, PKT_RX_RSS_HASH | PKT_RX_FDIR, > + 0, 0, 0, 0, > + 0, 0, 0, 0}; > + > + vlan1 =3D vandq_u16(staterr, rss_vlan_msk); > + vlan0 =3D vreinterpretq_u16_u8(vqtbl1q_u8(vlan_flags, > + vreinterpretq_u8_u16(vlan1))); > + > + rss =3D vshrq_n_u16(vlan1, 11); > + rss =3D vreinterpretq_u16_u8(vqtbl1q_u8(rss_flags, > + vreinterpretq_u8_u16(rss))); > + > + vlan0 =3D vorrq_u16(vlan0, rss); > + vol.dword =3D vgetq_lane_u64(vreinterpretq_u64_u16(vlan0), 0); > + > + rx_pkts[0]->ol_flags =3D vol.e[0]; > + rx_pkts[1]->ol_flags =3D vol.e[1]; > + rx_pkts[2]->ol_flags =3D vol.e[2]; > + rx_pkts[3]->ol_flags =3D vol.e[3]; > +} > +#else > +#define desc_to_olflags_v(staterr, rx_pkts) do {} while (0) #endif > + > +#define PKTLEN_SHIFT 10 > + > +#define I40E_VPMD_DESC_DD_MASK 0x0001000100010001ULL > + > + /* > + * Notice: > + * - nb_pkts < RTE_I40E_DESCS_PER_LOOP, just return no packet > + * - nb_pkts > RTE_I40E_VPMD_RX_BURST, only scan > RTE_I40E_VPMD_RX_BURST > + * numbers of DD bits > + */ > +static inline uint16_t > +_recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts, > + uint16_t nb_pkts, uint8_t *split_packet) { > + volatile union i40e_rx_desc *rxdp; > + struct i40e_rx_entry *sw_ring; > + uint16_t nb_pkts_recd; > + int pos; > + uint64_t var; > + > + /* mask to shuffle from desc. to mbuf */ > + uint8x16_t shuf_msk =3D { > + 0xFF, 0xFF, /* pkt_type set as unknown */ > + 0xFF, 0xFF, /* pkt_type set as unknown */ > + 14, 15, /* octet 15~14, low 16 bits pkt_len */ > + 0xFF, 0xFF, /* skip high 16 bits pkt_len, zero out */ > + 14, 15, /* octet 15~14, 16 bits data_len */ > + 2, 3, /* octet 2~3, low 16 bits vlan_macip */ > + 4, 5, 6, 7 /* octet 4~7, 32bits rss */ > + }; > + > + uint8x16_t eop_check =3D { > + 0x02, 0x00, 0x02, 0x00, > + 0x02, 0x00, 0x02, 0x00, > + 0x00, 0x00, 0x00, 0x00, > + 0x00, 0x00, 0x00, 0x00 > + }; > + > + uint16x8_t crc_adjust =3D { > + 0, 0, /* ignore pkt_type field */ > + rxq->crc_len, /* sub crc on pkt_len */ > + 0, /* ignore high-16bits of pkt_len */ > + rxq->crc_len, /* sub crc on data_len */ > + 0, 0, 0 /* ignore non-length fields */ > + }; > + > + /* nb_pkts shall be less equal than RTE_I40E_MAX_RX_BURST */ > + nb_pkts =3D RTE_MIN(nb_pkts, RTE_I40E_MAX_RX_BURST); > + > + /* nb_pkts has to be floor-aligned to RTE_I40E_DESCS_PER_LOOP */ > + nb_pkts =3D RTE_ALIGN_FLOOR(nb_pkts, RTE_I40E_DESCS_PER_LOOP); > + > + /* Just the act of getting into the function from the application is > + * going to cost about 7 cycles > + */ > + rxdp =3D rxq->rx_ring + rxq->rx_tail; > + > + rte_prefetch_non_temporal(rxdp); > + > + /* See if we need to rearm the RX queue - gives the prefetch a bit > + * of time to act > + */ > + if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) > + i40e_rxq_rearm(rxq); > + > + /* Before we start moving massive data around, check to see if > + * there is actually a packet available > + */ > + if (!(rxdp->wb.qword1.status_error_len & > + rte_cpu_to_le_32(1 << I40E_RX_DESC_STATUS_DD_SHIFT))) > + return 0; > + > + /* Cache is empty -> need to scan the buffer rings, but first move > + * the next 'n' mbufs into the cache > + */ > + sw_ring =3D &rxq->sw_ring[rxq->rx_tail]; > + > + /* A. load 4 packet in one loop > + * [A*. mask out 4 unused dirty field in desc] > + * B. copy 4 mbuf point from swring to rx_pkts > + * C. calc the number of DD bits among the 4 packets > + * [C*. extract the end-of-packet bit, if requested] > + * D. fill info. from desc to mbuf > + */ > + > + for (pos =3D 0, nb_pkts_recd =3D 0; pos < nb_pkts; > + pos +=3D RTE_I40E_DESCS_PER_LOOP, > + rxdp +=3D RTE_I40E_DESCS_PER_LOOP) { > + uint64x2_t descs[RTE_I40E_DESCS_PER_LOOP]; > + uint8x16_t pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4; > + uint16x8x2_t sterr_tmp1, sterr_tmp2; > + uint64x2_t mbp1, mbp2; > + uint16x8_t staterr; > + uint16x8_t tmp; > + uint64_t stat; > + > + int32x4_t len_shl =3D {0, 0, 0, PKTLEN_SHIFT}; > + > + /* B.1 load 1 mbuf point */ > + mbp1 =3D vld1q_u64((uint64_t *)&sw_ring[pos]); > + /* Read desc statuses backwards to avoid race condition */ > + /* A.1 load 4 pkts desc */ > + descs[3] =3D vld1q_u64((uint64_t *)(rxdp + 3)); > + rte_rmb(); > + > + /* B.2 copy 2 mbuf point into rx_pkts */ > + vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1); > + > + /* B.1 load 1 mbuf point */ > + mbp2 =3D vld1q_u64((uint64_t *)&sw_ring[pos + 2]); > + > + descs[2] =3D vld1q_u64((uint64_t *)(rxdp + 2)); > + /* B.1 load 2 mbuf point */ > + descs[1] =3D vld1q_u64((uint64_t *)(rxdp + 1)); > + descs[0] =3D vld1q_u64((uint64_t *)(rxdp)); > + > + /* B.2 copy 2 mbuf point into rx_pkts */ > + vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2); > + > + if (split_packet) { > + rte_mbuf_prefetch_part2(rx_pkts[pos]); > + rte_mbuf_prefetch_part2(rx_pkts[pos + 1]); > + rte_mbuf_prefetch_part2(rx_pkts[pos + 2]); > + rte_mbuf_prefetch_part2(rx_pkts[pos + 3]); > + } > + > + /* avoid compiler reorder optimization */ > + rte_compiler_barrier(); > + > + /* pkt 3,4 shift the pktlen field to be 16-bit aligned*/ > + uint32x4_t len3 =3D vshlq_u32(vreinterpretq_u32_u64(descs[3]), > + len_shl); > + descs[3] =3D vreinterpretq_u64_u32(len3); > + uint32x4_t len2 =3D vshlq_u32(vreinterpretq_u32_u64(descs[2]), > + len_shl); > + descs[2] =3D vreinterpretq_u64_u32(len2); > + > + /* D.1 pkt 3,4 convert format from desc to pktmbuf */ > + pkt_mb4 =3D vqtbl1q_u8(vreinterpretq_u8_u64(descs[3]), shuf_msk); > + pkt_mb3 =3D vqtbl1q_u8(vreinterpretq_u8_u64(descs[2]), shuf_msk); > + > + /* C.1 4=3D>2 filter staterr info only */ > + sterr_tmp2 =3D vzipq_u16(vreinterpretq_u16_u64(descs[1]), > + vreinterpretq_u16_u64(descs[3])); > + /* C.1 4=3D>2 filter staterr info only */ > + sterr_tmp1 =3D vzipq_u16(vreinterpretq_u16_u64(descs[0]), > + vreinterpretq_u16_u64(descs[2])); > + > + /* C.2 get 4 pkts staterr value */ > + staterr =3D vzipq_u16(sterr_tmp1.val[1], > + sterr_tmp2.val[1]).val[0]; > + stat =3D vgetq_lane_u64(vreinterpretq_u64_u16(staterr), 0); > + > + desc_to_olflags_v(staterr, &rx_pkts[pos]); > + > + /* D.2 pkt 3,4 set in_port/nb_seg and remove crc */ > + tmp =3D vsubq_u16(vreinterpretq_u16_u8(pkt_mb4), crc_adjust); > + pkt_mb4 =3D vreinterpretq_u8_u16(tmp); > + tmp =3D vsubq_u16(vreinterpretq_u16_u8(pkt_mb3), crc_adjust); > + pkt_mb3 =3D vreinterpretq_u8_u16(tmp); > + > + /* pkt 1,2 shift the pktlen field to be 16-bit aligned*/ > + uint32x4_t len1 =3D vshlq_u32(vreinterpretq_u32_u64(descs[1]), > + len_shl); > + descs[1] =3D vreinterpretq_u64_u32(len1); > + uint32x4_t len0 =3D vshlq_u32(vreinterpretq_u32_u64(descs[0]), > + len_shl); > + descs[0] =3D vreinterpretq_u64_u32(len0); > + > + /* D.1 pkt 1,2 convert format from desc to pktmbuf */ > + pkt_mb2 =3D vqtbl1q_u8(vreinterpretq_u8_u64(descs[1]), shuf_msk); > + pkt_mb1 =3D vqtbl1q_u8(vreinterpretq_u8_u64(descs[0]), shuf_msk); > + > + /* D.3 copy final 3,4 data to rx_pkts */ > + vst1q_u8((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1, > + pkt_mb4); > + vst1q_u8((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1, > + pkt_mb3); > + > + /* D.2 pkt 1,2 set in_port/nb_seg and remove crc */ > + tmp =3D vsubq_u16(vreinterpretq_u16_u8(pkt_mb2), crc_adjust); > + pkt_mb2 =3D vreinterpretq_u8_u16(tmp); > + tmp =3D vsubq_u16(vreinterpretq_u16_u8(pkt_mb1), crc_adjust); > + pkt_mb1 =3D vreinterpretq_u8_u16(tmp); > + > + /* C* extract and record EOP bit */ > + if (split_packet) { > + uint8x16_t eop_shuf_mask =3D { > + 0x00, 0x02, 0x04, 0x06, > + 0xFF, 0xFF, 0xFF, 0xFF, > + 0xFF, 0xFF, 0xFF, 0xFF, > + 0xFF, 0xFF, 0xFF, 0xFF}; > + uint8x16_t eop_bits; > + > + /* and with mask to extract bits, flipping 1-0 */ > + eop_bits =3D vmvnq_u8(vreinterpretq_u8_u16(staterr)); > + eop_bits =3D vandq_u8(eop_bits, eop_check); > + /* the staterr values are not in order, as the count > + * count of dd bits doesn't care. However, for end of > + * packet tracking, we do care, so shuffle. This also > + * compresses the 32-bit values to 8-bit > + */ > + eop_bits =3D vqtbl1q_u8(eop_bits, eop_shuf_mask); > + > + /* store the resulting 32-bit value */ > + vst1q_lane_u32((uint32_t *)split_packet, > + vreinterpretq_u32_u8(eop_bits), 0); > + split_packet +=3D RTE_I40E_DESCS_PER_LOOP; > + > + /* zero-out next pointers */ > + rx_pkts[pos]->next =3D NULL; > + rx_pkts[pos + 1]->next =3D NULL; > + rx_pkts[pos + 2]->next =3D NULL; > + rx_pkts[pos + 3]->next =3D NULL; > + } > + > + rte_prefetch_non_temporal(rxdp + RTE_I40E_DESCS_PER_LOOP); > + > + /* D.3 copy final 1,2 data to rx_pkts */ > + vst1q_u8((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1, > + pkt_mb2); > + vst1q_u8((void *)&rx_pkts[pos]->rx_descriptor_fields1, > + pkt_mb1); > + /* C.4 calc avaialbe number of desc */ > + var =3D __builtin_popcountll(stat & I40E_VPMD_DESC_DD_MASK); > + nb_pkts_recd +=3D var; > + if (likely(var !=3D RTE_I40E_DESCS_PER_LOOP)) > + break; > + } > + > + /* Update our internal tail pointer */ > + rxq->rx_tail =3D (uint16_t)(rxq->rx_tail + nb_pkts_recd); > + rxq->rx_tail =3D (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1)); > + rxq->rxrearm_nb =3D (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd); > + > + return nb_pkts_recd; > +} > + > + /* > + * Notice: > + * - nb_pkts < RTE_I40E_DESCS_PER_LOOP, just return no packet > + * - nb_pkts > RTE_I40E_VPMD_RX_BURST, only scan > RTE_I40E_VPMD_RX_BURST > + * numbers of DD bits > + */ > +uint16_t > +i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > + uint16_t nb_pkts) > +{ > + return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL); } > + > + /* vPMD receive routine that reassembles scattered packets > + * Notice: > + * - nb_pkts < RTE_I40E_DESCS_PER_LOOP, just return no packet > + * - nb_pkts > RTE_I40E_VPMD_RX_BURST, only scan > RTE_I40E_VPMD_RX_BURST > + * numbers of DD bits > + */ > +uint16_t > +i40e_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > + uint16_t nb_pkts) > +{ > + > + struct i40e_rx_queue *rxq =3D rx_queue; > + uint8_t split_flags[RTE_I40E_VPMD_RX_BURST] =3D {0}; > + > + /* get some new buffers */ > + uint16_t nb_bufs =3D _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts, > + split_flags); > + if (nb_bufs =3D=3D 0) > + return 0; > + > + /* happy day case, full burst + no packets to be joined */ > + const uint64_t *split_fl64 =3D (uint64_t *)split_flags; > + > + if (rxq->pkt_first_seg =3D=3D NULL && > + split_fl64[0] =3D=3D 0 && split_fl64[1] =3D=3D 0 && > + split_fl64[2] =3D=3D 0 && split_fl64[3] =3D=3D 0) > + return nb_bufs; > + > + /* reassemble any packets that need reassembly*/ > + unsigned i =3D 0; > + > + if (rxq->pkt_first_seg =3D=3D NULL) { > + /* find the first split flag, and only reassemble then*/ > + while (i < nb_bufs && !split_flags[i]) > + i++; > + if (i =3D=3D nb_bufs) > + return nb_bufs; > + } > + return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i, > + &split_flags[i]); > +} > + > +static inline void > +vtx1(volatile struct i40e_tx_desc *txdp, > + struct rte_mbuf *pkt, uint64_t flags) { > + uint64_t high_qw =3D (I40E_TX_DESC_DTYPE_DATA | > + ((uint64_t)flags << I40E_TXD_QW1_CMD_SHIFT) | > + ((uint64_t)pkt->data_len << > I40E_TXD_QW1_TX_BUF_SZ_SHIFT)); > + > + uint64x2_t descriptor =3D {pkt->buf_physaddr + pkt->data_off, high_qw}; > + vst1q_u64((uint64_t *)txdp, descriptor); } > + > +static inline void > +vtx(volatile struct i40e_tx_desc *txdp, > + struct rte_mbuf **pkt, uint16_t nb_pkts, uint64_t flags) { > + int i; > + > + for (i =3D 0; i < nb_pkts; ++i, ++txdp, ++pkt) > + vtx1(txdp, *pkt, flags); > +} > + > +uint16_t > +i40e_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, > + uint16_t nb_pkts) > +{ > + struct i40e_tx_queue *txq =3D (struct i40e_tx_queue *)tx_queue; > + volatile struct i40e_tx_desc *txdp; > + struct i40e_tx_entry *txep; > + uint16_t n, nb_commit, tx_id; > + uint64_t flags =3D I40E_TD_CMD; > + uint64_t rs =3D I40E_TX_DESC_CMD_RS | I40E_TD_CMD; > + int i; > + > + /* cross rx_thresh boundary is not allowed */ > + nb_pkts =3D RTE_MIN(nb_pkts, txq->tx_rs_thresh); > + > + if (txq->nb_tx_free < txq->tx_free_thresh) > + i40e_tx_free_bufs(txq); > + > + nb_commit =3D nb_pkts =3D (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts); > + if (unlikely(nb_pkts =3D=3D 0)) > + return 0; > + > + tx_id =3D txq->tx_tail; > + txdp =3D &txq->tx_ring[tx_id]; > + txep =3D &txq->sw_ring[tx_id]; > + > + txq->nb_tx_free =3D (uint16_t)(txq->nb_tx_free - nb_pkts); > + > + n =3D (uint16_t)(txq->nb_tx_desc - tx_id); > + if (nb_commit >=3D n) { > + tx_backlog_entry(txep, tx_pkts, n); > + > + for (i =3D 0; i < n - 1; ++i, ++tx_pkts, ++txdp) > + vtx1(txdp, *tx_pkts, flags); > + > + vtx1(txdp, *tx_pkts++, rs); > + > + nb_commit =3D (uint16_t)(nb_commit - n); > + > + tx_id =3D 0; > + txq->tx_next_rs =3D (uint16_t)(txq->tx_rs_thresh - 1); > + > + /* avoid reach the end of ring */ > + txdp =3D &txq->tx_ring[tx_id]; > + txep =3D &txq->sw_ring[tx_id]; > + } > + > + tx_backlog_entry(txep, tx_pkts, nb_commit); > + > + vtx(txdp, tx_pkts, nb_commit, flags); > + > + tx_id =3D (uint16_t)(tx_id + nb_commit); > + if (tx_id > txq->tx_next_rs) { > + txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=3D > + rte_cpu_to_le_64(((uint64_t)I40E_TX_DESC_CMD_RS) << > + I40E_TXD_QW1_CMD_SHIFT); > + txq->tx_next_rs =3D > + (uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh); > + } > + > + txq->tx_tail =3D tx_id; > + > + I40E_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail); > + > + return nb_pkts; > +} > + > +void __attribute__((cold)) > +i40e_rx_queue_release_mbufs_vec(struct i40e_rx_queue *rxq) { > + _i40e_rx_queue_release_mbufs_vec(rxq); > +} > + > +int __attribute__((cold)) > +i40e_rxq_vec_setup(struct i40e_rx_queue *rxq) { > + return i40e_rxq_vec_setup_default(rxq); } > + > +int __attribute__((cold)) > +i40e_txq_vec_setup(struct i40e_tx_queue __rte_unused *txq) { > + return 0; > +} > + > +int __attribute__((cold)) > +i40e_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev) { > + return i40e_rx_vec_dev_conf_condition_check_default(dev); > +} > -- > 2.4.11 ptype and bad checksum offload is enabled with below patches=20 http://dpdk.org/dev/patchwork/patch/16394 http://dpdk.org/dev/patchwork/patch/16395 You may take a look to see if it's necessary to enable them for ARM also. Thanks! Qi