DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH 0/2] Wangxun support vector Rx/Tx
@ 2024-02-01  3:00 Jiawen Wu
  2024-02-01  3:00 ` [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx Jiawen Wu
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jiawen Wu @ 2024-02-01  3:00 UTC (permalink / raw)
  To: dev; +Cc: Jiawen Wu

Add SSE/NEON vector instructions for TXGBE and NGBE driver to process
packets.

Jiawen Wu (2):
  net/txgbe: add vectorized functions for Rx/Tx
  net/ngbe: add vectorized functions for Rx/Tx

 drivers/net/ngbe/meson.build              |   6 +
 drivers/net/ngbe/ngbe_ethdev.c            |   6 +
 drivers/net/ngbe/ngbe_ethdev.h            |   1 +
 drivers/net/ngbe/ngbe_rxtx.c              | 161 ++++-
 drivers/net/ngbe/ngbe_rxtx.h              |  32 +-
 drivers/net/ngbe/ngbe_rxtx_vec_common.h   | 296 +++++++++
 drivers/net/ngbe/ngbe_rxtx_vec_neon.c     | 604 ++++++++++++++++++
 drivers/net/ngbe/ngbe_rxtx_vec_sse.c      | 692 ++++++++++++++++++++
 drivers/net/txgbe/meson.build             |   6 +
 drivers/net/txgbe/txgbe_ethdev.c          |   6 +
 drivers/net/txgbe/txgbe_ethdev.h          |   1 +
 drivers/net/txgbe/txgbe_ethdev_vf.c       |   1 +
 drivers/net/txgbe/txgbe_rxtx.c            | 150 ++++-
 drivers/net/txgbe/txgbe_rxtx.h            |  18 +
 drivers/net/txgbe/txgbe_rxtx_vec_common.h | 301 +++++++++
 drivers/net/txgbe/txgbe_rxtx_vec_neon.c   | 604 ++++++++++++++++++
 drivers/net/txgbe/txgbe_rxtx_vec_sse.c    | 736 ++++++++++++++++++++++
 17 files changed, 3611 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_common.h
 create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_neon.c
 create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_sse.c
 create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_common.h
 create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_neon.c
 create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_sse.c

-- 
2.27.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx
  2024-02-01  3:00 [PATCH 0/2] Wangxun support vector Rx/Tx Jiawen Wu
@ 2024-02-01  3:00 ` Jiawen Wu
  2024-02-07  3:13   ` Ferruh Yigit
  2024-02-01  3:00 ` [PATCH 2/2] net/ngbe: " Jiawen Wu
  2024-02-06  1:50 ` [PATCH 0/2] Wangxun support vector Rx/Tx Jiawen Wu
  2 siblings, 1 reply; 10+ messages in thread
From: Jiawen Wu @ 2024-02-01  3:00 UTC (permalink / raw)
  To: dev; +Cc: Jiawen Wu

To optimize Rx/Tx burst process, add SSE/NEON vector instructions on
x86/arm architecture.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
---
 drivers/net/txgbe/meson.build             |   6 +
 drivers/net/txgbe/txgbe_ethdev.c          |   6 +
 drivers/net/txgbe/txgbe_ethdev.h          |   1 +
 drivers/net/txgbe/txgbe_ethdev_vf.c       |   1 +
 drivers/net/txgbe/txgbe_rxtx.c            | 150 ++++-
 drivers/net/txgbe/txgbe_rxtx.h            |  18 +
 drivers/net/txgbe/txgbe_rxtx_vec_common.h | 301 +++++++++
 drivers/net/txgbe/txgbe_rxtx_vec_neon.c   | 604 ++++++++++++++++++
 drivers/net/txgbe/txgbe_rxtx_vec_sse.c    | 736 ++++++++++++++++++++++
 9 files changed, 1817 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_common.h
 create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_neon.c
 create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_sse.c

diff --git a/drivers/net/txgbe/meson.build b/drivers/net/txgbe/meson.build
index 14729a6cf3..ba7167a511 100644
--- a/drivers/net/txgbe/meson.build
+++ b/drivers/net/txgbe/meson.build
@@ -24,6 +24,12 @@ sources = files(
 
 deps += ['hash', 'security']
 
+if arch_subdir == 'x86'
+    sources += files('txgbe_rxtx_vec_sse.c')
+elif arch_subdir == 'arm'
+    sources += files('txgbe_rxtx_vec_neon.c')
+endif
+
 includes += include_directories('base')
 
 install_headers('rte_pmd_txgbe.h')
diff --git a/drivers/net/txgbe/txgbe_ethdev.c b/drivers/net/txgbe/txgbe_ethdev.c
index 6bc231a130..2d5b935002 100644
--- a/drivers/net/txgbe/txgbe_ethdev.c
+++ b/drivers/net/txgbe/txgbe_ethdev.c
@@ -1544,6 +1544,7 @@ txgbe_dev_configure(struct rte_eth_dev *dev)
 	 * allocation Rx preconditions we will reset it.
 	 */
 	adapter->rx_bulk_alloc_allowed = true;
+	adapter->rx_vec_allowed = true;
 
 	return 0;
 }
@@ -2735,6 +2736,11 @@ txgbe_dev_supported_ptypes_get(struct rte_eth_dev *dev)
 	    dev->rx_pkt_burst == txgbe_recv_pkts_bulk_alloc)
 		return txgbe_get_supported_ptypes();
 
+#if defined(RTE_ARCH_X86) || defined(__ARM_NEON)
+	if (dev->rx_pkt_burst == txgbe_recv_pkts_vec ||
+	    dev->rx_pkt_burst == txgbe_recv_scattered_pkts_vec)
+		return txgbe_get_supported_ptypes();
+#endif
 	return NULL;
 }
 
diff --git a/drivers/net/txgbe/txgbe_ethdev.h b/drivers/net/txgbe/txgbe_ethdev.h
index 7feb45d0cf..7718ad4819 100644
--- a/drivers/net/txgbe/txgbe_ethdev.h
+++ b/drivers/net/txgbe/txgbe_ethdev.h
@@ -364,6 +364,7 @@ struct txgbe_adapter {
 	struct txgbe_ipsec          ipsec;
 #endif
 	bool rx_bulk_alloc_allowed;
+	bool rx_vec_allowed;
 	struct rte_timecounter      systime_tc;
 	struct rte_timecounter      rx_tstamp_tc;
 	struct rte_timecounter      tx_tstamp_tc;
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index f1341fbf7e..7d8327e7ad 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -603,6 +603,7 @@ txgbevf_dev_configure(struct rte_eth_dev *dev)
 	 * allocation or vector Rx preconditions we will reset it.
 	 */
 	adapter->rx_bulk_alloc_allowed = true;
+	adapter->rx_vec_allowed = true;
 
 	return 0;
 }
diff --git a/drivers/net/txgbe/txgbe_rxtx.c b/drivers/net/txgbe/txgbe_rxtx.c
index 1cd4b25965..310310b686 100644
--- a/drivers/net/txgbe/txgbe_rxtx.c
+++ b/drivers/net/txgbe/txgbe_rxtx.c
@@ -36,6 +36,7 @@
 #include <rte_errno.h>
 #include <rte_ip.h>
 #include <rte_net.h>
+#include <rte_vect.h>
 
 #include "txgbe_logs.h"
 #include "base/txgbe.h"
@@ -314,6 +315,27 @@ txgbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 	return nb_tx;
 }
 
+static uint16_t
+txgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		    uint16_t nb_pkts)
+{
+	struct txgbe_tx_queue *txq = (struct txgbe_tx_queue *)tx_queue;
+	uint16_t nb_tx = 0;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_free_thresh);
+		ret = txgbe_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 static inline void
 txgbe_set_xmit_ctx(struct txgbe_tx_queue *txq,
 		volatile struct txgbe_tx_ctx_desc *ctx_txd,
@@ -2198,8 +2220,15 @@ txgbe_set_tx_function(struct rte_eth_dev *dev, struct txgbe_tx_queue *txq)
 #endif
 			txq->tx_free_thresh >= RTE_PMD_TXGBE_TX_MAX_BURST) {
 		PMD_INIT_LOG(DEBUG, "Using simple tx code path");
-		dev->tx_pkt_burst = txgbe_xmit_pkts_simple;
 		dev->tx_pkt_prepare = NULL;
+		if (txq->tx_free_thresh <= RTE_TXGBE_TX_MAX_FREE_BUF_SZ &&
+				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
+					txgbe_txq_vec_setup(txq) == 0)) {
+			PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+			dev->tx_pkt_burst = txgbe_xmit_pkts_vec;
+		} else {
+			dev->tx_pkt_burst = txgbe_xmit_pkts_simple;
+		}
 	} else {
 		PMD_INIT_LOG(DEBUG, "Using full-featured tx code path");
 		PMD_INIT_LOG(DEBUG,
@@ -2425,6 +2454,12 @@ txgbe_rx_queue_release_mbufs(struct txgbe_rx_queue *rxq)
 {
 	unsigned int i;
 
+	/* SSE Vector driver has a different way of releasing mbufs. */
+	if (rxq->rx_using_sse) {
+		txgbe_rx_queue_release_mbufs_vec(rxq);
+		return;
+	}
+
 	if (rxq->sw_ring != NULL) {
 		for (i = 0; i < rxq->nb_rx_desc; i++) {
 			if (rxq->sw_ring[i].mbuf != NULL) {
@@ -2553,6 +2588,11 @@ txgbe_reset_rx_queue(struct txgbe_adapter *adapter, struct txgbe_rx_queue *rxq)
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
+#endif
 }
 
 int __rte_cold
@@ -2706,6 +2746,16 @@ txgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		     rxq->sw_ring, rxq->sw_sc_ring, rxq->rx_ring,
 		     rxq->rx_ring_phys_addr);
 
+	if (!rte_is_power_of_2(nb_desc)) {
+		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
+				    "preconditions - canceling the feature for "
+				    "the whole port[%d]",
+			     rxq->queue_id, rxq->port_id);
+		adapter->rx_vec_allowed = false;
+	} else {
+		txgbe_rxq_vec_setup(rxq);
+	}
+
 	dev->data->rx_queues[queue_idx] = rxq;
 
 	txgbe_reset_rx_queue(adapter, rxq);
@@ -2747,7 +2797,12 @@ txgbe_dev_rx_descriptor_status(void *rx_queue, uint16_t offset)
 	if (unlikely(offset >= rxq->nb_rx_desc))
 		return -EINVAL;
 
-	nb_hold = rxq->nb_rx_hold;
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+	if (rxq->rx_using_sse)
+		nb_hold = rxq->rxrearm_nb;
+	else
+#endif
+		nb_hold = rxq->nb_rx_hold;
 	if (offset >= rxq->nb_rx_desc - nb_hold)
 		return RTE_ETH_RX_DESC_UNAVAIL;
 
@@ -4216,9 +4271,23 @@ txgbe_set_rsc(struct rte_eth_dev *dev)
 void __rte_cold
 txgbe_set_rx_function(struct rte_eth_dev *dev)
 {
-	uint16_t i;
+	uint16_t i, rx_using_sse;
 	struct txgbe_adapter *adapter = TXGBE_DEV_ADAPTER(dev);
 
+	/*
+	 * In order to allow Vector Rx there are a few configuration
+	 * conditions to be met and Rx Bulk Allocation should be allowed.
+	 */
+	if (txgbe_rx_vec_dev_conf_condition_check(dev) ||
+	    !adapter->rx_bulk_alloc_allowed ||
+			rte_vect_get_max_simd_bitwidth() < RTE_VECT_SIMD_128) {
+		PMD_INIT_LOG(DEBUG, "Port[%d] doesn't meet Vector Rx "
+				    "preconditions",
+			     dev->data->port_id);
+
+		adapter->rx_vec_allowed = false;
+	}
+
 	/*
 	 * Initialize the appropriate LRO callback.
 	 *
@@ -4241,7 +4310,12 @@ txgbe_set_rx_function(struct rte_eth_dev *dev)
 		 * Set the non-LRO scattered callback: there are bulk and
 		 * single allocation versions.
 		 */
-		if (adapter->rx_bulk_alloc_allowed) {
+		if (adapter->rx_vec_allowed) {
+			PMD_INIT_LOG(DEBUG, "Using Vector Scattered Rx "
+					    "callback (port=%d).",
+				     dev->data->port_id);
+			dev->rx_pkt_burst = txgbe_recv_scattered_pkts_vec;
+		} else if (adapter->rx_bulk_alloc_allowed) {
 			PMD_INIT_LOG(DEBUG, "Using a Scattered with bulk "
 					   "allocation callback (port=%d).",
 				     dev->data->port_id);
@@ -4259,9 +4333,16 @@ txgbe_set_rx_function(struct rte_eth_dev *dev)
 	 * Below we set "simple" callbacks according to port/queues parameters.
 	 * If parameters allow we are going to choose between the following
 	 * callbacks:
+	 *    - Vector
 	 *    - Bulk Allocation
 	 *    - Single buffer allocation (the simplest one)
 	 */
+	} else if (adapter->rx_vec_allowed) {
+		PMD_INIT_LOG(DEBUG, "Vector rx enabled, please make sure RX "
+				    "burst size no less than %d (port=%d).",
+			     RTE_TXGBE_DESCS_PER_LOOP,
+			     dev->data->port_id);
+		dev->rx_pkt_burst = txgbe_recv_pkts_vec;
 	} else if (adapter->rx_bulk_alloc_allowed) {
 		PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
 				    "satisfied. Rx Burst Bulk Alloc function "
@@ -4278,14 +4359,18 @@ txgbe_set_rx_function(struct rte_eth_dev *dev)
 		dev->rx_pkt_burst = txgbe_recv_pkts;
 	}
 
-#ifdef RTE_LIB_SECURITY
+	rx_using_sse = (dev->rx_pkt_burst == txgbe_recv_scattered_pkts_vec ||
+			dev->rx_pkt_burst == txgbe_recv_pkts_vec);
+
 	for (i = 0; i < dev->data->nb_rx_queues; i++) {
 		struct txgbe_rx_queue *rxq = dev->data->rx_queues[i];
 
+		rxq->rx_using_sse = rx_using_sse;
+#ifdef RTE_LIB_SECURITY
 		rxq->using_ipsec = !!(dev->data->dev_conf.rxmode.offloads &
 				RTE_ETH_RX_OFFLOAD_SECURITY);
-	}
 #endif
+	}
 }
 
 /*
@@ -5122,3 +5207,56 @@ txgbe_config_rss_filter(struct rte_eth_dev *dev,
 
 	return 0;
 }
+
+/* Stubs needed for linkage when RTE_ARCH_PPC_64, RTE_ARCH_RISCV or
+ * RTE_ARCH_LOONGARCH is set.
+ */
+#if defined(RTE_ARCH_PPC_64) || defined(RTE_ARCH_RISCV) || \
+	defined(RTE_ARCH_LOONGARCH)
+int
+txgbe_rx_vec_dev_conf_condition_check(__rte_unused struct rte_eth_dev *dev)
+{
+	return -1;
+}
+
+uint16_t
+txgbe_recv_pkts_vec(__rte_unused void *rx_queue,
+		    __rte_unused struct rte_mbuf **rx_pkts,
+		    __rte_unused uint16_t nb_pkts)
+{
+	return 0;
+}
+
+uint16_t
+txgbe_recv_scattered_pkts_vec(__rte_unused void *rx_queue,
+			      __rte_unused struct rte_mbuf **rx_pkts,
+			      __rte_unused uint16_t nb_pkts)
+{
+	return 0;
+}
+
+int
+txgbe_rxq_vec_setup(__rte_unused struct txgbe_rx_queue *rxq)
+{
+	return -1;
+}
+
+uint16_t
+txgbe_xmit_fixed_burst_vec(__rte_unused void *tx_queue,
+			   __rte_unused struct rte_mbuf **tx_pkts,
+			   __rte_unused uint16_t nb_pkts)
+{
+	return 0;
+}
+
+int
+txgbe_txq_vec_setup(__rte_unused struct txgbe_tx_queue *txq)
+{
+	return -1;
+}
+
+void
+txgbe_rx_queue_release_mbufs_vec(__rte_unused struct txgbe_rx_queue *rxq)
+{
+}
+#endif
diff --git a/drivers/net/txgbe/txgbe_rxtx.h b/drivers/net/txgbe/txgbe_rxtx.h
index 27d4c842c0..336f060633 100644
--- a/drivers/net/txgbe/txgbe_rxtx.h
+++ b/drivers/net/txgbe/txgbe_rxtx.h
@@ -237,6 +237,8 @@ struct txgbe_tx_desc {
 #define RTE_PMD_TXGBE_RX_MAX_BURST 32
 #define RTE_TXGBE_TX_MAX_FREE_BUF_SZ 64
 
+#define RTE_TXGBE_DESCS_PER_LOOP    4
+
 #define RX_RING_SZ ((TXGBE_RING_DESC_MAX + RTE_PMD_TXGBE_RX_MAX_BURST) * \
 		    sizeof(struct txgbe_rx_desc))
 
@@ -297,6 +299,12 @@ struct txgbe_rx_queue {
 #ifdef RTE_LIB_SECURITY
 	uint8_t            using_ipsec;
 	/**< indicates that IPsec RX feature is in use */
+#endif
+	uint64_t	    mbuf_initializer; /**< value to init mbufs */
+	uint8_t             rx_using_sse;
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+	uint16_t	    rxrearm_nb;     /**< number of remaining to be re-armed */
+	uint16_t	    rxrearm_start;  /**< the idx we start the re-arming from */
 #endif
 	uint16_t            rx_free_thresh; /**< max free RX desc to hold. */
 	uint16_t            queue_id; /**< RX queue index. */
@@ -417,6 +425,16 @@ struct txgbe_txq_ops {
 void txgbe_set_tx_function(struct rte_eth_dev *dev, struct txgbe_tx_queue *txq);
 
 void txgbe_set_rx_function(struct rte_eth_dev *dev);
+uint16_t txgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+uint16_t txgbe_recv_scattered_pkts_vec(void *rx_queue,
+		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
+int txgbe_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev);
+int txgbe_rxq_vec_setup(struct txgbe_rx_queue *rxq);
+void txgbe_rx_queue_release_mbufs_vec(struct txgbe_rx_queue *rxq);
+uint16_t txgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+				    uint16_t nb_pkts);
+int txgbe_txq_vec_setup(struct txgbe_tx_queue *txq);
 int txgbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
 
 uint64_t txgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
diff --git a/drivers/net/txgbe/txgbe_rxtx_vec_common.h b/drivers/net/txgbe/txgbe_rxtx_vec_common.h
new file mode 100644
index 0000000000..cf67df66d8
--- /dev/null
+++ b/drivers/net/txgbe/txgbe_rxtx_vec_common.h
@@ -0,0 +1,301 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
+ * Copyright(c) 2010-2015 Intel Corporation
+ */
+
+#ifndef _TXGBE_RXTX_VEC_COMMON_H_
+#define _TXGBE_RXTX_VEC_COMMON_H_
+#include <stdint.h>
+
+#include "txgbe_ethdev.h"
+#include "txgbe_rxtx.h"
+
+#define TXGBE_RXD_PTID_SHIFT 9
+
+#define RTE_TXGBE_RXQ_REARM_THRESH      32
+#define RTE_TXGBE_MAX_RX_BURST          RTE_TXGBE_RXQ_REARM_THRESH
+
+static inline uint16_t
+reassemble_packets(struct txgbe_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		   uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[nb_bufs]; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned int pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end != NULL) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static __rte_always_inline int
+txgbe_tx_free_bufs(struct txgbe_tx_queue *txq)
+{
+	struct txgbe_tx_entry_v *txep;
+	uint32_t status;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[RTE_TXGBE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bit on threshold descriptor */
+	status = txq->tx_ring[txq->tx_next_dd].dw3;
+	if (!(status & TXGBE_TXD_DD)) {
+		if (txq->nb_tx_free >> 1 < txq->tx_free_thresh)
+			txgbe_set32_masked(txq->tdc_reg_addr,
+				TXGBE_TXCFG_FLUSH, TXGBE_TXCFG_FLUSH);
+		return 0;
+	}
+
+	n = txq->tx_free_thresh;
+
+	/*
+	 * first buffer to free from S/W ring is at index
+	 * tx_next_dd - (tx_rs_thresh-1)
+	 */
+	txep = &txq->sw_ring_v[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m != NULL)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m != NULL)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							(void *)free, nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m != NULL)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_free_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_free_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_free_thresh - 1);
+
+	return txq->tx_free_thresh;
+}
+
+static __rte_always_inline void
+tx_backlog_entry(struct txgbe_tx_entry_v *txep,
+		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
+static inline void
+_txgbe_tx_queue_release_mbufs_vec(struct txgbe_tx_queue *txq)
+{
+	unsigned int i;
+	struct txgbe_tx_entry_v *txe;
+	const uint16_t max_desc = (uint16_t)(txq->nb_tx_desc - 1);
+
+	if (txq->sw_ring == NULL || txq->nb_tx_free == max_desc)
+		return;
+
+	/* release the used mbufs in sw_ring */
+	for (i = txq->tx_next_dd - (txq->tx_free_thresh - 1);
+	     i != txq->tx_tail;
+	     i = (i + 1) % txq->nb_tx_desc) {
+		txe = &txq->sw_ring_v[i];
+		rte_pktmbuf_free_seg(txe->mbuf);
+	}
+	txq->nb_tx_free = max_desc;
+
+	/* reset tx_entry */
+	for (i = 0; i < txq->nb_tx_desc; i++) {
+		txe = &txq->sw_ring_v[i];
+		txe->mbuf = NULL;
+	}
+}
+
+static inline void
+_txgbe_rx_queue_release_mbufs_vec(struct txgbe_rx_queue *rxq)
+{
+	unsigned int i;
+
+	if (rxq->sw_ring == NULL || rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf != NULL)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) % rxq->nb_rx_desc) {
+			if (rxq->sw_ring[i].mbuf != NULL)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline void
+_txgbe_tx_free_swring_vec(struct txgbe_tx_queue *txq)
+{
+	if (txq == NULL)
+		return;
+
+	if (txq->sw_ring != NULL) {
+		rte_free(txq->sw_ring_v - 1);
+		txq->sw_ring_v = NULL;
+	}
+}
+
+static inline void
+_txgbe_reset_tx_queue_vec(struct txgbe_tx_queue *txq)
+{
+	static const struct txgbe_tx_desc zeroed_desc = {0};
+	struct txgbe_tx_entry_v *txe = txq->sw_ring_v;
+	uint16_t i;
+
+	/* Zero out HW ring memory */
+	for (i = 0; i < txq->nb_tx_desc; i++)
+		txq->tx_ring[i] = zeroed_desc;
+
+	/* Initialize SW ring entries */
+	for (i = 0; i < txq->nb_tx_desc; i++) {
+		volatile struct txgbe_tx_desc *txd = &txq->tx_ring[i];
+
+		txd->dw3 = TXGBE_TXD_DD;
+		txe[i].mbuf = NULL;
+	}
+
+	txq->tx_next_dd = (uint16_t)(txq->tx_free_thresh - 1);
+
+	txq->tx_tail = 0;
+	/*
+	 * Always allow 1 descriptor to be un-allocated to avoid
+	 * a H/W race condition
+	 */
+	txq->last_desc_cleaned = (uint16_t)(txq->nb_tx_desc - 1);
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_desc - 1);
+	txq->ctx_curr = 0;
+	memset((void *)&txq->ctx_cache, 0,
+		TXGBE_CTX_NUM * sizeof(struct txgbe_ctx_info));
+}
+
+static inline int
+txgbe_rxq_vec_setup_default(struct txgbe_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+txgbe_txq_vec_setup_default(struct txgbe_tx_queue *txq,
+			    const struct txgbe_txq_ops *txq_ops)
+{
+	if (txq->sw_ring_v == NULL)
+		return -1;
+
+	/* leave the first one for overflow */
+	txq->sw_ring_v = txq->sw_ring_v + 1;
+	txq->ops = txq_ops;
+
+	return 0;
+}
+
+static inline int
+txgbe_rx_vec_dev_conf_condition_check_default(struct rte_eth_dev *dev)
+{
+#ifndef RTE_LIBRTE_IEEE1588
+	struct rte_eth_fdir_conf *fconf = TXGBE_DEV_FDIR_CONF(dev);
+
+	/* no fdir support */
+	if (fconf->mode != RTE_FDIR_MODE_NONE)
+		return -1;
+
+	return 0;
+#else
+	RTE_SET_USED(dev);
+	return -1;
+#endif
+}
+#endif
diff --git a/drivers/net/txgbe/txgbe_rxtx_vec_neon.c b/drivers/net/txgbe/txgbe_rxtx_vec_neon.c
new file mode 100644
index 0000000000..5018fbc0b8
--- /dev/null
+++ b/drivers/net/txgbe/txgbe_rxtx_vec_neon.c
@@ -0,0 +1,604 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
+ * Copyright(c) 2010-2015 Intel Corporation
+ */
+
+#include <ethdev_driver.h>
+#include <rte_malloc.h>
+#include <rte_vect.h>
+
+#include "txgbe_ethdev.h"
+#include "txgbe_rxtx.h"
+#include "txgbe_rxtx_vec_common.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+static inline void
+txgbe_rxq_rearm(struct txgbe_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile struct txgbe_rx_desc *rxdp;
+	struct txgbe_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	uint64x2_t dma_addr0, dma_addr1;
+	uint64x2_t zero = vdupq_n_u64(0);
+	uint64_t paddr;
+	uint8x8_t p;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (unlikely(rte_mempool_get_bulk(rxq->mb_pool,
+					  (void *)rxep,
+					  RTE_TXGBE_RXQ_REARM_THRESH) < 0)) {
+		if (rxq->rxrearm_nb + RTE_TXGBE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			for (i = 0; i < RTE_TXGBE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				vst1q_u64((uint64_t *)&rxdp[i], zero);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			RTE_TXGBE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	p = vld1_u8((uint8_t *)&rxq->mbuf_initializer);
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < RTE_TXGBE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/*
+		 * Flush mbuf with pkt template.
+		 * Data to be rearmed is 6 bytes long.
+		 */
+		vst1_u8((uint8_t *)&mb0->rearm_data, p);
+		paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr0 = vsetq_lane_u64(paddr, zero, 0);
+		/* flush desc with pa dma_addr */
+		vst1q_u64((uint64_t *)rxdp++, dma_addr0);
+
+		vst1_u8((uint8_t *)&mb1->rearm_data, p);
+		paddr = mb1->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr1 = vsetq_lane_u64(paddr, zero, 0);
+		vst1q_u64((uint64_t *)rxdp++, dma_addr1);
+	}
+
+	rxq->rxrearm_start += RTE_TXGBE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= RTE_TXGBE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	txgbe_set32(rxq->rdt_reg_addr, rx_id);
+}
+
+static inline void
+desc_to_olflags_v(uint8x16x2_t sterr_tmp1, uint8x16x2_t sterr_tmp2,
+		  uint8x16_t staterr, uint8_t vlan_flags,
+		  struct rte_mbuf **rx_pkts)
+{
+	uint8x16_t ptype;
+	uint8x16_t vtag_lo, vtag_hi, vtag;
+	uint8x16_t temp_csum, temp_vp;
+	uint8x16_t vtag_mask = vdupq_n_u8(0x0F);
+	uint32x4_t csum = {0, 0, 0, 0};
+
+	union {
+		uint16_t e[4];
+		uint64_t word;
+	} vol;
+
+	const uint8x16_t rsstype_msk = {
+			0x0F, 0x0F, 0x0F, 0x0F,
+			0x00, 0x00, 0x00, 0x00,
+			0x00, 0x00, 0x00, 0x00,
+			0x00, 0x00, 0x00, 0x00};
+
+	const uint8x16_t rss_flags = {
+			0, RTE_MBUF_F_RX_RSS_HASH, RTE_MBUF_F_RX_RSS_HASH, RTE_MBUF_F_RX_RSS_HASH,
+			0, RTE_MBUF_F_RX_RSS_HASH, 0, RTE_MBUF_F_RX_RSS_HASH,
+			RTE_MBUF_F_RX_RSS_HASH, 0, 0, 0,
+			0, 0, 0, RTE_MBUF_F_RX_FDIR};
+
+	/* mask everything except vlan present and l4/ip csum error */
+	const uint8x16_t vlan_csum_msk = {
+			TXGBE_RXD_STAT_VLAN, TXGBE_RXD_STAT_VLAN,
+			TXGBE_RXD_STAT_VLAN, TXGBE_RXD_STAT_VLAN,
+			0, 0, 0, 0,
+			0, 0, 0, 0,
+			(TXGBE_RXD_ERR_L4CS | TXGBE_RXD_ERR_IPCS) >> 24,
+			(TXGBE_RXD_ERR_L4CS | TXGBE_RXD_ERR_IPCS) >> 24,
+			(TXGBE_RXD_ERR_L4CS | TXGBE_RXD_ERR_IPCS) >> 24,
+			(TXGBE_RXD_ERR_L4CS | TXGBE_RXD_ERR_IPCS) >> 24};
+
+	/* map vlan present and l4/ip csum error to ol_flags */
+	const uint8x16_t vlan_csum_map_lo = {
+			RTE_MBUF_F_RX_IP_CKSUM_GOOD,
+			RTE_MBUF_F_RX_IP_CKSUM_GOOD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+			RTE_MBUF_F_RX_IP_CKSUM_BAD,
+			RTE_MBUF_F_RX_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+			0, 0, 0, 0,
+			vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_GOOD,
+			vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_GOOD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+			vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_BAD,
+			vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+			0, 0, 0, 0};
+
+	const uint8x16_t vlan_csum_map_hi = {
+			RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+			RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+			0, 0, 0, 0,
+			RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+			RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+			0, 0, 0, 0};
+
+	ptype = vzipq_u8(sterr_tmp1.val[0], sterr_tmp2.val[0]).val[0];
+
+	ptype = vandq_u8(ptype, rsstype_msk);
+	ptype = vqtbl1q_u8(rss_flags, ptype);
+
+	/* extract vlan_flags and csum_error from staterr */
+	vtag = vandq_u8(staterr, vlan_csum_msk);
+
+	/* csum bits are in the most significant, to use shuffle we need to
+	 * shift them. Change mask from 0xc0 to 0x03.
+	 */
+	temp_csum = vshrq_n_u8(vtag, 6);
+
+	/* Change vlan present mask from 0x20 to 0x08.
+	 */
+	temp_vp = vshrq_n_u8(vtag, 2);
+
+	/* 'OR' the most significant 32 bits containing the checksum flags with
+	 * the vlan present flags. Then bits layout of each lane(8bits) will be
+	 * 'xxxx,VLAN,x,ERR_IPCS,ERR_L4CS'
+	 */
+	csum = vsetq_lane_u32(vgetq_lane_u32(vreinterpretq_u32_u8(temp_csum), 3), csum, 0);
+	vtag = vorrq_u8(vreinterpretq_u8_u32(csum), vtag);
+	vtag = vorrq_u8(vtag, temp_vp);
+	vtag = vandq_u8(vtag, vtag_mask);
+
+	/* convert L4 checksum correct type to vtag_hi */
+	vtag_hi = vqtbl1q_u8(vlan_csum_map_hi, vtag);
+	vtag_hi = vshrq_n_u8(vtag_hi, 7);
+
+	/* convert VP, IPE, L4E to vtag_lo */
+	vtag_lo = vqtbl1q_u8(vlan_csum_map_lo, vtag);
+	vtag_lo = vorrq_u8(ptype, vtag_lo);
+
+	vtag = vzipq_u8(vtag_lo, vtag_hi).val[0];
+	vol.word = vgetq_lane_u64(vreinterpretq_u64_u8(vtag), 0);
+
+	rx_pkts[0]->ol_flags = vol.e[0];
+	rx_pkts[1]->ol_flags = vol.e[1];
+	rx_pkts[2]->ol_flags = vol.e[2];
+	rx_pkts[3]->ol_flags = vol.e[3];
+}
+
+#define TXGBE_VPMD_DESC_EOP_MASK	0x02020202
+#define TXGBE_UINT8_BIT			(CHAR_BIT * sizeof(uint8_t))
+
+static inline void
+desc_to_ptype_v(uint64x2_t descs[4], uint16_t pkt_type_mask,
+		struct rte_mbuf **rx_pkts)
+{
+	uint32x4_t ptype_mask = vdupq_n_u32((uint32_t)pkt_type_mask);
+	uint32x4_t ptype0 = vzipq_u32(vreinterpretq_u32_u64(descs[0]),
+				vreinterpretq_u32_u64(descs[2])).val[0];
+	uint32x4_t ptype1 = vzipq_u32(vreinterpretq_u32_u64(descs[1]),
+				vreinterpretq_u32_u64(descs[3])).val[0];
+
+	/* interleave low 32 bits,
+	 * now we have 4 ptypes in a NEON register
+	 */
+	ptype0 = vzipq_u32(ptype0, ptype1).val[0];
+
+	/* shift right by TXGBE_RXD_PTID_SHIFT, and apply ptype mask */
+	ptype0 = vandq_u32(vshrq_n_u32(ptype0, TXGBE_RXD_PTID_SHIFT), ptype_mask);
+
+	rx_pkts[0]->packet_type = txgbe_decode_ptype(vgetq_lane_u32(ptype0, 0));
+	rx_pkts[1]->packet_type = txgbe_decode_ptype(vgetq_lane_u32(ptype0, 1));
+	rx_pkts[2]->packet_type = txgbe_decode_ptype(vgetq_lane_u32(ptype0, 2));
+	rx_pkts[3]->packet_type = txgbe_decode_ptype(vgetq_lane_u32(ptype0, 3));
+}
+
+/**
+ * vPMD raw receive routine, only accept(nb_pkts >= RTE_TXGBE_DESCS_PER_LOOP)
+ *
+ * Notice:
+ * - nb_pkts < RTE_TXGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_TXGBE_DESC_PER_LOOP power-of-two
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct txgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		   uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile struct txgbe_rx_desc *rxdp;
+	struct txgbe_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint8x16_t shuf_msk = {
+		0xFF, 0xFF,
+		0xFF, 0xFF,  /* skip 32 bits pkt_type */
+		12, 13,      /* octet 12~13, low 16 bits pkt_len */
+		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+		12, 13,      /* octet 12~13, 16 bits data_len */
+		14, 15,      /* octet 14~15, low 16 bits vlan_macip */
+		4, 5, 6, 7  /* octet 4~7, 32bits rss */
+		};
+	uint16x8_t crc_adjust = {0, 0, rxq->crc_len, 0,
+				 rxq->crc_len, 0, 0, 0};
+	uint8_t vlan_flags;
+
+	/* nb_pkts has to be floor-aligned to RTE_TXGBE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_TXGBE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch_non_temporal(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > RTE_TXGBE_RXQ_REARM_THRESH)
+		txgbe_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->qw1.lo.status & rte_cpu_to_le_32(TXGBE_RXD_STAT_DD)))
+		return 0;
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* ensure these 2 flags are in the lower 8 bits */
+	RTE_BUILD_BUG_ON((RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED) > UINT8_MAX);
+	vlan_flags = rxq->vlan_flags & UINT8_MAX;
+
+	/* A. load 4 packet in one loop
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+			pos += RTE_TXGBE_DESCS_PER_LOOP,
+			rxdp += RTE_TXGBE_DESCS_PER_LOOP) {
+		uint64x2_t descs[RTE_TXGBE_DESCS_PER_LOOP];
+		uint8x16_t pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		uint8x16x2_t sterr_tmp1, sterr_tmp2;
+		uint64x2_t mbp1, mbp2;
+		uint8x16_t staterr;
+		uint16x8_t tmp;
+		uint32_t stat;
+
+		/* B.1 load 2 mbuf point */
+		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
+
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
+
+		/* B.1 load 2 mbuf point */
+		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
+
+		/* A. load 4 pkts descs */
+		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
+		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
+		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
+
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2);
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[3]), shuf_msk);
+		pkt_mb3 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[2]), shuf_msk);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[1]), shuf_msk);
+		pkt_mb1 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[0]), shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = vzipq_u8(vreinterpretq_u8_u64(descs[1]),
+				      vreinterpretq_u8_u64(descs[3]));
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = vzipq_u8(vreinterpretq_u8_u64(descs[0]),
+				      vreinterpretq_u8_u64(descs[2]));
+
+		/* C.2 get 4 pkts staterr value  */
+		staterr = vzipq_u8(sterr_tmp1.val[1], sterr_tmp2.val[1]).val[0];
+
+		/* set ol_flags with vlan packet type */
+		desc_to_olflags_v(sterr_tmp1, sterr_tmp2, staterr, vlan_flags,
+				  &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb4), crc_adjust);
+		pkt_mb4 = vreinterpretq_u8_u16(tmp);
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb3), crc_adjust);
+		pkt_mb3 = vreinterpretq_u8_u16(tmp);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		vst1q_u8((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		vst1q_u8((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb2), crc_adjust);
+		pkt_mb2 = vreinterpretq_u8_u16(tmp);
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb1), crc_adjust);
+		pkt_mb1 = vreinterpretq_u8_u16(tmp);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			stat = vgetq_lane_u32(vreinterpretq_u32_u8(staterr), 0);
+			/* and with mask to extract bits, flipping 1-0 */
+			*(int *)split_packet = ~stat & TXGBE_VPMD_DESC_EOP_MASK;
+
+			split_packet += RTE_TXGBE_DESCS_PER_LOOP;
+		}
+
+		/* C.4 expand DD bit to saturate UINT8 */
+		staterr = vshlq_n_u8(staterr, TXGBE_UINT8_BIT - 1);
+		staterr = vreinterpretq_u8_s8(vshrq_n_s8(vreinterpretq_s8_u8(staterr),
+					      TXGBE_UINT8_BIT - 1));
+		stat = ~vgetq_lane_u32(vreinterpretq_u32_u8(staterr), 0);
+
+		rte_prefetch_non_temporal(rxdp + RTE_TXGBE_DESCS_PER_LOOP);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		vst1q_u8((uint8_t *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		vst1q_u8((uint8_t *)&rx_pkts[pos]->rx_descriptor_fields1,
+			 pkt_mb1);
+
+		desc_to_ptype_v(descs, rxq->pkt_type_mask, &rx_pkts[pos]);
+
+		/* C.5 calc available number of desc */
+		if (unlikely(stat == 0)) {
+			nb_pkts_recd += RTE_TXGBE_DESCS_PER_LOOP;
+		} else {
+			nb_pkts_recd += rte_ctz32(stat) / TXGBE_UINT8_BIT;
+			break;
+		}
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * vPMD receive routine, only accept(nb_pkts >= RTE_IXGBE_DESCS_PER_LOOP)
+ *
+ * Notice:
+ * - nb_pkts < RTE_IXGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_IXGBE_DESC_PER_LOOP power-of-two
+ */
+uint16_t
+txgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets
+ *
+ * Notice:
+ * - nb_pkts < RTE_TXGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_TXGBE_DESC_PER_LOOP power-of-two
+ */
+static uint16_t
+txgbe_recv_scattered_burst_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			       uint16_t nb_pkts)
+{
+	struct txgbe_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[RTE_TXGBE_MAX_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+	if (rxq->pkt_first_seg == NULL &&
+			split_fl64[0] == 0 && split_fl64[1] == 0 &&
+			split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+	if (rxq->pkt_first_seg == NULL) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+		rxq->pkt_first_seg = rx_pkts[i];
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ */
+uint16_t
+txgbe_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			      uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > RTE_TXGBE_MAX_RX_BURST) {
+		uint16_t burst;
+
+		burst = txgbe_recv_scattered_burst_vec(rx_queue,
+						       rx_pkts + retval,
+						       RTE_TXGBE_MAX_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < RTE_TXGBE_MAX_RX_BURST)
+			return retval;
+	}
+
+	return retval + txgbe_recv_scattered_burst_vec(rx_queue,
+						       rx_pkts + retval,
+						       nb_pkts);
+}
+
+static inline void
+vtx1(volatile struct txgbe_tx_desc *txdp,
+		struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64x2_t descriptor = {
+			pkt->buf_iova + pkt->data_off,
+			(uint64_t)pkt->pkt_len << 45 | flags | pkt->data_len};
+
+	vst1q_u64((uint64_t *)txdp, descriptor);
+}
+
+static inline void
+vtx(volatile struct txgbe_tx_desc *txdp,
+		struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		vtx1(txdp, *pkt, flags);
+}
+
+uint16_t
+txgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts)
+{
+	struct txgbe_tx_queue *txq = (struct txgbe_tx_queue *)tx_queue;
+	volatile struct txgbe_tx_desc *txdp;
+	struct txgbe_tx_entry_v *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = TXGBE_TXD_FLAGS;
+	uint64_t rs = TXGBE_TXD_FLAGS;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_free_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		txgbe_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring_v[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	nb_commit = nb_pkts;
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			vtx1(txdp, *tx_pkts, flags);
+
+		vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring_v[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+
+	txq->tx_tail = tx_id;
+
+	txgbe_set32(txq->tdt_reg_addr, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+static void __rte_cold
+txgbe_tx_queue_release_mbufs_vec(struct txgbe_tx_queue *txq)
+{
+	_txgbe_tx_queue_release_mbufs_vec(txq);
+}
+
+void __rte_cold
+txgbe_rx_queue_release_mbufs_vec(struct txgbe_rx_queue *rxq)
+{
+	_txgbe_rx_queue_release_mbufs_vec(rxq);
+}
+
+static void __rte_cold
+txgbe_tx_free_swring(struct txgbe_tx_queue *txq)
+{
+	_txgbe_tx_free_swring_vec(txq);
+}
+
+static void __rte_cold
+txgbe_reset_tx_queue(struct txgbe_tx_queue *txq)
+{
+	_txgbe_reset_tx_queue_vec(txq);
+}
+
+static const struct txgbe_txq_ops vec_txq_ops = {
+	.release_mbufs = txgbe_tx_queue_release_mbufs_vec,
+	.free_swring = txgbe_tx_free_swring,
+	.reset = txgbe_reset_tx_queue,
+};
+
+int __rte_cold
+txgbe_rxq_vec_setup(struct txgbe_rx_queue *rxq)
+{
+	return txgbe_rxq_vec_setup_default(rxq);
+}
+
+int __rte_cold
+txgbe_txq_vec_setup(struct txgbe_tx_queue *txq)
+{
+	return txgbe_txq_vec_setup_default(txq, &vec_txq_ops);
+}
+
+int __rte_cold
+txgbe_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev)
+{
+	return txgbe_rx_vec_dev_conf_condition_check_default(dev);
+}
diff --git a/drivers/net/txgbe/txgbe_rxtx_vec_sse.c b/drivers/net/txgbe/txgbe_rxtx_vec_sse.c
new file mode 100644
index 0000000000..dec4b56e0a
--- /dev/null
+++ b/drivers/net/txgbe/txgbe_rxtx_vec_sse.c
@@ -0,0 +1,736 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
+ * Copyright(c) 2010-2015 Intel Corporation
+ */
+
+#include <ethdev_driver.h>
+#include <rte_malloc.h>
+
+#include "txgbe_ethdev.h"
+#include "txgbe_rxtx.h"
+#include "txgbe_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+txgbe_rxq_rearm(struct txgbe_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile struct txgbe_rx_desc *rxdp;
+	struct txgbe_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	const __m128i hba_msk = _mm_set_epi64x(0, UINT64_MAX);
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mb_pool,
+				 (void *)rxep,
+				 RTE_TXGBE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + RTE_TXGBE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < RTE_TXGBE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i],
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			RTE_TXGBE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < RTE_TXGBE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* set Header Buffer Address to zero */
+		dma_addr0 =  _mm_and_si128(dma_addr0, hba_msk);
+		dma_addr1 =  _mm_and_si128(dma_addr1, hba_msk);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)rxdp++, dma_addr0);
+		_mm_store_si128((__m128i *)rxdp++, dma_addr1);
+	}
+
+	rxq->rxrearm_start += RTE_TXGBE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= RTE_TXGBE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	txgbe_set32(rxq->rdt_reg_addr, rx_id);
+}
+
+#ifdef RTE_LIB_SECURITY
+static inline void
+desc_to_olflags_v_ipsec(__m128i descs[4], struct rte_mbuf **rx_pkts)
+{
+	__m128i sterr, rearm, tmp_e, tmp_p;
+	uint32_t *rearm0 = (uint32_t *)rx_pkts[0]->rearm_data + 2;
+	uint32_t *rearm1 = (uint32_t *)rx_pkts[1]->rearm_data + 2;
+	uint32_t *rearm2 = (uint32_t *)rx_pkts[2]->rearm_data + 2;
+	uint32_t *rearm3 = (uint32_t *)rx_pkts[3]->rearm_data + 2;
+	const __m128i ipsec_sterr_msk =
+			_mm_set1_epi32(TXGBE_RXD_STAT_SECP |
+				       TXGBE_RXD_ERR_SECERR);
+	const __m128i ipsec_proc_msk  =
+			_mm_set1_epi32(TXGBE_RXD_STAT_SECP);
+	const __m128i ipsec_err_flag  =
+			_mm_set1_epi32(RTE_MBUF_F_RX_SEC_OFFLOAD_FAILED |
+				       RTE_MBUF_F_RX_SEC_OFFLOAD);
+	const __m128i ipsec_proc_flag = _mm_set1_epi32(RTE_MBUF_F_RX_SEC_OFFLOAD);
+
+	rearm = _mm_set_epi32(*rearm3, *rearm2, *rearm1, *rearm0);
+	sterr = _mm_set_epi32(_mm_extract_epi32(descs[3], 2),
+			      _mm_extract_epi32(descs[2], 2),
+			      _mm_extract_epi32(descs[1], 2),
+			      _mm_extract_epi32(descs[0], 2));
+	sterr = _mm_and_si128(sterr, ipsec_sterr_msk);
+	tmp_e = _mm_cmpeq_epi32(sterr, ipsec_sterr_msk);
+	tmp_p = _mm_cmpeq_epi32(sterr, ipsec_proc_msk);
+	sterr = _mm_or_si128(_mm_and_si128(tmp_e, ipsec_err_flag),
+				_mm_and_si128(tmp_p, ipsec_proc_flag));
+	rearm = _mm_or_si128(rearm, sterr);
+	*rearm0 = _mm_extract_epi32(rearm, 0);
+	*rearm1 = _mm_extract_epi32(rearm, 1);
+	*rearm2 = _mm_extract_epi32(rearm, 2);
+	*rearm3 = _mm_extract_epi32(rearm, 3);
+}
+#endif
+
+static inline void
+desc_to_olflags_v(__m128i descs[4], __m128i mbuf_init, uint8_t vlan_flags,
+	struct rte_mbuf **rx_pkts)
+{
+	__m128i ptype0, ptype1, vtag0, vtag1, csum, vp;
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	/* mask everything except rss type */
+	const __m128i rsstype_msk = _mm_set_epi16(0x0000, 0x0000, 0x0000, 0x0000,
+						  0x000F, 0x000F, 0x000F, 0x000F);
+
+	/* mask the lower byte of ol_flags */
+	const __m128i ol_flags_msk = _mm_set_epi16(0x0000, 0x0000, 0x0000, 0x0000,
+						   0x00FF, 0x00FF, 0x00FF, 0x00FF);
+
+	/* map rss type to rss hash flag */
+	const __m128i rss_flags = _mm_set_epi8(RTE_MBUF_F_RX_FDIR, 0, 0, 0,
+			0, 0, 0, RTE_MBUF_F_RX_RSS_HASH,
+			RTE_MBUF_F_RX_RSS_HASH, 0, RTE_MBUF_F_RX_RSS_HASH, 0,
+			RTE_MBUF_F_RX_RSS_HASH, RTE_MBUF_F_RX_RSS_HASH, RTE_MBUF_F_RX_RSS_HASH, 0);
+
+	/* mask everything except vlan present and l4/ip csum error */
+	const __m128i vlan_csum_msk =
+		_mm_set_epi16((TXGBE_RXD_ERR_L4CS | TXGBE_RXD_ERR_IPCS) >> 16,
+			      (TXGBE_RXD_ERR_L4CS | TXGBE_RXD_ERR_IPCS) >> 16,
+			      (TXGBE_RXD_ERR_L4CS | TXGBE_RXD_ERR_IPCS) >> 16,
+			      (TXGBE_RXD_ERR_L4CS | TXGBE_RXD_ERR_IPCS) >> 16,
+			      TXGBE_RXD_STAT_VLAN, TXGBE_RXD_STAT_VLAN,
+			      TXGBE_RXD_STAT_VLAN, TXGBE_RXD_STAT_VLAN);
+
+	/* map vlan present and l4/ip csum error to ol_flags */
+	const __m128i vlan_csum_map_lo = _mm_set_epi8(0, 0, 0, 0,
+		vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+		vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_BAD,
+		vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_GOOD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+		vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_GOOD,
+		0, 0, 0, 0,
+		RTE_MBUF_F_RX_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+		RTE_MBUF_F_RX_IP_CKSUM_BAD,
+		RTE_MBUF_F_RX_IP_CKSUM_GOOD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+		RTE_MBUF_F_RX_IP_CKSUM_GOOD);
+
+	const __m128i vlan_csum_map_hi = _mm_set_epi8(0, 0, 0, 0,
+		0, RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+		RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t),
+		0, 0, 0, 0,
+		0, RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+		RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t));
+
+	const __m128i vtag_msk = _mm_set_epi16(0x0000, 0x0000, 0x0000, 0x0000,
+					       0x000F, 0x000F, 0x000F, 0x000F);
+
+	ptype0 = _mm_unpacklo_epi16(descs[0], descs[1]);
+	ptype1 = _mm_unpacklo_epi16(descs[2], descs[3]);
+	vtag0 = _mm_unpackhi_epi16(descs[0], descs[1]);
+	vtag1 = _mm_unpackhi_epi16(descs[2], descs[3]);
+
+	ptype0 = _mm_unpacklo_epi32(ptype0, ptype1);
+	ptype0 = _mm_and_si128(ptype0, rsstype_msk);
+	ptype0 = _mm_shuffle_epi8(rss_flags, ptype0);
+
+	vtag1 = _mm_unpacklo_epi32(vtag0, vtag1);
+	vtag1 = _mm_and_si128(vtag1, vlan_csum_msk);
+
+	/* csum bits are in the most significant, to use shuffle we need to
+	 * shift them. Change mask to 0xc000 to 0x0003.
+	 */
+	csum = _mm_srli_epi16(vtag1, 14);
+
+	/* Change mask to 0x20 to 0x08. */
+	vp = _mm_srli_epi16(vtag1, 2);
+
+	/* now or the most significant 64 bits containing the checksum
+	 * flags with the vlan present flags.
+	 */
+	csum = _mm_srli_si128(csum, 8);
+	vtag1 = _mm_or_si128(csum, vtag1);
+	vtag1 = _mm_or_si128(vtag1, vp);
+	vtag1 = _mm_and_si128(vtag1, vtag_msk);
+
+	/* convert STAT_VLAN, ERR_IPCS, ERR_L4CS to ol_flags */
+	vtag0 = _mm_shuffle_epi8(vlan_csum_map_hi, vtag1);
+	vtag0 = _mm_slli_epi16(vtag0, sizeof(uint8_t));
+
+	vtag1 = _mm_shuffle_epi8(vlan_csum_map_lo, vtag1);
+	vtag1 = _mm_and_si128(vtag1, ol_flags_msk);
+	vtag1 = _mm_or_si128(vtag0, vtag1);
+
+	vtag1 = _mm_or_si128(ptype0, vtag1);
+
+	/*
+	 * At this point, we have the 4 sets of flags in the low 64-bits
+	 * of vtag1 (4x16).
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vtag1, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vtag1, 6), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vtag1, 4), 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vtag1, 2), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+static inline void
+desc_to_ptype_v(__m128i descs[4], uint16_t pkt_type_mask,
+		struct rte_mbuf **rx_pkts)
+{
+	__m128i ptype_mask = _mm_set_epi32(pkt_type_mask, pkt_type_mask,
+					pkt_type_mask, pkt_type_mask);
+
+	__m128i ptype0 = _mm_unpacklo_epi32(descs[0], descs[2]);
+	__m128i ptype1 = _mm_unpacklo_epi32(descs[1], descs[3]);
+
+	/* interleave low 32 bits,
+	 * now we have 4 ptypes in a XMM register
+	 */
+	ptype0 = _mm_unpacklo_epi32(ptype0, ptype1);
+
+	/* shift left by TXGBE_RXD_PTID_SHIFT, and apply ptype mask */
+	ptype0 = _mm_and_si128(_mm_srli_epi32(ptype0, TXGBE_RXD_PTID_SHIFT),
+			       ptype_mask);
+
+	rx_pkts[0]->packet_type = txgbe_decode_ptype(_mm_extract_epi32(ptype0, 0));
+	rx_pkts[1]->packet_type = txgbe_decode_ptype(_mm_extract_epi32(ptype0, 1));
+	rx_pkts[2]->packet_type = txgbe_decode_ptype(_mm_extract_epi32(ptype0, 2));
+	rx_pkts[3]->packet_type = txgbe_decode_ptype(_mm_extract_epi32(ptype0, 3));
+}
+
+/*
+ * vPMD raw receive routine, only accept(nb_pkts >= RTE_TXGBE_DESCS_PER_LOOP)
+ *
+ * Notice:
+ * - nb_pkts < RTE_TXGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_TXGBE_DESC_PER_LOOP power-of-two
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct txgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile struct txgbe_rx_desc *rxdp;
+	struct txgbe_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+#ifdef RTE_LIB_SECURITY
+	uint8_t use_ipsec = rxq->using_ipsec;
+#endif
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	__m128i crc_adjust = _mm_set_epi16(0, 0, 0, /* ignore non-length fields */
+				-rxq->crc_len, /* sub crc on data_len */
+				0,             /* ignore high-16bits of pkt_len */
+				-rxq->crc_len, /* sub crc on pkt_len */
+				0, 0);         /* ignore pkt_type field */
+
+	/*
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+	__m128i mbuf_init;
+	uint8_t vlan_flags;
+
+	/*
+	 * Under the circumstance that `rx_tail` wrap back to zero
+	 * and the advance speed of `rx_tail` is greater than `rxrearm_start`,
+	 * `rx_tail` will catch up with `rxrearm_start` and surpass it.
+	 * This may cause some mbufs be reused by application.
+	 *
+	 * So we need to make some restrictions to ensure that
+	 * `rx_tail` will not exceed `rxrearm_start`.
+	 */
+	nb_pkts = RTE_MIN(nb_pkts, RTE_TXGBE_RXQ_REARM_THRESH);
+
+	/* nb_pkts has to be floor-aligned to RTE_TXGBE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_TXGBE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > RTE_TXGBE_RXQ_REARM_THRESH)
+		txgbe_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->qw1.lo.status &
+				rte_cpu_to_le_32(TXGBE_RXD_STAT_DD)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8(7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+		15, 14,      /* octet 14~15, low 16 bits vlan_macip */
+		13, 12,      /* octet 12~13, 16 bits data_len */
+		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+		13, 12,      /* octet 12~13, low 16 bits pkt_len */
+		0xFF, 0xFF,  /* skip 32 bit pkt_type */
+		0xFF, 0xFF);
+	/*
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* ensure these 2 flags are in the lower 8 bits */
+	RTE_BUILD_BUG_ON((RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED) > UINT8_MAX);
+	vlan_flags = rxq->vlan_flags & UINT8_MAX;
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+			pos += RTE_TXGBE_DESCS_PER_LOOP,
+			rxdp += RTE_TXGBE_DESCS_PER_LOOP) {
+		__m128i descs[RTE_TXGBE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load desc[3] */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		/* A.1 load desc[2-0] */
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		/* set ol_flags with vlan packet type */
+		desc_to_olflags_v(descs, mbuf_init, vlan_flags, &rx_pkts[pos]);
+
+#ifdef RTE_LIB_SECURITY
+		if (unlikely(use_ipsec))
+			desc_to_olflags_v_ipsec(descs, &rx_pkts[pos]);
+#endif
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+				pkt_mb4);
+		_mm_storeu_si128((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+				pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask =
+				_mm_set_epi8(0xFF, 0xFF, 0xFF, 0xFF,
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     0x04, 0x0C, 0x00, 0x08);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += RTE_TXGBE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+				pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				pkt_mb1);
+
+		desc_to_ptype_v(descs, TXGBE_PTID_MASK, &rx_pkts[pos]);
+
+		/* C.4 calc available number of desc */
+		var = rte_popcount64(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != RTE_TXGBE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/*
+ * vPMD receive routine, only accept(nb_pkts >= RTE_TXGBE_DESCS_PER_LOOP)
+ *
+ * Notice:
+ * - nb_pkts < RTE_TXGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_TXGBE_DESC_PER_LOOP power-of-two
+ */
+uint16_t
+txgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets
+ *
+ * Notice:
+ * - nb_pkts < RTE_TXGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_TXGBE_DESC_PER_LOOP power-of-two
+ */
+static uint16_t
+txgbe_recv_scattered_burst_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			       uint16_t nb_pkts)
+{
+	struct txgbe_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[RTE_TXGBE_MAX_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+	if (rxq->pkt_first_seg == NULL &&
+			split_fl64[0] == 0 && split_fl64[1] == 0 &&
+			split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+	if (rxq->pkt_first_seg == NULL) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+		rxq->pkt_first_seg = rx_pkts[i];
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ */
+uint16_t
+txgbe_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			      uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > RTE_TXGBE_MAX_RX_BURST) {
+		uint16_t burst;
+
+		burst = txgbe_recv_scattered_burst_vec(rx_queue,
+						       rx_pkts + retval,
+						       RTE_TXGBE_MAX_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < RTE_TXGBE_MAX_RX_BURST)
+			return retval;
+	}
+
+	return retval + txgbe_recv_scattered_burst_vec(rx_queue,
+						       rx_pkts + retval,
+						       nb_pkts);
+}
+
+static inline void
+vtx1(volatile struct txgbe_tx_desc *txdp,
+		struct rte_mbuf *pkt, uint64_t flags)
+{
+	__m128i descriptor = _mm_set_epi64x((uint64_t)pkt->pkt_len << 45 |
+			flags | pkt->data_len,
+			pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+vtx(volatile struct txgbe_tx_desc *txdp,
+		struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		vtx1(txdp, *pkt, flags);
+}
+
+uint16_t
+txgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts)
+{
+	struct txgbe_tx_queue *txq = (struct txgbe_tx_queue *)tx_queue;
+	volatile struct txgbe_tx_desc *txdp;
+	struct txgbe_tx_entry_v *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = TXGBE_TXD_FLAGS;
+	uint64_t rs = TXGBE_TXD_FLAGS;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_free_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		txgbe_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring_v[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	nb_commit = nb_pkts;
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			vtx1(txdp, *tx_pkts, flags);
+
+		vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring_v[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+
+	txq->tx_tail = tx_id;
+
+	txgbe_set32(txq->tdt_reg_addr, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+static void __rte_cold
+txgbe_tx_queue_release_mbufs_vec(struct txgbe_tx_queue *txq)
+{
+	_txgbe_tx_queue_release_mbufs_vec(txq);
+}
+
+void __rte_cold
+txgbe_rx_queue_release_mbufs_vec(struct txgbe_rx_queue *rxq)
+{
+	_txgbe_rx_queue_release_mbufs_vec(rxq);
+}
+
+static void __rte_cold
+txgbe_tx_free_swring(struct txgbe_tx_queue *txq)
+{
+	_txgbe_tx_free_swring_vec(txq);
+}
+
+static void __rte_cold
+txgbe_reset_tx_queue(struct txgbe_tx_queue *txq)
+{
+	_txgbe_reset_tx_queue_vec(txq);
+}
+
+static const struct txgbe_txq_ops vec_txq_ops = {
+	.release_mbufs = txgbe_tx_queue_release_mbufs_vec,
+	.free_swring = txgbe_tx_free_swring,
+	.reset = txgbe_reset_tx_queue,
+};
+
+int __rte_cold
+txgbe_rxq_vec_setup(struct txgbe_rx_queue *rxq)
+{
+	return txgbe_rxq_vec_setup_default(rxq);
+}
+
+int __rte_cold
+txgbe_txq_vec_setup(struct txgbe_tx_queue *txq)
+{
+	return txgbe_txq_vec_setup_default(txq, &vec_txq_ops);
+}
+
+int __rte_cold
+txgbe_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev)
+{
+	return txgbe_rx_vec_dev_conf_condition_check_default(dev);
+}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] net/ngbe: add vectorized functions for Rx/Tx
  2024-02-01  3:00 [PATCH 0/2] Wangxun support vector Rx/Tx Jiawen Wu
  2024-02-01  3:00 ` [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx Jiawen Wu
@ 2024-02-01  3:00 ` Jiawen Wu
  2024-02-06  1:50 ` [PATCH 0/2] Wangxun support vector Rx/Tx Jiawen Wu
  2 siblings, 0 replies; 10+ messages in thread
From: Jiawen Wu @ 2024-02-01  3:00 UTC (permalink / raw)
  To: dev; +Cc: Jiawen Wu

To optimize Rx/Tx burst process, add SSE/NEON vector instructions on
x86/arm architecture.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
---
 drivers/net/ngbe/meson.build            |   6 +
 drivers/net/ngbe/ngbe_ethdev.c          |   6 +
 drivers/net/ngbe/ngbe_ethdev.h          |   1 +
 drivers/net/ngbe/ngbe_rxtx.c            | 161 +++++-
 drivers/net/ngbe/ngbe_rxtx.h            |  32 +-
 drivers/net/ngbe/ngbe_rxtx_vec_common.h | 296 ++++++++++
 drivers/net/ngbe/ngbe_rxtx_vec_neon.c   | 604 +++++++++++++++++++++
 drivers/net/ngbe/ngbe_rxtx_vec_sse.c    | 692 ++++++++++++++++++++++++
 8 files changed, 1794 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_common.h
 create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_neon.c
 create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_sse.c

diff --git a/drivers/net/ngbe/meson.build b/drivers/net/ngbe/meson.build
index 8b5195aab3..5d395ee17f 100644
--- a/drivers/net/ngbe/meson.build
+++ b/drivers/net/ngbe/meson.build
@@ -19,4 +19,10 @@ sources = files(
 
 deps += ['hash']
 
+if arch_subdir == 'x86'
+	sources += files('ngbe_rxtx_vec_sse.c')
+elif arch_subdir == 'arm'
+	sources += files('ngbe_rxtx_vec_neon.c')
+endif
+
 includes += include_directories('base')
diff --git a/drivers/net/ngbe/ngbe_ethdev.c b/drivers/net/ngbe/ngbe_ethdev.c
index 478da014b2..39306e353c 100644
--- a/drivers/net/ngbe/ngbe_ethdev.c
+++ b/drivers/net/ngbe/ngbe_ethdev.c
@@ -932,6 +932,7 @@ ngbe_dev_configure(struct rte_eth_dev *dev)
 	 * allocation Rx preconditions we will reset it.
 	 */
 	adapter->rx_bulk_alloc_allowed = true;
+	adapter->rx_vec_allowed = true;
 
 	return 0;
 }
@@ -1872,6 +1873,11 @@ ngbe_dev_supported_ptypes_get(struct rte_eth_dev *dev)
 	    dev->rx_pkt_burst == ngbe_recv_pkts_bulk_alloc)
 		return ngbe_get_supported_ptypes();
 
+#if defined(RTE_ARCH_X86) || defined(__ARM_NEON)
+	if (dev->rx_pkt_burst == ngbe_recv_pkts_vec ||
+	    dev->rx_pkt_burst == ngbe_recv_scattered_pkts_vec)
+		return ngbe_get_supported_ptypes();
+#endif
 	return NULL;
 }
 
diff --git a/drivers/net/ngbe/ngbe_ethdev.h b/drivers/net/ngbe/ngbe_ethdev.h
index 3cde7c8750..8aef6553a0 100644
--- a/drivers/net/ngbe/ngbe_ethdev.h
+++ b/drivers/net/ngbe/ngbe_ethdev.h
@@ -130,6 +130,7 @@ struct ngbe_adapter {
 	struct ngbe_vf_info        *vfdata;
 	struct ngbe_uta_info       uta_info;
 	bool                       rx_bulk_alloc_allowed;
+	bool                       rx_vec_allowed;
 	struct rte_timecounter     systime_tc;
 	struct rte_timecounter     rx_tstamp_tc;
 	struct rte_timecounter     tx_tstamp_tc;
diff --git a/drivers/net/ngbe/ngbe_rxtx.c b/drivers/net/ngbe/ngbe_rxtx.c
index 8a873b858e..f3e741cbad 100644
--- a/drivers/net/ngbe/ngbe_rxtx.c
+++ b/drivers/net/ngbe/ngbe_rxtx.c
@@ -10,6 +10,7 @@
 #include <ethdev_driver.h>
 #include <rte_malloc.h>
 #include <rte_net.h>
+#include <rte_vect.h>
 
 #include "ngbe_logs.h"
 #include "base/ngbe.h"
@@ -267,6 +268,27 @@ ngbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 	return nb_tx;
 }
 
+static uint16_t
+ngbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		   uint16_t nb_pkts)
+{
+	struct ngbe_tx_queue *txq = (struct ngbe_tx_queue *)tx_queue;
+	uint16_t nb_tx = 0;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_free_thresh);
+		ret = ngbe_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 static inline void
 ngbe_set_xmit_ctx(struct ngbe_tx_queue *txq,
 		volatile struct ngbe_tx_ctx_desc *ctx_txd,
@@ -1858,8 +1880,15 @@ ngbe_set_tx_function(struct rte_eth_dev *dev, struct ngbe_tx_queue *txq)
 	if (txq->offloads == 0 &&
 			txq->tx_free_thresh >= RTE_PMD_NGBE_TX_MAX_BURST) {
 		PMD_INIT_LOG(DEBUG, "Using simple tx code path");
-		dev->tx_pkt_burst = ngbe_xmit_pkts_simple;
 		dev->tx_pkt_prepare = NULL;
+		if (txq->tx_free_thresh <= RTE_NGBE_TX_MAX_FREE_BUF_SZ &&
+				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
+					ngbe_txq_vec_setup(txq) == 0)) {
+			PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+			dev->tx_pkt_burst = ngbe_xmit_pkts_vec;
+		} else {
+			dev->tx_pkt_burst = ngbe_xmit_pkts_simple;
+		}
 	} else {
 		PMD_INIT_LOG(DEBUG, "Using full-featured tx code path");
 		PMD_INIT_LOG(DEBUG,
@@ -1880,6 +1909,11 @@ static const struct {
 } ngbe_tx_burst_infos[] = {
 	{ ngbe_xmit_pkts_simple,   "Scalar Simple"},
 	{ ngbe_xmit_pkts,          "Scalar"},
+#ifdef RTE_ARCH_X86
+	{ ngbe_xmit_pkts_vec,      "Vector SSE" },
+#elif defined(RTE_ARCH_ARM64)
+	{ ngbe_xmit_pkts_vec,      "Vector Neon" },
+#endif
 };
 
 int
@@ -2066,6 +2100,12 @@ ngbe_rx_queue_release_mbufs(struct ngbe_rx_queue *rxq)
 {
 	unsigned int i;
 
+	/* SSE Vector driver has a different way of releasing mbufs. */
+	if (rxq->rx_using_sse) {
+		ngbe_rx_queue_release_mbufs_vec(rxq);
+		return;
+	}
+
 	if (rxq->sw_ring != NULL) {
 		for (i = 0; i < rxq->nb_rx_desc; i++) {
 			if (rxq->sw_ring[i].mbuf != NULL) {
@@ -2189,6 +2229,11 @@ ngbe_reset_rx_queue(struct ngbe_adapter *adapter, struct ngbe_rx_queue *rxq)
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
+#endif
 }
 
 uint64_t
@@ -2339,6 +2384,16 @@ ngbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		     rxq->sw_ring, rxq->sw_sc_ring, rxq->rx_ring,
 		     rxq->rx_ring_phys_addr);
 
+	if (!rte_is_power_of_2(nb_desc)) {
+		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
+				    "preconditions - canceling the feature for "
+				    "the whole port[%d]",
+			     rxq->queue_id, rxq->port_id);
+		adapter->rx_vec_allowed = false;
+	} else {
+		ngbe_rxq_vec_setup(rxq);
+	}
+
 	dev->data->rx_queues[queue_idx] = rxq;
 
 	ngbe_reset_rx_queue(adapter, rxq);
@@ -2379,7 +2434,12 @@ ngbe_dev_rx_descriptor_status(void *rx_queue, uint16_t offset)
 	if (unlikely(offset >= rxq->nb_rx_desc))
 		return -EINVAL;
 
-	nb_hold = rxq->nb_rx_hold;
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+	if (rxq->rx_using_sse)
+		nb_hold = rxq->rxrearm_nb;
+	else
+#endif
+		nb_hold = rxq->nb_rx_hold;
 	if (offset >= rxq->nb_rx_desc - nb_hold)
 		return RTE_ETH_RX_DESC_UNAVAIL;
 
@@ -2740,14 +2800,33 @@ ngbe_dev_mq_rx_configure(struct rte_eth_dev *dev)
 void
 ngbe_set_rx_function(struct rte_eth_dev *dev)
 {
+	uint16_t i, rx_using_sse;
 	struct ngbe_adapter *adapter = ngbe_dev_adapter(dev);
 
+	/*
+	 * In order to allow Vector Rx there are a few configuration
+	 * conditions to be met and Rx Bulk Allocation should be allowed.
+	 */
+	if (ngbe_rx_vec_dev_conf_condition_check(dev) ||
+	    !adapter->rx_bulk_alloc_allowed ||
+			rte_vect_get_max_simd_bitwidth() < RTE_VECT_SIMD_128) {
+		PMD_INIT_LOG(DEBUG,
+			     "Port[%d] doesn't meet Vector Rx preconditions",
+			     dev->data->port_id);
+		adapter->rx_vec_allowed = false;
+	}
+
 	if (dev->data->scattered_rx) {
 		/*
 		 * Set the scattered callback: there are bulk and
 		 * single allocation versions.
 		 */
-		if (adapter->rx_bulk_alloc_allowed) {
+		if (adapter->rx_vec_allowed) {
+			PMD_INIT_LOG(DEBUG,
+				     "Using Vector Scattered Rx callback (port=%d).",
+				     dev->data->port_id);
+			dev->rx_pkt_burst = ngbe_recv_scattered_pkts_vec;
+		} else if (adapter->rx_bulk_alloc_allowed) {
 			PMD_INIT_LOG(DEBUG, "Using a Scattered with bulk "
 					   "allocation callback (port=%d).",
 				     dev->data->port_id);
@@ -2765,9 +2844,16 @@ ngbe_set_rx_function(struct rte_eth_dev *dev)
 	 * Below we set "simple" callbacks according to port/queues parameters.
 	 * If parameters allow we are going to choose between the following
 	 * callbacks:
+	 *    - Vector
 	 *    - Bulk Allocation
 	 *    - Single buffer allocation (the simplest one)
 	 */
+	} else if (adapter->rx_vec_allowed) {
+		PMD_INIT_LOG(DEBUG, "Vector rx enabled, please make sure Rx "
+				    "burst size no less than %d (port=%d).",
+			     RTE_NGBE_DESCS_PER_LOOP,
+			     dev->data->port_id);
+		dev->rx_pkt_burst = ngbe_recv_pkts_vec;
 	} else if (adapter->rx_bulk_alloc_allowed) {
 		PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
 				    "satisfied. Rx Burst Bulk Alloc function "
@@ -2783,6 +2869,15 @@ ngbe_set_rx_function(struct rte_eth_dev *dev)
 
 		dev->rx_pkt_burst = ngbe_recv_pkts;
 	}
+
+	rx_using_sse = (dev->rx_pkt_burst == ngbe_recv_scattered_pkts_vec ||
+			dev->rx_pkt_burst == ngbe_recv_pkts_vec);
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct ngbe_rx_queue *rxq = dev->data->rx_queues[i];
+
+		rxq->rx_using_sse = rx_using_sse;
+	}
 }
 
 static const struct {
@@ -2793,6 +2888,13 @@ static const struct {
 	{ ngbe_recv_pkts_sc_bulk_alloc,      "Scalar Scattered Bulk Alloc"},
 	{ ngbe_recv_pkts_bulk_alloc,         "Scalar Bulk Alloc"},
 	{ ngbe_recv_pkts,                    "Scalar"},
+#ifdef RTE_ARCH_X86
+	{ ngbe_recv_scattered_pkts_vec,      "Vector SSE Scattered" },
+	{ ngbe_recv_pkts_vec,                "Vector SSE" },
+#elif defined(RTE_ARCH_ARM64)
+	{ ngbe_recv_scattered_pkts_vec,      "Vector Neon Scattered" },
+	{ ngbe_recv_pkts_vec,                "Vector Neon" },
+#endif
 };
 
 int
@@ -3311,3 +3413,56 @@ ngbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	qinfo->conf.offloads = txq->offloads;
 	qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
 }
+
+/* Stubs needed for linkage when RTE_ARCH_PPC_64, RTE_ARCH_RISCV or
+ * RTE_ARCH_LOONGARCH is set.
+ */
+#if defined(RTE_ARCH_PPC_64) || defined(RTE_ARCH_RISCV) || \
+	defined(RTE_ARCH_LOONGARCH)
+int
+ngbe_rx_vec_dev_conf_condition_check(__rte_unused struct rte_eth_dev *dev)
+{
+	return -1;
+}
+
+uint16_t
+ngbe_recv_pkts_vec(__rte_unused void *rx_queue,
+		   __rte_unused struct rte_mbuf **rx_pkts,
+		   __rte_unused uint16_t nb_pkts)
+{
+	return 0;
+}
+
+uint16_t
+ngbe_recv_scattered_pkts_vec(__rte_unused void *rx_queue,
+			     __rte_unused struct rte_mbuf **rx_pkts,
+			     __rte_unused uint16_t nb_pkts)
+{
+	return 0;
+}
+
+int
+ngbe_rxq_vec_setup(__rte_unused struct ngbe_rx_queue *rxq)
+{
+	return -1;
+}
+
+uint16_t
+ngbe_xmit_fixed_burst_vec(__rte_unused void *tx_queue,
+			  __rte_unused struct rte_mbuf **tx_pkts,
+			  __rte_unused uint16_t nb_pkts)
+{
+	return 0;
+}
+
+int
+ngbe_txq_vec_setup(__rte_unused struct ngbe_tx_queue *txq)
+{
+	return -1;
+}
+
+void
+ngbe_rx_queue_release_mbufs_vec(__rte_unused struct ngbe_rx_queue *rxq)
+{
+}
+#endif
diff --git a/drivers/net/ngbe/ngbe_rxtx.h b/drivers/net/ngbe/ngbe_rxtx.h
index 9130f9d0df..41580ba0b9 100644
--- a/drivers/net/ngbe/ngbe_rxtx.h
+++ b/drivers/net/ngbe/ngbe_rxtx.h
@@ -203,6 +203,8 @@ struct ngbe_tx_desc {
 #define RTE_PMD_NGBE_RX_MAX_BURST 32
 #define RTE_NGBE_TX_MAX_FREE_BUF_SZ 64
 
+#define RTE_NGBE_DESCS_PER_LOOP    4
+
 #define RX_RING_SZ ((NGBE_RING_DESC_MAX + RTE_PMD_NGBE_RX_MAX_BURST) * \
 		    sizeof(struct ngbe_rx_desc))
 
@@ -237,6 +239,13 @@ struct ngbe_tx_entry {
 	uint16_t last_id; /**< Index of last scattered descriptor. */
 };
 
+/**
+ * Structure associated with each descriptor of the Tx ring of a Tx queue.
+ */
+struct ngbe_tx_entry_v {
+	struct rte_mbuf *mbuf; /**< mbuf associated with Tx desc, if any. */
+};
+
 /**
  * Structure associated with each Rx queue.
  */
@@ -254,6 +263,7 @@ struct ngbe_rx_queue {
 
 	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet */
 	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet */
+	uint64_t        mbuf_initializer; /**< value to init mbufs */
 	uint16_t        nb_rx_desc; /**< number of Rx descriptors */
 	uint16_t        rx_tail;  /**< current value of RDT register */
 	uint16_t        nb_rx_hold; /**< number of held free Rx desc */
@@ -262,6 +272,11 @@ struct ngbe_rx_queue {
 	uint16_t rx_next_avail; /**< idx of next staged pkt to ret to app */
 	uint16_t rx_free_trigger; /**< triggers rx buffer allocation */
 
+	uint8_t         rx_using_sse;   /**< indicates that vector Rx is in use */
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+	uint16_t        rxrearm_nb;     /**< number of remaining to be re-armed */
+	uint16_t        rxrearm_start;  /**< the idx we start the re-arming from */
+#endif
 	uint16_t        rx_free_thresh; /**< max free Rx desc to hold */
 	uint16_t        queue_id; /**< RX queue index */
 	uint16_t        reg_idx;  /**< RX queue register index */
@@ -325,7 +340,12 @@ struct ngbe_tx_queue {
 	volatile struct ngbe_tx_desc *tx_ring;
 
 	uint64_t             tx_ring_phys_addr; /**< Tx ring DMA address */
-	struct ngbe_tx_entry *sw_ring; /**< address of SW ring for scalar PMD */
+	union {
+		/**< address of SW ring for scalar PMD. */
+		struct ngbe_tx_entry *sw_ring;
+		/**< address of SW ring for vector PMD */
+		struct ngbe_tx_entry_v *sw_ring_v;
+	};
 	volatile uint32_t    *tdt_reg_addr; /**< Address of TDT register */
 	volatile uint32_t    *tdc_reg_addr; /**< Address of TDC register */
 	uint16_t             nb_tx_desc;    /**< number of Tx descriptors */
@@ -368,6 +388,16 @@ struct ngbe_txq_ops {
 void ngbe_set_tx_function(struct rte_eth_dev *dev, struct ngbe_tx_queue *txq);
 
 void ngbe_set_rx_function(struct rte_eth_dev *dev);
+uint16_t ngbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts);
+uint16_t ngbe_recv_scattered_pkts_vec(void *rx_queue,
+		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
+int ngbe_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev);
+int ngbe_rxq_vec_setup(struct ngbe_rx_queue *rxq);
+void ngbe_rx_queue_release_mbufs_vec(struct ngbe_rx_queue *rxq);
+uint16_t ngbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+				   uint16_t nb_pkts);
+int ngbe_txq_vec_setup(struct ngbe_tx_queue *txq);
 int ngbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
 
 uint64_t ngbe_get_tx_port_offloads(struct rte_eth_dev *dev);
diff --git a/drivers/net/ngbe/ngbe_rxtx_vec_common.h b/drivers/net/ngbe/ngbe_rxtx_vec_common.h
new file mode 100644
index 0000000000..1ce175de52
--- /dev/null
+++ b/drivers/net/ngbe/ngbe_rxtx_vec_common.h
@@ -0,0 +1,296 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
+ * Copyright(c) 2010-2015 Intel Corporation
+ */
+
+#ifndef _NGBE_RXTX_VEC_COMMON_H_
+#define _NGBE_RXTX_VEC_COMMON_H_
+#include <stdint.h>
+
+#include "ngbe_ethdev.h"
+#include "ngbe_rxtx.h"
+
+#define NGBE_RXD_PTID_SHIFT 9
+
+#define RTE_NGBE_RXQ_REARM_THRESH      32
+#define RTE_NGBE_MAX_RX_BURST          RTE_NGBE_RXQ_REARM_THRESH
+
+static inline uint16_t
+reassemble_packets(struct ngbe_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		   uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[nb_bufs]; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned int pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end != NULL) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static __rte_always_inline int
+ngbe_tx_free_bufs(struct ngbe_tx_queue *txq)
+{
+	struct ngbe_tx_entry_v *txep;
+	uint32_t status;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[RTE_NGBE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bit on threshold descriptor */
+	status = txq->tx_ring[txq->tx_next_dd].dw3;
+	if (!(status & NGBE_TXD_DD)) {
+		if (txq->nb_tx_free >> 1 < txq->tx_free_thresh)
+			ngbe_set32_masked(txq->tdc_reg_addr,
+				NGBE_TXCFG_FLUSH, NGBE_TXCFG_FLUSH);
+		return 0;
+	}
+
+	n = txq->tx_free_thresh;
+
+	/*
+	 * first buffer to free from S/W ring is at index
+	 * tx_next_dd - (tx_rs_thresh-1)
+	 */
+	txep = &txq->sw_ring_v[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m != NULL)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m != NULL)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							(void *)free, nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m != NULL)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_free_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_free_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_free_thresh - 1);
+
+	return txq->tx_free_thresh;
+}
+
+static __rte_always_inline void
+tx_backlog_entry(struct ngbe_tx_entry_v *txep,
+		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
+static inline void
+_ngbe_tx_queue_release_mbufs_vec(struct ngbe_tx_queue *txq)
+{
+	unsigned int i;
+	struct ngbe_tx_entry_v *txe;
+	const uint16_t max_desc = (uint16_t)(txq->nb_tx_desc - 1);
+
+	if (txq->sw_ring == NULL || txq->nb_tx_free == max_desc)
+		return;
+
+	/* release the used mbufs in sw_ring */
+	for (i = txq->tx_next_dd - (txq->tx_free_thresh - 1);
+	     i != txq->tx_tail;
+	     i = (i + 1) % txq->nb_tx_desc) {
+		txe = &txq->sw_ring_v[i];
+		rte_pktmbuf_free_seg(txe->mbuf);
+	}
+	txq->nb_tx_free = max_desc;
+
+	/* reset tx_entry */
+	for (i = 0; i < txq->nb_tx_desc; i++) {
+		txe = &txq->sw_ring_v[i];
+		txe->mbuf = NULL;
+	}
+}
+
+static inline void
+_ngbe_rx_queue_release_mbufs_vec(struct ngbe_rx_queue *rxq)
+{
+	unsigned int i;
+
+	if (rxq->sw_ring == NULL || rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf != NULL)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) % rxq->nb_rx_desc) {
+			if (rxq->sw_ring[i].mbuf != NULL)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline void
+_ngbe_tx_free_swring_vec(struct ngbe_tx_queue *txq)
+{
+	if (txq == NULL)
+		return;
+
+	if (txq->sw_ring != NULL) {
+		rte_free(txq->sw_ring_v - 1);
+		txq->sw_ring_v = NULL;
+	}
+}
+
+static inline void
+_ngbe_reset_tx_queue_vec(struct ngbe_tx_queue *txq)
+{
+	static const struct ngbe_tx_desc zeroed_desc = {0};
+	struct ngbe_tx_entry_v *txe = txq->sw_ring_v;
+	uint16_t i;
+
+	/* Zero out HW ring memory */
+	for (i = 0; i < txq->nb_tx_desc; i++)
+		txq->tx_ring[i] = zeroed_desc;
+
+	/* Initialize SW ring entries */
+	for (i = 0; i < txq->nb_tx_desc; i++) {
+		volatile struct ngbe_tx_desc *txd = &txq->tx_ring[i];
+
+		txd->dw3 = NGBE_TXD_DD;
+		txe[i].mbuf = NULL;
+	}
+
+	txq->tx_next_dd = (uint16_t)(txq->tx_free_thresh - 1);
+
+	txq->tx_tail = 0;
+	/*
+	 * Always allow 1 descriptor to be un-allocated to avoid
+	 * a H/W race condition
+	 */
+	txq->last_desc_cleaned = (uint16_t)(txq->nb_tx_desc - 1);
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_desc - 1);
+	txq->ctx_curr = 0;
+	memset((void *)&txq->ctx_cache, 0,
+		NGBE_CTX_NUM * sizeof(struct ngbe_ctx_info));
+}
+
+static inline int
+ngbe_rxq_vec_setup_default(struct ngbe_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+ngbe_txq_vec_setup_default(struct ngbe_tx_queue *txq,
+			    const struct ngbe_txq_ops *txq_ops)
+{
+	if (txq->sw_ring_v == NULL)
+		return -1;
+
+	/* leave the first one for overflow */
+	txq->sw_ring_v = txq->sw_ring_v + 1;
+	txq->ops = txq_ops;
+
+	return 0;
+}
+
+static inline int
+ngbe_rx_vec_dev_conf_condition_check_default(struct rte_eth_dev *dev)
+{
+	RTE_SET_USED(dev);
+#ifndef RTE_LIBRTE_IEEE1588
+
+	return 0;
+#else
+	return -1;
+#endif
+}
+#endif
diff --git a/drivers/net/ngbe/ngbe_rxtx_vec_neon.c b/drivers/net/ngbe/ngbe_rxtx_vec_neon.c
new file mode 100644
index 0000000000..e08fa56dbd
--- /dev/null
+++ b/drivers/net/ngbe/ngbe_rxtx_vec_neon.c
@@ -0,0 +1,604 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
+ * Copyright(c) 2010-2015 Intel Corporation
+ */
+
+#include <ethdev_driver.h>
+#include <rte_malloc.h>
+#include <rte_vect.h>
+
+#include "ngbe_type.h"
+#include "ngbe_ethdev.h"
+#include "ngbe_rxtx.h"
+#include "ngbe_rxtx_vec_common.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+static inline void
+ngbe_rxq_rearm(struct ngbe_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile struct ngbe_rx_desc *rxdp;
+	struct ngbe_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	uint64x2_t dma_addr0, dma_addr1;
+	uint64x2_t zero = vdupq_n_u64(0);
+	uint64_t paddr;
+	uint8x8_t p;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (unlikely(rte_mempool_get_bulk(rxq->mb_pool,
+					  (void *)rxep,
+					  RTE_NGBE_RXQ_REARM_THRESH) < 0)) {
+		if (rxq->rxrearm_nb + RTE_NGBE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			for (i = 0; i < RTE_NGBE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				vst1q_u64((uint64_t *)&rxdp[i], zero);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			RTE_NGBE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	p = vld1_u8((uint8_t *)&rxq->mbuf_initializer);
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < RTE_NGBE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/*
+		 * Flush mbuf with pkt template.
+		 * Data to be rearmed is 6 bytes long.
+		 */
+		vst1_u8((uint8_t *)&mb0->rearm_data, p);
+		paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr0 = vsetq_lane_u64(paddr, zero, 0);
+		/* flush desc with pa dma_addr */
+		vst1q_u64((uint64_t *)rxdp++, dma_addr0);
+
+		vst1_u8((uint8_t *)&mb1->rearm_data, p);
+		paddr = mb1->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr1 = vsetq_lane_u64(paddr, zero, 0);
+		vst1q_u64((uint64_t *)rxdp++, dma_addr1);
+	}
+
+	rxq->rxrearm_start += RTE_NGBE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= RTE_NGBE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ngbe_set32(rxq->rdt_reg_addr, rx_id);
+}
+
+static inline void
+desc_to_olflags_v(uint8x16x2_t sterr_tmp1, uint8x16x2_t sterr_tmp2,
+		  uint8x16_t staterr, uint8_t vlan_flags,
+		  struct rte_mbuf **rx_pkts)
+{
+	uint8x16_t ptype;
+	uint8x16_t vtag_lo, vtag_hi, vtag;
+	uint8x16_t temp_csum, temp_vp;
+	uint8x16_t vtag_mask = vdupq_n_u8(0x0F);
+	uint32x4_t csum = {0, 0, 0, 0};
+
+	union {
+		uint16_t e[4];
+		uint64_t word;
+	} vol;
+
+	const uint8x16_t rsstype_msk = {
+			0x0F, 0x0F, 0x0F, 0x0F,
+			0x00, 0x00, 0x00, 0x00,
+			0x00, 0x00, 0x00, 0x00,
+			0x00, 0x00, 0x00, 0x00};
+
+	const uint8x16_t rss_flags = {
+			0, RTE_MBUF_F_RX_RSS_HASH, RTE_MBUF_F_RX_RSS_HASH, RTE_MBUF_F_RX_RSS_HASH,
+			0, RTE_MBUF_F_RX_RSS_HASH, 0, RTE_MBUF_F_RX_RSS_HASH,
+			RTE_MBUF_F_RX_RSS_HASH, 0, 0, 0,
+			0, 0, 0, RTE_MBUF_F_RX_FDIR};
+
+	/* mask everything except vlan present and l4/ip csum error */
+	const uint8x16_t vlan_csum_msk = {
+			NGBE_RXD_STAT_VLAN, NGBE_RXD_STAT_VLAN,
+			NGBE_RXD_STAT_VLAN, NGBE_RXD_STAT_VLAN,
+			0, 0, 0, 0,
+			0, 0, 0, 0,
+			(NGBE_RXD_ERR_L4CS | NGBE_RXD_ERR_IPCS) >> 24,
+			(NGBE_RXD_ERR_L4CS | NGBE_RXD_ERR_IPCS) >> 24,
+			(NGBE_RXD_ERR_L4CS | NGBE_RXD_ERR_IPCS) >> 24,
+			(NGBE_RXD_ERR_L4CS | NGBE_RXD_ERR_IPCS) >> 24};
+
+	/* map vlan present and l4/ip csum error to ol_flags */
+	const uint8x16_t vlan_csum_map_lo = {
+			RTE_MBUF_F_RX_IP_CKSUM_GOOD,
+			RTE_MBUF_F_RX_IP_CKSUM_GOOD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+			RTE_MBUF_F_RX_IP_CKSUM_BAD,
+			RTE_MBUF_F_RX_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+			0, 0, 0, 0,
+			vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_GOOD,
+			vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_GOOD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+			vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_BAD,
+			vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+			0, 0, 0, 0};
+
+	const uint8x16_t vlan_csum_map_hi = {
+			RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+			RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+			0, 0, 0, 0,
+			RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+			RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+			0, 0, 0, 0};
+
+	ptype = vzipq_u8(sterr_tmp1.val[0], sterr_tmp2.val[0]).val[0];
+	ptype = vandq_u8(ptype, rsstype_msk);
+	ptype = vqtbl1q_u8(rss_flags, ptype);
+
+	/* extract vlan_flags and csum_error from staterr */
+	vtag = vandq_u8(staterr, vlan_csum_msk);
+
+	/* csum bits are in the most significant, to use shuffle we need to
+	 * shift them. Change mask from 0xc0 to 0x03.
+	 */
+	temp_csum = vshrq_n_u8(vtag, 6);
+
+	/* Change vlan present mask from 0x20 to 0x08.
+	 */
+	temp_vp = vshrq_n_u8(vtag, 2);
+
+	/* 'OR' the most significant 32 bits containing the checksum flags with
+	 * the vlan present flags. Then bits layout of each lane(8bits) will be
+	 * 'xxxx,VLAN,x,ERR_IPCS,ERR_L4CS'
+	 */
+	csum = vsetq_lane_u32(vgetq_lane_u32(vreinterpretq_u32_u8(temp_csum), 3), csum, 0);
+	vtag = vorrq_u8(vreinterpretq_u8_u32(csum), vtag);
+	vtag = vorrq_u8(vtag, temp_vp);
+	vtag = vandq_u8(vtag, vtag_mask);
+
+	/* convert L4 checksum correct type to vtag_hi */
+	vtag_hi = vqtbl1q_u8(vlan_csum_map_hi, vtag);
+	vtag_hi = vshrq_n_u8(vtag_hi, 7);
+
+	/* convert VP, IPE, L4E to vtag_lo */
+	vtag_lo = vqtbl1q_u8(vlan_csum_map_lo, vtag);
+	vtag_lo = vorrq_u8(ptype, vtag_lo);
+
+	vtag = vzipq_u8(vtag_lo, vtag_hi).val[0];
+	vol.word = vgetq_lane_u64(vreinterpretq_u64_u8(vtag), 0);
+
+	rx_pkts[0]->ol_flags = vol.e[0];
+	rx_pkts[1]->ol_flags = vol.e[1];
+	rx_pkts[2]->ol_flags = vol.e[2];
+	rx_pkts[3]->ol_flags = vol.e[3];
+}
+
+#define NGBE_VPMD_DESC_EOP_MASK	0x02020202
+#define NGBE_UINT8_BIT			(CHAR_BIT * sizeof(uint8_t))
+
+static inline void
+desc_to_ptype_v(uint64x2_t descs[4], uint16_t pkt_type_mask,
+		struct rte_mbuf **rx_pkts)
+{
+	uint32x4_t ptype_mask = vdupq_n_u32((uint32_t)pkt_type_mask);
+	uint32x4_t ptype0 = vzipq_u32(vreinterpretq_u32_u64(descs[0]),
+				vreinterpretq_u32_u64(descs[2])).val[0];
+	uint32x4_t ptype1 = vzipq_u32(vreinterpretq_u32_u64(descs[1]),
+				vreinterpretq_u32_u64(descs[3])).val[0];
+
+	/* interleave low 32 bits,
+	 * now we have 4 ptypes in a NEON register
+	 */
+	ptype0 = vzipq_u32(ptype0, ptype1).val[0];
+
+	/* shift right by NGBE_RXD_PTID_SHIFT, and apply ptype mask */
+	ptype0 = vandq_u32(vshrq_n_u32(ptype0, NGBE_RXD_PTID_SHIFT), ptype_mask);
+
+	rx_pkts[0]->packet_type = ngbe_decode_ptype(vgetq_lane_u32(ptype0, 0));
+	rx_pkts[1]->packet_type = ngbe_decode_ptype(vgetq_lane_u32(ptype0, 1));
+	rx_pkts[2]->packet_type = ngbe_decode_ptype(vgetq_lane_u32(ptype0, 2));
+	rx_pkts[3]->packet_type = ngbe_decode_ptype(vgetq_lane_u32(ptype0, 3));
+}
+
+/**
+ * vPMD raw receive routine, only accept(nb_pkts >= RTE_NGBE_DESCS_PER_LOOP)
+ *
+ * Notice:
+ * - nb_pkts < RTE_NGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_NGBE_DESC_PER_LOOP power-of-two
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct ngbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		   uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile struct ngbe_rx_desc *rxdp;
+	struct ngbe_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint8x16_t shuf_msk = {
+		0xFF, 0xFF,
+		0xFF, 0xFF,  /* skip 32 bits pkt_type */
+		12, 13,      /* octet 12~13, low 16 bits pkt_len */
+		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+		12, 13,      /* octet 12~13, 16 bits data_len */
+		14, 15,      /* octet 14~15, low 16 bits vlan_macip */
+		4, 5, 6, 7  /* octet 4~7, 32bits rss */
+		};
+	uint16x8_t crc_adjust = {0, 0, rxq->crc_len, 0,
+				 rxq->crc_len, 0, 0, 0};
+	uint8_t vlan_flags;
+
+	/* nb_pkts has to be floor-aligned to RTE_NGBE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_NGBE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch_non_temporal(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > RTE_NGBE_RXQ_REARM_THRESH)
+		ngbe_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->qw1.lo.status & rte_cpu_to_le_32(NGBE_RXD_STAT_DD)))
+		return 0;
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* ensure these 2 flags are in the lower 8 bits */
+	RTE_BUILD_BUG_ON((RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED) > UINT8_MAX);
+	vlan_flags = rxq->vlan_flags & UINT8_MAX;
+
+	/* A. load 4 packet in one loop
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+			pos += RTE_NGBE_DESCS_PER_LOOP,
+			rxdp += RTE_NGBE_DESCS_PER_LOOP) {
+		uint64x2_t descs[RTE_NGBE_DESCS_PER_LOOP];
+		uint8x16_t pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		uint8x16x2_t sterr_tmp1, sterr_tmp2;
+		uint64x2_t mbp1, mbp2;
+		uint8x16_t staterr;
+		uint16x8_t tmp;
+		uint32_t stat;
+
+		/* B.1 load 2 mbuf point */
+		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
+
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
+
+		/* B.1 load 2 mbuf point */
+		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
+
+		/* A. load 4 pkts descs */
+		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
+		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
+		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
+
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2);
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[3]), shuf_msk);
+		pkt_mb3 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[2]), shuf_msk);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[1]), shuf_msk);
+		pkt_mb1 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[0]), shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = vzipq_u8(vreinterpretq_u8_u64(descs[1]),
+				      vreinterpretq_u8_u64(descs[3]));
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = vzipq_u8(vreinterpretq_u8_u64(descs[0]),
+				      vreinterpretq_u8_u64(descs[2]));
+
+		/* C.2 get 4 pkts staterr value  */
+		staterr = vzipq_u8(sterr_tmp1.val[1], sterr_tmp2.val[1]).val[0];
+
+		/* set ol_flags with vlan packet type */
+		desc_to_olflags_v(sterr_tmp1, sterr_tmp2, staterr, vlan_flags,
+				  &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb4), crc_adjust);
+		pkt_mb4 = vreinterpretq_u8_u16(tmp);
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb3), crc_adjust);
+		pkt_mb3 = vreinterpretq_u8_u16(tmp);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		vst1q_u8((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		vst1q_u8((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb2), crc_adjust);
+		pkt_mb2 = vreinterpretq_u8_u16(tmp);
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb1), crc_adjust);
+		pkt_mb1 = vreinterpretq_u8_u16(tmp);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			stat = vgetq_lane_u32(vreinterpretq_u32_u8(staterr), 0);
+			/* and with mask to extract bits, flipping 1-0 */
+			*(int *)split_packet = ~stat & NGBE_VPMD_DESC_EOP_MASK;
+
+			split_packet += RTE_NGBE_DESCS_PER_LOOP;
+		}
+
+		/* C.4 expand DD bit to saturate UINT8 */
+		staterr = vshlq_n_u8(staterr, NGBE_UINT8_BIT - 1);
+		staterr = vreinterpretq_u8_s8(vshrq_n_s8(vreinterpretq_s8_u8(staterr),
+					      NGBE_UINT8_BIT - 1));
+		stat = ~vgetq_lane_u32(vreinterpretq_u32_u8(staterr), 0);
+
+		rte_prefetch_non_temporal(rxdp + RTE_NGBE_DESCS_PER_LOOP);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		vst1q_u8((uint8_t *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		vst1q_u8((uint8_t *)&rx_pkts[pos]->rx_descriptor_fields1,
+			 pkt_mb1);
+
+		desc_to_ptype_v(descs, NGBE_PTID_MASK, &rx_pkts[pos]);
+
+		/* C.5 calc available number of desc */
+		if (unlikely(stat == 0)) {
+			nb_pkts_recd += RTE_NGBE_DESCS_PER_LOOP;
+		} else {
+			nb_pkts_recd += rte_ctz32(stat) / NGBE_UINT8_BIT;
+			break;
+		}
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * vPMD receive routine, only accept(nb_pkts >= RTE_IXGBE_DESCS_PER_LOOP)
+ *
+ * Notice:
+ * - nb_pkts < RTE_IXGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_IXGBE_DESC_PER_LOOP power-of-two
+ */
+uint16_t
+ngbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets
+ *
+ * Notice:
+ * - nb_pkts < RTE_NGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_NGBE_DESC_PER_LOOP power-of-two
+ */
+static uint16_t
+ngbe_recv_scattered_burst_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			      uint16_t nb_pkts)
+{
+	struct ngbe_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[RTE_NGBE_MAX_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+	if (rxq->pkt_first_seg == NULL &&
+			split_fl64[0] == 0 && split_fl64[1] == 0 &&
+			split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+	if (rxq->pkt_first_seg == NULL) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+		rxq->pkt_first_seg = rx_pkts[i];
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ */
+uint16_t
+ngbe_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			     uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > RTE_NGBE_MAX_RX_BURST) {
+		uint16_t burst;
+
+		burst = ngbe_recv_scattered_burst_vec(rx_queue,
+						      rx_pkts + retval,
+						      RTE_NGBE_MAX_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < RTE_NGBE_MAX_RX_BURST)
+			return retval;
+	}
+
+	return retval + ngbe_recv_scattered_burst_vec(rx_queue,
+						      rx_pkts + retval,
+						      nb_pkts);
+}
+
+static inline void
+vtx1(volatile struct ngbe_tx_desc *txdp,
+		struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64x2_t descriptor = {
+			pkt->buf_iova + pkt->data_off,
+			(uint64_t)pkt->pkt_len << 45 | flags | pkt->data_len};
+
+	vst1q_u64((uint64_t *)txdp, descriptor);
+}
+
+static inline void
+vtx(volatile struct ngbe_tx_desc *txdp,
+		struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		vtx1(txdp, *pkt, flags);
+}
+
+uint16_t
+ngbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			  uint16_t nb_pkts)
+{
+	struct ngbe_tx_queue *txq = (struct ngbe_tx_queue *)tx_queue;
+	volatile struct ngbe_tx_desc *txdp;
+	struct ngbe_tx_entry_v *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = NGBE_TXD_FLAGS;
+	uint64_t rs = NGBE_TXD_FLAGS;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_free_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ngbe_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring_v[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	nb_commit = nb_pkts;
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			vtx1(txdp, *tx_pkts, flags);
+
+		vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring_v[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+
+	txq->tx_tail = tx_id;
+
+	ngbe_set32(txq->tdt_reg_addr, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+static void __rte_cold
+ngbe_tx_queue_release_mbufs_vec(struct ngbe_tx_queue *txq)
+{
+	_ngbe_tx_queue_release_mbufs_vec(txq);
+}
+
+void __rte_cold
+ngbe_rx_queue_release_mbufs_vec(struct ngbe_rx_queue *rxq)
+{
+	_ngbe_rx_queue_release_mbufs_vec(rxq);
+}
+
+static void __rte_cold
+ngbe_tx_free_swring(struct ngbe_tx_queue *txq)
+{
+	_ngbe_tx_free_swring_vec(txq);
+}
+
+static void __rte_cold
+ngbe_reset_tx_queue(struct ngbe_tx_queue *txq)
+{
+	_ngbe_reset_tx_queue_vec(txq);
+}
+
+static const struct ngbe_txq_ops vec_txq_ops = {
+	.release_mbufs = ngbe_tx_queue_release_mbufs_vec,
+	.free_swring = ngbe_tx_free_swring,
+	.reset = ngbe_reset_tx_queue,
+};
+
+int __rte_cold
+ngbe_rxq_vec_setup(struct ngbe_rx_queue *rxq)
+{
+	return ngbe_rxq_vec_setup_default(rxq);
+}
+
+int __rte_cold
+ngbe_txq_vec_setup(struct ngbe_tx_queue *txq)
+{
+	return ngbe_txq_vec_setup_default(txq, &vec_txq_ops);
+}
+
+int __rte_cold
+ngbe_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev)
+{
+	return ngbe_rx_vec_dev_conf_condition_check_default(dev);
+}
diff --git a/drivers/net/ngbe/ngbe_rxtx_vec_sse.c b/drivers/net/ngbe/ngbe_rxtx_vec_sse.c
new file mode 100644
index 0000000000..7d9e7db615
--- /dev/null
+++ b/drivers/net/ngbe/ngbe_rxtx_vec_sse.c
@@ -0,0 +1,692 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
+ * Copyright(c) 2010-2015 Intel Corporation
+ */
+
+#include <ethdev_driver.h>
+#include <rte_malloc.h>
+
+#include "ngbe_type.h"
+#include "ngbe_ethdev.h"
+#include "ngbe_rxtx.h"
+#include "ngbe_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ngbe_rxq_rearm(struct ngbe_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile struct ngbe_rx_desc *rxdp;
+	struct ngbe_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	const __m128i hba_msk = _mm_set_epi64x(0, UINT64_MAX);
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mb_pool,
+				 (void *)rxep,
+				 RTE_NGBE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + RTE_NGBE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < RTE_NGBE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i],
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			RTE_NGBE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < RTE_NGBE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* set Header Buffer Address to zero */
+		dma_addr0 =  _mm_and_si128(dma_addr0, hba_msk);
+		dma_addr1 =  _mm_and_si128(dma_addr1, hba_msk);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)rxdp++, dma_addr0);
+		_mm_store_si128((__m128i *)rxdp++, dma_addr1);
+	}
+
+	rxq->rxrearm_start += RTE_NGBE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= RTE_NGBE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ngbe_set32(rxq->rdt_reg_addr, rx_id);
+}
+
+static inline void
+desc_to_olflags_v(__m128i descs[4], __m128i mbuf_init, uint8_t vlan_flags,
+	struct rte_mbuf **rx_pkts)
+{
+	__m128i ptype0, ptype1, vtag0, vtag1, csum, vp;
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	/* mask everything except rss type */
+	const __m128i rsstype_msk = _mm_set_epi16(0x0000, 0x0000, 0x0000, 0x0000,
+						  0x000F, 0x000F, 0x000F, 0x000F);
+
+	/* mask the lower byte of ol_flags */
+	const __m128i ol_flags_msk = _mm_set_epi16(0x0000, 0x0000, 0x0000, 0x0000,
+						   0x00FF, 0x00FF, 0x00FF, 0x00FF);
+
+	/* map rss type to rss hash flag */
+	const __m128i rss_flags = _mm_set_epi8(RTE_MBUF_F_RX_FDIR, 0, 0, 0,
+			0, 0, 0, RTE_MBUF_F_RX_RSS_HASH,
+			RTE_MBUF_F_RX_RSS_HASH, 0, RTE_MBUF_F_RX_RSS_HASH, 0,
+			RTE_MBUF_F_RX_RSS_HASH, RTE_MBUF_F_RX_RSS_HASH, RTE_MBUF_F_RX_RSS_HASH, 0);
+
+	/* mask everything except vlan present and l4/ip csum error */
+	const __m128i vlan_csum_msk =
+		_mm_set_epi16((NGBE_RXD_ERR_L4CS | NGBE_RXD_ERR_IPCS) >> 16,
+			      (NGBE_RXD_ERR_L4CS | NGBE_RXD_ERR_IPCS) >> 16,
+			      (NGBE_RXD_ERR_L4CS | NGBE_RXD_ERR_IPCS) >> 16,
+			      (NGBE_RXD_ERR_L4CS | NGBE_RXD_ERR_IPCS) >> 16,
+			      NGBE_RXD_STAT_VLAN, NGBE_RXD_STAT_VLAN,
+			      NGBE_RXD_STAT_VLAN, NGBE_RXD_STAT_VLAN);
+
+	/* map vlan present and l4/ip csum error to ol_flags */
+	const __m128i vlan_csum_map_lo = _mm_set_epi8(0, 0, 0, 0,
+		vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+		vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_BAD,
+		vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_GOOD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+		vlan_flags | RTE_MBUF_F_RX_IP_CKSUM_GOOD,
+		0, 0, 0, 0,
+		RTE_MBUF_F_RX_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+		RTE_MBUF_F_RX_IP_CKSUM_BAD,
+		RTE_MBUF_F_RX_IP_CKSUM_GOOD | RTE_MBUF_F_RX_L4_CKSUM_BAD,
+		RTE_MBUF_F_RX_IP_CKSUM_GOOD);
+
+	const __m128i vlan_csum_map_hi = _mm_set_epi8(0, 0, 0, 0,
+		0, RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+		RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t),
+		0, 0, 0, 0,
+		0, RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t), 0,
+		RTE_MBUF_F_RX_L4_CKSUM_GOOD >> sizeof(uint8_t));
+
+	const __m128i vtag_msk = _mm_set_epi16(0x0000, 0x0000, 0x0000, 0x0000,
+					       0x000F, 0x000F, 0x000F, 0x000F);
+
+	ptype0 = _mm_unpacklo_epi16(descs[0], descs[1]);
+	ptype1 = _mm_unpacklo_epi16(descs[2], descs[3]);
+	vtag0 = _mm_unpackhi_epi16(descs[0], descs[1]);
+	vtag1 = _mm_unpackhi_epi16(descs[2], descs[3]);
+
+	ptype0 = _mm_unpacklo_epi32(ptype0, ptype1);
+	ptype0 = _mm_and_si128(ptype0, rsstype_msk);
+	ptype0 = _mm_shuffle_epi8(rss_flags, ptype0);
+
+	vtag1 = _mm_unpacklo_epi32(vtag0, vtag1);
+	vtag1 = _mm_and_si128(vtag1, vlan_csum_msk);
+
+	/* csum bits are in the most significant, to use shuffle we need to
+	 * shift them. Change mask to 0xc000 to 0x0003.
+	 */
+	csum = _mm_srli_epi16(vtag1, 14);
+
+	/* Change mask to 0x20 to 0x08. */
+	vp = _mm_srli_epi16(vtag1, 2);
+
+	/* now or the most significant 64 bits containing the checksum
+	 * flags with the vlan present flags.
+	 */
+	csum = _mm_srli_si128(csum, 8);
+	vtag1 = _mm_or_si128(csum, vtag1);
+	vtag1 = _mm_or_si128(vtag1, vp);
+	vtag1 = _mm_and_si128(vtag1, vtag_msk);
+
+	/* convert STAT_VLAN, ERR_IPCS, ERR_L4CS to ol_flags */
+	vtag0 = _mm_shuffle_epi8(vlan_csum_map_hi, vtag1);
+	vtag0 = _mm_slli_epi16(vtag0, sizeof(uint8_t));
+
+	vtag1 = _mm_shuffle_epi8(vlan_csum_map_lo, vtag1);
+	vtag1 = _mm_and_si128(vtag1, ol_flags_msk);
+	vtag1 = _mm_or_si128(vtag0, vtag1);
+
+	vtag1 = _mm_or_si128(ptype0, vtag1);
+
+	/*
+	 * At this point, we have the 4 sets of flags in the low 64-bits
+	 * of vtag1 (4x16).
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vtag1, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vtag1, 6), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vtag1, 4), 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vtag1, 2), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+static inline void
+desc_to_ptype_v(__m128i descs[4], uint16_t pkt_type_mask,
+		struct rte_mbuf **rx_pkts)
+{
+	__m128i ptype_mask = _mm_set_epi32(pkt_type_mask, pkt_type_mask,
+					pkt_type_mask, pkt_type_mask);
+
+	__m128i ptype0 = _mm_unpacklo_epi32(descs[0], descs[2]);
+	__m128i ptype1 = _mm_unpacklo_epi32(descs[1], descs[3]);
+
+	/* interleave low 32 bits,
+	 * now we have 4 ptypes in a XMM register
+	 */
+	ptype0 = _mm_unpacklo_epi32(ptype0, ptype1);
+
+	/* shift left by NGBE_RXD_PTID_SHIFT, and apply ptype mask */
+	ptype0 = _mm_and_si128(_mm_srli_epi32(ptype0, NGBE_RXD_PTID_SHIFT),
+			       ptype_mask);
+
+	rx_pkts[0]->packet_type = ngbe_decode_ptype(_mm_extract_epi32(ptype0, 0));
+	rx_pkts[1]->packet_type = ngbe_decode_ptype(_mm_extract_epi32(ptype0, 1));
+	rx_pkts[2]->packet_type = ngbe_decode_ptype(_mm_extract_epi32(ptype0, 2));
+	rx_pkts[3]->packet_type = ngbe_decode_ptype(_mm_extract_epi32(ptype0, 3));
+}
+
+/*
+ * vPMD raw receive routine, only accept(nb_pkts >= RTE_NGBE_DESCS_PER_LOOP)
+ *
+ * Notice:
+ * - nb_pkts < RTE_NGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_NGBE_DESC_PER_LOOP power-of-two
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct ngbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile struct ngbe_rx_desc *rxdp;
+	struct ngbe_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	__m128i crc_adjust = _mm_set_epi16(0, 0, 0, /* ignore non-length fields */
+				-rxq->crc_len, /* sub crc on data_len */
+				0,             /* ignore high-16bits of pkt_len */
+				-rxq->crc_len, /* sub crc on pkt_len */
+				0, 0);         /* ignore pkt_type field */
+
+	/*
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+	__m128i mbuf_init;
+	uint8_t vlan_flags;
+
+	/*
+	 * Under the circumstance that `rx_tail` wrap back to zero
+	 * and the advance speed of `rx_tail` is greater than `rxrearm_start`,
+	 * `rx_tail` will catch up with `rxrearm_start` and surpass it.
+	 * This may cause some mbufs be reused by application.
+	 *
+	 * So we need to make some restrictions to ensure that
+	 * `rx_tail` will not exceed `rxrearm_start`.
+	 */
+	nb_pkts = RTE_MIN(nb_pkts, RTE_NGBE_RXQ_REARM_THRESH);
+
+	/* nb_pkts has to be floor-aligned to RTE_NGBE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_NGBE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > RTE_NGBE_RXQ_REARM_THRESH)
+		ngbe_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->qw1.lo.status &
+				rte_cpu_to_le_32(NGBE_RXD_STAT_DD)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8(7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+		15, 14,      /* octet 14~15, low 16 bits vlan_macip */
+		13, 12,      /* octet 12~13, 16 bits data_len */
+		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+		13, 12,      /* octet 12~13, low 16 bits pkt_len */
+		0xFF, 0xFF,  /* skip 32 bit pkt_type */
+		0xFF, 0xFF);
+	/*
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* ensure these 2 flags are in the lower 8 bits */
+	RTE_BUILD_BUG_ON((RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED) > UINT8_MAX);
+	vlan_flags = rxq->vlan_flags & UINT8_MAX;
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+			pos += RTE_NGBE_DESCS_PER_LOOP,
+			rxdp += RTE_NGBE_DESCS_PER_LOOP) {
+		__m128i descs[RTE_NGBE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load desc[3] */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		/* A.1 load desc[2-0] */
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		/* set ol_flags with vlan packet type */
+		desc_to_olflags_v(descs, mbuf_init, vlan_flags, &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+				pkt_mb4);
+		_mm_storeu_si128((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+				pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask =
+				_mm_set_epi8(0xFF, 0xFF, 0xFF, 0xFF,
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     0x04, 0x0C, 0x00, 0x08);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += RTE_NGBE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+				pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				pkt_mb1);
+
+		desc_to_ptype_v(descs, NGBE_PTID_MASK, &rx_pkts[pos]);
+
+		/* C.4 calc available number of desc */
+		var = rte_popcount64(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != RTE_NGBE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/*
+ * vPMD receive routine, only accept(nb_pkts >= RTE_NGBE_DESCS_PER_LOOP)
+ *
+ * Notice:
+ * - nb_pkts < RTE_NGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_NGBE_DESC_PER_LOOP power-of-two
+ */
+uint16_t
+ngbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets
+ *
+ * Notice:
+ * - nb_pkts < RTE_NGBE_DESCS_PER_LOOP, just return no packet
+ * - floor align nb_pkts to a RTE_NGBE_DESC_PER_LOOP power-of-two
+ */
+static uint16_t
+ngbe_recv_scattered_burst_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			       uint16_t nb_pkts)
+{
+	struct ngbe_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[RTE_NGBE_MAX_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+	if (rxq->pkt_first_seg == NULL &&
+			split_fl64[0] == 0 && split_fl64[1] == 0 &&
+			split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+	if (rxq->pkt_first_seg == NULL) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+		rxq->pkt_first_seg = rx_pkts[i];
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ */
+uint16_t
+ngbe_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			     uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > RTE_NGBE_MAX_RX_BURST) {
+		uint16_t burst;
+
+		burst = ngbe_recv_scattered_burst_vec(rx_queue,
+						      rx_pkts + retval,
+						      RTE_NGBE_MAX_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < RTE_NGBE_MAX_RX_BURST)
+			return retval;
+	}
+
+	return retval + ngbe_recv_scattered_burst_vec(rx_queue,
+						      rx_pkts + retval,
+						      nb_pkts);
+}
+
+static inline void
+vtx1(volatile struct ngbe_tx_desc *txdp,
+		struct rte_mbuf *pkt, uint64_t flags)
+{
+	__m128i descriptor = _mm_set_epi64x((uint64_t)pkt->pkt_len << 45 |
+			flags | pkt->data_len,
+			pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+vtx(volatile struct ngbe_tx_desc *txdp,
+		struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		vtx1(txdp, *pkt, flags);
+}
+
+uint16_t
+ngbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			  uint16_t nb_pkts)
+{
+	struct ngbe_tx_queue *txq = (struct ngbe_tx_queue *)tx_queue;
+	volatile struct ngbe_tx_desc *txdp;
+	struct ngbe_tx_entry_v *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = NGBE_TXD_FLAGS;
+	uint64_t rs = NGBE_TXD_FLAGS;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_free_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ngbe_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring_v[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	nb_commit = nb_pkts;
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			vtx1(txdp, *tx_pkts, flags);
+
+		vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring_v[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+
+	txq->tx_tail = tx_id;
+
+	ngbe_set32(txq->tdt_reg_addr, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+static void __rte_cold
+ngbe_tx_queue_release_mbufs_vec(struct ngbe_tx_queue *txq)
+{
+	_ngbe_tx_queue_release_mbufs_vec(txq);
+}
+
+void __rte_cold
+ngbe_rx_queue_release_mbufs_vec(struct ngbe_rx_queue *rxq)
+{
+	_ngbe_rx_queue_release_mbufs_vec(rxq);
+}
+
+static void __rte_cold
+ngbe_tx_free_swring(struct ngbe_tx_queue *txq)
+{
+	_ngbe_tx_free_swring_vec(txq);
+}
+
+static void __rte_cold
+ngbe_reset_tx_queue(struct ngbe_tx_queue *txq)
+{
+	_ngbe_reset_tx_queue_vec(txq);
+}
+
+static const struct ngbe_txq_ops vec_txq_ops = {
+	.release_mbufs = ngbe_tx_queue_release_mbufs_vec,
+	.free_swring = ngbe_tx_free_swring,
+	.reset = ngbe_reset_tx_queue,
+};
+
+int __rte_cold
+ngbe_rxq_vec_setup(struct ngbe_rx_queue *rxq)
+{
+	return ngbe_rxq_vec_setup_default(rxq);
+}
+
+int __rte_cold
+ngbe_txq_vec_setup(struct ngbe_tx_queue *txq)
+{
+	return ngbe_txq_vec_setup_default(txq, &vec_txq_ops);
+}
+
+int __rte_cold
+ngbe_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev)
+{
+	return ngbe_rx_vec_dev_conf_condition_check_default(dev);
+}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 0/2] Wangxun support vector Rx/Tx
  2024-02-01  3:00 [PATCH 0/2] Wangxun support vector Rx/Tx Jiawen Wu
  2024-02-01  3:00 ` [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx Jiawen Wu
  2024-02-01  3:00 ` [PATCH 2/2] net/ngbe: " Jiawen Wu
@ 2024-02-06  1:50 ` Jiawen Wu
  2 siblings, 0 replies; 10+ messages in thread
From: Jiawen Wu @ 2024-02-06  1:50 UTC (permalink / raw)
  To: dev

Hi,

> -----Original Message-----
> From: Jiawen Wu <jiawenwu@trustnetic.com>
> Sent: Thursday, February 1, 2024 11:00 AM
> To: dev@dpdk.org
> Cc: Jiawen Wu <jiawenwu@trustnetic.com>
> Subject: [PATCH 0/2] Wangxun support vector Rx/Tx
> 
> Add SSE/NEON vector instructions for TXGBE and NGBE driver to process
> packets.
> 
> Jiawen Wu (2):
>   net/txgbe: add vectorized functions for Rx/Tx
>   net/ngbe: add vectorized functions for Rx/Tx
> 
>  drivers/net/ngbe/meson.build              |   6 +
>  drivers/net/ngbe/ngbe_ethdev.c            |   6 +
>  drivers/net/ngbe/ngbe_ethdev.h            |   1 +
>  drivers/net/ngbe/ngbe_rxtx.c              | 161 ++++-
>  drivers/net/ngbe/ngbe_rxtx.h              |  32 +-
>  drivers/net/ngbe/ngbe_rxtx_vec_common.h   | 296 +++++++++
>  drivers/net/ngbe/ngbe_rxtx_vec_neon.c     | 604 ++++++++++++++++++
>  drivers/net/ngbe/ngbe_rxtx_vec_sse.c      | 692 ++++++++++++++++++++
>  drivers/net/txgbe/meson.build             |   6 +
>  drivers/net/txgbe/txgbe_ethdev.c          |   6 +
>  drivers/net/txgbe/txgbe_ethdev.h          |   1 +
>  drivers/net/txgbe/txgbe_ethdev_vf.c       |   1 +
>  drivers/net/txgbe/txgbe_rxtx.c            | 150 ++++-
>  drivers/net/txgbe/txgbe_rxtx.h            |  18 +
>  drivers/net/txgbe/txgbe_rxtx_vec_common.h | 301 +++++++++
>  drivers/net/txgbe/txgbe_rxtx_vec_neon.c   | 604 ++++++++++++++++++
>  drivers/net/txgbe/txgbe_rxtx_vec_sse.c    | 736 ++++++++++++++++++++++
>  17 files changed, 3611 insertions(+), 10 deletions(-)
>  create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_common.h
>  create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_neon.c
>  create mode 100644 drivers/net/ngbe/ngbe_rxtx_vec_sse.c
>  create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_common.h
>  create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_neon.c
>  create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_sse.c
> 
> --
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx
  2024-02-01  3:00 ` [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx Jiawen Wu
@ 2024-02-07  3:13   ` Ferruh Yigit
  2024-03-05  8:10     ` Jiawen Wu
  0 siblings, 1 reply; 10+ messages in thread
From: Ferruh Yigit @ 2024-02-07  3:13 UTC (permalink / raw)
  To: Jiawen Wu, dev

On 2/1/2024 3:00 AM, Jiawen Wu wrote:
> To optimize Rx/Tx burst process, add SSE/NEON vector instructions on
> x86/arm architecture.
> 

Do you have any performance improvement number with vector
implementation, if so can you put it into commit log for record?

> Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
> ---
>  drivers/net/txgbe/meson.build             |   6 +
>  drivers/net/txgbe/txgbe_ethdev.c          |   6 +
>  drivers/net/txgbe/txgbe_ethdev.h          |   1 +
>  drivers/net/txgbe/txgbe_ethdev_vf.c       |   1 +
>  drivers/net/txgbe/txgbe_rxtx.c            | 150 ++++-
>  drivers/net/txgbe/txgbe_rxtx.h            |  18 +
>  drivers/net/txgbe/txgbe_rxtx_vec_common.h | 301 +++++++++
>  drivers/net/txgbe/txgbe_rxtx_vec_neon.c   | 604 ++++++++++++++++++
>  drivers/net/txgbe/txgbe_rxtx_vec_sse.c    | 736 ++++++++++++++++++++++
>  9 files changed, 1817 insertions(+), 6 deletions(-)
>  create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_common.h
>  create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_neon.c
>  create mode 100644 drivers/net/txgbe/txgbe_rxtx_vec_sse.c
> 
> diff --git a/drivers/net/txgbe/meson.build b/drivers/net/txgbe/meson.build
> index 14729a6cf3..ba7167a511 100644
> --- a/drivers/net/txgbe/meson.build
> +++ b/drivers/net/txgbe/meson.build
> @@ -24,6 +24,12 @@ sources = files(
>  
>  deps += ['hash', 'security']
>  
> +if arch_subdir == 'x86'
> +    sources += files('txgbe_rxtx_vec_sse.c')
> +elif arch_subdir == 'arm'
> +    sources += files('txgbe_rxtx_vec_neon.c')
> +endif
> +
>  includes += include_directories('base')
>  
>  install_headers('rte_pmd_txgbe.h')
> diff --git a/drivers/net/txgbe/txgbe_ethdev.c b/drivers/net/txgbe/txgbe_ethdev.c
> index 6bc231a130..2d5b935002 100644
> --- a/drivers/net/txgbe/txgbe_ethdev.c
> +++ b/drivers/net/txgbe/txgbe_ethdev.c
> @@ -1544,6 +1544,7 @@ txgbe_dev_configure(struct rte_eth_dev *dev)
>  	 * allocation Rx preconditions we will reset it.
>  	 */
>  	adapter->rx_bulk_alloc_allowed = true;
> +	adapter->rx_vec_allowed = true;
>  
>  	return 0;
>  }
> @@ -2735,6 +2736,11 @@ txgbe_dev_supported_ptypes_get(struct rte_eth_dev *dev)
>  	    dev->rx_pkt_burst == txgbe_recv_pkts_bulk_alloc)
>  		return txgbe_get_supported_ptypes();
>  
> +#if defined(RTE_ARCH_X86) || defined(__ARM_NEON)
> +	if (dev->rx_pkt_burst == txgbe_recv_pkts_vec ||
> +	    dev->rx_pkt_burst == txgbe_recv_scattered_pkts_vec)
> +		return txgbe_get_supported_ptypes();
> +#endif
>

Sometimes the packet parsing capability of the device changes based on
vector Rx used, but above calls same function.
If there is no ptype parsing capability difference, why not just add
above checks to previous if block?


Btw, 'txgbe_get_supported_ptypes()' now gets a parameter, based on
changes in 'next-net', can you please rebase code on top of latest next-net?

<...>

> @@ -2198,8 +2220,15 @@ txgbe_set_tx_function(struct rte_eth_dev *dev, struct txgbe_tx_queue *txq)
>  #endif
>  			txq->tx_free_thresh >= RTE_PMD_TXGBE_TX_MAX_BURST) {
>  		PMD_INIT_LOG(DEBUG, "Using simple tx code path");
> -		dev->tx_pkt_burst = txgbe_xmit_pkts_simple;
>  		dev->tx_pkt_prepare = NULL;
> +		if (txq->tx_free_thresh <= RTE_TXGBE_TX_MAX_FREE_BUF_SZ &&
> +				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
>

Why vector Tx enable only for secondary process?

<...>

> @@ -297,6 +299,12 @@ struct txgbe_rx_queue {
>  #ifdef RTE_LIB_SECURITY
>  	uint8_t            using_ipsec;
>  	/**< indicates that IPsec RX feature is in use */
> +#endif
> +	uint64_t	    mbuf_initializer; /**< value to init mbufs */
> +	uint8_t             rx_using_sse;
> +#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
>

RTE_ARCH_ARM, RTE_ARCH_ARM64 & __ARM_NEON seems used interchangable,
what do you think to stick one?

Similarly with RTE_ARCH_X86_64 & RTE_ARCH_X86.

<...>

> +++ b/drivers/net/txgbe/txgbe_rxtx_vec_neon.c
> @@ -0,0 +1,604 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
> + * Copyright(c) 2010-2015 Intel Corporation
> + */
> +
> +#include <ethdev_driver.h>
> +#include <rte_malloc.h>
> +#include <rte_vect.h>
> +
> +#include "txgbe_ethdev.h"
> +#include "txgbe_rxtx.h"
> +#include "txgbe_rxtx_vec_common.h"
> +
> +#pragma GCC diagnostic ignored "-Wcast-qual"
> +

Is this pragma really required?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx
  2024-02-07  3:13   ` Ferruh Yigit
@ 2024-03-05  8:10     ` Jiawen Wu
  2024-03-21 16:21       ` Ferruh Yigit
  2024-03-21 16:27       ` Ferruh Yigit
  0 siblings, 2 replies; 10+ messages in thread
From: Jiawen Wu @ 2024-03-05  8:10 UTC (permalink / raw)
  To: Ferruh.Yigit, dev

On Wed, Feb 7, 2024 11:13 AM, Ferruh.Yigit@amd.com wrote:
> On 2/1/2024 3:00 AM, Jiawen Wu wrote:
> > To optimize Rx/Tx burst process, add SSE/NEON vector instructions on
> > x86/arm architecture.
> >
> 
> Do you have any performance improvement number with vector
> implementation, if so can you put it into commit log for record?

On our local x86 platforms, the performance was at full speed without
using vector. So we don't have the performance improvement number
with SSE yet. But I will add the test result for arm.

> > @@ -2198,8 +2220,15 @@ txgbe_set_tx_function(struct rte_eth_dev *dev, struct txgbe_tx_queue *txq)
> >  #endif
> >  			txq->tx_free_thresh >= RTE_PMD_TXGBE_TX_MAX_BURST) {
> >  		PMD_INIT_LOG(DEBUG, "Using simple tx code path");
> > -		dev->tx_pkt_burst = txgbe_xmit_pkts_simple;
> >  		dev->tx_pkt_prepare = NULL;
> > +		if (txq->tx_free_thresh <= RTE_TXGBE_TX_MAX_FREE_BUF_SZ &&
> > +				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
> >
> 
> Why vector Tx enable only for secondary process?

It is not only for secondary process. The constraint is

(rte_eal_process_type() != RTE_PROC_PRIMARY || txgbe_txq_vec_setup(txq) == 0)

This code references ixgbe, which explains:
"When using multiple processes, the TX function used in all processes
 should be the same, otherwise the secondary processes cannot transmit
 more than tx-ring-size - 1 packets.
 To achieve this, we extract out the code to select the ixgbe TX function
 to be used into a separate function inside the ixgbe driver, and call
 that from a secondary process when it is attaching to an
 already-configured NIC."

> > +++ b/drivers/net/txgbe/txgbe_rxtx_vec_neon.c
> > @@ -0,0 +1,604 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
> > + * Copyright(c) 2010-2015 Intel Corporation
> > + */
> > +
> > +#include <ethdev_driver.h>
> > +#include <rte_malloc.h>
> > +#include <rte_vect.h>
> > +
> > +#include "txgbe_ethdev.h"
> > +#include "txgbe_rxtx.h"
> > +#include "txgbe_rxtx_vec_common.h"
> > +
> > +#pragma GCC diagnostic ignored "-Wcast-qual"
> > +
> 
> Is this pragma really required?

Yes. Otherwise, there are warnings in the compilation:

[1909/2921] Compiling C object drivers/libtmp_rte_net_txgbe.a.p/net_txgbe_txgbe_rxtx_vec_neon.c.o
../drivers/net/txgbe/txgbe_rxtx_vec_neon.c: In function ‘txgbe_rxq_rearm’:
../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:37:15: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
     vst1q_u64((uint64_t *)&rxdp[i], zero);
               ^
../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:60:13: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
   vst1q_u64((uint64_t *)rxdp++, dma_addr0);
             ^
../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:65:13: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
   vst1q_u64((uint64_t *)rxdp++, dma_addr1);



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx
  2024-03-05  8:10     ` Jiawen Wu
@ 2024-03-21 16:21       ` Ferruh Yigit
  2024-04-07  8:32         ` Jiawen Wu
  2024-03-21 16:27       ` Ferruh Yigit
  1 sibling, 1 reply; 10+ messages in thread
From: Ferruh Yigit @ 2024-03-21 16:21 UTC (permalink / raw)
  To: Jiawen Wu, dev

On 3/5/2024 8:10 AM, Jiawen Wu wrote:
> On Wed, Feb 7, 2024 11:13 AM, Ferruh.Yigit@amd.com wrote:
>> On 2/1/2024 3:00 AM, Jiawen Wu wrote:
>>> To optimize Rx/Tx burst process, add SSE/NEON vector instructions on
>>> x86/arm architecture.
>>>
>>
>> Do you have any performance improvement number with vector
>> implementation, if so can you put it into commit log for record?
> 
> On our local x86 platforms, the performance was at full speed without
> using vector. So we don't have the performance improvement number
> with SSE yet. But I will add the test result for arm.
> 

Ack

>>> @@ -2198,8 +2220,15 @@ txgbe_set_tx_function(struct rte_eth_dev *dev, struct txgbe_tx_queue *txq)
>>>  #endif
>>>  			txq->tx_free_thresh >= RTE_PMD_TXGBE_TX_MAX_BURST) {
>>>  		PMD_INIT_LOG(DEBUG, "Using simple tx code path");
>>> -		dev->tx_pkt_burst = txgbe_xmit_pkts_simple;
>>>  		dev->tx_pkt_prepare = NULL;
>>> +		if (txq->tx_free_thresh <= RTE_TXGBE_TX_MAX_FREE_BUF_SZ &&
>>> +				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
>>>
>>
>> Why vector Tx enable only for secondary process?
> 
> It is not only for secondary process. The constraint is
> 
> (rte_eal_process_type() != RTE_PROC_PRIMARY || txgbe_txq_vec_setup(txq) == 0)
> 
> This code references ixgbe, which explains:
> "When using multiple processes, the TX function used in all processes
>  should be the same, otherwise the secondary processes cannot transmit
>  more than tx-ring-size - 1 packets.
>  To achieve this, we extract out the code to select the ixgbe TX function
>  to be used into a separate function inside the ixgbe driver, and call
>  that from a secondary process when it is attaching to an
>  already-configured NIC."
> 

Got it,

1- Is txgbe has the constraint that same Tx function should be used
separate queues?
Tx functions is all in SW, right? HW interface is same, so HW doesn't
know or care vector Tx or simple Tx is used.
As primary and secondary processes manage different queues, I don't know
why this constraint exists.

2. I see above logic prevents secondary to call 'txgbe_txq_vec_setup()'
again. Perhaps unlikely but technically, if 'txgbe_txq_vec_setup()'
fails for primary 'txgbe_xmit_pkts_simple' is set and for secondary
'txgbe_xmit_pkts_vec' is set, causing both primary and secondary have
different Tx functions, can you please check if this option is valid.


There are other comments not addressed, I assume they are accepted and
there will be a new version, but I want to highlight in case they are
missed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx
  2024-03-05  8:10     ` Jiawen Wu
  2024-03-21 16:21       ` Ferruh Yigit
@ 2024-03-21 16:27       ` Ferruh Yigit
  2024-03-21 17:55         ` Tyler Retzlaff
  1 sibling, 1 reply; 10+ messages in thread
From: Ferruh Yigit @ 2024-03-21 16:27 UTC (permalink / raw)
  To: Jiawen Wu, Honnappa Nagarahalli; +Cc: dev

On 3/5/2024 8:10 AM, Jiawen Wu wrote:

<...>

>>> +++ b/drivers/net/txgbe/txgbe_rxtx_vec_neon.c
>>> @@ -0,0 +1,604 @@
>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>> + * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
>>> + * Copyright(c) 2010-2015 Intel Corporation
>>> + */
>>> +
>>> +#include <ethdev_driver.h>
>>> +#include <rte_malloc.h>
>>> +#include <rte_vect.h>
>>> +
>>> +#include "txgbe_ethdev.h"
>>> +#include "txgbe_rxtx.h"
>>> +#include "txgbe_rxtx_vec_common.h"
>>> +
>>> +#pragma GCC diagnostic ignored "-Wcast-qual"
>>> +
>>
>> Is this pragma really required?
> 
> Yes. Otherwise, there are warnings in the compilation:
> 
> [1909/2921] Compiling C object drivers/libtmp_rte_net_txgbe.a.p/net_txgbe_txgbe_rxtx_vec_neon.c.o
> ../drivers/net/txgbe/txgbe_rxtx_vec_neon.c: In function ‘txgbe_rxq_rearm’:
> ../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:37:15: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
>      vst1q_u64((uint64_t *)&rxdp[i], zero);
>                ^
> ../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:60:13: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
>    vst1q_u64((uint64_t *)rxdp++, dma_addr0);
>              ^
> ../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:65:13: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
>    vst1q_u64((uint64_t *)rxdp++, dma_addr1);
> 

Hi Honnappa,

There are multiple drivers ignores "-Wcast-qual" for neon implementation.

Is there a better, more proper way to address this warning?


Thanks,
ferruh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx
  2024-03-21 16:27       ` Ferruh Yigit
@ 2024-03-21 17:55         ` Tyler Retzlaff
  0 siblings, 0 replies; 10+ messages in thread
From: Tyler Retzlaff @ 2024-03-21 17:55 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: Jiawen Wu, Honnappa Nagarahalli, dev

On Thu, Mar 21, 2024 at 04:27:26PM +0000, Ferruh Yigit wrote:
> On 3/5/2024 8:10 AM, Jiawen Wu wrote:
> 
> <...>
> 
> >>> +++ b/drivers/net/txgbe/txgbe_rxtx_vec_neon.c
> >>> @@ -0,0 +1,604 @@
> >>> +/* SPDX-License-Identifier: BSD-3-Clause
> >>> + * Copyright(c) 2015-2024 Beijing WangXun Technology Co., Ltd.
> >>> + * Copyright(c) 2010-2015 Intel Corporation
> >>> + */
> >>> +
> >>> +#include <ethdev_driver.h>
> >>> +#include <rte_malloc.h>
> >>> +#include <rte_vect.h>
> >>> +
> >>> +#include "txgbe_ethdev.h"
> >>> +#include "txgbe_rxtx.h"
> >>> +#include "txgbe_rxtx_vec_common.h"
> >>> +
> >>> +#pragma GCC diagnostic ignored "-Wcast-qual"
> >>> +
> >>
> >> Is this pragma really required?
> > 
> > Yes. Otherwise, there are warnings in the compilation:
> > 
> > [1909/2921] Compiling C object drivers/libtmp_rte_net_txgbe.a.p/net_txgbe_txgbe_rxtx_vec_neon.c.o
> > ../drivers/net/txgbe/txgbe_rxtx_vec_neon.c: In function ‘txgbe_rxq_rearm’:
> > ../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:37:15: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
> >      vst1q_u64((uint64_t *)&rxdp[i], zero);
> >                ^
> > ../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:60:13: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
> >    vst1q_u64((uint64_t *)rxdp++, dma_addr0);
> >              ^
> > ../drivers/net/txgbe/txgbe_rxtx_vec_neon.c:65:13: warning: cast discards ‘volatile’ qualifier from pointer target type [-Wcast-qual]
> >    vst1q_u64((uint64_t *)rxdp++, dma_addr1);
> > 
> 
> Hi Honnappa,
> 
> There are multiple drivers ignores "-Wcast-qual" for neon implementation.
> 
> Is there a better, more proper way to address this warning?

rather than suppress the warning you could just cast away the volatile
qualifier the common approach is to cast through uintptr_t to discard.

volatile uint64_t *vp;
uint64_t *p = (uint64_t *)(uintptr_t)vp;

perhaps better than broad suppression of warnings. but does require code
be reviewed to be certain it is okay to have the qualifier removed which
i suspect is okay in functions that are inline assembly or intrinsics.

> 
> 
> Thanks,
> ferruh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx
  2024-03-21 16:21       ` Ferruh Yigit
@ 2024-04-07  8:32         ` Jiawen Wu
  0 siblings, 0 replies; 10+ messages in thread
From: Jiawen Wu @ 2024-04-07  8:32 UTC (permalink / raw)
  To: Ferruh.Yigit, dev

> >>> @@ -2198,8 +2220,15 @@ txgbe_set_tx_function(struct rte_eth_dev *dev, struct txgbe_tx_queue *txq)
> >>>  #endif
> >>>  			txq->tx_free_thresh >= RTE_PMD_TXGBE_TX_MAX_BURST) {
> >>>  		PMD_INIT_LOG(DEBUG, "Using simple tx code path");
> >>> -		dev->tx_pkt_burst = txgbe_xmit_pkts_simple;
> >>>  		dev->tx_pkt_prepare = NULL;
> >>> +		if (txq->tx_free_thresh <= RTE_TXGBE_TX_MAX_FREE_BUF_SZ &&
> >>> +				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
> >>>
> >>
> >> Why vector Tx enable only for secondary process?
> >
> > It is not only for secondary process. The constraint is
> >
> > (rte_eal_process_type() != RTE_PROC_PRIMARY || txgbe_txq_vec_setup(txq) == 0)
> >
> > This code references ixgbe, which explains:
> > "When using multiple processes, the TX function used in all processes
> >  should be the same, otherwise the secondary processes cannot transmit
> >  more than tx-ring-size - 1 packets.
> >  To achieve this, we extract out the code to select the ixgbe TX function
> >  to be used into a separate function inside the ixgbe driver, and call
> >  that from a secondary process when it is attaching to an
> >  already-configured NIC."
> >
> 
> Got it,
> 
> 1- Is txgbe has the constraint that same Tx function should be used
> separate queues?
> Tx functions is all in SW, right? HW interface is same, so HW doesn't
> know or care vector Tx or simple Tx is used.
> As primary and secondary processes manage different queues, I don't know
> why this constraint exists.

In theory, the same Tx function needs to be used for different queues.
Because some hardware configurations are not per-queue, like MTU.

> 2. I see above logic prevents secondary to call 'txgbe_txq_vec_setup()'
> again. Perhaps unlikely but technically, if 'txgbe_txq_vec_setup()'
> fails for primary 'txgbe_xmit_pkts_simple' is set and for secondary
> 'txgbe_xmit_pkts_vec' is set, causing both primary and secondary have
> different Tx functions, can you please check if this option is valid.

I wonder when 'txgbe_txq_vec_setup()' will fail. It should be when there is
a memory allocation error. Then the application will fail to initialize?

> There are other comments not addressed, I assume they are accepted and
> there will be a new version, but I want to highlight in case they are
> missed.

Yes, other issues will be fixed in the next version.

I am sorry that I have been busy with other work these months. I will
send the next version in these two days.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-04-07  8:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-01  3:00 [PATCH 0/2] Wangxun support vector Rx/Tx Jiawen Wu
2024-02-01  3:00 ` [PATCH 1/2] net/txgbe: add vectorized functions for Rx/Tx Jiawen Wu
2024-02-07  3:13   ` Ferruh Yigit
2024-03-05  8:10     ` Jiawen Wu
2024-03-21 16:21       ` Ferruh Yigit
2024-04-07  8:32         ` Jiawen Wu
2024-03-21 16:27       ` Ferruh Yigit
2024-03-21 17:55         ` Tyler Retzlaff
2024-02-01  3:00 ` [PATCH 2/2] net/ngbe: " Jiawen Wu
2024-02-06  1:50 ` [PATCH 0/2] Wangxun support vector Rx/Tx Jiawen Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).