DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
@ 2014-07-10 20:32 John W. Linville
  2014-07-11 13:11 ` Stephen Hemminger
                   ` (5 more replies)
  0 siblings, 6 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-10 20:32 UTC (permalink / raw)
  To: dev

This is a Linux-specific virtual PMD driver backed by an AF_PACKET
socket.  This implementation uses mmap'ed ring buffers to limit copying
and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
AF_PACKET is used for frame reception.  In the current implementation,
Tx and Rx queues are always paired, and therefore are always equal
in number -- changing this would be a Simple Matter Of Programming.

Interfaces of this type are created with a command line option like
"--vdev=eth_packet0,iface=...".  There are a number of options availabe
as arguments:

 - Interface is chosen by "iface" (required)
 - Number of queue pairs set by "qpairs" (optional, default: 16)
 - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
 - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
 - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)

Signed-off-by: John W. Linville <linville@tuxdriver.com>
---
This PMD is intended to provide a means for using DPDK on a broad
range of hardware without hardware-specific PMDs and (hopefully)
with better performance than what PCAP offers in Linux.  This might
be useful as a development platform for DPDK applications when
DPDK-supported hardware is expensive or unavailable.

 config/common_bsdapp                   |   5 +
 config/common_linuxapp                 |   5 +
 lib/Makefile                           |   1 +
 lib/librte_eal/linuxapp/eal/Makefile   |   1 +
 lib/librte_pmd_packet/Makefile         |  60 +++
 lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
 lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
 mk/rte.app.mk                          |   4 +
 8 files changed, 957 insertions(+)
 create mode 100644 lib/librte_pmd_packet/Makefile
 create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
 create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h

diff --git a/config/common_bsdapp b/config/common_bsdapp
index 943dce8f1ede..c317f031278e 100644
--- a/config/common_bsdapp
+++ b/config/common_bsdapp
@@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
 CONFIG_RTE_LIBRTE_PMD_BOND=y
 
 #
+# Compile software PMD backed by AF_PACKET sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_PACKET=n
+
+#
 # Do prefetch of packet data within PMD driver receive function
 #
 CONFIG_RTE_PMD_PACKET_PREFETCH=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 7bf5d80d4e26..f9e7bc3015ec 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
 CONFIG_RTE_LIBRTE_PMD_BOND=y
 
 #
+# Compile software PMD backed by AF_PACKET sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_PACKET=y
+
+#
 # Compile Xen PMD
 #
 CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
diff --git a/lib/Makefile b/lib/Makefile
index 10c5bb3045bc..930fadf29898 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
 DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
 DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 756d6b0c9301..feed24a63272 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
 CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
+CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
 CFLAGS += $(WERROR_FLAGS) -O3
 
diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
new file mode 100644
index 000000000000..e1266fb992cd
--- /dev/null
+++ b/lib/librte_pmd_packet/Makefile
@@ -0,0 +1,60 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   Copyright(c) 2014 6WIND S.A.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_pmd_packet.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
+
+#
+# Export include files
+#
+SYMLINK-y-include += rte_eth_packet.h
+
+# this lib depends upon:
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
new file mode 100644
index 000000000000..fceb6258aad6
--- /dev/null
+++ b/lib/librte_pmd_packet/rte_eth_packet.c
@@ -0,0 +1,826 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
+ *
+ *   Originally based upon librte_pmd_pcap code:
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2014 6WIND S.A.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_mbuf.h>
+#include <rte_ethdev.h>
+#include <rte_malloc.h>
+#include <rte_kvargs.h>
+#include <rte_dev.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+
+#include "rte_eth_packet.h"
+
+#define ETH_PACKET_IFACE_ARG		"iface"
+#define ETH_PACKET_NUM_Q_ARG		"qpairs"
+#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
+#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
+#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
+
+#define DFLT_BLOCK_SIZE		(1 << 12)
+#define DFLT_FRAME_SIZE		(1 << 11)
+#define DFLT_FRAME_COUNT	(1 << 9)
+
+struct pkt_rx_queue {
+	int sockfd;
+
+	struct iovec *rd;
+	uint8_t *map;
+	unsigned int framecount;
+	unsigned int framenum;
+
+	struct rte_mempool *mb_pool;
+
+	volatile unsigned long rx_pkts;
+	volatile unsigned long err_pkts;
+};
+
+struct pkt_tx_queue {
+	int sockfd;
+
+	struct iovec *rd;
+	uint8_t *map;
+	unsigned int framecount;
+	unsigned int framenum;
+
+	volatile unsigned long tx_pkts;
+	volatile unsigned long err_pkts;
+};
+
+struct pmd_internals {
+	unsigned nb_queues;
+
+	int if_index;
+	struct ether_addr eth_addr;
+
+	struct tpacket_req req;
+
+	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
+	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
+};
+
+static const char *valid_arguments[] = {
+	ETH_PACKET_IFACE_ARG,
+	ETH_PACKET_NUM_Q_ARG,
+	ETH_PACKET_BLOCKSIZE_ARG,
+	ETH_PACKET_FRAMESIZE_ARG,
+	ETH_PACKET_FRAMECOUNT_ARG,
+	NULL
+};
+
+static const char *drivername = "AF_PACKET PMD";
+
+static struct rte_eth_link pmd_link = {
+	.link_speed = 10000,
+	.link_duplex = ETH_LINK_FULL_DUPLEX,
+	.link_status = 0
+};
+
+static uint16_t
+eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	unsigned i;
+	struct tpacket2_hdr *ppd;
+	struct rte_mbuf *mbuf;
+	uint8_t *pbuf;
+	struct pkt_rx_queue *pkt_q = queue;
+	uint16_t num_rx = 0;
+	unsigned int framecount, framenum;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	/*
+	 * Reads the given number of packets from the AF_PACKET socket one by
+	 * one and copies the packet data into a newly allocated mbuf.
+	 */
+	framecount = pkt_q->framecount;
+	framenum = pkt_q->framenum;
+	for (i = 0; i < nb_pkts; i++) {
+		/* point at the next incoming frame */
+		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+		if ((ppd->tp_status & TP_STATUS_USER) == 0)
+			break;
+
+		/* allocate the next mbuf */
+		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
+		if (unlikely(mbuf == NULL))
+			break;
+
+		/* packet will fit in the mbuf, go ahead and receive it */
+		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
+		pbuf = (uint8_t *) ppd + ppd->tp_mac;
+		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
+
+		/* release incoming frame and advance ring buffer */
+		ppd->tp_status = TP_STATUS_KERNEL;
+		if (++framenum >= framecount)
+			framenum = 0;
+
+		/* account for the receive frame */
+		bufs[i] = mbuf;
+		num_rx++;
+	}
+	pkt_q->framenum = framenum;
+	pkt_q->rx_pkts += num_rx;
+	return num_rx;
+}
+
+/*
+ * Callback to handle sending packets through a real NIC.
+ */
+static uint16_t
+eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tpacket2_hdr *ppd;
+	struct rte_mbuf *mbuf;
+	uint8_t *pbuf;
+	unsigned int framecount, framenum;
+	struct pollfd pfd;
+	struct pkt_tx_queue *pkt_q = queue;
+	uint16_t num_tx = 0;
+	int i;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	memset(&pfd, 0, sizeof(pfd));
+	pfd.fd = pkt_q->sockfd;
+	pfd.events = POLLOUT;
+	pfd.revents = 0;
+
+	framecount = pkt_q->framecount;
+	framenum = pkt_q->framenum;
+	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+	for (i = 0; i < nb_pkts; i++) {
+		/* point at the next incoming frame */
+		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
+		    (poll(&pfd, 1, -1) < 0))
+				continue;
+
+		/* copy the tx frame data */
+		mbuf = bufs[num_tx];
+		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
+			sizeof(struct sockaddr_ll);
+		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
+		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
+
+		/* release incoming frame and advance ring buffer */
+		ppd->tp_status = TP_STATUS_SEND_REQUEST;
+		if (++framenum >= framecount)
+			framenum = 0;
+		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+
+		num_tx++;
+		rte_pktmbuf_free(mbuf);
+	}
+
+	/* kick-off transmits */
+	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+
+	pkt_q->framenum = framenum;
+	pkt_q->tx_pkts += num_tx;
+	pkt_q->err_pkts += nb_pkts - num_tx;
+	return num_tx;
+}
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = 1;
+	return 0;
+}
+
+/*
+ * This function gets called when the current port gets stopped.
+ */
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	unsigned i;
+	int sockfd;
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	for (i = 0; i < internals->nb_queues; i++) {
+		sockfd = internals->rx_queue[i].sockfd;
+		if(sockfd != -1)
+			close(sockfd);
+		sockfd = internals->tx_queue[i].sockfd;
+		if(sockfd != -1)
+			close(sockfd);
+	}
+
+	dev->data->dev_link.link_status = 0;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev_info->driver_name = drivername;
+	dev_info->if_index = internals->if_index;
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
+	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
+	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
+	dev_info->min_rx_bufsize = 0;
+	dev_info->pci_dev = NULL;
+}
+
+static void
+eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
+{
+	unsigned i, imax;
+	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
+	const struct pmd_internals *internal = dev->data->dev_private;
+
+	memset(igb_stats, 0, sizeof(*igb_stats));
+
+	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
+	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
+	for (i = 0; i < imax; i++) {
+		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
+		rx_total += igb_stats->q_ipackets[i];
+	}
+
+	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
+	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
+	for (i = 0; i < imax; i++) {
+		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
+		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
+		tx_total += igb_stats->q_opackets[i];
+		tx_err_total += igb_stats->q_errors[i];
+	}
+
+	igb_stats->ipackets = rx_total;
+	igb_stats->opackets = tx_total;
+	igb_stats->oerrors = tx_err_total;
+}
+
+static void
+eth_stats_reset(struct rte_eth_dev *dev)
+{
+	unsigned i;
+	struct pmd_internals *internal = dev->data->dev_private;
+
+	for (i = 0; i < internal->nb_queues; i++)
+		internal->rx_queue[i].rx_pkts = 0;
+
+	for (i = 0; i < internal->nb_queues; i++) {
+		internal->tx_queue[i].tx_pkts = 0;
+		internal->tx_queue[i].err_pkts = 0;
+	}
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev __rte_unused)
+{
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+                int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev,
+                   uint16_t rx_queue_id,
+                   uint16_t nb_rx_desc __rte_unused,
+                   unsigned int socket_id __rte_unused,
+                   const struct rte_eth_rxconf *rx_conf __rte_unused,
+                   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
+	struct rte_pktmbuf_pool_private *mbp_priv;
+	uint16_t buf_size;
+
+	pkt_q->mb_pool = mb_pool;
+
+	/* Now get the space available for data in the mbuf */
+	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
+	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
+	                       RTE_PKTMBUF_HEADROOM);
+
+	if (ETH_FRAME_LEN > buf_size) {
+		RTE_LOG(ERR, PMD,
+			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
+			dev->data->name, ETH_FRAME_LEN, buf_size);
+		return -ENOMEM;
+	}
+
+	dev->data->rx_queues[rx_queue_id] = pkt_q;
+
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev,
+                   uint16_t tx_queue_id,
+                   uint16_t nb_tx_desc __rte_unused,
+                   unsigned int socket_id __rte_unused,
+                   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
+	return 0;
+}
+
+static struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+	.stats_get = eth_stats_get,
+	.stats_reset = eth_stats_reset,
+};
+
+/*
+ * Opens an AF_PACKET socket
+ */
+static int
+open_packet_iface(const char *key __rte_unused,
+                  const char *value __rte_unused,
+                  void *extra_args)
+{
+	int *sockfd = extra_args;
+
+	/* Open an AF_PACKET socket... */
+	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	if (*sockfd == -1) {
+		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+rte_pmd_init_internals(const char *name,
+                       const int sockfd,
+                       const unsigned nb_queues,
+                       unsigned int blocksize,
+                       unsigned int blockcnt,
+                       unsigned int framesize,
+                       unsigned int framecnt,
+                       const unsigned numa_node,
+                       struct pmd_internals **internals,
+                       struct rte_eth_dev **eth_dev,
+                       struct rte_kvargs *kvlist)
+{
+	struct rte_eth_dev_data *data = NULL;
+	struct rte_pci_device *pci_dev = NULL;
+	struct rte_kvargs_pair *pair = NULL;
+	struct ifreq ifr;
+	size_t ifnamelen;
+	unsigned k_idx;
+	struct sockaddr_ll sockaddr;
+	struct tpacket_req *req;
+	struct pkt_rx_queue *rx_queue;
+	struct pkt_tx_queue *tx_queue;
+	int rc, tpver, discard, bypass;
+	unsigned int i, q, rdsize;
+	int qsockfd, fanout_arg;
+
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
+			break;
+	}
+	if (pair == NULL) {
+		RTE_LOG(ERR, PMD,
+			"%s: no interface specified for AF_PACKET ethdev\n",
+		        name);
+		goto error;
+	}
+
+	RTE_LOG(INFO, PMD,
+		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
+		name, numa_node);
+
+	/*
+	 * now do all data allocation - for eth_dev structure, dummy pci driver
+	 * and internal (private) data
+	 */
+	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
+	if (data == NULL)
+		goto error;
+
+	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
+	if (pci_dev == NULL)
+		goto error;
+
+	*internals = rte_zmalloc_socket(name, sizeof(**internals),
+	                                0, numa_node);
+	if (*internals == NULL)
+		goto error;
+
+	req = &((*internals)->req);
+
+	req->tp_block_size = blocksize;
+	req->tp_block_nr = blockcnt;
+	req->tp_frame_size = framesize;
+	req->tp_frame_nr = framecnt;
+
+	ifnamelen = strlen(pair->value);
+	if (ifnamelen < sizeof(ifr.ifr_name)) {
+		memcpy(ifr.ifr_name, pair->value, ifnamelen);
+		ifr.ifr_name[ifnamelen]='\0';
+	} else {
+		RTE_LOG(ERR, PMD,
+			"%s: I/F name too long (%s)\n",
+			name, pair->value);
+		goto error;
+	}
+	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
+		RTE_LOG(ERR, PMD,
+			"%s: ioctl failed (SIOCGIFINDEX)\n",
+		        name);
+		goto error;
+	}
+	(*internals)->if_index = ifr.ifr_ifindex;
+
+	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
+		RTE_LOG(ERR, PMD,
+			"%s: ioctl failed (SIOCGIFHWADDR)\n",
+		        name);
+		goto error;
+	}
+	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
+
+	memset(&sockaddr, 0, sizeof(sockaddr));
+	sockaddr.sll_family = AF_PACKET;
+	sockaddr.sll_protocol = htons(ETH_P_ALL);
+	sockaddr.sll_ifindex = (*internals)->if_index;
+
+	fanout_arg = getpid() & 0xffff;
+	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
+	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
+
+	for (q = 0; q < nb_queues; q++) {
+		/* Open an AF_PACKET socket for this queue... */
+		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+		if (qsockfd == -1) {
+			RTE_LOG(ERR, PMD,
+			        "%s: could not open AF_PACKET socket\n",
+			        name);
+			return -1;
+		}
+
+		tpver = TPACKET_V2;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
+				&tpver, sizeof(tpver));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_VERSION on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		discard = 1;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
+				&discard, sizeof(discard));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_LOSS on "
+			        "AF_PACKET socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		bypass = 1;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
+				&bypass, sizeof(bypass));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_QDISC_BYPASS "
+			        "on AF_PACKET socket for %s\n", name,
+			        pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_RX_RING on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_TX_RING on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		rx_queue = &((*internals)->rx_queue[q]);
+		rx_queue->framecount = req->tp_frame_nr;
+
+		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
+				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
+				    qsockfd, 0);
+		if (rx_queue->map == MAP_FAILED) {
+			RTE_LOG(ERR, PMD,
+				"%s: call to mmap failed on AF_PACKET socket for %s\n",
+				name, pair->value);
+			goto error;
+		}
+
+		/* rdsize is same for both Tx and Rx */
+		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
+
+		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
+		for (i = 0; i < req->tp_frame_nr; ++i) {
+			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
+			rx_queue->rd[i].iov_len = req->tp_frame_size;
+		}
+		rx_queue->sockfd = qsockfd;
+
+		tx_queue = &((*internals)->tx_queue[q]);
+		tx_queue->framecount = req->tp_frame_nr;
+
+		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
+
+		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
+		for (i = 0; i < req->tp_frame_nr; ++i) {
+			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
+			tx_queue->rd[i].iov_len = req->tp_frame_size;
+		}
+		tx_queue->sockfd = qsockfd;
+
+		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not bind AF_PACKET socket to %s\n",
+			        name, pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
+				&fanout_arg, sizeof(fanout_arg));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
+				"for %s\n", name, pair->value);
+			goto error;
+		}
+	}
+
+	/* reserve an ethdev entry */
+	*eth_dev = rte_eth_dev_allocate(name);
+	if (*eth_dev == NULL)
+		goto error;
+
+	/*
+	 * now put it all together
+	 * - store queue data in internals,
+	 * - store numa_node info in pci_driver
+	 * - point eth_dev_data to internals and pci_driver
+	 * - and point eth_dev structure to new eth_dev_data structure
+	 */
+
+	(*internals)->nb_queues = nb_queues;
+
+	data->dev_private = *internals;
+	data->port_id = (*eth_dev)->data->port_id;
+	data->nb_rx_queues = (uint16_t)nb_queues;
+	data->nb_tx_queues = (uint16_t)nb_queues;
+	data->dev_link = pmd_link;
+	data->mac_addrs = &(*internals)->eth_addr;
+
+	pci_dev->numa_node = numa_node;
+
+	(*eth_dev)->data = data;
+	(*eth_dev)->dev_ops = &ops;
+	(*eth_dev)->pci_dev = pci_dev;
+
+	return 0;
+
+error:
+	if (data)
+		rte_free(data);
+	if (pci_dev)
+		rte_free(pci_dev);
+	for (q = 0; q < nb_queues; q++) {
+		if ((*internals)->rx_queue[q].rd)
+			rte_free((*internals)->rx_queue[q].rd);
+		if ((*internals)->tx_queue[q].rd)
+			rte_free((*internals)->tx_queue[q].rd);
+	}
+	if (*internals)
+		rte_free(*internals);
+	return -1;
+}
+
+static int
+rte_eth_from_packet(const char *name,
+                    int const *sockfd,
+                    const unsigned numa_node,
+                    struct rte_kvargs *kvlist)
+{
+	struct pmd_internals *internals = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	struct rte_kvargs_pair *pair = NULL;
+	unsigned k_idx;
+	unsigned int blockcount;
+	unsigned int blocksize = DFLT_BLOCK_SIZE;
+	unsigned int framesize = DFLT_FRAME_SIZE;
+	unsigned int framecount = DFLT_FRAME_COUNT;
+	unsigned int qpairs = RTE_PMD_PACKET_MAX_RINGS;
+
+	/* do some parameter checking */
+	if (*sockfd < 0)
+		return -1;
+
+	/*
+	 * Walk arguments for configurable settings
+	 */
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
+			qpairs = atoi(pair->value);
+			if (qpairs < 1 ||
+			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid qpairs value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
+			blocksize = atoi(pair->value);
+			if (!blocksize) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid blocksize value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
+			framesize = atoi(pair->value);
+			if (!framesize) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid framesize value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
+			framecount = atoi(pair->value);
+			if (!framecount) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid framecount value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+	}
+
+	if (framesize > blocksize) {
+		RTE_LOG(ERR, PMD,
+			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
+		        name);
+		return -1;
+	}
+
+	blockcount = framecount / (blocksize / framesize);
+	if (!blockcount) {
+		RTE_LOG(ERR, PMD,
+			"%s: invalid AF_PACKET MMAP parameters\n", name);
+		return -1;
+	}
+
+	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
+	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
+	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
+	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
+	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
+
+	if (rte_pmd_init_internals(name, *sockfd, qpairs,
+	                           blocksize, blockcount,
+	                           framesize, framecount,
+	                           numa_node, &internals, &eth_dev,
+	                           kvlist) < 0)
+		return -1;
+
+	eth_dev->rx_pkt_burst = eth_packet_rx;
+	eth_dev->tx_pkt_burst = eth_packet_tx;
+
+	return 0;
+}
+
+int
+rte_pmd_packet_devinit(const char *name, const char *params)
+{
+	unsigned numa_node;
+	int ret;
+	struct rte_kvargs *kvlist;
+	int sockfd = -1;
+
+	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
+
+	numa_node = rte_socket_id();
+
+	kvlist = rte_kvargs_parse(params, valid_arguments);
+	if (kvlist == NULL)
+		return -1;
+
+	/*
+	 * If iface argument is passed we open the NICs and use them for
+	 * reading / writing
+	 */
+	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
+
+		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
+		                         &open_packet_iface, &sockfd);
+		if (ret < 0)
+			return -1;
+	}
+
+	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
+	close(sockfd); /* no longer needed */
+
+	if (ret < 0)
+		return -1;
+
+	return 0;
+}
+
+static struct rte_driver pmd_packet_drv = {
+	.name = "eth_packet",
+	.type = PMD_VDEV,
+	.init = rte_pmd_packet_devinit,
+};
+
+PMD_REGISTER_DRIVER(pmd_packet_drv);
diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
new file mode 100644
index 000000000000..f685611da3e9
--- /dev/null
+++ b/lib/librte_pmd_packet/rte_eth_packet.h
@@ -0,0 +1,55 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ETH_PACKET_H_
+#define _RTE_ETH_PACKET_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
+
+#define RTE_PMD_PACKET_MAX_RINGS 16
+
+/**
+ * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
+ * configured on command line.
+ */
+int rte_pmd_packet_devinit(const char *name, const char *params);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 34dff2a02a05..a6994c4dbe93 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
 LDLIBS += -lrte_pmd_pcap -lpcap
 endif
 
+ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
+LDLIBS += -lrte_pmd_packet
+endif
+
 endif # plugins
 
 LDLIBS += $(EXECENV_LDLIBS)
-- 
1.9.3

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-10 20:32 [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices John W. Linville
@ 2014-07-11 13:11 ` Stephen Hemminger
  2014-07-11 14:49   ` John W. Linville
  2014-07-11 13:26 ` Thomas Monjalon
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 76+ messages in thread
From: Stephen Hemminger @ 2014-07-11 13:11 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

On Thu, 10 Jul 2014 16:32:49 -0400
"John W. Linville" <linville@tuxdriver.com> wrote:

> This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> socket.  This implementation uses mmap'ed ring buffers to limit copying
> and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> AF_PACKET is used for frame reception.  In the current implementation,
> Tx and Rx queues are always paired, and therefore are always equal
> in number -- changing this would be a Simple Matter Of Programming.
> 
> Interfaces of this type are created with a command line option like
> "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> as arguments:
> 
>  - Interface is chosen by "iface" (required)
>  - Number of queue pairs set by "qpairs" (optional, default: 16)
>  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
>  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
>  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> This PMD is intended to provide a means for using DPDK on a broad
> range of hardware without hardware-specific PMDs and (hopefully)
> with better performance than what PCAP offers in Linux.  This might
> be useful as a development platform for DPDK applications when
> DPDK-supported hardware is expensive or unavailable.
> 
>  config/common_bsdapp                   |   5 +
>  config/common_linuxapp                 |   5 +
>  lib/Makefile                           |   1 +
>  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
>  lib/librte_pmd_packet/Makefile         |  60 +++
>  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
>  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
>  mk/rte.app.mk                          |   4 +
>  8 files changed, 957 insertions(+)
>  create mode 100644 lib/librte_pmd_packet/Makefile
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> 
> diff --git a/config/common_bsdapp b/config/common_bsdapp
> index 943dce8f1ede..c317f031278e 100644
> --- a/config/common_bsdapp
> +++ b/config/common_bsdapp
> @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
>  CONFIG_RTE_LIBRTE_PMD_BOND=y
>  
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> +#
> +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> +
> +#
>  # Do prefetch of packet data within PMD driver receive function
>  #
>  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> diff --git a/config/common_linuxapp b/config/common_linuxapp
> index 7bf5d80d4e26..f9e7bc3015ec 100644
> --- a/config/common_linuxapp
> +++ b/config/common_linuxapp
> @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
>  CONFIG_RTE_LIBRTE_PMD_BOND=y
>  
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> +#
> +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> +
> +#
>  # Compile Xen PMD
>  #
>  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> diff --git a/lib/Makefile b/lib/Makefile
> index 10c5bb3045bc..930fadf29898 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
>  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
>  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> index 756d6b0c9301..feed24a63272 100644
> --- a/lib/librte_eal/linuxapp/eal/Makefile
> +++ b/lib/librte_eal/linuxapp/eal/Makefile
> @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
>  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
>  CFLAGS += $(WERROR_FLAGS) -O3
>  
> diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> new file mode 100644
> index 000000000000..e1266fb992cd
> --- /dev/null
> +++ b/lib/librte_pmd_packet/Makefile
> @@ -0,0 +1,60 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_pmd_packet.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +#
> +# all source are stored in SRCS-y
> +#
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> +
> +#
> +# Export include files
> +#
> +SYMLINK-y-include += rte_eth_packet.h
> +
> +# this lib depends upon:
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> new file mode 100644
> index 000000000000..fceb6258aad6
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> @@ -0,0 +1,826 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> + *
> + *   Originally based upon librte_pmd_pcap code:
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2014 6WIND S.A.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <rte_mbuf.h>
> +#include <rte_ethdev.h>
> +#include <rte_malloc.h>
> +#include <rte_kvargs.h>
> +#include <rte_dev.h>
> +
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <arpa/inet.h>
> +#include <net/if.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <poll.h>
> +
> +#include "rte_eth_packet.h"
> +
> +#define ETH_PACKET_IFACE_ARG		"iface"
> +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> +
> +#define DFLT_BLOCK_SIZE		(1 << 12)
> +#define DFLT_FRAME_SIZE		(1 << 11)
> +#define DFLT_FRAME_COUNT	(1 << 9)
> +
> +struct pkt_rx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	struct rte_mempool *mb_pool;
> +
> +	volatile unsigned long rx_pkts;
> +	volatile unsigned long err_pkts;

Use of volatile will generate slow code, don't think
it is necessary, especially when only one CPU can use a queue
at a time.

> +};
> +
> +struct pkt_tx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	volatile unsigned long tx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pmd_internals {
> +	unsigned nb_queues;
> +
> +	int if_index;
> +	struct ether_addr eth_addr;
> +
> +	struct tpacket_req req;
> +
> +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +};
> +
> +static const char *valid_arguments[] = {
> +	ETH_PACKET_IFACE_ARG,
> +	ETH_PACKET_NUM_Q_ARG,
> +	ETH_PACKET_BLOCKSIZE_ARG,
> +	ETH_PACKET_FRAMESIZE_ARG,
> +	ETH_PACKET_FRAMECOUNT_ARG,
> +	NULL
> +};
> +
> +static const char *drivername = "AF_PACKET PMD";
> +
> +static struct rte_eth_link pmd_link = {
> +	.link_speed = 10000,
> +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> +	.link_status = 0
> +};
> +
> +static uint16_t
> +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	unsigned i;
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	struct pkt_rx_queue *pkt_q = queue;
> +	uint16_t num_rx = 0;
> +	unsigned int framecount, framenum;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	/*
> +	 * Reads the given number of packets from the AF_PACKET socket one by
> +	 * one and copies the packet data into a newly allocated mbuf.
> +	 */
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> +			break;
> +
> +		/* allocate the next mbuf */
> +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> +		if (unlikely(mbuf == NULL))
> +			break;
> +
> +		/* packet will fit in the mbuf, go ahead and receive it */
> +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_KERNEL;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +
> +		/* account for the receive frame */
> +		bufs[i] = mbuf;
> +		num_rx++;
> +	}
> +	pkt_q->framenum = framenum;
> +	pkt_q->rx_pkts += num_rx;
> +	return num_rx;
> +}
> +
> +/*
> + * Callback to handle sending packets through a real NIC.
> + */
> +static uint16_t
> +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	unsigned int framecount, framenum;
> +	struct pollfd pfd;
> +	struct pkt_tx_queue *pkt_q = queue;
> +	uint16_t num_tx = 0;
> +	int i;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	memset(&pfd, 0, sizeof(pfd));
> +	pfd.fd = pkt_q->sockfd;
> +	pfd.events = POLLOUT;
> +	pfd.revents = 0;
> +
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> +		    (poll(&pfd, 1, -1) < 0))
> +				continue;
> +
> +		/* copy the tx frame data */
> +		mbuf = bufs[num_tx];
> +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> +			sizeof(struct sockaddr_ll);
> +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +
> +		num_tx++;
> +		rte_pktmbuf_free(mbuf);
> +	}
> +
> +	/* kick-off transmits */
> +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +
> +	pkt_q->framenum = framenum;
> +	pkt_q->tx_pkts += num_tx;
> +	pkt_q->err_pkts += nb_pkts - num_tx;
> +	return num_tx;
> +}
> +
> +static int
> +eth_dev_start(struct rte_eth_dev *dev)
> +{
> +	dev->data->dev_link.link_status = 1;
> +	return 0;
> +}
> +
> +/*
> + * This function gets called when the current port gets stopped.
> + */
> +static void
> +eth_dev_stop(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	int sockfd;
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	for (i = 0; i < internals->nb_queues; i++) {
> +		sockfd = internals->rx_queue[i].sockfd;
> +		if(sockfd != -1)
> +			close(sockfd);
> +		sockfd = internals->tx_queue[i].sockfd;
> +		if(sockfd != -1)
> +			close(sockfd);
> +	}
> +
> +	dev->data->dev_link.link_status = 0;
> +}
> +
> +static int
> +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> +{
> +	return 0;
> +}
> +
> +static void
> +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> +{
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev_info->driver_name = drivername;
> +	dev_info->if_index = internals->if_index;
> +	dev_info->max_mac_addrs = 1;
> +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->min_rx_bufsize = 0;
> +	dev_info->pci_dev = NULL;
> +}
> +
> +static void
> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> +{
> +	unsigned i, imax;
> +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> +	const struct pmd_internals *internal = dev->data->dev_private;
> +
> +	memset(igb_stats, 0, sizeof(*igb_stats));
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> +		rx_total += igb_stats->q_ipackets[i];
> +	}
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> +		tx_total += igb_stats->q_opackets[i];
> +		tx_err_total += igb_stats->q_errors[i];
> +	}
> +
> +	igb_stats->ipackets = rx_total;
> +	igb_stats->opackets = tx_total;
> +	igb_stats->oerrors = tx_err_total;
> +}
> +
> +static void
> +eth_stats_reset(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	struct pmd_internals *internal = dev->data->dev_private;
> +
> +	for (i = 0; i < internal->nb_queues; i++)
> +		internal->rx_queue[i].rx_pkts = 0;
> +
> +	for (i = 0; i < internal->nb_queues; i++) {
> +		internal->tx_queue[i].tx_pkts = 0;
> +		internal->tx_queue[i].err_pkts = 0;
> +	}
> +}
> +
> +static void
> +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> +{
> +}
> +
> +static void
> +eth_queue_release(void *q __rte_unused)
> +{
> +}
> +
> +static int
> +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> +                int wait_to_complete __rte_unused)
> +{
> +	return 0;
> +}
> +
> +static int
> +eth_rx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t rx_queue_id,
> +                   uint16_t nb_rx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> +                   struct rte_mempool *mb_pool)
> +{
> +	struct pmd_internals *internals = dev->data->dev_private;
> +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> +	struct rte_pktmbuf_pool_private *mbp_priv;
> +	uint16_t buf_size;
> +
> +	pkt_q->mb_pool = mb_pool;
> +
> +	/* Now get the space available for data in the mbuf */
> +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> +	                       RTE_PKTMBUF_HEADROOM);
> +
> +	if (ETH_FRAME_LEN > buf_size) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> +			dev->data->name, ETH_FRAME_LEN, buf_size);
> +		return -ENOMEM;
> +	}
> +
> +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> +
> +	return 0;
> +}
> +
> +static int
> +eth_tx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t tx_queue_id,
> +                   uint16_t nb_tx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> +{
> +
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> +	return 0;
> +}
> +
> +static struct eth_dev_ops ops = {
> +	.dev_start = eth_dev_start,
> +	.dev_stop = eth_dev_stop,
> +	.dev_close = eth_dev_close,
> +	.dev_configure = eth_dev_configure,
> +	.dev_infos_get = eth_dev_info,
> +	.rx_queue_setup = eth_rx_queue_setup,
> +	.tx_queue_setup = eth_tx_queue_setup,
> +	.rx_queue_release = eth_queue_release,
> +	.tx_queue_release = eth_queue_release,
> +	.link_update = eth_link_update,
> +	.stats_get = eth_stats_get,
> +	.stats_reset = eth_stats_reset,
> +};
> +
> +/*
> + * Opens an AF_PACKET socket
> + */
> +static int
> +open_packet_iface(const char *key __rte_unused,
> +                  const char *value __rte_unused,
> +                  void *extra_args)
> +{
> +	int *sockfd = extra_args;
> +
> +	/* Open an AF_PACKET socket... */
> +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +	if (*sockfd == -1) {
> +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +rte_pmd_init_internals(const char *name,
> +                       const int sockfd,
> +                       const unsigned nb_queues,
> +                       unsigned int blocksize,
> +                       unsigned int blockcnt,
> +                       unsigned int framesize,
> +                       unsigned int framecnt,
> +                       const unsigned numa_node,
> +                       struct pmd_internals **internals,
> +                       struct rte_eth_dev **eth_dev,
> +                       struct rte_kvargs *kvlist)
> +{
> +	struct rte_eth_dev_data *data = NULL;
> +	struct rte_pci_device *pci_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	struct ifreq ifr;
> +	size_t ifnamelen;
> +	unsigned k_idx;
> +	struct sockaddr_ll sockaddr;
> +	struct tpacket_req *req;
> +	struct pkt_rx_queue *rx_queue;
> +	struct pkt_tx_queue *tx_queue;
> +	int rc, tpver, discard, bypass;
> +	unsigned int i, q, rdsize;
> +	int qsockfd, fanout_arg;
> +
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> +			break;
> +	}
> +	if (pair == NULL) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: no interface specified for AF_PACKET ethdev\n",
> +		        name);
> +		goto error;
> +	}
> +
> +	RTE_LOG(INFO, PMD,
> +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> +		name, numa_node);
> +
> +	/*
> +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> +	 * and internal (private) data
> +	 */
> +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> +	if (data == NULL)
> +		goto error;
> +
> +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> +	if (pci_dev == NULL)
> +		goto error;
> +
> +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> +	                                0, numa_node);
> +	if (*internals == NULL)
> +		goto error;
> +
> +	req = &((*internals)->req);
> +
> +	req->tp_block_size = blocksize;
> +	req->tp_block_nr = blockcnt;
> +	req->tp_frame_size = framesize;
> +	req->tp_frame_nr = framecnt;
> +
> +	ifnamelen = strlen(pair->value);
> +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> +		ifr.ifr_name[ifnamelen]='\0';
> +	} else {
> +		RTE_LOG(ERR, PMD,
> +			"%s: I/F name too long (%s)\n",
> +			name, pair->value);
> +		goto error;
> +	}
> +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> +		        name);
> +		goto error;
> +	}
> +	(*internals)->if_index = ifr.ifr_ifindex;
> +
> +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> +		        name);
> +		goto error;
> +	}
> +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> +
> +	memset(&sockaddr, 0, sizeof(sockaddr));
> +	sockaddr.sll_family = AF_PACKET;
> +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> +	sockaddr.sll_ifindex = (*internals)->if_index;
> +
> +	fanout_arg = getpid() & 0xffff;
> +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> +
> +	for (q = 0; q < nb_queues; q++) {
> +		/* Open an AF_PACKET socket for this queue... */
> +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +		if (qsockfd == -1) {
> +			RTE_LOG(ERR, PMD,
> +			        "%s: could not open AF_PACKET socket\n",
> +			        name);
> +			return -1;
> +		}
> +
> +		tpver = TPACKET_V2;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> +				&tpver, sizeof(tpver));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_VERSION on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		discard = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> +				&discard, sizeof(discard));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_LOSS on "
> +			        "AF_PACKET socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		bypass = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> +				&bypass, sizeof(bypass));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_QDISC_BYPASS "
> +			        "on AF_PACKET socket for %s\n", name,
> +			        pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rx_queue = &((*internals)->rx_queue[q]);
> +		rx_queue->framecount = req->tp_frame_nr;
> +
> +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> +				    qsockfd, 0);
> +		if (rx_queue->map == MAP_FAILED) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> +				name, pair->value);
> +			goto error;
> +		}
> +
> +		/* rdsize is same for both Tx and Rx */
> +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> +
> +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		rx_queue->sockfd = qsockfd;
> +
> +		tx_queue = &((*internals)->tx_queue[q]);
> +		tx_queue->framecount = req->tp_frame_nr;
> +
> +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> +
> +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		tx_queue->sockfd = qsockfd;
> +
> +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not bind AF_PACKET socket to %s\n",
> +			        name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> +				&fanout_arg, sizeof(fanout_arg));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> +				"for %s\n", name, pair->value);
> +			goto error;
> +		}
> +	}
> +
> +	/* reserve an ethdev entry */
> +	*eth_dev = rte_eth_dev_allocate(name);
> +	if (*eth_dev == NULL)
> +		goto error;
> +
> +	/*
> +	 * now put it all together
> +	 * - store queue data in internals,
> +	 * - store numa_node info in pci_driver
> +	 * - point eth_dev_data to internals and pci_driver
> +	 * - and point eth_dev structure to new eth_dev_data structure
> +	 */
> +
> +	(*internals)->nb_queues = nb_queues;
> +
> +	data->dev_private = *internals;
> +	data->port_id = (*eth_dev)->data->port_id;
> +	data->nb_rx_queues = (uint16_t)nb_queues;
> +	data->nb_tx_queues = (uint16_t)nb_queues;
> +	data->dev_link = pmd_link;
> +	data->mac_addrs = &(*internals)->eth_addr;
> +
> +	pci_dev->numa_node = numa_node;
> +
> +	(*eth_dev)->data = data;
> +	(*eth_dev)->dev_ops = &ops;
> +	(*eth_dev)->pci_dev = pci_dev;
> +
> +	return 0;
> +
> +error:
> +	if (data)
> +		rte_free(data);
> +	if (pci_dev)
> +		rte_free(pci_dev);
> +	for (q = 0; q < nb_queues; q++) {
> +		if ((*internals)->rx_queue[q].rd)
> +			rte_free((*internals)->rx_queue[q].rd);
> +		if ((*internals)->tx_queue[q].rd)
> +			rte_free((*internals)->tx_queue[q].rd);
> +	}
> +	if (*internals)
> +		rte_free(*internals);
> +	return -1;
> +}
> +
> +static int
> +rte_eth_from_packet(const char *name,
> +                    int const *sockfd,
> +                    const unsigned numa_node,
> +                    struct rte_kvargs *kvlist)
> +{
> +	struct pmd_internals *internals = NULL;
> +	struct rte_eth_dev *eth_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	unsigned k_idx;
> +	unsigned int blockcount;
> +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> +	unsigned int framesize = DFLT_FRAME_SIZE;
> +	unsigned int framecount = DFLT_FRAME_COUNT;
> +	unsigned int qpairs = RTE_PMD_PACKET_MAX_RINGS;
> +
> +	/* do some parameter checking */
> +	if (*sockfd < 0)
> +		return -1;
> +
> +	/*
> +	 * Walk arguments for configurable settings
> +	 */
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> +			qpairs = atoi(pair->value);
> +			if (qpairs < 1 ||
> +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid qpairs value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> +			blocksize = atoi(pair->value);
> +			if (!blocksize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid blocksize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> +			framesize = atoi(pair->value);
> +			if (!framesize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framesize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> +			framecount = atoi(pair->value);
> +			if (!framecount) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framecount value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +	}
> +
> +	if (framesize > blocksize) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> +		        name);
> +		return -1;
> +	}
> +
> +	blockcount = framecount / (blocksize / framesize);
> +	if (!blockcount) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> +		return -1;
> +	}
> +
> +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> +
> +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> +	                           blocksize, blockcount,
> +	                           framesize, framecount,
> +	                           numa_node, &internals, &eth_dev,
> +	                           kvlist) < 0)
> +		return -1;
> +
> +	eth_dev->rx_pkt_burst = eth_packet_rx;
> +	eth_dev->tx_pkt_burst = eth_packet_tx;
> +
> +	return 0;
> +}
> +
> +int
> +rte_pmd_packet_devinit(const char *name, const char *params)
> +{
> +	unsigned numa_node;
> +	int ret;
> +	struct rte_kvargs *kvlist;
> +	int sockfd = -1;
> +
> +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> +
> +	numa_node = rte_socket_id();
> +
> +	kvlist = rte_kvargs_parse(params, valid_arguments);
> +	if (kvlist == NULL)
> +		return -1;
> +
> +	/*
> +	 * If iface argument is passed we open the NICs and use them for
> +	 * reading / writing
> +	 */
> +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> +
> +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> +		                         &open_packet_iface, &sockfd);
> +		if (ret < 0)
> +			return -1;
> +	}
> +
> +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> +	close(sockfd); /* no longer needed */
> +
> +	if (ret < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static struct rte_driver pmd_packet_drv = {
> +	.name = "eth_packet",
> +	.type = PMD_VDEV,
> +	.init = rte_pmd_packet_devinit,
> +};
> +
> +PMD_REGISTER_DRIVER(pmd_packet_drv);
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> new file mode 100644
> index 000000000000..f685611da3e9
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> @@ -0,0 +1,55 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_ETH_PACKET_H_
> +#define _RTE_ETH_PACKET_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> +
> +#define RTE_PMD_PACKET_MAX_RINGS 16
> +
> +/**
> + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> + * configured on command line.
> + */
> +int rte_pmd_packet_devinit(const char *name, const char *params);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 34dff2a02a05..a6994c4dbe93 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
>  LDLIBS += -lrte_pmd_pcap -lpcap
>  endif
>  
> +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> +LDLIBS += -lrte_pmd_packet
> +endif
> +
>  endif # plugins
>  
>  LDLIBS += $(EXECENV_LDLIBS)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-10 20:32 [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices John W. Linville
  2014-07-11 13:11 ` Stephen Hemminger
@ 2014-07-11 13:26 ` Thomas Monjalon
  2014-07-11 14:51   ` John W. Linville
       [not found] ` <D0158A423229094DA7ABF71CF2FA0DA3117D3A23@shsmsx102.ccr.corp.intel.com>
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 76+ messages in thread
From: Thomas Monjalon @ 2014-07-11 13:26 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

2014-07-10 16:32, John W. Linville:
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> socket.  This implementation uses mmap'ed ring buffers to limit copying
> and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> AF_PACKET is used for frame reception.  In the current implementation,
> Tx and Rx queues are always paired, and therefore are always equal
> in number -- changing this would be a Simple Matter Of Programming.
> 
> Interfaces of this type are created with a command line option like
> "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> as arguments:
> 
>  - Interface is chosen by "iface" (required)
>  - Number of queue pairs set by "qpairs" (optional, default: 16)
>  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
>  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
>  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> This PMD is intended to provide a means for using DPDK on a broad
> range of hardware without hardware-specific PMDs and (hopefully)
> with better performance than what PCAP offers in Linux.  This might
> be useful as a development platform for DPDK applications when
> DPDK-supported hardware is expensive or unavailable.

Thank you for this nice work.

I think it would be well suited to host this PMD as an external one in order 
to make it work also with DPDK 1.7.0.

-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 13:11 ` Stephen Hemminger
@ 2014-07-11 14:49   ` John W. Linville
  2014-07-11 15:06     ` Richardson, Bruce
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-11 14:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Fri, Jul 11, 2014 at 06:11:47AM -0700, Stephen Hemminger wrote:
> On Thu, 10 Jul 2014 16:32:49 -0400
> "John W. Linville" <linville@tuxdriver.com> wrote:
> 
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > AF_PACKET is used for frame reception.  In the current implementation,
> > Tx and Rx queues are always paired, and therefore are always equal
> > in number -- changing this would be a Simple Matter Of Programming.
> > 
> > Interfaces of this type are created with a command line option like
> > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > as arguments:
> > 
> >  - Interface is chosen by "iface" (required)
> >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > 
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > This PMD is intended to provide a means for using DPDK on a broad
> > range of hardware without hardware-specific PMDs and (hopefully)
> > with better performance than what PCAP offers in Linux.  This might
> > be useful as a development platform for DPDK applications when
> > DPDK-supported hardware is expensive or unavailable.
> > 
> >  config/common_bsdapp                   |   5 +
> >  config/common_linuxapp                 |   5 +
> >  lib/Makefile                           |   1 +
> >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> >  lib/librte_pmd_packet/Makefile         |  60 +++
> >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> >  mk/rte.app.mk                          |   4 +
> >  8 files changed, 957 insertions(+)
> >  create mode 100644 lib/librte_pmd_packet/Makefile
> >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > 
> > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > index 943dce8f1ede..c317f031278e 100644
> > --- a/config/common_bsdapp
> > +++ b/config/common_bsdapp
> > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> >  
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > +#
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > +
> > +#
> >  # Do prefetch of packet data within PMD driver receive function
> >  #
> >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > --- a/config/common_linuxapp
> > +++ b/config/common_linuxapp
> > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> >  
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > +#
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > +
> > +#
> >  # Compile Xen PMD
> >  #
> >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 10c5bb3045bc..930fadf29898 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > index 756d6b0c9301..feed24a63272 100644
> > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> >  CFLAGS += $(WERROR_FLAGS) -O3
> >  
> > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > new file mode 100644
> > index 000000000000..e1266fb992cd
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/Makefile
> > @@ -0,0 +1,60 @@
> > +#   BSD LICENSE
> > +#
> > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > +#   Copyright(c) 2014 6WIND S.A.
> > +#   All rights reserved.
> > +#
> > +#   Redistribution and use in source and binary forms, with or without
> > +#   modification, are permitted provided that the following conditions
> > +#   are met:
> > +#
> > +#     * Redistributions of source code must retain the above copyright
> > +#       notice, this list of conditions and the following disclaimer.
> > +#     * Redistributions in binary form must reproduce the above copyright
> > +#       notice, this list of conditions and the following disclaimer in
> > +#       the documentation and/or other materials provided with the
> > +#       distribution.
> > +#     * Neither the name of Intel Corporation nor the names of its
> > +#       contributors may be used to endorse or promote products derived
> > +#       from this software without specific prior written permission.
> > +#
> > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +#
> > +# library name
> > +#
> > +LIB = librte_pmd_packet.a
> > +
> > +CFLAGS += -O3
> > +CFLAGS += $(WERROR_FLAGS)
> > +
> > +#
> > +# all source are stored in SRCS-y
> > +#
> > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > +
> > +#
> > +# Export include files
> > +#
> > +SYMLINK-y-include += rte_eth_packet.h
> > +
> > +# this lib depends upon:
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > new file mode 100644
> > index 000000000000..fceb6258aad6
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > @@ -0,0 +1,826 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > + *
> > + *   Originally based upon librte_pmd_pcap code:
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   Copyright(c) 2014 6WIND S.A.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#include <rte_mbuf.h>
> > +#include <rte_ethdev.h>
> > +#include <rte_malloc.h>
> > +#include <rte_kvargs.h>
> > +#include <rte_dev.h>
> > +
> > +#include <linux/if_ether.h>
> > +#include <linux/if_packet.h>
> > +#include <arpa/inet.h>
> > +#include <net/if.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +#include <poll.h>
> > +
> > +#include "rte_eth_packet.h"
> > +
> > +#define ETH_PACKET_IFACE_ARG		"iface"
> > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > +
> > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > +#define DFLT_FRAME_SIZE		(1 << 11)
> > +#define DFLT_FRAME_COUNT	(1 << 9)
> > +
> > +struct pkt_rx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	struct rte_mempool *mb_pool;
> > +
> > +	volatile unsigned long rx_pkts;
> > +	volatile unsigned long err_pkts;
> 
> Use of volatile will generate slow code, don't think
> it is necessary, especially when only one CPU can use a queue
> at a time.

That is a good point, worth checking out.  FWIW, those lines are
boilerplate originally copied from the pcap PMD. :-)

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 13:26 ` Thomas Monjalon
@ 2014-07-11 14:51   ` John W. Linville
  2014-07-11 15:04     ` Thomas Monjalon
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-11 14:51 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Fri, Jul 11, 2014 at 03:26:39PM +0200, Thomas Monjalon wrote:
> 2014-07-10 16:32, John W. Linville:
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > AF_PACKET is used for frame reception.  In the current implementation,
> > Tx and Rx queues are always paired, and therefore are always equal
> > in number -- changing this would be a Simple Matter Of Programming.
> > 
> > Interfaces of this type are created with a command line option like
> > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > as arguments:
> > 
> >  - Interface is chosen by "iface" (required)
> >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > 
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > This PMD is intended to provide a means for using DPDK on a broad
> > range of hardware without hardware-specific PMDs and (hopefully)
> > with better performance than what PCAP offers in Linux.  This might
> > be useful as a development platform for DPDK applications when
> > DPDK-supported hardware is expensive or unavailable.
> 
> Thank you for this nice work.
> 
> I think it would be well suited to host this PMD as an external one in order 
> to make it work also with DPDK 1.7.0.

I'm not sure I understand the suggestion -- you don't want to merge
the driver for 1.8?  Or you just want to host this patch somewhere,
so people can still use it w/ 1.7?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 14:51   ` John W. Linville
@ 2014-07-11 15:04     ` Thomas Monjalon
  2014-07-11 15:30       ` John W. Linville
  0 siblings, 1 reply; 76+ messages in thread
From: Thomas Monjalon @ 2014-07-11 15:04 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

2014-07-11 10:51, John W. Linville:
> On Fri, Jul 11, 2014 at 03:26:39PM +0200, Thomas Monjalon wrote:
> > 2014-07-10 16:32, John W. Linville:
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > AF_PACKET is used for frame reception.  In the current implementation,
> > > Tx and Rx queues are always paired, and therefore are always equal
> > > in number -- changing this would be a Simple Matter Of Programming.
> > > 
> > > Interfaces of this type are created with a command line option like
> > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > 
> > > as arguments:
> > >  - Interface is chosen by "iface" (required)
> > >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > 
> > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > ---
> > > This PMD is intended to provide a means for using DPDK on a broad
> > > range of hardware without hardware-specific PMDs and (hopefully)
> > > with better performance than what PCAP offers in Linux.  This might
> > > be useful as a development platform for DPDK applications when
> > > DPDK-supported hardware is expensive or unavailable.
> > 
> > Thank you for this nice work.
> > 
> > I think it would be well suited to host this PMD as an external one in
> > order to make it work also with DPDK 1.7.0.
> 
> I'm not sure I understand the suggestion -- you don't want to merge
> the driver for 1.8?  Or you just want to host this patch somewhere,
> so people can still use it w/ 1.7?

I suggest to have a separated repository here:
	http://dpdk.org/browse/

-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 14:49   ` John W. Linville
@ 2014-07-11 15:06     ` Richardson, Bruce
  2014-07-11 15:16       ` Stephen Hemminger
  2014-07-11 15:29       ` Venkatesan, Venky
  0 siblings, 2 replies; 76+ messages in thread
From: Richardson, Bruce @ 2014-07-11 15:06 UTC (permalink / raw)
  To: John W. Linville, Stephen Hemminger; +Cc: dev

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> Sent: Friday, July 11, 2014 7:49 AM
> To: Stephen Hemminger
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-
> based virtual devices
> 
> On Fri, Jul 11, 2014 at 06:11:47AM -0700, Stephen Hemminger wrote:
> > On Thu, 10 Jul 2014 16:32:49 -0400
> > "John W. Linville" <linville@tuxdriver.com> wrote:
> >
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > AF_PACKET is used for frame reception.  In the current implementation,
> > > Tx and Rx queues are always paired, and therefore are always equal
> > > in number -- changing this would be a Simple Matter Of Programming.
> > >
> > > Interfaces of this type are created with a command line option like
> > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > as arguments:
> > >
> > >  - Interface is chosen by "iface" (required)
> > >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > >
> > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > ---
> > > This PMD is intended to provide a means for using DPDK on a broad
> > > range of hardware without hardware-specific PMDs and (hopefully)
> > > with better performance than what PCAP offers in Linux.  This might
> > > be useful as a development platform for DPDK applications when
> > > DPDK-supported hardware is expensive or unavailable.
> > >
> > >  config/common_bsdapp                   |   5 +
> > >  config/common_linuxapp                 |   5 +
> > >  lib/Makefile                           |   1 +
> > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > >  lib/librte_pmd_packet/rte_eth_packet.c | 826
> +++++++++++++++++++++++++++++++++
> > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > >  mk/rte.app.mk                          |   4 +
> > >  8 files changed, 957 insertions(+)
> > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > >
> > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > index 943dce8f1ede..c317f031278e 100644
> > > --- a/config/common_bsdapp
> > > +++ b/config/common_bsdapp
> > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > +#
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > +
> > > +#
> > >  # Do prefetch of packet data within PMD driver receive function
> > >  #
> > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > --- a/config/common_linuxapp
> > > +++ b/config/common_linuxapp
> > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > +#
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > +
> > > +#
> > >  # Compile Xen PMD
> > >  #
> > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > diff --git a/lib/Makefile b/lib/Makefile
> > > index 10c5bb3045bc..930fadf29898 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> librte_pmd_i40e
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile
> b/lib/librte_eal/linuxapp/eal/Makefile
> > > index 756d6b0c9301..feed24a63272 100644
> > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > >  CFLAGS += $(WERROR_FLAGS) -O3
> > >
> > > diff --git a/lib/librte_pmd_packet/Makefile
> b/lib/librte_pmd_packet/Makefile
> > > new file mode 100644
> > > index 000000000000..e1266fb992cd
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/Makefile
> > > @@ -0,0 +1,60 @@
> > > +#   BSD LICENSE
> > > +#
> > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > +#   Copyright(c) 2014 6WIND S.A.
> > > +#   All rights reserved.
> > > +#
> > > +#   Redistribution and use in source and binary forms, with or without
> > > +#   modification, are permitted provided that the following conditions
> > > +#   are met:
> > > +#
> > > +#     * Redistributions of source code must retain the above copyright
> > > +#       notice, this list of conditions and the following disclaimer.
> > > +#     * Redistributions in binary form must reproduce the above copyright
> > > +#       notice, this list of conditions and the following disclaimer in
> > > +#       the documentation and/or other materials provided with the
> > > +#       distribution.
> > > +#     * Neither the name of Intel Corporation nor the names of its
> > > +#       contributors may be used to endorse or promote products derived
> > > +#       from this software without specific prior written permission.
> > > +#
> > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF THE USE
> > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +#
> > > +# library name
> > > +#
> > > +LIB = librte_pmd_packet.a
> > > +
> > > +CFLAGS += -O3
> > > +CFLAGS += $(WERROR_FLAGS)
> > > +
> > > +#
> > > +# all source are stored in SRCS-y
> > > +#
> > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > +
> > > +#
> > > +# Export include files
> > > +#
> > > +SYMLINK-y-include += rte_eth_packet.h
> > > +
> > > +# this lib depends upon:
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> b/lib/librte_pmd_packet/rte_eth_packet.c
> > > new file mode 100644
> > > index 000000000000..fceb6258aad6
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > @@ -0,0 +1,826 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > + *
> > > + *   Originally based upon librte_pmd_pcap code:
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   Copyright(c) 2014 6WIND S.A.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> > > + */
> > > +
> > > +#include <rte_mbuf.h>
> > > +#include <rte_ethdev.h>
> > > +#include <rte_malloc.h>
> > > +#include <rte_kvargs.h>
> > > +#include <rte_dev.h>
> > > +
> > > +#include <linux/if_ether.h>
> > > +#include <linux/if_packet.h>
> > > +#include <arpa/inet.h>
> > > +#include <net/if.h>
> > > +#include <sys/types.h>
> > > +#include <sys/socket.h>
> > > +#include <sys/ioctl.h>
> > > +#include <sys/mman.h>
> > > +#include <unistd.h>
> > > +#include <poll.h>
> > > +
> > > +#include "rte_eth_packet.h"
> > > +
> > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > +
> > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > +
> > > +struct pkt_rx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	struct rte_mempool *mb_pool;
> > > +
> > > +	volatile unsigned long rx_pkts;
> > > +	volatile unsigned long err_pkts;
> >
> > Use of volatile will generate slow code, don't think
> > it is necessary, especially when only one CPU can use a queue
> > at a time.
> 
> That is a good point, worth checking out.  FWIW, those lines are
> boilerplate originally copied from the pcap PMD. :-)
> 


Yes, I agree it's worth checking out if there is a performance impact, but if we assume that the stats for RX/TX are possibly going to be read by another core, they really should be volatile for correctness.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 15:06     ` Richardson, Bruce
@ 2014-07-11 15:16       ` Stephen Hemminger
  2014-07-11 16:07         ` Richardson, Bruce
  2014-07-11 15:29       ` Venkatesan, Venky
  1 sibling, 1 reply; 76+ messages in thread
From: Stephen Hemminger @ 2014-07-11 15:16 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

On Fri, 11 Jul 2014 15:06:25 +0000
"Richardson, Bruce" <bruce.richardson@intel.com> wrote:

> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > Sent: Friday, July 11, 2014 7:49 AM
> > To: Stephen Hemminger
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-
> > based virtual devices
> > 
> > On Fri, Jul 11, 2014 at 06:11:47AM -0700, Stephen Hemminger wrote:
> > > On Thu, 10 Jul 2014 16:32:49 -0400
> > > "John W. Linville" <linville@tuxdriver.com> wrote:
> > >
> > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > in number -- changing this would be a Simple Matter Of Programming.
> > > >
> > > > Interfaces of this type are created with a command line option like
> > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > as arguments:
> > > >
> > > >  - Interface is chosen by "iface" (required)
> > > >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > >
> > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > ---
> > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > with better performance than what PCAP offers in Linux.  This might
> > > > be useful as a development platform for DPDK applications when
> > > > DPDK-supported hardware is expensive or unavailable.
> > > >
> > > >  config/common_bsdapp                   |   5 +
> > > >  config/common_linuxapp                 |   5 +
> > > >  lib/Makefile                           |   1 +
> > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826
> > +++++++++++++++++++++++++++++++++
> > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > >  mk/rte.app.mk                          |   4 +
> > > >  8 files changed, 957 insertions(+)
> > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > >
> > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > index 943dce8f1ede..c317f031278e 100644
> > > > --- a/config/common_bsdapp
> > > > +++ b/config/common_bsdapp
> > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > >
> > > >  #
> > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > +#
> > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > +
> > > > +#
> > > >  # Do prefetch of packet data within PMD driver receive function
> > > >  #
> > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > --- a/config/common_linuxapp
> > > > +++ b/config/common_linuxapp
> > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > >
> > > >  #
> > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > +#
> > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > +
> > > > +#
> > > >  # Compile Xen PMD
> > > >  #
> > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > index 10c5bb3045bc..930fadf29898 100644
> > > > --- a/lib/Makefile
> > > > +++ b/lib/Makefile
> > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> > librte_pmd_i40e
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile
> > b/lib/librte_eal/linuxapp/eal/Makefile
> > > > index 756d6b0c9301..feed24a63272 100644
> > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > >
> > > > diff --git a/lib/librte_pmd_packet/Makefile
> > b/lib/librte_pmd_packet/Makefile
> > > > new file mode 100644
> > > > index 000000000000..e1266fb992cd
> > > > --- /dev/null
> > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > @@ -0,0 +1,60 @@
> > > > +#   BSD LICENSE
> > > > +#
> > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > +#   All rights reserved.
> > > > +#
> > > > +#   Redistribution and use in source and binary forms, with or without
> > > > +#   modification, are permitted provided that the following conditions
> > > > +#   are met:
> > > > +#
> > > > +#     * Redistributions of source code must retain the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer.
> > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer in
> > > > +#       the documentation and/or other materials provided with the
> > > > +#       distribution.
> > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > +#       contributors may be used to endorse or promote products derived
> > > > +#       from this software without specific prior written permission.
> > > > +#
> > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > CONTRIBUTORS
> > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> > NOT
> > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > FITNESS FOR
> > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > COPYRIGHT
> > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > INCIDENTAL,
> > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> > NOT
> > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > LOSS OF USE,
> > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> > AND ON ANY
> > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > TORT
> > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> > OF THE USE
> > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > DAMAGE.
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > +
> > > > +#
> > > > +# library name
> > > > +#
> > > > +LIB = librte_pmd_packet.a
> > > > +
> > > > +CFLAGS += -O3
> > > > +CFLAGS += $(WERROR_FLAGS)
> > > > +
> > > > +#
> > > > +# all source are stored in SRCS-y
> > > > +#
> > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > +
> > > > +#
> > > > +# Export include files
> > > > +#
> > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > +
> > > > +# this lib depends upon:
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> > b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > new file mode 100644
> > > > index 000000000000..fceb6258aad6
> > > > --- /dev/null
> > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > @@ -0,0 +1,826 @@
> > > > +/*-
> > > > + *   BSD LICENSE
> > > > + *
> > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > + *
> > > > + *   Originally based upon librte_pmd_pcap code:
> > > > + *
> > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > + *   All rights reserved.
> > > > + *
> > > > + *   Redistribution and use in source and binary forms, with or without
> > > > + *   modification, are permitted provided that the following conditions
> > > > + *   are met:
> > > > + *
> > > > + *     * Redistributions of source code must retain the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer.
> > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer in
> > > > + *       the documentation and/or other materials provided with the
> > > > + *       distribution.
> > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > + *       contributors may be used to endorse or promote products derived
> > > > + *       from this software without specific prior written permission.
> > > > + *
> > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > CONTRIBUTORS
> > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> > NOT
> > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > FITNESS FOR
> > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > COPYRIGHT
> > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > INCIDENTAL,
> > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> > BUT NOT
> > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > LOSS OF USE,
> > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> > AND ON ANY
> > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > TORT
> > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> > OF THE USE
> > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > DAMAGE.
> > > > + */
> > > > +
> > > > +#include <rte_mbuf.h>
> > > > +#include <rte_ethdev.h>
> > > > +#include <rte_malloc.h>
> > > > +#include <rte_kvargs.h>
> > > > +#include <rte_dev.h>
> > > > +
> > > > +#include <linux/if_ether.h>
> > > > +#include <linux/if_packet.h>
> > > > +#include <arpa/inet.h>
> > > > +#include <net/if.h>
> > > > +#include <sys/types.h>
> > > > +#include <sys/socket.h>
> > > > +#include <sys/ioctl.h>
> > > > +#include <sys/mman.h>
> > > > +#include <unistd.h>
> > > > +#include <poll.h>
> > > > +
> > > > +#include "rte_eth_packet.h"
> > > > +
> > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > +
> > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > +
> > > > +struct pkt_rx_queue {
> > > > +	int sockfd;
> > > > +
> > > > +	struct iovec *rd;
> > > > +	uint8_t *map;
> > > > +	unsigned int framecount;
> > > > +	unsigned int framenum;
> > > > +
> > > > +	struct rte_mempool *mb_pool;
> > > > +
> > > > +	volatile unsigned long rx_pkts;
> > > > +	volatile unsigned long err_pkts;
> > >
> > > Use of volatile will generate slow code, don't think
> > > it is necessary, especially when only one CPU can use a queue
> > > at a time.
> > 
> > That is a good point, worth checking out.  FWIW, those lines are
> > boilerplate originally copied from the pcap PMD. :-)
> > 
> 
> 
> Yes, I agree it's worth checking out if there is a performance impact, but if we assume that the stats for RX/TX are possibly going to be read by another core, they really should be volatile for correctness.

Since only one core does update, that is not necessary. add will generate
valid value. and reader will read a valid value.
Only if two cpu's are using same queue would it be possible to for two add's
to collide; but DPDK queue documentation specifically says queue's are not MP safe.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 15:06     ` Richardson, Bruce
  2014-07-11 15:16       ` Stephen Hemminger
@ 2014-07-11 15:29       ` Venkatesan, Venky
  2014-07-11 15:33         ` John W. Linville
  1 sibling, 1 reply; 76+ messages in thread
From: Venkatesan, Venky @ 2014-07-11 15:29 UTC (permalink / raw)
  To: Richardson, Bruce, John W. Linville, Stephen Hemminger; +Cc: dev

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> Sent: Friday, July 11, 2014 7:49 AM
> To: Stephen Hemminger
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for 
> AF_PACKET- based virtual devices
> 
> On Fri, Jul 11, 2014 at 06:11:47AM -0700, Stephen Hemminger wrote:
> > On Thu, 10 Jul 2014 16:32:49 -0400
> > "John W. Linville" <linville@tuxdriver.com> wrote:
> >
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET 
> > > socket.  This implementation uses mmap'ed ring buffers to limit 
> > > copying and user/kernel transitions.  The PACKET_FANOUT_HASH 
> > > behavior of AF_PACKET is used for frame reception.  In the current 
> > > implementation, Tx and Rx queues are always paired, and therefore 
> > > are always equal in number -- changing this would be a Simple Matter Of Programming.
> > >
> > > Interfaces of this type are created with a command line option 
> > > like "--vdev=eth_packet0,iface=...".  There are a number of 
> > > options availabe as arguments:
> > >
> > >  - Interface is chosen by "iface" (required)
> > >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 
> > > 4096)
> > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 
> > > 2048)
> > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, 
> > > default: 512)
> > >
> > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > ---
> > > This PMD is intended to provide a means for using DPDK on a broad 
> > > range of hardware without hardware-specific PMDs and (hopefully) 
> > > with better performance than what PCAP offers in Linux.  This 
> > > might be useful as a development platform for DPDK applications 
> > > when DPDK-supported hardware is expensive or unavailable.
> > >
> > >  config/common_bsdapp                   |   5 +
> > >  config/common_linuxapp                 |   5 +
> > >  lib/Makefile                           |   1 +
> > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > >  lib/librte_pmd_packet/rte_eth_packet.c | 826
> +++++++++++++++++++++++++++++++++
> > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > >  mk/rte.app.mk                          |   4 +
> > >  8 files changed, 957 insertions(+)  create mode 100644 
> > > lib/librte_pmd_packet/Makefile  create mode 100644 
> > > lib/librte_pmd_packet/rte_eth_packet.c
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > >
> > > diff --git a/config/common_bsdapp b/config/common_bsdapp index 
> > > 943dce8f1ede..c317f031278e 100644
> > > --- a/config/common_bsdapp
> > > +++ b/config/common_bsdapp
> > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y  
> > > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only) # 
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > +
> > > +#
> > >  # Do prefetch of packet data within PMD driver receive function  
> > > #  CONFIG_RTE_PMD_PACKET_PREFETCH=y diff --git 
> > > a/config/common_linuxapp b/config/common_linuxapp index 
> > > 7bf5d80d4e26..f9e7bc3015ec 100644
> > > --- a/config/common_linuxapp
> > > +++ b/config/common_linuxapp
> > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n  
> > > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only) # 
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > +
> > > +#
> > >  # Compile Xen PMD
> > >  #
> > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > diff --git a/lib/Makefile b/lib/Makefile index 
> > > 10c5bb3045bc..930fadf29898 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> librte_pmd_i40e
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt diff 
> > > --git a/lib/librte_eal/linuxapp/eal/Makefile
> b/lib/librte_eal/linuxapp/eal/Makefile
> > > index 756d6b0c9301..feed24a63272 100644
> > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether  CFLAGS 
> > > += -I$(RTE_SDK)/lib/librte_ivshmem  CFLAGS += 
> > > -I$(RTE_SDK)/lib/librte_pmd_ring  CFLAGS += 
> > > -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > >  CFLAGS += $(WERROR_FLAGS) -O3
> > >
> > > diff --git a/lib/librte_pmd_packet/Makefile
> b/lib/librte_pmd_packet/Makefile
> > > new file mode 100644
> > > index 000000000000..e1266fb992cd
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/Makefile
> > > @@ -0,0 +1,60 @@
> > > +#   BSD LICENSE
> > > +#
> > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > +#   Copyright(c) 2014 6WIND S.A.
> > > +#   All rights reserved.
> > > +#
> > > +#   Redistribution and use in source and binary forms, with or without
> > > +#   modification, are permitted provided that the following conditions
> > > +#   are met:
> > > +#
> > > +#     * Redistributions of source code must retain the above copyright
> > > +#       notice, this list of conditions and the following disclaimer.
> > > +#     * Redistributions in binary form must reproduce the above copyright
> > > +#       notice, this list of conditions and the following disclaimer in
> > > +#       the documentation and/or other materials provided with the
> > > +#       distribution.
> > > +#     * Neither the name of Intel Corporation nor the names of its
> > > +#       contributors may be used to endorse or promote products derived
> > > +#       from this software without specific prior written permission.
> > > +#
> > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF THE USE
> > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +#
> > > +# library name
> > > +#
> > > +LIB = librte_pmd_packet.a
> > > +
> > > +CFLAGS += -O3
> > > +CFLAGS += $(WERROR_FLAGS)
> > > +
> > > +#
> > > +# all source are stored in SRCS-y #
> > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > +
> > > +#
> > > +# Export include files
> > > +#
> > > +SYMLINK-y-include += rte_eth_packet.h
> > > +
> > > +# this lib depends upon:
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> b/lib/librte_pmd_packet/rte_eth_packet.c
> > > new file mode 100644
> > > index 000000000000..fceb6258aad6
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > @@ -0,0 +1,826 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > + *
> > > + *   Originally based upon librte_pmd_pcap code:
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   Copyright(c) 2014 6WIND S.A.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> > > + */
> > > +
> > > +#include <rte_mbuf.h>
> > > +#include <rte_ethdev.h>
> > > +#include <rte_malloc.h>
> > > +#include <rte_kvargs.h>
> > > +#include <rte_dev.h>
> > > +
> > > +#include <linux/if_ether.h>
> > > +#include <linux/if_packet.h>
> > > +#include <arpa/inet.h>
> > > +#include <net/if.h>
> > > +#include <sys/types.h>
> > > +#include <sys/socket.h>
> > > +#include <sys/ioctl.h>
> > > +#include <sys/mman.h>
> > > +#include <unistd.h>
> > > +#include <poll.h>
> > > +
> > > +#include "rte_eth_packet.h"
> > > +
> > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > +
> > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > +
> > > +struct pkt_rx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	struct rte_mempool *mb_pool;
> > > +
> > > +	volatile unsigned long rx_pkts;
> > > +	volatile unsigned long err_pkts;
> >
> > Use of volatile will generate slow code, don't think it is 
> > necessary, especially when only one CPU can use a queue at a time.
> 
> That is a good point, worth checking out.  FWIW, those lines are 
> boilerplate originally copied from the pcap PMD. :-)
> 

> Yes, I agree it's worth checking out if there is a performance impact, but if we assume that the stats for RX/TX are possibly going to be read by another core, they really should be volatile for correctness

Accessing the rx_queue structure directly for stats is unlikely to happen from a second core; we should probably change the PCAP PMD as well (thanks for pointing that out John). 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 15:04     ` Thomas Monjalon
@ 2014-07-11 15:30       ` John W. Linville
  2014-07-11 16:47         ` Thomas Monjalon
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-11 15:30 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Fri, Jul 11, 2014 at 05:04:04PM +0200, Thomas Monjalon wrote:
> 2014-07-11 10:51, John W. Linville:
> > On Fri, Jul 11, 2014 at 03:26:39PM +0200, Thomas Monjalon wrote:
> > > 2014-07-10 16:32, John W. Linville:
> > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > in number -- changing this would be a Simple Matter Of Programming.
> > > > 
> > > > Interfaces of this type are created with a command line option like
> > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > 
> > > > as arguments:
> > > >  - Interface is chosen by "iface" (required)
> > > >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > > 
> > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > ---
> > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > with better performance than what PCAP offers in Linux.  This might
> > > > be useful as a development platform for DPDK applications when
> > > > DPDK-supported hardware is expensive or unavailable.
> > > 
> > > Thank you for this nice work.
> > > 
> > > I think it would be well suited to host this PMD as an external one in
> > > order to make it work also with DPDK 1.7.0.
> > 
> > I'm not sure I understand the suggestion -- you don't want to merge
> > the driver for 1.8?  Or you just want to host this patch somewhere,
> > so people can still use it w/ 1.7?
> 
> I suggest to have a separated repository here:
> 	http://dpdk.org/browse/

I really don't see any reason not to merge it.  It was already delayed
by me waiting for all the PMD init changes to settle out in the 1.6
release, and I still had to do a few touch-ups for it to compile on
1.7.  I definitely do not want to have to do that over and over again.

Why wouldn't you just merge it?  If someone wants to use it on 1.7,
they can just apply the patch.

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 15:29       ` Venkatesan, Venky
@ 2014-07-11 15:33         ` John W. Linville
  2014-07-11 16:29           ` Venkatesan, Venky
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-11 15:33 UTC (permalink / raw)
  To: Venkatesan, Venky; +Cc: dev

On Fri, Jul 11, 2014 at 03:29:17PM +0000, Venkatesan, Venky wrote:
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > Sent: Friday, July 11, 2014 7:49 AM
> > To: Stephen Hemminger
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for 
> > AF_PACKET- based virtual devices
> > 
> > On Fri, Jul 11, 2014 at 06:11:47AM -0700, Stephen Hemminger wrote:
> > > On Thu, 10 Jul 2014 16:32:49 -0400
> > > "John W. Linville" <linville@tuxdriver.com> wrote:
> > >
> > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET 

<snip>

> > > > +struct pkt_rx_queue {
> > > > +	int sockfd;
> > > > +
> > > > +	struct iovec *rd;
> > > > +	uint8_t *map;
> > > > +	unsigned int framecount;
> > > > +	unsigned int framenum;
> > > > +
> > > > +	struct rte_mempool *mb_pool;
> > > > +
> > > > +	volatile unsigned long rx_pkts;
> > > > +	volatile unsigned long err_pkts;
> > >
> > > Use of volatile will generate slow code, don't think it is 
> > > necessary, especially when only one CPU can use a queue at a time.
> > 
> > That is a good point, worth checking out.  FWIW, those lines are 
> > boilerplate originally copied from the pcap PMD. :-)
> > 
> 
> > Yes, I agree it's worth checking out if there is a performance impact, but if we assume that the stats for RX/TX are possibly going to be read by another core, they really should be volatile for correctness
> 
> Accessing the rx_queue structure directly for stats is unlikely to happen from a second core; we should probably change the PCAP PMD as well (thanks for pointing that out John). 

"Unlikely" doesn't sound completely safe... :-)

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 15:16       ` Stephen Hemminger
@ 2014-07-11 16:07         ` Richardson, Bruce
  0 siblings, 0 replies; 76+ messages in thread
From: Richardson, Bruce @ 2014-07-11 16:07 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, July 11, 2014 8:16 AM
> To: Richardson, Bruce
> Cc: John W. Linville; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-
> based virtual devices
> 
> On Fri, 11 Jul 2014 15:06:25 +0000
> "Richardson, Bruce" <bruce.richardson@intel.com> wrote:
> 
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > Sent: Friday, July 11, 2014 7:49 AM
> > > To: Stephen Hemminger
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> AF_PACKET-
> > > based virtual devices
> > >
> > > On Fri, Jul 11, 2014 at 06:11:47AM -0700, Stephen Hemminger wrote:
> > > > On Thu, 10 Jul 2014 16:32:49 -0400
> > > > "John W. Linville" <linville@tuxdriver.com> wrote:
> > > >
> > > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > > in number -- changing this would be a Simple Matter Of Programming.
> > > > >
> > > > > Interfaces of this type are created with a command line option like
> > > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > > as arguments:
> > > > >
> > > > >  - Interface is chosen by "iface" (required)
> > > > >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> > > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default:
> 512)
> > > > >
> > > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > > ---
> > > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > > with better performance than what PCAP offers in Linux.  This might
> > > > > be useful as a development platform for DPDK applications when
> > > > > DPDK-supported hardware is expensive or unavailable.
> > > > >
> > > > >  config/common_bsdapp                   |   5 +
> > > > >  config/common_linuxapp                 |   5 +
> > > > >  lib/Makefile                           |   1 +
> > > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826
> > > +++++++++++++++++++++++++++++++++
> > > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > > >  mk/rte.app.mk                          |   4 +
> > > > >  8 files changed, 957 insertions(+)
> > > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > > >
> > > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > > index 943dce8f1ede..c317f031278e 100644
> > > > > --- a/config/common_bsdapp
> > > > > +++ b/config/common_bsdapp
> > > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > >
> > > > >  #
> > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > +#
> > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > > +
> > > > > +#
> > > > >  # Do prefetch of packet data within PMD driver receive function
> > > > >  #
> > > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > > --- a/config/common_linuxapp
> > > > > +++ b/config/common_linuxapp
> > > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > >
> > > > >  #
> > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > +#
> > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > > +
> > > > > +#
> > > > >  # Compile Xen PMD
> > > > >  #
> > > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > > index 10c5bb3045bc..930fadf29898 100644
> > > > > --- a/lib/Makefile
> > > > > +++ b/lib/Makefile
> > > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> > > librte_pmd_i40e
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile
> > > b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > index 756d6b0c9301..feed24a63272 100644
> > > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > > >
> > > > > diff --git a/lib/librte_pmd_packet/Makefile
> > > b/lib/librte_pmd_packet/Makefile
> > > > > new file mode 100644
> > > > > index 000000000000..e1266fb992cd
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > > @@ -0,0 +1,60 @@
> > > > > +#   BSD LICENSE
> > > > > +#
> > > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > > +#   All rights reserved.
> > > > > +#
> > > > > +#   Redistribution and use in source and binary forms, with or without
> > > > > +#   modification, are permitted provided that the following conditions
> > > > > +#   are met:
> > > > > +#
> > > > > +#     * Redistributions of source code must retain the above copyright
> > > > > +#       notice, this list of conditions and the following disclaimer.
> > > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > > +#       notice, this list of conditions and the following disclaimer in
> > > > > +#       the documentation and/or other materials provided with the
> > > > > +#       distribution.
> > > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > > +#       contributors may be used to endorse or promote products derived
> > > > > +#       from this software without specific prior written permission.
> > > > > +#
> > > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > CONTRIBUTORS
> > > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> > > NOT
> > > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > FITNESS FOR
> > > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > > COPYRIGHT
> > > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > > INCIDENTAL,
> > > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> > > NOT
> > > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > > LOSS OF USE,
> > > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED
> > > AND ON ANY
> > > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
> OR
> > > TORT
> > > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> OUT
> > > OF THE USE
> > > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > > DAMAGE.
> > > > > +
> > > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > > +
> > > > > +#
> > > > > +# library name
> > > > > +#
> > > > > +LIB = librte_pmd_packet.a
> > > > > +
> > > > > +CFLAGS += -O3
> > > > > +CFLAGS += $(WERROR_FLAGS)
> > > > > +
> > > > > +#
> > > > > +# all source are stored in SRCS-y
> > > > > +#
> > > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > > +
> > > > > +#
> > > > > +# Export include files
> > > > > +#
> > > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > > +
> > > > > +# this lib depends upon:
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > > +
> > > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> > > b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > new file mode 100644
> > > > > index 000000000000..fceb6258aad6
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > @@ -0,0 +1,826 @@
> > > > > +/*-
> > > > > + *   BSD LICENSE
> > > > > + *
> > > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > > + *
> > > > > + *   Originally based upon librte_pmd_pcap code:
> > > > > + *
> > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > > + *   All rights reserved.
> > > > > + *
> > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > + *   modification, are permitted provided that the following conditions
> > > > > + *   are met:
> > > > > + *
> > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > + *     * Redistributions in binary form must reproduce the above
> copyright
> > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > + *       the documentation and/or other materials provided with the
> > > > > + *       distribution.
> > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > + *       contributors may be used to endorse or promote products derived
> > > > > + *       from this software without specific prior written permission.
> > > > > + *
> > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > CONTRIBUTORS
> > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> > > NOT
> > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > FITNESS FOR
> > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > > COPYRIGHT
> > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > > INCIDENTAL,
> > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> > > BUT NOT
> > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > > LOSS OF USE,
> > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED
> > > AND ON ANY
> > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
> OR
> > > TORT
> > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> OUT
> > > OF THE USE
> > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > > DAMAGE.
> > > > > + */
> > > > > +
> > > > > +#include <rte_mbuf.h>
> > > > > +#include <rte_ethdev.h>
> > > > > +#include <rte_malloc.h>
> > > > > +#include <rte_kvargs.h>
> > > > > +#include <rte_dev.h>
> > > > > +
> > > > > +#include <linux/if_ether.h>
> > > > > +#include <linux/if_packet.h>
> > > > > +#include <arpa/inet.h>
> > > > > +#include <net/if.h>
> > > > > +#include <sys/types.h>
> > > > > +#include <sys/socket.h>
> > > > > +#include <sys/ioctl.h>
> > > > > +#include <sys/mman.h>
> > > > > +#include <unistd.h>
> > > > > +#include <poll.h>
> > > > > +
> > > > > +#include "rte_eth_packet.h"
> > > > > +
> > > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > > +
> > > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > > +
> > > > > +struct pkt_rx_queue {
> > > > > +	int sockfd;
> > > > > +
> > > > > +	struct iovec *rd;
> > > > > +	uint8_t *map;
> > > > > +	unsigned int framecount;
> > > > > +	unsigned int framenum;
> > > > > +
> > > > > +	struct rte_mempool *mb_pool;
> > > > > +
> > > > > +	volatile unsigned long rx_pkts;
> > > > > +	volatile unsigned long err_pkts;
> > > >
> > > > Use of volatile will generate slow code, don't think
> > > > it is necessary, especially when only one CPU can use a queue
> > > > at a time.
> > >
> > > That is a good point, worth checking out.  FWIW, those lines are
> > > boilerplate originally copied from the pcap PMD. :-)
> > >
> >
> >
> > Yes, I agree it's worth checking out if there is a performance impact, but if we
> assume that the stats for RX/TX are possibly going to be read by another core,
> they really should be volatile for correctness.
> 
> Since only one core does update, that is not necessary. add will generate
> valid value. and reader will read a valid value.
> Only if two cpu's are using same queue would it be possible to for two add's
> to collide; but DPDK queue documentation specifically says queue's are not MP
> safe.

AFAIK adds colliding can occur whether volatile or not, unless atomic operations are explicitly used. The volatile would just be a sanity check to ensure the value isn't cached in registers in either read or writer cores, so it's strictly necessary but I also would suspect it to have minimal to no performance impact as the value should be written to memory anyway even without volatile (though there is no guarantee of this), and the additional compiler ordering constraints imposed by volatile, I would hope shouldn't affect things much. 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 15:33         ` John W. Linville
@ 2014-07-11 16:29           ` Venkatesan, Venky
  0 siblings, 0 replies; 76+ messages in thread
From: Venkatesan, Venky @ 2014-07-11 16:29 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

On Fri, Jul 11, 2014 at 03:29:17PM +0000, Venkatesan, Venky wrote:
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. 
> > Linville
> > Sent: Friday, July 11, 2014 7:49 AM
> > To: Stephen Hemminger
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > AF_PACKET- based virtual devices
> > 
> > On Fri, Jul 11, 2014 at 06:11:47AM -0700, Stephen Hemminger wrote:
> > > On Thu, 10 Jul 2014 16:32:49 -0400 "John W. Linville" 
> > > <linville@tuxdriver.com> wrote:
> > >
> > > > This is a Linux-specific virtual PMD driver backed by an 
> > > > AF_PACKET

<snip>

> > > > +struct pkt_rx_queue {
> > > > +	int sockfd;
> > > > +
> > > > +	struct iovec *rd;
> > > > +	uint8_t *map;
> > > > +	unsigned int framecount;
> > > > +	unsigned int framenum;
> > > > +
> > > > +	struct rte_mempool *mb_pool;
> > > > +
> > > > +	volatile unsigned long rx_pkts;
> > > > +	volatile unsigned long err_pkts;
> > >
> > > Use of volatile will generate slow code, don't think it is 
> > > necessary, especially when only one CPU can use a queue at a time.
> > 
> > That is a good point, worth checking out.  FWIW, those lines are 
> > boilerplate originally copied from the pcap PMD. :-)
> > 
> 
> > Yes, I agree it's worth checking out if there is a performance 
> > impact, but if we assume that the stats for RX/TX are possibly going 
> > to be read by another core, they really should be volatile for 
> > correctness
> 
> Accessing the rx_queue structure directly for stats is unlikely to happen from a second core; we should probably change the PCAP PMD as well (thanks for pointing that out John). 

> "Unlikely" doesn't sound completely safe... :-)

LOL. :-). This is an internal data structure and the DPDK docs specifically mention that they are not  multi-process safe/accessible. The unlikely was for people that don't read the docs ... ;)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 15:30       ` John W. Linville
@ 2014-07-11 16:47         ` Thomas Monjalon
  2014-07-11 17:38           ` Richardson, Bruce
  2014-07-12 11:48           ` Neil Horman
  0 siblings, 2 replies; 76+ messages in thread
From: Thomas Monjalon @ 2014-07-11 16:47 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

2014-07-11 11:30, John W. Linville:
> On Fri, Jul 11, 2014 at 05:04:04PM +0200, Thomas Monjalon wrote:
> > 2014-07-11 10:51, John W. Linville:
> > > On Fri, Jul 11, 2014 at 03:26:39PM +0200, Thomas Monjalon wrote:
> > > > Thank you for this nice work.
> > > > 
> > > > I think it would be well suited to host this PMD as an external one in
> > > > order to make it work also with DPDK 1.7.0.
> > > 
> > > I'm not sure I understand the suggestion -- you don't want to merge
> > > the driver for 1.8?  Or you just want to host this patch somewhere,
> > > so people can still use it w/ 1.7?
> > 
> > I suggest to have a separated repository here:
> > 	http://dpdk.org/browse/
> 
> I really don't see any reason not to merge it.  It was already delayed
> by me waiting for all the PMD init changes to settle out in the 1.6
> release, and I still had to do a few touch-ups for it to compile on
> 1.7.  I definitely do not want to have to do that over and over again.

It's a pity that we didn't synchronize our efforts to make it integrated 
during 1.7.0 cycle.

> Why wouldn't you just merge it?  If someone wants to use it on 1.7,
> they can just apply the patch.

I'm OK to merge it. I was only suggesting to host your PMD externally like we 
did for virtio-net-pmd, vmxnet3-usermap and memnic.
It was the same discussion for the vmxnet3 PMD that Stephen submitted.

I start thinking that nobody wants PMD to be external. So we may merge this 
one in dpdk.git and start talking what to do for the other ones:
	- move memnic in dpdk.git?
	- move virtio-net-pmd and vmxnet3-usermap where sits their uio 
counterparts?
	- merge Brocade's vmxnet3 as new one or as a replacement for vmxnet3-uio?

-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
       [not found] ` <D0158A423229094DA7ABF71CF2FA0DA3117D3A23@shsmsx102.ccr.corp.intel.com>
@ 2014-07-11 17:20   ` Zhou, Danny
  2014-07-11 17:40     ` John W. Linville
  0 siblings, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-07-11 17:20 UTC (permalink / raw)
  To: John W. Linville, dev

Looks like you used a pretty new kernel version with new socket options that old kernel like my 3.12 does not support. When I tried this patch, it just cannot build, and compiler complains like below. Which Linux distribution does this patch work for? How to ensure it works for old kernels?

/home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c: In function rte_pmd_init_internals:
/home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524:17: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in this function)
/home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524:17: note: each undeclared identifier is reported only once for each function it appears in
/home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:557:33: error: PACKET_QDISC_BYPASS undeclared (first use in this function)
> 
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> Sent: Thursday, July 10, 2014 1:33 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based
> virtual devices
> 
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET socket.  This
> implementation uses mmap'ed ring buffers to limit copying and user/kernel
> transitions.  The PACKET_FANOUT_HASH behavior of AF_PACKET is used for
> frame reception.  In the current implementation, Tx and Rx queues are always paired,
> and therefore are always equal in number -- changing this would be a Simple Matter
> Of Programming.
> 
> Interfaces of this type are created with a command line option like
> "--vdev=eth_packet0,iface=...".  There are a number of options availabe as
> arguments:
> 
>  - Interface is chosen by "iface" (required)
>  - Number of queue pairs set by "qpairs" (optional, default: 16)
>  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
>  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
>  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> This PMD is intended to provide a means for using DPDK on a broad range of
> hardware without hardware-specific PMDs and (hopefully) with better performance
> than what PCAP offers in Linux.  This might be useful as a development platform for
> DPDK applications when DPDK-supported hardware is expensive or unavailable.
> 
>  config/common_bsdapp                   |   5 +
>  config/common_linuxapp                 |   5 +
>  lib/Makefile                           |   1 +
>  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
>  lib/librte_pmd_packet/Makefile         |  60 +++
>  lib/librte_pmd_packet/rte_eth_packet.c | 826
> +++++++++++++++++++++++++++++++++
> lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
>  mk/rte.app.mk                          |   4 +
>  8 files changed, 957 insertions(+)
>  create mode 100644 lib/librte_pmd_packet/Makefile  create mode 100644
> lib/librte_pmd_packet/rte_eth_packet.c
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> 
> diff --git a/config/common_bsdapp b/config/common_bsdapp index
> 943dce8f1ede..c317f031278e 100644
> --- a/config/common_bsdapp
> +++ b/config/common_bsdapp
> @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> CONFIG_RTE_LIBRTE_PMD_BOND=y
> 
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> +
> +#
>  # Do prefetch of packet data within PMD driver receive function  #
> CONFIG_RTE_PMD_PACKET_PREFETCH=y diff --git a/config/common_linuxapp
> b/config/common_linuxapp index 7bf5d80d4e26..f9e7bc3015ec 100644
> --- a/config/common_linuxapp
> +++ b/config/common_linuxapp
> @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> CONFIG_RTE_LIBRTE_PMD_BOND=y
> 
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> +
> +#
>  # Compile Xen PMD
>  #
>  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> diff --git a/lib/Makefile b/lib/Makefile index 10c5bb3045bc..930fadf29898 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> librte_pmd_i40e
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
>  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
>  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt diff --git
> a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> index 756d6b0c9301..feed24a63272 100644
> --- a/lib/librte_eal/linuxapp/eal/Makefile
> +++ b/lib/librte_eal/linuxapp/eal/Makefile
> @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether  CFLAGS +=
> -I$(RTE_SDK)/lib/librte_ivshmem  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
>  CFLAGS += $(WERROR_FLAGS) -O3
> 
> diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile new file
> mode 100644 index 000000000000..e1266fb992cd
> --- /dev/null
> +++ b/lib/librte_pmd_packet/Makefile
> @@ -0,0 +1,60 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_pmd_packet.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +#
> +# all source are stored in SRCS-y
> +#
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> +
> +#
> +# Export include files
> +#
> +SYMLINK-y-include += rte_eth_packet.h
> +
> +# this lib depends upon:
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> b/lib/librte_pmd_packet/rte_eth_packet.c
> new file mode 100644
> index 000000000000..fceb6258aad6
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> @@ -0,0 +1,826 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> + *
> + *   Originally based upon librte_pmd_pcap code:
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2014 6WIND S.A.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> + */
> +
> +#include <rte_mbuf.h>
> +#include <rte_ethdev.h>
> +#include <rte_malloc.h>
> +#include <rte_kvargs.h>
> +#include <rte_dev.h>
> +
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <arpa/inet.h>
> +#include <net/if.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <poll.h>
> +
> +#include "rte_eth_packet.h"
> +
> +#define ETH_PACKET_IFACE_ARG		"iface"
> +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> +
> +#define DFLT_BLOCK_SIZE		(1 << 12)
> +#define DFLT_FRAME_SIZE		(1 << 11)
> +#define DFLT_FRAME_COUNT	(1 << 9)
> +
> +struct pkt_rx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	struct rte_mempool *mb_pool;
> +
> +	volatile unsigned long rx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pkt_tx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	volatile unsigned long tx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pmd_internals {
> +	unsigned nb_queues;
> +
> +	int if_index;
> +	struct ether_addr eth_addr;
> +
> +	struct tpacket_req req;
> +
> +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +};
> +
> +static const char *valid_arguments[] = {
> +	ETH_PACKET_IFACE_ARG,
> +	ETH_PACKET_NUM_Q_ARG,
> +	ETH_PACKET_BLOCKSIZE_ARG,
> +	ETH_PACKET_FRAMESIZE_ARG,
> +	ETH_PACKET_FRAMECOUNT_ARG,
> +	NULL
> +};
> +
> +static const char *drivername = "AF_PACKET PMD";
> +
> +static struct rte_eth_link pmd_link = {
> +	.link_speed = 10000,
> +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> +	.link_status = 0
> +};
> +
> +static uint16_t
> +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> +	unsigned i;
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	struct pkt_rx_queue *pkt_q = queue;
> +	uint16_t num_rx = 0;
> +	unsigned int framecount, framenum;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	/*
> +	 * Reads the given number of packets from the AF_PACKET socket one by
> +	 * one and copies the packet data into a newly allocated mbuf.
> +	 */
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> +			break;
> +
> +		/* allocate the next mbuf */
> +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> +		if (unlikely(mbuf == NULL))
> +			break;
> +
> +		/* packet will fit in the mbuf, go ahead and receive it */
> +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_KERNEL;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +
> +		/* account for the receive frame */
> +		bufs[i] = mbuf;
> +		num_rx++;
> +	}
> +	pkt_q->framenum = framenum;
> +	pkt_q->rx_pkts += num_rx;
> +	return num_rx;
> +}
> +
> +/*
> + * Callback to handle sending packets through a real NIC.
> + */
> +static uint16_t
> +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	unsigned int framecount, framenum;
> +	struct pollfd pfd;
> +	struct pkt_tx_queue *pkt_q = queue;
> +	uint16_t num_tx = 0;
> +	int i;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	memset(&pfd, 0, sizeof(pfd));
> +	pfd.fd = pkt_q->sockfd;
> +	pfd.events = POLLOUT;
> +	pfd.revents = 0;
> +
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> +		    (poll(&pfd, 1, -1) < 0))
> +				continue;
> +
> +		/* copy the tx frame data */
> +		mbuf = bufs[num_tx];
> +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> +			sizeof(struct sockaddr_ll);
> +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +
> +		num_tx++;
> +		rte_pktmbuf_free(mbuf);
> +	}
> +
> +	/* kick-off transmits */
> +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +
> +	pkt_q->framenum = framenum;
> +	pkt_q->tx_pkts += num_tx;
> +	pkt_q->err_pkts += nb_pkts - num_tx;
> +	return num_tx;
> +}
> +
> +static int
> +eth_dev_start(struct rte_eth_dev *dev)
> +{
> +	dev->data->dev_link.link_status = 1;
> +	return 0;
> +}
> +
> +/*
> + * This function gets called when the current port gets stopped.
> + */
> +static void
> +eth_dev_stop(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	int sockfd;
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	for (i = 0; i < internals->nb_queues; i++) {
> +		sockfd = internals->rx_queue[i].sockfd;
> +		if(sockfd != -1)
> +			close(sockfd);
> +		sockfd = internals->tx_queue[i].sockfd;
> +		if(sockfd != -1)
> +			close(sockfd);
> +	}
> +
> +	dev->data->dev_link.link_status = 0;
> +}
> +
> +static int
> +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) {
> +	return 0;
> +}
> +
> +static void
> +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
> +*dev_info) {
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev_info->driver_name = drivername;
> +	dev_info->if_index = internals->if_index;
> +	dev_info->max_mac_addrs = 1;
> +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->min_rx_bufsize = 0;
> +	dev_info->pci_dev = NULL;
> +}
> +
> +static void
> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> +{
> +	unsigned i, imax;
> +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> +	const struct pmd_internals *internal = dev->data->dev_private;
> +
> +	memset(igb_stats, 0, sizeof(*igb_stats));
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> +		rx_total += igb_stats->q_ipackets[i];
> +	}
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> +		tx_total += igb_stats->q_opackets[i];
> +		tx_err_total += igb_stats->q_errors[i];
> +	}
> +
> +	igb_stats->ipackets = rx_total;
> +	igb_stats->opackets = tx_total;
> +	igb_stats->oerrors = tx_err_total;
> +}
> +
> +static void
> +eth_stats_reset(struct rte_eth_dev *dev) {
> +	unsigned i;
> +	struct pmd_internals *internal = dev->data->dev_private;
> +
> +	for (i = 0; i < internal->nb_queues; i++)
> +		internal->rx_queue[i].rx_pkts = 0;
> +
> +	for (i = 0; i < internal->nb_queues; i++) {
> +		internal->tx_queue[i].tx_pkts = 0;
> +		internal->tx_queue[i].err_pkts = 0;
> +	}
> +}
> +
> +static void
> +eth_dev_close(struct rte_eth_dev *dev __rte_unused) { }
> +
> +static void
> +eth_queue_release(void *q __rte_unused) { }
> +
> +static int
> +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> +                int wait_to_complete __rte_unused) {
> +	return 0;
> +}
> +
> +static int
> +eth_rx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t rx_queue_id,
> +                   uint16_t nb_rx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> +                   struct rte_mempool *mb_pool) {
> +	struct pmd_internals *internals = dev->data->dev_private;
> +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> +	struct rte_pktmbuf_pool_private *mbp_priv;
> +	uint16_t buf_size;
> +
> +	pkt_q->mb_pool = mb_pool;
> +
> +	/* Now get the space available for data in the mbuf */
> +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> +	                       RTE_PKTMBUF_HEADROOM);
> +
> +	if (ETH_FRAME_LEN > buf_size) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> +			dev->data->name, ETH_FRAME_LEN, buf_size);
> +		return -ENOMEM;
> +	}
> +
> +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> +
> +	return 0;
> +}
> +
> +static int
> +eth_tx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t tx_queue_id,
> +                   uint16_t nb_tx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_txconf *tx_conf __rte_unused) {
> +
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> +	return 0;
> +}
> +
> +static struct eth_dev_ops ops = {
> +	.dev_start = eth_dev_start,
> +	.dev_stop = eth_dev_stop,
> +	.dev_close = eth_dev_close,
> +	.dev_configure = eth_dev_configure,
> +	.dev_infos_get = eth_dev_info,
> +	.rx_queue_setup = eth_rx_queue_setup,
> +	.tx_queue_setup = eth_tx_queue_setup,
> +	.rx_queue_release = eth_queue_release,
> +	.tx_queue_release = eth_queue_release,
> +	.link_update = eth_link_update,
> +	.stats_get = eth_stats_get,
> +	.stats_reset = eth_stats_reset,
> +};
> +
> +/*
> + * Opens an AF_PACKET socket
> + */
> +static int
> +open_packet_iface(const char *key __rte_unused,
> +                  const char *value __rte_unused,
> +                  void *extra_args)
> +{
> +	int *sockfd = extra_args;
> +
> +	/* Open an AF_PACKET socket... */
> +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +	if (*sockfd == -1) {
> +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +rte_pmd_init_internals(const char *name,
> +                       const int sockfd,
> +                       const unsigned nb_queues,
> +                       unsigned int blocksize,
> +                       unsigned int blockcnt,
> +                       unsigned int framesize,
> +                       unsigned int framecnt,
> +                       const unsigned numa_node,
> +                       struct pmd_internals **internals,
> +                       struct rte_eth_dev **eth_dev,
> +                       struct rte_kvargs *kvlist) {
> +	struct rte_eth_dev_data *data = NULL;
> +	struct rte_pci_device *pci_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	struct ifreq ifr;
> +	size_t ifnamelen;
> +	unsigned k_idx;
> +	struct sockaddr_ll sockaddr;
> +	struct tpacket_req *req;
> +	struct pkt_rx_queue *rx_queue;
> +	struct pkt_tx_queue *tx_queue;
> +	int rc, tpver, discard, bypass;
> +	unsigned int i, q, rdsize;
> +	int qsockfd, fanout_arg;
> +
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> +			break;
> +	}
> +	if (pair == NULL) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: no interface specified for AF_PACKET ethdev\n",
> +		        name);
> +		goto error;
> +	}
> +
> +	RTE_LOG(INFO, PMD,
> +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> +		name, numa_node);
> +
> +	/*
> +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> +	 * and internal (private) data
> +	 */
> +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> +	if (data == NULL)
> +		goto error;
> +
> +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> +	if (pci_dev == NULL)
> +		goto error;
> +
> +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> +	                                0, numa_node);
> +	if (*internals == NULL)
> +		goto error;
> +
> +	req = &((*internals)->req);
> +
> +	req->tp_block_size = blocksize;
> +	req->tp_block_nr = blockcnt;
> +	req->tp_frame_size = framesize;
> +	req->tp_frame_nr = framecnt;
> +
> +	ifnamelen = strlen(pair->value);
> +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> +		ifr.ifr_name[ifnamelen]='\0';
> +	} else {
> +		RTE_LOG(ERR, PMD,
> +			"%s: I/F name too long (%s)\n",
> +			name, pair->value);
> +		goto error;
> +	}
> +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> +		        name);
> +		goto error;
> +	}
> +	(*internals)->if_index = ifr.ifr_ifindex;
> +
> +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> +		        name);
> +		goto error;
> +	}
> +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> +
> +	memset(&sockaddr, 0, sizeof(sockaddr));
> +	sockaddr.sll_family = AF_PACKET;
> +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> +	sockaddr.sll_ifindex = (*internals)->if_index;
> +
> +	fanout_arg = getpid() & 0xffff;
> +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> +
> +	for (q = 0; q < nb_queues; q++) {
> +		/* Open an AF_PACKET socket for this queue... */
> +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +		if (qsockfd == -1) {
> +			RTE_LOG(ERR, PMD,
> +			        "%s: could not open AF_PACKET socket\n",
> +			        name);
> +			return -1;
> +		}
> +
> +		tpver = TPACKET_V2;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> +				&tpver, sizeof(tpver));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_VERSION on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		discard = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> +				&discard, sizeof(discard));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_LOSS on "
> +			        "AF_PACKET socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		bypass = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> +				&bypass, sizeof(bypass));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_QDISC_BYPASS "
> +			        "on AF_PACKET socket for %s\n", name,
> +			        pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req,
> sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req,
> sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rx_queue = &((*internals)->rx_queue[q]);
> +		rx_queue->framecount = req->tp_frame_nr;
> +
> +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> +				    PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_LOCKED,
> +				    qsockfd, 0);
> +		if (rx_queue->map == MAP_FAILED) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> +				name, pair->value);
> +			goto error;
> +		}
> +
> +		/* rdsize is same for both Tx and Rx */
> +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> +
> +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		rx_queue->sockfd = qsockfd;
> +
> +		tx_queue = &((*internals)->tx_queue[q]);
> +		tx_queue->framecount = req->tp_frame_nr;
> +
> +		tx_queue->map = rx_queue->map + req->tp_block_size *
> +req->tp_block_nr;
> +
> +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		tx_queue->sockfd = qsockfd;
> +
> +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not bind AF_PACKET socket to %s\n",
> +			        name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> +				&fanout_arg, sizeof(fanout_arg));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> +				"for %s\n", name, pair->value);
> +			goto error;
> +		}
> +	}
> +
> +	/* reserve an ethdev entry */
> +	*eth_dev = rte_eth_dev_allocate(name);
> +	if (*eth_dev == NULL)
> +		goto error;
> +
> +	/*
> +	 * now put it all together
> +	 * - store queue data in internals,
> +	 * - store numa_node info in pci_driver
> +	 * - point eth_dev_data to internals and pci_driver
> +	 * - and point eth_dev structure to new eth_dev_data structure
> +	 */
> +
> +	(*internals)->nb_queues = nb_queues;
> +
> +	data->dev_private = *internals;
> +	data->port_id = (*eth_dev)->data->port_id;
> +	data->nb_rx_queues = (uint16_t)nb_queues;
> +	data->nb_tx_queues = (uint16_t)nb_queues;
> +	data->dev_link = pmd_link;
> +	data->mac_addrs = &(*internals)->eth_addr;
> +
> +	pci_dev->numa_node = numa_node;
> +
> +	(*eth_dev)->data = data;
> +	(*eth_dev)->dev_ops = &ops;
> +	(*eth_dev)->pci_dev = pci_dev;
> +
> +	return 0;
> +
> +error:
> +	if (data)
> +		rte_free(data);
> +	if (pci_dev)
> +		rte_free(pci_dev);
> +	for (q = 0; q < nb_queues; q++) {
> +		if ((*internals)->rx_queue[q].rd)
> +			rte_free((*internals)->rx_queue[q].rd);
> +		if ((*internals)->tx_queue[q].rd)
> +			rte_free((*internals)->tx_queue[q].rd);
> +	}
> +	if (*internals)
> +		rte_free(*internals);
> +	return -1;
> +}
> +
> +static int
> +rte_eth_from_packet(const char *name,
> +                    int const *sockfd,
> +                    const unsigned numa_node,
> +                    struct rte_kvargs *kvlist) {
> +	struct pmd_internals *internals = NULL;
> +	struct rte_eth_dev *eth_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	unsigned k_idx;
> +	unsigned int blockcount;
> +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> +	unsigned int framesize = DFLT_FRAME_SIZE;
> +	unsigned int framecount = DFLT_FRAME_COUNT;
> +	unsigned int qpairs = RTE_PMD_PACKET_MAX_RINGS;
> +
> +	/* do some parameter checking */
> +	if (*sockfd < 0)
> +		return -1;
> +
> +	/*
> +	 * Walk arguments for configurable settings
> +	 */
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> +			qpairs = atoi(pair->value);
> +			if (qpairs < 1 ||
> +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid qpairs value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> +			blocksize = atoi(pair->value);
> +			if (!blocksize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid blocksize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> +			framesize = atoi(pair->value);
> +			if (!framesize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framesize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> +			framecount = atoi(pair->value);
> +			if (!framecount) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framecount value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +	}
> +
> +	if (framesize > blocksize) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> +		        name);
> +		return -1;
> +	}
> +
> +	blockcount = framecount / (blocksize / framesize);
> +	if (!blockcount) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> +		return -1;
> +	}
> +
> +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> +
> +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> +	                           blocksize, blockcount,
> +	                           framesize, framecount,
> +	                           numa_node, &internals, &eth_dev,
> +	                           kvlist) < 0)
> +		return -1;
> +
> +	eth_dev->rx_pkt_burst = eth_packet_rx;
> +	eth_dev->tx_pkt_burst = eth_packet_tx;
> +
> +	return 0;
> +}
> +
> +int
> +rte_pmd_packet_devinit(const char *name, const char *params) {
> +	unsigned numa_node;
> +	int ret;
> +	struct rte_kvargs *kvlist;
> +	int sockfd = -1;
> +
> +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> +
> +	numa_node = rte_socket_id();
> +
> +	kvlist = rte_kvargs_parse(params, valid_arguments);
> +	if (kvlist == NULL)
> +		return -1;
> +
> +	/*
> +	 * If iface argument is passed we open the NICs and use them for
> +	 * reading / writing
> +	 */
> +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> +
> +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> +		                         &open_packet_iface, &sockfd);
> +		if (ret < 0)
> +			return -1;
> +	}
> +
> +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> +	close(sockfd); /* no longer needed */
> +
> +	if (ret < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static struct rte_driver pmd_packet_drv = {
> +	.name = "eth_packet",
> +	.type = PMD_VDEV,
> +	.init = rte_pmd_packet_devinit,
> +};
> +
> +PMD_REGISTER_DRIVER(pmd_packet_drv);
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.h
> b/lib/librte_pmd_packet/rte_eth_packet.h
> new file mode 100644
> index 000000000000..f685611da3e9
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> @@ -0,0 +1,55 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> + */
> +
> +#ifndef _RTE_ETH_PACKET_H_
> +#define _RTE_ETH_PACKET_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> +
> +#define RTE_PMD_PACKET_MAX_RINGS 16
> +
> +/**
> + * For use by the EAL only. Called as part of EAL init to set up any
> +dummy NICs
> + * configured on command line.
> + */
> +int rte_pmd_packet_devinit(const char *name, const char *params);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 34dff2a02a05..a6994c4dbe93
> 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)  LDLIBS
> += -lrte_pmd_pcap -lpcap  endif
> 
> +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> +LDLIBS += -lrte_pmd_packet
> +endif
> +
>  endif # plugins
> 
>  LDLIBS += $(EXECENV_LDLIBS)
> --
> 1.9.3

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 16:47         ` Thomas Monjalon
@ 2014-07-11 17:38           ` Richardson, Bruce
  2014-07-11 17:41             ` John W. Linville
  2014-07-12 11:48           ` Neil Horman
  1 sibling, 1 reply; 76+ messages in thread
From: Richardson, Bruce @ 2014-07-11 17:38 UTC (permalink / raw)
  To: Thomas Monjalon, John W. Linville; +Cc: dev

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Friday, July 11, 2014 9:48 AM
> To: John W. Linville
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-
> based virtual devices
> 
> 2014-07-11 11:30, John W. Linville:
> > On Fri, Jul 11, 2014 at 05:04:04PM +0200, Thomas Monjalon wrote:
> > > 2014-07-11 10:51, John W. Linville:
> > > > On Fri, Jul 11, 2014 at 03:26:39PM +0200, Thomas Monjalon wrote:
> > > > > Thank you for this nice work.
> > > > >
> > > > > I think it would be well suited to host this PMD as an external one in
> > > > > order to make it work also with DPDK 1.7.0.
> > > >
> > > > I'm not sure I understand the suggestion -- you don't want to merge
> > > > the driver for 1.8?  Or you just want to host this patch somewhere,
> > > > so people can still use it w/ 1.7?
> > >
> > > I suggest to have a separated repository here:
> > > 	http://dpdk.org/browse/
> >
> > I really don't see any reason not to merge it.  It was already delayed
> > by me waiting for all the PMD init changes to settle out in the 1.6
> > release, and I still had to do a few touch-ups for it to compile on
> > 1.7.  I definitely do not want to have to do that over and over again.
> 
> It's a pity that we didn't synchronize our efforts to make it integrated
> during 1.7.0 cycle.
> 
> > Why wouldn't you just merge it?  If someone wants to use it on 1.7,
> > they can just apply the patch.
> 
> I'm OK to merge it. I was only suggesting to host your PMD externally like we
> did for virtio-net-pmd, vmxnet3-usermap and memnic.
> It was the same discussion for the vmxnet3 PMD that Stephen submitted.
> 
> I start thinking that nobody wants PMD to be external. So we may merge this
> one in dpdk.git and start talking what to do for the other ones:
> 	- move memnic in dpdk.git?

Yes, I would agree with this. Having drivers in external git repos makes it hard for us to take them into account when planning on making changes to the core libs.

> 	- move virtio-net-pmd and vmxnet3-usermap where sits their uio
> counterparts?
> 	- merge Brocade's vmxnet3 as new one or as a replacement for
> vmxnet3-uio?

For these we really should try and converge on a single solution. Having multiple vmxnet3 and virtio drivers duplicates effort and is just plain messy! Of course, that's easier to say than to agree on...

/Bruce

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 17:20   ` Zhou, Danny
@ 2014-07-11 17:40     ` John W. Linville
  2014-07-11 18:01       ` Zhou, Danny
                         ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-11 17:40 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> Looks like you used a pretty new kernel version with new socket options that old kernel like my 3.12 does not support. When I tried this patch, it just cannot build, and compiler complains like below. Which Linux distribution does this patch work for? How to ensure it works for old kernels?
> 
> /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c: In function rte_pmd_init_internals:
> /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524:17: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in this function)
> /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524:17: note: each undeclared identifier is reported only once for each function it appears in
> /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:557:33: error: PACKET_QDISC_BYPASS undeclared (first use in this function)

Both of them are isolated, so for playing with it you could just
comment those out.  It looks like PACKET_FANOUT_FLAG_ROLLOVER should
have been in 3.10, while PACKET_QDISC_BYPASS didn't show-up until
3.14...

/home/linville/git/linux
[linville-x1.hq.tuxdriver.com]:> git annotate include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define PACKET_FANOUT_FLAG_ROLLOVER	0x1000

/home/linville/git/linux
[linville-x1.hq.tuxdriver.com]:> git show -s --format=short 77f65ebdca506
commit 77f65ebdca506870d99bfabe52bde222511022ec
Author: Willem de Bruijn <willemb@google.com>

    packet: packet fanout rollover during socket overload

/home/linville/git/linux
[linville-x1.hq.tuxdriver.com]:> git describe --contains 77f65ebdca506
v3.10-rc1~66^2~423

/home/linville/git/linux
[linville-x1.hq.tuxdriver.com]:> git annotate include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define PACKET_QDISC_BYPASS		20

/home/linville/git/linux
[linville-x1.hq.tuxdriver.com]:> git show -s --format=short d346a3fae3ff1
commit d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
Author: Daniel Borkmann <dborkman@redhat.com>

    packet: introduce PACKET_QDISC_BYPASS socket option

/home/linville/git/linux
[linville-x1.hq.tuxdriver.com]:> git describe --contains d346a3fae3ff1
v3.14-rc1~94^2~564

Is there an example of code in DPDK that requires specific kernel
versions?  What is the preferred method for coding such dependencies?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 17:38           ` Richardson, Bruce
@ 2014-07-11 17:41             ` John W. Linville
  0 siblings, 0 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-11 17:41 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

On Fri, Jul 11, 2014 at 05:38:17PM +0000, Richardson, Bruce wrote:
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > Sent: Friday, July 11, 2014 9:48 AM
> > To: John W. Linville
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-
> > based virtual devices
> > 
> > 2014-07-11 11:30, John W. Linville:
> > > On Fri, Jul 11, 2014 at 05:04:04PM +0200, Thomas Monjalon wrote:
> > > > 2014-07-11 10:51, John W. Linville:
> > > > > On Fri, Jul 11, 2014 at 03:26:39PM +0200, Thomas Monjalon wrote:
> > > > > > Thank you for this nice work.
> > > > > >
> > > > > > I think it would be well suited to host this PMD as an external one in
> > > > > > order to make it work also with DPDK 1.7.0.
> > > > >
> > > > > I'm not sure I understand the suggestion -- you don't want to merge
> > > > > the driver for 1.8?  Or you just want to host this patch somewhere,
> > > > > so people can still use it w/ 1.7?
> > > >
> > > > I suggest to have a separated repository here:
> > > > 	http://dpdk.org/browse/
> > >
> > > I really don't see any reason not to merge it.  It was already delayed
> > > by me waiting for all the PMD init changes to settle out in the 1.6
> > > release, and I still had to do a few touch-ups for it to compile on
> > > 1.7.  I definitely do not want to have to do that over and over again.
> > 
> > It's a pity that we didn't synchronize our efforts to make it integrated
> > during 1.7.0 cycle.
> > 
> > > Why wouldn't you just merge it?  If someone wants to use it on 1.7,
> > > they can just apply the patch.
> > 
> > I'm OK to merge it. I was only suggesting to host your PMD externally like we
> > did for virtio-net-pmd, vmxnet3-usermap and memnic.
> > It was the same discussion for the vmxnet3 PMD that Stephen submitted.
> > 
> > I start thinking that nobody wants PMD to be external. So we may merge this
> > one in dpdk.git and start talking what to do for the other ones:
> > 	- move memnic in dpdk.git?
> 
> Yes, I would agree with this. Having drivers in external git repos makes it hard for us to take them into account when planning on making changes to the core libs.
> 
> > 	- move virtio-net-pmd and vmxnet3-usermap where sits their uio
> > counterparts?
> > 	- merge Brocade's vmxnet3 as new one or as a replacement for
> > vmxnet3-uio?
> 
> For these we really should try and converge on a single solution. Having multiple vmxnet3 and virtio drivers duplicates effort and is just plain messy! Of course, that's easier to say than to agree on...

+1

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 17:40     ` John W. Linville
@ 2014-07-11 18:01       ` Zhou, Danny
  2014-07-11 18:46         ` John W. Linville
  2014-07-11 19:04       ` Zhou, Danny
  2014-07-11 22:34       ` Thomas Monjalon
  2 siblings, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-07-11 18:01 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

Tried on 3.12, both of them are undefined. Anyway, will comment them out and see what performance it could achieve.

> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Saturday, July 12, 2014 1:41 AM
> To: Zhou, Danny
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > Looks like you used a pretty new kernel version with new socket options that old
> kernel like my 3.12 does not support. When I tried this patch, it just cannot build, and
> compiler complains like below. Which Linux distribution does this patch work for?
> How to ensure it works for old kernels?
> >
> > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c: In function
> rte_pmd_init_internals:
> > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524:1
> > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in this
> > function)
> > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524:1
> > 7: note: each undeclared identifier is reported only once for each
> > function it appears in
> > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:557:3
> > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this function)
> 
> Both of them are isolated, so for playing with it you could just comment those out.
> It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in 3.10, while
> PACKET_QDISC_BYPASS didn't show-up until 3.14...
> 
> /home/linville/git/linux
> [linville-x1.hq.tuxdriver.com]:> git annotate include/uapi/linux/if_packet.h | grep
> PACKET_FANOUT_FLAG_ROLLOVER
> 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define
> PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> 
> /home/linville/git/linux
> [linville-x1.hq.tuxdriver.com]:> git show -s --format=short 77f65ebdca506 commit
> 77f65ebdca506870d99bfabe52bde222511022ec
> Author: Willem de Bruijn <willemb@google.com>
> 
>     packet: packet fanout rollover during socket overload
> 
> /home/linville/git/linux
> [linville-x1.hq.tuxdriver.com]:> git describe --contains 77f65ebdca506
> v3.10-rc1~66^2~423
> 
> /home/linville/git/linux
> [linville-x1.hq.tuxdriver.com]:> git annotate include/uapi/linux/if_packet.h | grep
> PACKET_QDISC_BYPASS
> d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> PACKET_QDISC_BYPASS		20
> 
> /home/linville/git/linux
> [linville-x1.hq.tuxdriver.com]:> git show -s --format=short d346a3fae3ff1 commit
> d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> Author: Daniel Borkmann <dborkman@redhat.com>
> 
>     packet: introduce PACKET_QDISC_BYPASS socket option
> 
> /home/linville/git/linux
> [linville-x1.hq.tuxdriver.com]:> git describe --contains d346a3fae3ff1
> v3.14-rc1~94^2~564
> 
> Is there an example of code in DPDK that requires specific kernel versions?  What is
> the preferred method for coding such dependencies?
> 
> John
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 18:01       ` Zhou, Danny
@ 2014-07-11 18:46         ` John W. Linville
  2014-07-12  0:42           ` Zhou, Danny
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-11 18:46 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

Not sure what the issue might be, PACKET_FANOUT_FLAG_ROLLOVER is
defined in include/uapi/linux/if_packet.h in the v3.12 tree.

On Fri, Jul 11, 2014 at 06:01:27PM +0000, Zhou, Danny wrote:
> Tried on 3.12, both of them are undefined. Anyway, will comment them out and see what performance it could achieve.
> 
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Saturday, July 12, 2014 1:41 AM
> > To: Zhou, Danny
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> > 
> > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > Looks like you used a pretty new kernel version with new socket options that old
> > kernel like my 3.12 does not support. When I tried this patch, it just cannot build, and
> > compiler complains like below. Which Linux distribution does this patch work for?
> > How to ensure it works for old kernels?
> > >
> > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c: In function
> > rte_pmd_init_internals:
> > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524:1
> > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in this
> > > function)
> > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524:1
> > > 7: note: each undeclared identifier is reported only once for each
> > > function it appears in
> > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:557:3
> > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this function)
> > 
> > Both of them are isolated, so for playing with it you could just comment those out.
> > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in 3.10, while
> > PACKET_QDISC_BYPASS didn't show-up until 3.14...
> > 
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git annotate include/uapi/linux/if_packet.h | grep
> > PACKET_FANOUT_FLAG_ROLLOVER
> > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define
> > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> > 
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short 77f65ebdca506 commit
> > 77f65ebdca506870d99bfabe52bde222511022ec
> > Author: Willem de Bruijn <willemb@google.com>
> > 
> >     packet: packet fanout rollover during socket overload
> > 
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git describe --contains 77f65ebdca506
> > v3.10-rc1~66^2~423
> > 
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git annotate include/uapi/linux/if_packet.h | grep
> > PACKET_QDISC_BYPASS
> > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > PACKET_QDISC_BYPASS		20
> > 
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short d346a3fae3ff1 commit
> > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > Author: Daniel Borkmann <dborkman@redhat.com>
> > 
> >     packet: introduce PACKET_QDISC_BYPASS socket option
> > 
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git describe --contains d346a3fae3ff1
> > v3.14-rc1~94^2~564
> > 
> > Is there an example of code in DPDK that requires specific kernel versions?  What is
> > the preferred method for coding such dependencies?
> > 
> > John
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 17:40     ` John W. Linville
  2014-07-11 18:01       ` Zhou, Danny
@ 2014-07-11 19:04       ` Zhou, Danny
  2014-07-11 19:31         ` John W. Linville
  2014-07-11 22:34       ` Thomas Monjalon
  2 siblings, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-07-11 19:04 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

Does it support specifying multiple NIC interfaces using command line option like "--vdev=eth_packet0,iface=..."? Say "iface=eth0,eth1,eth2...", tried but it doesn't work.

> -----Original Message-----
> From: Zhou, Danny
> Sent: Saturday, July 12, 2014 2:01 AM
> To: 'John W. Linville'
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> Tried on 3.12, both of them are undefined. Anyway, will comment them out and see
> what performance it could achieve.
> 
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Saturday, July 12, 2014 1:41 AM
> > To: Zhou, Danny
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> >
> > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > Looks like you used a pretty new kernel version with new socket
> > > options that old
> > kernel like my 3.12 does not support. When I tried this patch, it just
> > cannot build, and compiler complains like below. Which Linux distribution does this
> patch work for?
> > How to ensure it works for old kernels?
> > >
> > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c: In
> > > function
> > rte_pmd_init_internals:
> > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524
> > > :1
> > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in this
> > > function)
> > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524
> > > :1
> > > 7: note: each undeclared identifier is reported only once for each
> > > function it appears in
> > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:557
> > > :3
> > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this
> > > function)
> >
> > Both of them are isolated, so for playing with it you could just comment those out.
> > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in 3.10,
> > while PACKET_QDISC_BYPASS didn't show-up until 3.14...
> >
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git annotate
> > include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
> > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define
> > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> >
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > 77f65ebdca506 commit 77f65ebdca506870d99bfabe52bde222511022ec
> > Author: Willem de Bruijn <willemb@google.com>
> >
> >     packet: packet fanout rollover during socket overload
> >
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git describe --contains 77f65ebdca506
> > v3.10-rc1~66^2~423
> >
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git annotate
> > include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
> > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > PACKET_QDISC_BYPASS		20
> >
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > d346a3fae3ff1 commit
> > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > Author: Daniel Borkmann <dborkman@redhat.com>
> >
> >     packet: introduce PACKET_QDISC_BYPASS socket option
> >
> > /home/linville/git/linux
> > [linville-x1.hq.tuxdriver.com]:> git describe --contains d346a3fae3ff1
> > v3.14-rc1~94^2~564
> >
> > Is there an example of code in DPDK that requires specific kernel
> > versions?  What is the preferred method for coding such dependencies?
> >
> > John
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 19:04       ` Zhou, Danny
@ 2014-07-11 19:31         ` John W. Linville
  2014-07-11 20:27           ` Zhou, Danny
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-11 19:31 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

I'm not sure that would make any sense -- the AF_PACKET sockets are
mapped to specific interfaces.

What are you trying to do with a syntax like that?

John

On Fri, Jul 11, 2014 at 07:04:19PM +0000, Zhou, Danny wrote:
> Does it support specifying multiple NIC interfaces using command line option like "--vdev=eth_packet0,iface=..."? Say "iface=eth0,eth1,eth2...", tried but it doesn't work.
> 
> > -----Original Message-----
> > From: Zhou, Danny
> > Sent: Saturday, July 12, 2014 2:01 AM
> > To: 'John W. Linville'
> > Cc: dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> > 
> > Tried on 3.12, both of them are undefined. Anyway, will comment them out and see
> > what performance it could achieve.
> > 
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Saturday, July 12, 2014 1:41 AM
> > > To: Zhou, Danny
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > AF_PACKET-based virtual devices
> > >
> > > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > > Looks like you used a pretty new kernel version with new socket
> > > > options that old
> > > kernel like my 3.12 does not support. When I tried this patch, it just
> > > cannot build, and compiler complains like below. Which Linux distribution does this
> > patch work for?
> > > How to ensure it works for old kernels?
> > > >
> > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c: In
> > > > function
> > > rte_pmd_init_internals:
> > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524
> > > > :1
> > > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in this
> > > > function)
> > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:524
> > > > :1
> > > > 7: note: each undeclared identifier is reported only once for each
> > > > function it appears in
> > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:557
> > > > :3
> > > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this
> > > > function)
> > >
> > > Both of them are isolated, so for playing with it you could just comment those out.
> > > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in 3.10,
> > > while PACKET_QDISC_BYPASS didn't show-up until 3.14...
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
> > > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define
> > > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > 77f65ebdca506 commit 77f65ebdca506870d99bfabe52bde222511022ec
> > > Author: Willem de Bruijn <willemb@google.com>
> > >
> > >     packet: packet fanout rollover during socket overload
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git describe --contains 77f65ebdca506
> > > v3.10-rc1~66^2~423
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
> > > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > > PACKET_QDISC_BYPASS		20
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > d346a3fae3ff1 commit
> > > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > > Author: Daniel Borkmann <dborkman@redhat.com>
> > >
> > >     packet: introduce PACKET_QDISC_BYPASS socket option
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git describe --contains d346a3fae3ff1
> > > v3.14-rc1~94^2~564
> > >
> > > Is there an example of code in DPDK that requires specific kernel
> > > versions?  What is the preferred method for coding such dependencies?
> > >
> > > John
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 19:31         ` John W. Linville
@ 2014-07-11 20:27           ` Zhou, Danny
  2014-07-11 20:31             ` Shaw, Jeffrey B
  0 siblings, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-07-11 20:27 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

I want to run a common DPDK L2 or L3 forward benchmark for bi-direction traffics, so at least two ports are required. Just like how to measure Linux bridge or OVS performance, you need add at least two ports into a bridge.

> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Saturday, July 12, 2014 3:32 AM
> To: Zhou, Danny
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> I'm not sure that would make any sense -- the AF_PACKET sockets are mapped to
> specific interfaces.
> 
> What are you trying to do with a syntax like that?
> 
> John
> 
> On Fri, Jul 11, 2014 at 07:04:19PM +0000, Zhou, Danny wrote:
> > Does it support specifying multiple NIC interfaces using command line option like
> "--vdev=eth_packet0,iface=..."? Say "iface=eth0,eth1,eth2...", tried but it doesn't
> work.
> >
> > > -----Original Message-----
> > > From: Zhou, Danny
> > > Sent: Saturday, July 12, 2014 2:01 AM
> > > To: 'John W. Linville'
> > > Cc: dev@dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > AF_PACKET-based virtual devices
> > >
> > > Tried on 3.12, both of them are undefined. Anyway, will comment them
> > > out and see what performance it could achieve.
> > >
> > > > -----Original Message-----
> > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > Sent: Saturday, July 12, 2014 1:41 AM
> > > > To: Zhou, Danny
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > > AF_PACKET-based virtual devices
> > > >
> > > > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > > > Looks like you used a pretty new kernel version with new socket
> > > > > options that old
> > > > kernel like my 3.12 does not support. When I tried this patch, it
> > > > just cannot build, and compiler complains like below. Which Linux
> > > > distribution does this
> > > patch work for?
> > > > How to ensure it works for old kernels?
> > > > >
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > : In function
> > > > rte_pmd_init_internals:
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > :524
> > > > > :1
> > > > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in
> > > > > this
> > > > > function)
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > :524
> > > > > :1
> > > > > 7: note: each undeclared identifier is reported only once for
> > > > > each function it appears in
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > :557
> > > > > :3
> > > > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this
> > > > > function)
> > > >
> > > > Both of them are isolated, so for playing with it you could just comment those
> out.
> > > > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in
> > > > 3.10, while PACKET_QDISC_BYPASS didn't show-up until 3.14...
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > > include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
> > > > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define
> > > > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > 77f65ebdca506 commit 77f65ebdca506870d99bfabe52bde222511022ec
> > > > Author: Willem de Bruijn <willemb@google.com>
> > > >
> > > >     packet: packet fanout rollover during socket overload
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > 77f65ebdca506
> > > > v3.10-rc1~66^2~423
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > > include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
> > > > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > > > PACKET_QDISC_BYPASS		20
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > d346a3fae3ff1 commit
> > > > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > > > Author: Daniel Borkmann <dborkman@redhat.com>
> > > >
> > > >     packet: introduce PACKET_QDISC_BYPASS socket option
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > d346a3fae3ff1
> > > > v3.14-rc1~94^2~564
> > > >
> > > > Is there an example of code in DPDK that requires specific kernel
> > > > versions?  What is the preferred method for coding such dependencies?
> > > >
> > > > John
> > > > --
> > > > John W. Linville		Someday the world will need a hero, and you
> > > > linville@tuxdriver.com			might be all we have.  Be ready.
> >
> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 20:27           ` Zhou, Danny
@ 2014-07-11 20:31             ` Shaw, Jeffrey B
  2014-07-11 20:35               ` Zhou, Danny
  0 siblings, 1 reply; 76+ messages in thread
From: Shaw, Jeffrey B @ 2014-07-11 20:31 UTC (permalink / raw)
  To: Zhou, Danny, John W. Linville; +Cc: dev

Danny, can you specify multiple --vdev parameters?
"--vdev=eth_packet0,iface=eth0 --vdev=eth_packet1,iface=eth1"


-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhou, Danny
Sent: Friday, July 11, 2014 1:27 PM
To: John W. Linville
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices

I want to run a common DPDK L2 or L3 forward benchmark for bi-direction traffics, so at least two ports are required. Just like how to measure Linux bridge or OVS performance, you need add at least two ports into a bridge.

> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Saturday, July 12, 2014 3:32 AM
> To: Zhou, Danny
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for 
> AF_PACKET-based virtual devices
> 
> I'm not sure that would make any sense -- the AF_PACKET sockets are 
> mapped to specific interfaces.
> 
> What are you trying to do with a syntax like that?
> 
> John
> 
> On Fri, Jul 11, 2014 at 07:04:19PM +0000, Zhou, Danny wrote:
> > Does it support specifying multiple NIC interfaces using command 
> > line option like
> "--vdev=eth_packet0,iface=..."? Say "iface=eth0,eth1,eth2...", tried 
> but it doesn't work.
> >
> > > -----Original Message-----
> > > From: Zhou, Danny
> > > Sent: Saturday, July 12, 2014 2:01 AM
> > > To: 'John W. Linville'
> > > Cc: dev@dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for 
> > > AF_PACKET-based virtual devices
> > >
> > > Tried on 3.12, both of them are undefined. Anyway, will comment 
> > > them out and see what performance it could achieve.
> > >
> > > > -----Original Message-----
> > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > Sent: Saturday, July 12, 2014 1:41 AM
> > > > To: Zhou, Danny
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for 
> > > > AF_PACKET-based virtual devices
> > > >
> > > > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > > > Looks like you used a pretty new kernel version with new 
> > > > > socket options that old
> > > > kernel like my 3.12 does not support. When I tried this patch, 
> > > > it just cannot build, and compiler complains like below. Which 
> > > > Linux distribution does this
> > > patch work for?
> > > > How to ensure it works for old kernels?
> > > > >
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > .c
> > > > > : In function
> > > > rte_pmd_init_internals:
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > .c
> > > > > :524
> > > > > :1
> > > > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in 
> > > > > this
> > > > > function)
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > .c
> > > > > :524
> > > > > :1
> > > > > 7: note: each undeclared identifier is reported only once for 
> > > > > each function it appears in 
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > .c
> > > > > :557
> > > > > :3
> > > > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this
> > > > > function)
> > > >
> > > > Both of them are isolated, so for playing with it you could just 
> > > > comment those
> out.
> > > > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in 
> > > > 3.10, while PACKET_QDISC_BYPASS didn't show-up until 3.14...
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git annotate 
> > > > include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
> > > > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define
> > > > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > 77f65ebdca506 commit 77f65ebdca506870d99bfabe52bde222511022ec
> > > > Author: Willem de Bruijn <willemb@google.com>
> > > >
> > > >     packet: packet fanout rollover during socket overload
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > 77f65ebdca506
> > > > v3.10-rc1~66^2~423
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git annotate 
> > > > include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
> > > > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > > > PACKET_QDISC_BYPASS		20
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > d346a3fae3ff1 commit
> > > > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > > > Author: Daniel Borkmann <dborkman@redhat.com>
> > > >
> > > >     packet: introduce PACKET_QDISC_BYPASS socket option
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > d346a3fae3ff1
> > > > v3.14-rc1~94^2~564
> > > >
> > > > Is there an example of code in DPDK that requires specific 
> > > > kernel versions?  What is the preferred method for coding such dependencies?
> > > >
> > > > John
> > > > --
> > > > John W. Linville		Someday the world will need a hero, and you
> > > > linville@tuxdriver.com			might be all we have.  Be ready.
> >
> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 20:31             ` Shaw, Jeffrey B
@ 2014-07-11 20:35               ` Zhou, Danny
  2014-07-11 20:40                 ` John W. Linville
  0 siblings, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-07-11 20:35 UTC (permalink / raw)
  To: Shaw, Jeffrey B, John W. Linville; +Cc: dev

Thanks Jeff, it works as expected, like below command line:

./l2fwd/build/l2fwd -c 0x3 -n 4 --vdev=eth_packet0,iface=p786p1 --vdev=eth_packet1,iface=p786p2 -- -p 0x3

> -----Original Message-----
> From: Shaw, Jeffrey B
> Sent: Saturday, July 12, 2014 4:32 AM
> To: Zhou, Danny; John W. Linville
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> Danny, can you specify multiple --vdev parameters?
> "--vdev=eth_packet0,iface=eth0 --vdev=eth_packet1,iface=eth1"
> 
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhou, Danny
> Sent: Friday, July 11, 2014 1:27 PM
> To: John W. Linville
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> I want to run a common DPDK L2 or L3 forward benchmark for bi-direction traffics,
> so at least two ports are required. Just like how to measure Linux bridge or OVS
> performance, you need add at least two ports into a bridge.
> 
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Saturday, July 12, 2014 3:32 AM
> > To: Zhou, Danny
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> >
> > I'm not sure that would make any sense -- the AF_PACKET sockets are
> > mapped to specific interfaces.
> >
> > What are you trying to do with a syntax like that?
> >
> > John
> >
> > On Fri, Jul 11, 2014 at 07:04:19PM +0000, Zhou, Danny wrote:
> > > Does it support specifying multiple NIC interfaces using command
> > > line option like
> > "--vdev=eth_packet0,iface=..."? Say "iface=eth0,eth1,eth2...", tried
> > but it doesn't work.
> > >
> > > > -----Original Message-----
> > > > From: Zhou, Danny
> > > > Sent: Saturday, July 12, 2014 2:01 AM
> > > > To: 'John W. Linville'
> > > > Cc: dev@dpdk.org
> > > > Subject: RE: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > > AF_PACKET-based virtual devices
> > > >
> > > > Tried on 3.12, both of them are undefined. Anyway, will comment
> > > > them out and see what performance it could achieve.
> > > >
> > > > > -----Original Message-----
> > > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > > Sent: Saturday, July 12, 2014 1:41 AM
> > > > > To: Zhou, Danny
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > > > AF_PACKET-based virtual devices
> > > > >
> > > > > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > > > > Looks like you used a pretty new kernel version with new
> > > > > > socket options that old
> > > > > kernel like my 3.12 does not support. When I tried this patch,
> > > > > it just cannot build, and compiler complains like below. Which
> > > > > Linux distribution does this
> > > > patch work for?
> > > > > How to ensure it works for old kernels?
> > > > > >
> > > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > > .c
> > > > > > : In function
> > > > > rte_pmd_init_internals:
> > > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > > .c
> > > > > > :524
> > > > > > :1
> > > > > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in
> > > > > > this
> > > > > > function)
> > > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > > .c
> > > > > > :524
> > > > > > :1
> > > > > > 7: note: each undeclared identifier is reported only once for
> > > > > > each function it appears in
> > > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > > .c
> > > > > > :557
> > > > > > :3
> > > > > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this
> > > > > > function)
> > > > >
> > > > > Both of them are isolated, so for playing with it you could just
> > > > > comment those
> > out.
> > > > > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in
> > > > > 3.10, while PACKET_QDISC_BYPASS didn't show-up until 3.14...
> > > > >
> > > > > /home/linville/git/linux
> > > > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > > > include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
> > > > > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000
> 	64)#define
> > > > > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> > > > >
> > > > > /home/linville/git/linux
> > > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > > 77f65ebdca506 commit 77f65ebdca506870d99bfabe52bde222511022ec
> > > > > Author: Willem de Bruijn <willemb@google.com>
> > > > >
> > > > >     packet: packet fanout rollover during socket overload
> > > > >
> > > > > /home/linville/git/linux
> > > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > > 77f65ebdca506
> > > > > v3.10-rc1~66^2~423
> > > > >
> > > > > /home/linville/git/linux
> > > > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > > > include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
> > > > > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > > > > PACKET_QDISC_BYPASS		20
> > > > >
> > > > > /home/linville/git/linux
> > > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > > d346a3fae3ff1 commit
> > > > > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > > > > Author: Daniel Borkmann <dborkman@redhat.com>
> > > > >
> > > > >     packet: introduce PACKET_QDISC_BYPASS socket option
> > > > >
> > > > > /home/linville/git/linux
> > > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > > d346a3fae3ff1
> > > > > v3.14-rc1~94^2~564
> > > > >
> > > > > Is there an example of code in DPDK that requires specific
> > > > > kernel versions?  What is the preferred method for coding such
> dependencies?
> > > > >
> > > > > John
> > > > > --
> > > > > John W. Linville		Someday the world will need a hero, and you
> > > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > >
> >
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 20:35               ` Zhou, Danny
@ 2014-07-11 20:40                 ` John W. Linville
  0 siblings, 0 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-11 20:40 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

Ah, yes...sorry, I misunderstood what you wanted to do.  The syntax
below is what I would expect to use.

John

On Fri, Jul 11, 2014 at 08:35:01PM +0000, Zhou, Danny wrote:
> Thanks Jeff, it works as expected, like below command line:
> 
> ./l2fwd/build/l2fwd -c 0x3 -n 4 --vdev=eth_packet0,iface=p786p1 --vdev=eth_packet1,iface=p786p2 -- -p 0x3
> 
> > -----Original Message-----
> > From: Shaw, Jeffrey B
> > Sent: Saturday, July 12, 2014 4:32 AM
> > To: Zhou, Danny; John W. Linville
> > Cc: dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> > 
> > Danny, can you specify multiple --vdev parameters?
> > "--vdev=eth_packet0,iface=eth0 --vdev=eth_packet1,iface=eth1"
> > 
> > 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhou, Danny
> > Sent: Friday, July 11, 2014 1:27 PM
> > To: John W. Linville
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> > 
> > I want to run a common DPDK L2 or L3 forward benchmark for bi-direction traffics,
> > so at least two ports are required. Just like how to measure Linux bridge or OVS
> > performance, you need add at least two ports into a bridge.
> > 
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Saturday, July 12, 2014 3:32 AM
> > > To: Zhou, Danny
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > AF_PACKET-based virtual devices
> > >
> > > I'm not sure that would make any sense -- the AF_PACKET sockets are
> > > mapped to specific interfaces.
> > >
> > > What are you trying to do with a syntax like that?
> > >
> > > John
> > >
> > > On Fri, Jul 11, 2014 at 07:04:19PM +0000, Zhou, Danny wrote:
> > > > Does it support specifying multiple NIC interfaces using command
> > > > line option like
> > > "--vdev=eth_packet0,iface=..."? Say "iface=eth0,eth1,eth2...", tried
> > > but it doesn't work.
> > > >
> > > > > -----Original Message-----
> > > > > From: Zhou, Danny
> > > > > Sent: Saturday, July 12, 2014 2:01 AM
> > > > > To: 'John W. Linville'
> > > > > Cc: dev@dpdk.org
> > > > > Subject: RE: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > > > AF_PACKET-based virtual devices
> > > > >
> > > > > Tried on 3.12, both of them are undefined. Anyway, will comment
> > > > > them out and see what performance it could achieve.
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > > > Sent: Saturday, July 12, 2014 1:41 AM
> > > > > > To: Zhou, Danny
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > > > > AF_PACKET-based virtual devices
> > > > > >
> > > > > > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > > > > > Looks like you used a pretty new kernel version with new
> > > > > > > socket options that old
> > > > > > kernel like my 3.12 does not support. When I tried this patch,
> > > > > > it just cannot build, and compiler complains like below. Which
> > > > > > Linux distribution does this
> > > > > patch work for?
> > > > > > How to ensure it works for old kernels?
> > > > > > >
> > > > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > > > .c
> > > > > > > : In function
> > > > > > rte_pmd_init_internals:
> > > > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > > > .c
> > > > > > > :524
> > > > > > > :1
> > > > > > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in
> > > > > > > this
> > > > > > > function)
> > > > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > > > .c
> > > > > > > :524
> > > > > > > :1
> > > > > > > 7: note: each undeclared identifier is reported only once for
> > > > > > > each function it appears in
> > > > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet
> > > > > > > .c
> > > > > > > :557
> > > > > > > :3
> > > > > > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this
> > > > > > > function)
> > > > > >
> > > > > > Both of them are isolated, so for playing with it you could just
> > > > > > comment those
> > > out.
> > > > > > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in
> > > > > > 3.10, while PACKET_QDISC_BYPASS didn't show-up until 3.14...
> > > > > >
> > > > > > /home/linville/git/linux
> > > > > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > > > > include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
> > > > > > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000
> > 	64)#define
> > > > > > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> > > > > >
> > > > > > /home/linville/git/linux
> > > > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > > > 77f65ebdca506 commit 77f65ebdca506870d99bfabe52bde222511022ec
> > > > > > Author: Willem de Bruijn <willemb@google.com>
> > > > > >
> > > > > >     packet: packet fanout rollover during socket overload
> > > > > >
> > > > > > /home/linville/git/linux
> > > > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > > > 77f65ebdca506
> > > > > > v3.10-rc1~66^2~423
> > > > > >
> > > > > > /home/linville/git/linux
> > > > > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > > > > include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
> > > > > > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > > > > > PACKET_QDISC_BYPASS		20
> > > > > >
> > > > > > /home/linville/git/linux
> > > > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > > > d346a3fae3ff1 commit
> > > > > > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > > > > > Author: Daniel Borkmann <dborkman@redhat.com>
> > > > > >
> > > > > >     packet: introduce PACKET_QDISC_BYPASS socket option
> > > > > >
> > > > > > /home/linville/git/linux
> > > > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > > > d346a3fae3ff1
> > > > > > v3.14-rc1~94^2~564
> > > > > >
> > > > > > Is there an example of code in DPDK that requires specific
> > > > > > kernel versions?  What is the preferred method for coding such
> > dependencies?
> > > > > >
> > > > > > John
> > > > > > --
> > > > > > John W. Linville		Someday the world will need a hero, and you
> > > > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > > >
> > >
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-10 20:32 [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices John W. Linville
                   ` (2 preceding siblings ...)
       [not found] ` <D0158A423229094DA7ABF71CF2FA0DA3117D3A23@shsmsx102.ccr.corp.intel.com>
@ 2014-07-11 22:30 ` Thomas Monjalon
  2014-07-14 17:53   ` John W. Linville
  2014-07-11 22:51 ` Bruce Richardson
  2014-07-14 18:24 ` [dpdk-dev] [PATCH v2] " John W. Linville
  5 siblings, 1 reply; 76+ messages in thread
From: Thomas Monjalon @ 2014-07-11 22:30 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

About the form of the patch, I have 2 comments:

1) A doc explaining the design, the dependencies and how it can be used would 
be a great help. Could you write it in rst format?

2) checkpatch.pl returns these errors:

ERROR:SPACING: space required before the open parenthesis '('
#468: FILE: lib/librte_pmd_packet/rte_eth_packet.c:250:
+               if(sockfd != -1)

ERROR:SPACING: space required before the open parenthesis '('
#471: FILE: lib/librte_pmd_packet/rte_eth_packet.c:253:
+               if(sockfd != -1)

ERROR:SPACING: spaces required around that '=' (ctx:VxV)
#712: FILE: lib/librte_pmd_packet/rte_eth_packet.c:494:
+               ifr.ifr_name[ifnamelen]='\0';

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 17:40     ` John W. Linville
  2014-07-11 18:01       ` Zhou, Danny
  2014-07-11 19:04       ` Zhou, Danny
@ 2014-07-11 22:34       ` Thomas Monjalon
  2014-07-14 13:46         ` John W. Linville
  2 siblings, 1 reply; 76+ messages in thread
From: Thomas Monjalon @ 2014-07-11 22:34 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

2014-07-11 13:40, John W. Linville:
> Is there an example of code in DPDK that requires specific kernel
> versions?  What is the preferred method for coding such dependencies?

No there is no userspace code checking kernel version in DPDK.
Feel free to use what you think the best method.
Please keep in mind that checking version number is a maintenance nightmare 
because of backports (like RedHat do ;).

-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-10 20:32 [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices John W. Linville
                   ` (3 preceding siblings ...)
  2014-07-11 22:30 ` Thomas Monjalon
@ 2014-07-11 22:51 ` Bruce Richardson
  2014-07-14 13:48   ` John W. Linville
  2014-07-14 18:24 ` [dpdk-dev] [PATCH v2] " John W. Linville
  5 siblings, 1 reply; 76+ messages in thread
From: Bruce Richardson @ 2014-07-11 22:51 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

On Thu, Jul 10, 2014 at 04:32:49PM -0400, John W. Linville wrote:
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> socket.  This implementation uses mmap'ed ring buffers to limit copying
> and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> AF_PACKET is used for frame reception.  In the current implementation,
> Tx and Rx queues are always paired, and therefore are always equal
> in number -- changing this would be a Simple Matter Of Programming.
> 
> Interfaces of this type are created with a command line option like
> "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> as arguments:
> 
>  - Interface is chosen by "iface" (required)
>  - Number of queue pairs set by "qpairs" (optional, default: 16)
>  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
>  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
>  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> This PMD is intended to provide a means for using DPDK on a broad
> range of hardware without hardware-specific PMDs and (hopefully)
> with better performance than what PCAP offers in Linux.  This might
> be useful as a development platform for DPDK applications when
> DPDK-supported hardware is expensive or unavailable.
> 
Hi John,

I'm just trying this out now on a Fedora 20 machine, using kernel 3.14.9-200.fc20.x86_64. However, while the first packet PMD port initializes correctly, the subsequent ones do not. Please see output from my test run below. All four ports are of the same type.

Regards,
/Bruce

bruce@silpixa00372841:dpdk.org$ sudo ./x86_64-native-linuxapp-gcc/app/testpmd -c 600 -n 4 --vdev=eth_packet0,iface=eth0,qpairs=1 --vdev=eth_packet1,iface=eth1,qpairs=1 --vdev=eth_packet2,iface=p802p1,qpairs=1 --vdev=eth_packet3,iface=p9p3,qpairs=1 -- --mbcache=250 --burst=32 --total-num-mbufs=65536EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 1 on socket 0
EAL: Detected lcore 2 as core 2 on socket 0
EAL: Detected lcore 3 as core 3 on socket 0
EAL: Detected lcore 4 as core 4 on socket 0
EAL: Detected lcore 5 as core 5 on socket 0
EAL: Detected lcore 6 as core 6 on socket 0
EAL: Detected lcore 7 as core 7 on socket 0
EAL: Detected lcore 8 as core 0 on socket 1
EAL: Detected lcore 9 as core 1 on socket 1
EAL: Detected lcore 10 as core 2 on socket 1
EAL: Detected lcore 11 as core 3 on socket 1
EAL: Detected lcore 12 as core 4 on socket 1
EAL: Detected lcore 13 as core 5 on socket 1
EAL: Detected lcore 14 as core 6 on socket 1
EAL: Detected lcore 15 as core 7 on socket 1
EAL: Detected lcore 16 as core 0 on socket 0
EAL: Detected lcore 17 as core 1 on socket 0
EAL: Detected lcore 18 as core 2 on socket 0
EAL: Detected lcore 19 as core 3 on socket 0
EAL: Detected lcore 20 as core 4 on socket 0
EAL: Detected lcore 21 as core 5 on socket 0
EAL: Detected lcore 22 as core 6 on socket 0
EAL: Detected lcore 23 as core 7 on socket 0
EAL: Detected lcore 24 as core 0 on socket 1
EAL: Detected lcore 25 as core 1 on socket 1
EAL: Detected lcore 26 as core 2 on socket 1
EAL: Detected lcore 27 as core 3 on socket 1
EAL: Detected lcore 28 as core 4 on socket 1
EAL: Detected lcore 29 as core 5 on socket 1
EAL: Detected lcore 30 as core 6 on socket 1
EAL: Detected lcore 31 as core 7 on socket 1
EAL: Support maximum 64 logical core(s) by configuration.
EAL: Detected 32 lcore(s)
EAL: No free hugepages reported in hugepages-2048kB
EAL:   unsupported IOMMU type!
EAL: VFIO support could not be initialized
EAL: Setting up memory...
EAL: Ask a virtual area of 0x80000000 bytes
EAL: Virtual area found at 0x7f9fc0000000 (size = 0x80000000)
EAL: Ask a virtual area of 0x80000000 bytes
EAL: Virtual area found at 0x7f9f00000000 (size = 0x80000000)
EAL: Requesting 2 pages of size 1024MB from socket 0
EAL: Requesting 2 pages of size 1024MB from socket 1
EAL: TSC frequency is ~2693512 KHz
EAL: Master core 9 is ready (tid=f4511880)
init (0) eth_packet0
PMD: Initializing pmd_packet for eth_packet0
PMD: eth_packet0: AF_PACKET MMAP parameters:
PMD: eth_packet0:	block size 4096
PMD: eth_packet0:	block count 256
PMD: eth_packet0:	frame size 2048
PMD: eth_packet0:	frame count 512
PMD: eth_packet0: creating AF_PACKET-backed ethdev on numa socket 1
init (0) eth_packet1
PMD: Initializing pmd_packet for eth_packet1
PMD: eth_packet1: AF_PACKET MMAP parameters:
PMD: eth_packet1:	block size 4096
PMD: eth_packet1:	block count 256
PMD: eth_packet1:	frame size 2048
PMD: eth_packet1:	frame count 512
PMD: eth_packet1: creating AF_PACKET-backed ethdev on numa socket 1
PMD: eth_packet1: could not set PACKET_FANOUT on AF_PACKET socket for eth1
init (0) eth_packet2
PMD: Initializing pmd_packet for eth_packet2
PMD: eth_packet2: AF_PACKET MMAP parameters:
PMD: eth_packet2:	block size 4096
PMD: eth_packet2:	block count 256
PMD: eth_packet2:	frame size 2048
PMD: eth_packet2:	frame count 512
PMD: eth_packet2: creating AF_PACKET-backed ethdev on numa socket 1
PMD: eth_packet2: could not set PACKET_FANOUT on AF_PACKET socket for p802p1
init (0) eth_packet3
PMD: Initializing pmd_packet for eth_packet3
PMD: eth_packet3: AF_PACKET MMAP parameters:
PMD: eth_packet3:	block size 4096
PMD: eth_packet3:	block count 256
PMD: eth_packet3:	frame size 2048
PMD: eth_packet3:	frame count 512
PMD: eth_packet3: creating AF_PACKET-backed ethdev on numa socket 1
PMD: eth_packet3: could not set PACKET_FANOUT on AF_PACKET socket for p9p3
EAL: Core 10 is ready (tid=f34d9700)
EAL: PCI device 0000:04:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   0000:04:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:04:00.1 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   0000:04:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:04:00.2 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   0000:04:00.2 not managed by UIO driver, skipping
EAL: PCI device 0000:04:00.3 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   0000:04:00.3 not managed by UIO driver, skipping
EAL: PCI device 0000:0c:00.0 on NUMA socket 0
EAL:   probe driver: 8086:10fb rte_ixgbe_pmd
EAL:   0000:0c:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:0c:00.1 on NUMA socket 0
EAL:   probe driver: 8086:10fb rte_ixgbe_pmd
EAL:   0000:0c:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:84:00.0 on NUMA socket 1
EAL:   probe driver: 8086:154a rte_ixgbe_pmd
EAL:   0000:84:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:84:00.1 on NUMA socket 1
EAL:   probe driver: 8086:154a rte_ixgbe_pmd
EAL:   0000:84:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:87:00.0 on NUMA socket 1
EAL:   probe driver: 8086:154a rte_ixgbe_pmd
EAL:   0000:87:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:87:00.1 on NUMA socket 1
EAL:   probe driver: 8086:154a rte_ixgbe_pmd
EAL:   0000:87:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:8b:00.0 on NUMA socket 1
EAL:   probe driver: 8086:154a rte_ixgbe_pmd
EAL:   0000:8b:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:8b:00.1 on NUMA socket 1
EAL:   probe driver: 8086:154a rte_ixgbe_pmd
EAL:   0000:8b:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:8e:00.0 on NUMA socket 1
EAL:   probe driver: 8086:154a rte_ixgbe_pmd
EAL:   0000:8e:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:8e:00.1 on NUMA socket 1
EAL:   probe driver: 8086:154a rte_ixgbe_pmd
EAL:   0000:8e:00.1 not managed by UIO driver, skipping
Configuring Port 0 (socket 0)
Port 0: 68:05:CA:19:F0:50
Checking link statuses...
Port 0 Link Up - speed 10000 Mbps - full-duplex
Done
No commandline core given, start packet forwarding

Warning! Cannot handle an odd number of ports with the current port topology. Configuration must be changed to have an even number of ports, or relaunch application with --port-topology=chained

  io packet forwarding - CRC stripping disabled - packets/burst=32
  nb forwarding cores=1 - nb forwarding ports=1
  RX queues=1 - RX desc=128 - RX free threshold=0
  RX threshold registers: pthresh=8 hthresh=8 wthresh=0
  TX queues=1 - TX desc=512 - TX free threshold=0
  TX threshold registers: pthresh=32 hthresh=0 wthresh=0
  TX RS bit threshold=0 - TXQ flags=0x0
Press enter to exit

 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 18:46         ` John W. Linville
@ 2014-07-12  0:42           ` Zhou, Danny
  2014-07-14 13:45             ` John W. Linville
  0 siblings, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-07-12  0:42 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

I just upgraded my kernel to 3.15.5 and hardcoded below captured from include/uapi/linux/if_packet.h to librte_pmd_packet.c to workaround it, now I can receive/transmit packet now. Commenting out PACKET_FANOUT_FLAG_ROLLOVER would cause no packet can be received. 

#define PACKET_QDISC_BYPASS             20
#define PACKET_FANOUT_FLAG_ROLLOVER     0x1000

> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Saturday, July 12, 2014 2:47 AM
> To: Zhou, Danny
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> Not sure what the issue might be, PACKET_FANOUT_FLAG_ROLLOVER is defined
> in include/uapi/linux/if_packet.h in the v3.12 tree.
> 
> On Fri, Jul 11, 2014 at 06:01:27PM +0000, Zhou, Danny wrote:
> > Tried on 3.12, both of them are undefined. Anyway, will comment them out and see
> what performance it could achieve.
> >
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Saturday, July 12, 2014 1:41 AM
> > > To: Zhou, Danny
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > AF_PACKET-based virtual devices
> > >
> > > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > > Looks like you used a pretty new kernel version with new socket
> > > > options that old
> > > kernel like my 3.12 does not support. When I tried this patch, it
> > > just cannot build, and compiler complains like below. Which Linux distribution
> does this patch work for?
> > > How to ensure it works for old kernels?
> > > >
> > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:
> > > > In function
> > > rte_pmd_init_internals:
> > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:5
> > > > 24:1
> > > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in
> > > > this
> > > > function)
> > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:5
> > > > 24:1
> > > > 7: note: each undeclared identifier is reported only once for each
> > > > function it appears in
> > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:5
> > > > 57:3
> > > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this
> > > > function)
> > >
> > > Both of them are isolated, so for playing with it you could just comment those
> out.
> > > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in 3.10,
> > > while PACKET_QDISC_BYPASS didn't show-up until 3.14...
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
> > > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define
> > > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > 77f65ebdca506 commit 77f65ebdca506870d99bfabe52bde222511022ec
> > > Author: Willem de Bruijn <willemb@google.com>
> > >
> > >     packet: packet fanout rollover during socket overload
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > 77f65ebdca506
> > > v3.10-rc1~66^2~423
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
> > > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > > PACKET_QDISC_BYPASS		20
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > d346a3fae3ff1 commit
> > > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > > Author: Daniel Borkmann <dborkman@redhat.com>
> > >
> > >     packet: introduce PACKET_QDISC_BYPASS socket option
> > >
> > > /home/linville/git/linux
> > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > d346a3fae3ff1
> > > v3.14-rc1~94^2~564
> > >
> > > Is there an example of code in DPDK that requires specific kernel
> > > versions?  What is the preferred method for coding such dependencies?
> > >
> > > John
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> >
> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 16:47         ` Thomas Monjalon
  2014-07-11 17:38           ` Richardson, Bruce
@ 2014-07-12 11:48           ` Neil Horman
  1 sibling, 0 replies; 76+ messages in thread
From: Neil Horman @ 2014-07-12 11:48 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Fri, Jul 11, 2014 at 06:47:47PM +0200, Thomas Monjalon wrote:
> 2014-07-11 11:30, John W. Linville:
> > On Fri, Jul 11, 2014 at 05:04:04PM +0200, Thomas Monjalon wrote:
> > > 2014-07-11 10:51, John W. Linville:
> > > > On Fri, Jul 11, 2014 at 03:26:39PM +0200, Thomas Monjalon wrote:
> > > > > Thank you for this nice work.
> > > > > 
> > > > > I think it would be well suited to host this PMD as an external one in
> > > > > order to make it work also with DPDK 1.7.0.
> > > > 
> > > > I'm not sure I understand the suggestion -- you don't want to merge
> > > > the driver for 1.8?  Or you just want to host this patch somewhere,
> > > > so people can still use it w/ 1.7?
> > > 
> > > I suggest to have a separated repository here:
> > > 	http://dpdk.org/browse/
> > 
> > I really don't see any reason not to merge it.  It was already delayed
> > by me waiting for all the PMD init changes to settle out in the 1.6
> > release, and I still had to do a few touch-ups for it to compile on
> > 1.7.  I definitely do not want to have to do that over and over again.
> 
> It's a pity that we didn't synchronize our efforts to make it integrated 
> during 1.7.0 cycle.
> 
> > Why wouldn't you just merge it?  If someone wants to use it on 1.7,
> > they can just apply the patch.
> 
> I'm OK to merge it. I was only suggesting to host your PMD externally like we 
> did for virtio-net-pmd, vmxnet3-usermap and memnic.
> It was the same discussion for the vmxnet3 PMD that Stephen submitted.
> 
> I start thinking that nobody wants PMD to be external. So we may merge this 
> one in dpdk.git and start talking what to do for the other ones:
> 	- move memnic in dpdk.git?
> 	- move virtio-net-pmd and vmxnet3-usermap where sits their uio 
> counterparts?
> 	- merge Brocade's vmxnet3 as new one or as a replacement for vmxnet3-uio?
> 
Yes!  Please.  this is what I suggested a few months ago.  Managing PMDs in
separate trees just leads to more difficult driver updates when core changes are
made that affect all pmds.  Please merge them all into the dpdk tree

Neil

> -- 
> Thomas
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-12  0:42           ` Zhou, Danny
@ 2014-07-14 13:45             ` John W. Linville
  0 siblings, 0 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-14 13:45 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

On Sat, Jul 12, 2014 at 12:42:04AM +0000, Zhou, Danny wrote:
> I just upgraded my kernel to 3.15.5 and hardcoded below captured from include/uapi/linux/if_packet.h to librte_pmd_packet.c to workaround it, now I can receive/transmit packet now. Commenting out PACKET_FANOUT_FLAG_ROLLOVER would cause no packet can be received. 
> 
> #define PACKET_QDISC_BYPASS             20
> #define PACKET_FANOUT_FLAG_ROLLOVER     0x1000

You shouldn't need PACKET_FANOUT_FLAG_ROLLOVER if all the queues are
being used.  Does the application you are running make use of all
the queues?  If not, you probably should use the qpairs option to
limit the number of queues created by the eth_packet PMD.

John

> 
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Saturday, July 12, 2014 2:47 AM
> > To: Zhou, Danny
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> > 
> > Not sure what the issue might be, PACKET_FANOUT_FLAG_ROLLOVER is defined
> > in include/uapi/linux/if_packet.h in the v3.12 tree.
> > 
> > On Fri, Jul 11, 2014 at 06:01:27PM +0000, Zhou, Danny wrote:
> > > Tried on 3.12, both of them are undefined. Anyway, will comment them out and see
> > what performance it could achieve.
> > >
> > > > -----Original Message-----
> > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > Sent: Saturday, July 12, 2014 1:41 AM
> > > > To: Zhou, Danny
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for
> > > > AF_PACKET-based virtual devices
> > > >
> > > > On Fri, Jul 11, 2014 at 05:20:42PM +0000, Zhou, Danny wrote:
> > > > > Looks like you used a pretty new kernel version with new socket
> > > > > options that old
> > > > kernel like my 3.12 does not support. When I tried this patch, it
> > > > just cannot build, and compiler complains like below. Which Linux distribution
> > does this patch work for?
> > > > How to ensure it works for old kernels?
> > > > >
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:
> > > > > In function
> > > > rte_pmd_init_internals:
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:5
> > > > > 24:1
> > > > > 7: error: PACKET_FANOUT_FLAG_ROLLOVER undeclared (first use in
> > > > > this
> > > > > function)
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:5
> > > > > 24:1
> > > > > 7: note: each undeclared identifier is reported only once for each
> > > > > function it appears in
> > > > > /home/danny/dpdk.org/dpdk/lib/librte_pmd_packet/rte_eth_packet.c:5
> > > > > 57:3
> > > > > 3: error: PACKET_QDISC_BYPASS undeclared (first use in this
> > > > > function)
> > > >
> > > > Both of them are isolated, so for playing with it you could just comment those
> > out.
> > > > It looks like PACKET_FANOUT_FLAG_ROLLOVER should have been in 3.10,
> > > > while PACKET_QDISC_BYPASS didn't show-up until 3.14...
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > > include/uapi/linux/if_packet.h | grep PACKET_FANOUT_FLAG_ROLLOVER
> > > > 77f65ebdca506	(Willem de Bruijn	2013-03-19 10:18:11 +0000	64)#define
> > > > PACKET_FANOUT_FLAG_ROLLOVER	0x1000
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > 77f65ebdca506 commit 77f65ebdca506870d99bfabe52bde222511022ec
> > > > Author: Willem de Bruijn <willemb@google.com>
> > > >
> > > >     packet: packet fanout rollover during socket overload
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > 77f65ebdca506
> > > > v3.10-rc1~66^2~423
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git annotate
> > > > include/uapi/linux/if_packet.h | grep PACKET_QDISC_BYPASS
> > > > d346a3fae3ff1	(Daniel Borkmann	2013-12-06 11:36:17 +0100	56)#define
> > > > PACKET_QDISC_BYPASS		20
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git show -s --format=short
> > > > d346a3fae3ff1 commit
> > > > d346a3fae3ff1d99f5d0c819bf86edf9094a26a1
> > > > Author: Daniel Borkmann <dborkman@redhat.com>
> > > >
> > > >     packet: introduce PACKET_QDISC_BYPASS socket option
> > > >
> > > > /home/linville/git/linux
> > > > [linville-x1.hq.tuxdriver.com]:> git describe --contains
> > > > d346a3fae3ff1
> > > > v3.14-rc1~94^2~564
> > > >
> > > > Is there an example of code in DPDK that requires specific kernel
> > > > versions?  What is the preferred method for coding such dependencies?
> > > >
> > > > John
> > > > --
> > > > John W. Linville		Someday the world will need a hero, and you
> > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > >
> > 
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 22:34       ` Thomas Monjalon
@ 2014-07-14 13:46         ` John W. Linville
  2014-07-15 21:27           ` Thomas Monjalon
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-14 13:46 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Sat, Jul 12, 2014 at 12:34:46AM +0200, Thomas Monjalon wrote:
> 2014-07-11 13:40, John W. Linville:
> > Is there an example of code in DPDK that requires specific kernel
> > versions?  What is the preferred method for coding such dependencies?
> 
> No there is no userspace code checking kernel version in DPDK.
> Feel free to use what you think the best method.
> Please keep in mind that checking version number is a maintenance nightmare 
> because of backports (like RedHat do ;).

I suppose that it could be a configuration option?

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 22:51 ` Bruce Richardson
@ 2014-07-14 13:48   ` John W. Linville
  2014-07-14 17:35     ` John W. Linville
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-14 13:48 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Fri, Jul 11, 2014 at 11:51:08PM +0100, Bruce Richardson wrote:
> On Thu, Jul 10, 2014 at 04:32:49PM -0400, John W. Linville wrote:
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > AF_PACKET is used for frame reception.  In the current implementation,
> > Tx and Rx queues are always paired, and therefore are always equal
> > in number -- changing this would be a Simple Matter Of Programming.
> > 
> > Interfaces of this type are created with a command line option like
> > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > as arguments:
> > 
> >  - Interface is chosen by "iface" (required)
> >  - Number of queue pairs set by "qpairs" (optional, default: 16)
> >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > 
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > This PMD is intended to provide a means for using DPDK on a broad
> > range of hardware without hardware-specific PMDs and (hopefully)
> > with better performance than what PCAP offers in Linux.  This might
> > be useful as a development platform for DPDK applications when
> > DPDK-supported hardware is expensive or unavailable.
> > 
> Hi John,
> 
> I'm just trying this out now on a Fedora 20 machine, using kernel 3.14.9-200.fc20.x86_64. However, while the first packet PMD port initializes correctly, the subsequent ones do not. Please see output from my test run below. All four ports are of the same type.

Thanks I'll check into it.  I'm not sure why you would only be able
to set the fanout on the first port...

> 
> Regards,
> /Bruce
> 
> bruce@silpixa00372841:dpdk.org$ sudo ./x86_64-native-linuxapp-gcc/app/testpmd -c 600 -n 4 --vdev=eth_packet0,iface=eth0,qpairs=1 --vdev=eth_packet1,iface=eth1,qpairs=1 --vdev=eth_packet2,iface=p802p1,qpairs=1 --vdev=eth_packet3,iface=p9p3,qpairs=1 -- --mbcache=250 --burst=32 --total-num-mbufs=65536EAL: Detected lcore 0 as core 0 on socket 0
> EAL: Detected lcore 1 as core 1 on socket 0
> EAL: Detected lcore 2 as core 2 on socket 0
> EAL: Detected lcore 3 as core 3 on socket 0
> EAL: Detected lcore 4 as core 4 on socket 0
> EAL: Detected lcore 5 as core 5 on socket 0
> EAL: Detected lcore 6 as core 6 on socket 0
> EAL: Detected lcore 7 as core 7 on socket 0
> EAL: Detected lcore 8 as core 0 on socket 1
> EAL: Detected lcore 9 as core 1 on socket 1
> EAL: Detected lcore 10 as core 2 on socket 1
> EAL: Detected lcore 11 as core 3 on socket 1
> EAL: Detected lcore 12 as core 4 on socket 1
> EAL: Detected lcore 13 as core 5 on socket 1
> EAL: Detected lcore 14 as core 6 on socket 1
> EAL: Detected lcore 15 as core 7 on socket 1
> EAL: Detected lcore 16 as core 0 on socket 0
> EAL: Detected lcore 17 as core 1 on socket 0
> EAL: Detected lcore 18 as core 2 on socket 0
> EAL: Detected lcore 19 as core 3 on socket 0
> EAL: Detected lcore 20 as core 4 on socket 0
> EAL: Detected lcore 21 as core 5 on socket 0
> EAL: Detected lcore 22 as core 6 on socket 0
> EAL: Detected lcore 23 as core 7 on socket 0
> EAL: Detected lcore 24 as core 0 on socket 1
> EAL: Detected lcore 25 as core 1 on socket 1
> EAL: Detected lcore 26 as core 2 on socket 1
> EAL: Detected lcore 27 as core 3 on socket 1
> EAL: Detected lcore 28 as core 4 on socket 1
> EAL: Detected lcore 29 as core 5 on socket 1
> EAL: Detected lcore 30 as core 6 on socket 1
> EAL: Detected lcore 31 as core 7 on socket 1
> EAL: Support maximum 64 logical core(s) by configuration.
> EAL: Detected 32 lcore(s)
> EAL: No free hugepages reported in hugepages-2048kB
> EAL:   unsupported IOMMU type!
> EAL: VFIO support could not be initialized
> EAL: Setting up memory...
> EAL: Ask a virtual area of 0x80000000 bytes
> EAL: Virtual area found at 0x7f9fc0000000 (size = 0x80000000)
> EAL: Ask a virtual area of 0x80000000 bytes
> EAL: Virtual area found at 0x7f9f00000000 (size = 0x80000000)
> EAL: Requesting 2 pages of size 1024MB from socket 0
> EAL: Requesting 2 pages of size 1024MB from socket 1
> EAL: TSC frequency is ~2693512 KHz
> EAL: Master core 9 is ready (tid=f4511880)
> init (0) eth_packet0
> PMD: Initializing pmd_packet for eth_packet0
> PMD: eth_packet0: AF_PACKET MMAP parameters:
> PMD: eth_packet0:	block size 4096
> PMD: eth_packet0:	block count 256
> PMD: eth_packet0:	frame size 2048
> PMD: eth_packet0:	frame count 512
> PMD: eth_packet0: creating AF_PACKET-backed ethdev on numa socket 1
> init (0) eth_packet1
> PMD: Initializing pmd_packet for eth_packet1
> PMD: eth_packet1: AF_PACKET MMAP parameters:
> PMD: eth_packet1:	block size 4096
> PMD: eth_packet1:	block count 256
> PMD: eth_packet1:	frame size 2048
> PMD: eth_packet1:	frame count 512
> PMD: eth_packet1: creating AF_PACKET-backed ethdev on numa socket 1
> PMD: eth_packet1: could not set PACKET_FANOUT on AF_PACKET socket for eth1
> init (0) eth_packet2
> PMD: Initializing pmd_packet for eth_packet2
> PMD: eth_packet2: AF_PACKET MMAP parameters:
> PMD: eth_packet2:	block size 4096
> PMD: eth_packet2:	block count 256
> PMD: eth_packet2:	frame size 2048
> PMD: eth_packet2:	frame count 512
> PMD: eth_packet2: creating AF_PACKET-backed ethdev on numa socket 1
> PMD: eth_packet2: could not set PACKET_FANOUT on AF_PACKET socket for p802p1
> init (0) eth_packet3
> PMD: Initializing pmd_packet for eth_packet3
> PMD: eth_packet3: AF_PACKET MMAP parameters:
> PMD: eth_packet3:	block size 4096
> PMD: eth_packet3:	block count 256
> PMD: eth_packet3:	frame size 2048
> PMD: eth_packet3:	frame count 512
> PMD: eth_packet3: creating AF_PACKET-backed ethdev on numa socket 1
> PMD: eth_packet3: could not set PACKET_FANOUT on AF_PACKET socket for p9p3
> EAL: Core 10 is ready (tid=f34d9700)
> EAL: PCI device 0000:04:00.0 on NUMA socket 0
> EAL:   probe driver: 8086:1521 rte_igb_pmd
> EAL:   0000:04:00.0 not managed by UIO driver, skipping
> EAL: PCI device 0000:04:00.1 on NUMA socket 0
> EAL:   probe driver: 8086:1521 rte_igb_pmd
> EAL:   0000:04:00.1 not managed by UIO driver, skipping
> EAL: PCI device 0000:04:00.2 on NUMA socket 0
> EAL:   probe driver: 8086:1521 rte_igb_pmd
> EAL:   0000:04:00.2 not managed by UIO driver, skipping
> EAL: PCI device 0000:04:00.3 on NUMA socket 0
> EAL:   probe driver: 8086:1521 rte_igb_pmd
> EAL:   0000:04:00.3 not managed by UIO driver, skipping
> EAL: PCI device 0000:0c:00.0 on NUMA socket 0
> EAL:   probe driver: 8086:10fb rte_ixgbe_pmd
> EAL:   0000:0c:00.0 not managed by UIO driver, skipping
> EAL: PCI device 0000:0c:00.1 on NUMA socket 0
> EAL:   probe driver: 8086:10fb rte_ixgbe_pmd
> EAL:   0000:0c:00.1 not managed by UIO driver, skipping
> EAL: PCI device 0000:84:00.0 on NUMA socket 1
> EAL:   probe driver: 8086:154a rte_ixgbe_pmd
> EAL:   0000:84:00.0 not managed by UIO driver, skipping
> EAL: PCI device 0000:84:00.1 on NUMA socket 1
> EAL:   probe driver: 8086:154a rte_ixgbe_pmd
> EAL:   0000:84:00.1 not managed by UIO driver, skipping
> EAL: PCI device 0000:87:00.0 on NUMA socket 1
> EAL:   probe driver: 8086:154a rte_ixgbe_pmd
> EAL:   0000:87:00.0 not managed by UIO driver, skipping
> EAL: PCI device 0000:87:00.1 on NUMA socket 1
> EAL:   probe driver: 8086:154a rte_ixgbe_pmd
> EAL:   0000:87:00.1 not managed by UIO driver, skipping
> EAL: PCI device 0000:8b:00.0 on NUMA socket 1
> EAL:   probe driver: 8086:154a rte_ixgbe_pmd
> EAL:   0000:8b:00.0 not managed by UIO driver, skipping
> EAL: PCI device 0000:8b:00.1 on NUMA socket 1
> EAL:   probe driver: 8086:154a rte_ixgbe_pmd
> EAL:   0000:8b:00.1 not managed by UIO driver, skipping
> EAL: PCI device 0000:8e:00.0 on NUMA socket 1
> EAL:   probe driver: 8086:154a rte_ixgbe_pmd
> EAL:   0000:8e:00.0 not managed by UIO driver, skipping
> EAL: PCI device 0000:8e:00.1 on NUMA socket 1
> EAL:   probe driver: 8086:154a rte_ixgbe_pmd
> EAL:   0000:8e:00.1 not managed by UIO driver, skipping
> Configuring Port 0 (socket 0)
> Port 0: 68:05:CA:19:F0:50
> Checking link statuses...
> Port 0 Link Up - speed 10000 Mbps - full-duplex
> Done
> No commandline core given, start packet forwarding
> 
> Warning! Cannot handle an odd number of ports with the current port topology. Configuration must be changed to have an even number of ports, or relaunch application with --port-topology=chained
> 
>   io packet forwarding - CRC stripping disabled - packets/burst=32
>   nb forwarding cores=1 - nb forwarding ports=1
>   RX queues=1 - RX desc=128 - RX free threshold=0
>   RX threshold registers: pthresh=8 hthresh=8 wthresh=0
>   TX queues=1 - TX desc=512 - TX free threshold=0
>   TX threshold registers: pthresh=32 hthresh=0 wthresh=0
>   TX RS bit threshold=0 - TXQ flags=0x0
> Press enter to exit
> 
>  
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-14 13:48   ` John W. Linville
@ 2014-07-14 17:35     ` John W. Linville
  0 siblings, 0 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-14 17:35 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Mon, Jul 14, 2014 at 09:48:33AM -0400, John W. Linville wrote:
> On Fri, Jul 11, 2014 at 11:51:08PM +0100, Bruce Richardson wrote:
> > On Thu, Jul 10, 2014 at 04:32:49PM -0400, John W. Linville wrote:
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET

> > I'm just trying this out now on a Fedora 20 machine, using kernel
> > 3.14.9-200.fc20.x86_64. However, while the first packet PMD port
> > initializes correctly, the subsequent ones do not. Please see output
> > from my test run below. All four ports are of the same type.
> 
> Thanks I'll check into it.  I'm not sure why you would only be able
> to set the fanout on the first port...

It looks like I lost a patch in the prep for posting.  As a result, I
was using the same fanout group ID between different interfaces... :-(

I'll post a V2 in a bit...thanks!

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-11 22:30 ` Thomas Monjalon
@ 2014-07-14 17:53   ` John W. Linville
  0 siblings, 0 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-14 17:53 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Sat, Jul 12, 2014 at 12:30:34AM +0200, Thomas Monjalon wrote:
> About the form of the patch, I have 2 comments:
> 
> 1) A doc explaining the design, the dependencies and how it can be used would 
> be a great help. Could you write it in rst format?

What is rst format?  Are there other examples in the repository?

> 2) checkpatch.pl returns these errors:
> 
> ERROR:SPACING: space required before the open parenthesis '('
> #468: FILE: lib/librte_pmd_packet/rte_eth_packet.c:250:
> +               if(sockfd != -1)
> 
> ERROR:SPACING: space required before the open parenthesis '('
> #471: FILE: lib/librte_pmd_packet/rte_eth_packet.c:253:
> +               if(sockfd != -1)
> 
> ERROR:SPACING: spaces required around that '=' (ctx:VxV)
> #712: FILE: lib/librte_pmd_packet/rte_eth_packet.c:494:
> +               ifr.ifr_name[ifnamelen]='\0';

OK.  FWIW, at least the first two are slightly changed from what was
copied from the PCAP driver.  The other probably was a cut-n-paste
error from another source.

I'll post a V2 shortly...

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-10 20:32 [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices John W. Linville
                   ` (4 preceding siblings ...)
  2014-07-11 22:51 ` Bruce Richardson
@ 2014-07-14 18:24 ` John W. Linville
  2014-07-15  0:15   ` Zhou, Danny
  2014-09-12 18:05   ` John W. Linville
  5 siblings, 2 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-14 18:24 UTC (permalink / raw)
  To: dev

This is a Linux-specific virtual PMD driver backed by an AF_PACKET
socket.  This implementation uses mmap'ed ring buffers to limit copying
and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
AF_PACKET is used for frame reception.  In the current implementation,
Tx and Rx queues are always paired, and therefore are always equal
in number -- changing this would be a Simple Matter Of Programming.

Interfaces of this type are created with a command line option like
"--vdev=eth_packet0,iface=...".  There are a number of options availabe
as arguments:

 - Interface is chosen by "iface" (required)
 - Number of queue pairs set by "qpairs" (optional, default: 1)
 - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
 - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
 - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)

Signed-off-by: John W. Linville <linville@tuxdriver.com>
---
This PMD is intended to provide a means for using DPDK on a broad
range of hardware without hardware-specific PMDs and (hopefully)
with better performance than what PCAP offers in Linux.  This might
be useful as a development platform for DPDK applications when
DPDK-supported hardware is expensive or unavailable.

New in v2:

-- fixup some style issues found by check patch
-- use if_index as part of fanout group ID
-- set default number of queue pairs to 1

 config/common_bsdapp                   |   5 +
 config/common_linuxapp                 |   5 +
 lib/Makefile                           |   1 +
 lib/librte_eal/linuxapp/eal/Makefile   |   1 +
 lib/librte_pmd_packet/Makefile         |  60 +++
 lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
 lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
 mk/rte.app.mk                          |   4 +
 8 files changed, 957 insertions(+)
 create mode 100644 lib/librte_pmd_packet/Makefile
 create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
 create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h

diff --git a/config/common_bsdapp b/config/common_bsdapp
index 943dce8f1ede..c317f031278e 100644
--- a/config/common_bsdapp
+++ b/config/common_bsdapp
@@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
 CONFIG_RTE_LIBRTE_PMD_BOND=y
 
 #
+# Compile software PMD backed by AF_PACKET sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_PACKET=n
+
+#
 # Do prefetch of packet data within PMD driver receive function
 #
 CONFIG_RTE_PMD_PACKET_PREFETCH=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 7bf5d80d4e26..f9e7bc3015ec 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
 CONFIG_RTE_LIBRTE_PMD_BOND=y
 
 #
+# Compile software PMD backed by AF_PACKET sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_PACKET=y
+
+#
 # Compile Xen PMD
 #
 CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
diff --git a/lib/Makefile b/lib/Makefile
index 10c5bb3045bc..930fadf29898 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
 DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
 DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 756d6b0c9301..feed24a63272 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
 CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
+CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
 CFLAGS += $(WERROR_FLAGS) -O3
 
diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
new file mode 100644
index 000000000000..e1266fb992cd
--- /dev/null
+++ b/lib/librte_pmd_packet/Makefile
@@ -0,0 +1,60 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   Copyright(c) 2014 6WIND S.A.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_pmd_packet.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
+
+#
+# Export include files
+#
+SYMLINK-y-include += rte_eth_packet.h
+
+# this lib depends upon:
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
new file mode 100644
index 000000000000..9c82d16e730f
--- /dev/null
+++ b/lib/librte_pmd_packet/rte_eth_packet.c
@@ -0,0 +1,826 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
+ *
+ *   Originally based upon librte_pmd_pcap code:
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2014 6WIND S.A.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_mbuf.h>
+#include <rte_ethdev.h>
+#include <rte_malloc.h>
+#include <rte_kvargs.h>
+#include <rte_dev.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+
+#include "rte_eth_packet.h"
+
+#define ETH_PACKET_IFACE_ARG		"iface"
+#define ETH_PACKET_NUM_Q_ARG		"qpairs"
+#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
+#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
+#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
+
+#define DFLT_BLOCK_SIZE		(1 << 12)
+#define DFLT_FRAME_SIZE		(1 << 11)
+#define DFLT_FRAME_COUNT	(1 << 9)
+
+struct pkt_rx_queue {
+	int sockfd;
+
+	struct iovec *rd;
+	uint8_t *map;
+	unsigned int framecount;
+	unsigned int framenum;
+
+	struct rte_mempool *mb_pool;
+
+	volatile unsigned long rx_pkts;
+	volatile unsigned long err_pkts;
+};
+
+struct pkt_tx_queue {
+	int sockfd;
+
+	struct iovec *rd;
+	uint8_t *map;
+	unsigned int framecount;
+	unsigned int framenum;
+
+	volatile unsigned long tx_pkts;
+	volatile unsigned long err_pkts;
+};
+
+struct pmd_internals {
+	unsigned nb_queues;
+
+	int if_index;
+	struct ether_addr eth_addr;
+
+	struct tpacket_req req;
+
+	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
+	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
+};
+
+static const char *valid_arguments[] = {
+	ETH_PACKET_IFACE_ARG,
+	ETH_PACKET_NUM_Q_ARG,
+	ETH_PACKET_BLOCKSIZE_ARG,
+	ETH_PACKET_FRAMESIZE_ARG,
+	ETH_PACKET_FRAMECOUNT_ARG,
+	NULL
+};
+
+static const char *drivername = "AF_PACKET PMD";
+
+static struct rte_eth_link pmd_link = {
+	.link_speed = 10000,
+	.link_duplex = ETH_LINK_FULL_DUPLEX,
+	.link_status = 0
+};
+
+static uint16_t
+eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	unsigned i;
+	struct tpacket2_hdr *ppd;
+	struct rte_mbuf *mbuf;
+	uint8_t *pbuf;
+	struct pkt_rx_queue *pkt_q = queue;
+	uint16_t num_rx = 0;
+	unsigned int framecount, framenum;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	/*
+	 * Reads the given number of packets from the AF_PACKET socket one by
+	 * one and copies the packet data into a newly allocated mbuf.
+	 */
+	framecount = pkt_q->framecount;
+	framenum = pkt_q->framenum;
+	for (i = 0; i < nb_pkts; i++) {
+		/* point at the next incoming frame */
+		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+		if ((ppd->tp_status & TP_STATUS_USER) == 0)
+			break;
+
+		/* allocate the next mbuf */
+		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
+		if (unlikely(mbuf == NULL))
+			break;
+
+		/* packet will fit in the mbuf, go ahead and receive it */
+		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
+		pbuf = (uint8_t *) ppd + ppd->tp_mac;
+		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
+
+		/* release incoming frame and advance ring buffer */
+		ppd->tp_status = TP_STATUS_KERNEL;
+		if (++framenum >= framecount)
+			framenum = 0;
+
+		/* account for the receive frame */
+		bufs[i] = mbuf;
+		num_rx++;
+	}
+	pkt_q->framenum = framenum;
+	pkt_q->rx_pkts += num_rx;
+	return num_rx;
+}
+
+/*
+ * Callback to handle sending packets through a real NIC.
+ */
+static uint16_t
+eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tpacket2_hdr *ppd;
+	struct rte_mbuf *mbuf;
+	uint8_t *pbuf;
+	unsigned int framecount, framenum;
+	struct pollfd pfd;
+	struct pkt_tx_queue *pkt_q = queue;
+	uint16_t num_tx = 0;
+	int i;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	memset(&pfd, 0, sizeof(pfd));
+	pfd.fd = pkt_q->sockfd;
+	pfd.events = POLLOUT;
+	pfd.revents = 0;
+
+	framecount = pkt_q->framecount;
+	framenum = pkt_q->framenum;
+	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+	for (i = 0; i < nb_pkts; i++) {
+		/* point at the next incoming frame */
+		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
+		    (poll(&pfd, 1, -1) < 0))
+				continue;
+
+		/* copy the tx frame data */
+		mbuf = bufs[num_tx];
+		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
+			sizeof(struct sockaddr_ll);
+		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
+		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
+
+		/* release incoming frame and advance ring buffer */
+		ppd->tp_status = TP_STATUS_SEND_REQUEST;
+		if (++framenum >= framecount)
+			framenum = 0;
+		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+
+		num_tx++;
+		rte_pktmbuf_free(mbuf);
+	}
+
+	/* kick-off transmits */
+	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+
+	pkt_q->framenum = framenum;
+	pkt_q->tx_pkts += num_tx;
+	pkt_q->err_pkts += nb_pkts - num_tx;
+	return num_tx;
+}
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = 1;
+	return 0;
+}
+
+/*
+ * This function gets called when the current port gets stopped.
+ */
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	unsigned i;
+	int sockfd;
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	for (i = 0; i < internals->nb_queues; i++) {
+		sockfd = internals->rx_queue[i].sockfd;
+		if (sockfd != -1)
+			close(sockfd);
+		sockfd = internals->tx_queue[i].sockfd;
+		if (sockfd != -1)
+			close(sockfd);
+	}
+
+	dev->data->dev_link.link_status = 0;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev_info->driver_name = drivername;
+	dev_info->if_index = internals->if_index;
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
+	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
+	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
+	dev_info->min_rx_bufsize = 0;
+	dev_info->pci_dev = NULL;
+}
+
+static void
+eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
+{
+	unsigned i, imax;
+	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
+	const struct pmd_internals *internal = dev->data->dev_private;
+
+	memset(igb_stats, 0, sizeof(*igb_stats));
+
+	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
+	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
+	for (i = 0; i < imax; i++) {
+		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
+		rx_total += igb_stats->q_ipackets[i];
+	}
+
+	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
+	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
+	for (i = 0; i < imax; i++) {
+		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
+		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
+		tx_total += igb_stats->q_opackets[i];
+		tx_err_total += igb_stats->q_errors[i];
+	}
+
+	igb_stats->ipackets = rx_total;
+	igb_stats->opackets = tx_total;
+	igb_stats->oerrors = tx_err_total;
+}
+
+static void
+eth_stats_reset(struct rte_eth_dev *dev)
+{
+	unsigned i;
+	struct pmd_internals *internal = dev->data->dev_private;
+
+	for (i = 0; i < internal->nb_queues; i++)
+		internal->rx_queue[i].rx_pkts = 0;
+
+	for (i = 0; i < internal->nb_queues; i++) {
+		internal->tx_queue[i].tx_pkts = 0;
+		internal->tx_queue[i].err_pkts = 0;
+	}
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev __rte_unused)
+{
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+                int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev,
+                   uint16_t rx_queue_id,
+                   uint16_t nb_rx_desc __rte_unused,
+                   unsigned int socket_id __rte_unused,
+                   const struct rte_eth_rxconf *rx_conf __rte_unused,
+                   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
+	struct rte_pktmbuf_pool_private *mbp_priv;
+	uint16_t buf_size;
+
+	pkt_q->mb_pool = mb_pool;
+
+	/* Now get the space available for data in the mbuf */
+	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
+	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
+	                       RTE_PKTMBUF_HEADROOM);
+
+	if (ETH_FRAME_LEN > buf_size) {
+		RTE_LOG(ERR, PMD,
+			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
+			dev->data->name, ETH_FRAME_LEN, buf_size);
+		return -ENOMEM;
+	}
+
+	dev->data->rx_queues[rx_queue_id] = pkt_q;
+
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev,
+                   uint16_t tx_queue_id,
+                   uint16_t nb_tx_desc __rte_unused,
+                   unsigned int socket_id __rte_unused,
+                   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
+	return 0;
+}
+
+static struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+	.stats_get = eth_stats_get,
+	.stats_reset = eth_stats_reset,
+};
+
+/*
+ * Opens an AF_PACKET socket
+ */
+static int
+open_packet_iface(const char *key __rte_unused,
+                  const char *value __rte_unused,
+                  void *extra_args)
+{
+	int *sockfd = extra_args;
+
+	/* Open an AF_PACKET socket... */
+	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	if (*sockfd == -1) {
+		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+rte_pmd_init_internals(const char *name,
+                       const int sockfd,
+                       const unsigned nb_queues,
+                       unsigned int blocksize,
+                       unsigned int blockcnt,
+                       unsigned int framesize,
+                       unsigned int framecnt,
+                       const unsigned numa_node,
+                       struct pmd_internals **internals,
+                       struct rte_eth_dev **eth_dev,
+                       struct rte_kvargs *kvlist)
+{
+	struct rte_eth_dev_data *data = NULL;
+	struct rte_pci_device *pci_dev = NULL;
+	struct rte_kvargs_pair *pair = NULL;
+	struct ifreq ifr;
+	size_t ifnamelen;
+	unsigned k_idx;
+	struct sockaddr_ll sockaddr;
+	struct tpacket_req *req;
+	struct pkt_rx_queue *rx_queue;
+	struct pkt_tx_queue *tx_queue;
+	int rc, tpver, discard, bypass;
+	unsigned int i, q, rdsize;
+	int qsockfd, fanout_arg;
+
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
+			break;
+	}
+	if (pair == NULL) {
+		RTE_LOG(ERR, PMD,
+			"%s: no interface specified for AF_PACKET ethdev\n",
+		        name);
+		goto error;
+	}
+
+	RTE_LOG(INFO, PMD,
+		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
+		name, numa_node);
+
+	/*
+	 * now do all data allocation - for eth_dev structure, dummy pci driver
+	 * and internal (private) data
+	 */
+	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
+	if (data == NULL)
+		goto error;
+
+	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
+	if (pci_dev == NULL)
+		goto error;
+
+	*internals = rte_zmalloc_socket(name, sizeof(**internals),
+	                                0, numa_node);
+	if (*internals == NULL)
+		goto error;
+
+	req = &((*internals)->req);
+
+	req->tp_block_size = blocksize;
+	req->tp_block_nr = blockcnt;
+	req->tp_frame_size = framesize;
+	req->tp_frame_nr = framecnt;
+
+	ifnamelen = strlen(pair->value);
+	if (ifnamelen < sizeof(ifr.ifr_name)) {
+		memcpy(ifr.ifr_name, pair->value, ifnamelen);
+		ifr.ifr_name[ifnamelen] = '\0';
+	} else {
+		RTE_LOG(ERR, PMD,
+			"%s: I/F name too long (%s)\n",
+			name, pair->value);
+		goto error;
+	}
+	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
+		RTE_LOG(ERR, PMD,
+			"%s: ioctl failed (SIOCGIFINDEX)\n",
+		        name);
+		goto error;
+	}
+	(*internals)->if_index = ifr.ifr_ifindex;
+
+	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
+		RTE_LOG(ERR, PMD,
+			"%s: ioctl failed (SIOCGIFHWADDR)\n",
+		        name);
+		goto error;
+	}
+	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
+
+	memset(&sockaddr, 0, sizeof(sockaddr));
+	sockaddr.sll_family = AF_PACKET;
+	sockaddr.sll_protocol = htons(ETH_P_ALL);
+	sockaddr.sll_ifindex = (*internals)->if_index;
+
+	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
+	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
+	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
+
+	for (q = 0; q < nb_queues; q++) {
+		/* Open an AF_PACKET socket for this queue... */
+		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+		if (qsockfd == -1) {
+			RTE_LOG(ERR, PMD,
+			        "%s: could not open AF_PACKET socket\n",
+			        name);
+			return -1;
+		}
+
+		tpver = TPACKET_V2;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
+				&tpver, sizeof(tpver));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_VERSION on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		discard = 1;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
+				&discard, sizeof(discard));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_LOSS on "
+			        "AF_PACKET socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		bypass = 1;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
+				&bypass, sizeof(bypass));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_QDISC_BYPASS "
+			        "on AF_PACKET socket for %s\n", name,
+			        pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_RX_RING on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_TX_RING on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		rx_queue = &((*internals)->rx_queue[q]);
+		rx_queue->framecount = req->tp_frame_nr;
+
+		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
+				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
+				    qsockfd, 0);
+		if (rx_queue->map == MAP_FAILED) {
+			RTE_LOG(ERR, PMD,
+				"%s: call to mmap failed on AF_PACKET socket for %s\n",
+				name, pair->value);
+			goto error;
+		}
+
+		/* rdsize is same for both Tx and Rx */
+		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
+
+		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
+		for (i = 0; i < req->tp_frame_nr; ++i) {
+			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
+			rx_queue->rd[i].iov_len = req->tp_frame_size;
+		}
+		rx_queue->sockfd = qsockfd;
+
+		tx_queue = &((*internals)->tx_queue[q]);
+		tx_queue->framecount = req->tp_frame_nr;
+
+		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
+
+		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
+		for (i = 0; i < req->tp_frame_nr; ++i) {
+			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
+			tx_queue->rd[i].iov_len = req->tp_frame_size;
+		}
+		tx_queue->sockfd = qsockfd;
+
+		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not bind AF_PACKET socket to %s\n",
+			        name, pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
+				&fanout_arg, sizeof(fanout_arg));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
+				"for %s\n", name, pair->value);
+			goto error;
+		}
+	}
+
+	/* reserve an ethdev entry */
+	*eth_dev = rte_eth_dev_allocate(name);
+	if (*eth_dev == NULL)
+		goto error;
+
+	/*
+	 * now put it all together
+	 * - store queue data in internals,
+	 * - store numa_node info in pci_driver
+	 * - point eth_dev_data to internals and pci_driver
+	 * - and point eth_dev structure to new eth_dev_data structure
+	 */
+
+	(*internals)->nb_queues = nb_queues;
+
+	data->dev_private = *internals;
+	data->port_id = (*eth_dev)->data->port_id;
+	data->nb_rx_queues = (uint16_t)nb_queues;
+	data->nb_tx_queues = (uint16_t)nb_queues;
+	data->dev_link = pmd_link;
+	data->mac_addrs = &(*internals)->eth_addr;
+
+	pci_dev->numa_node = numa_node;
+
+	(*eth_dev)->data = data;
+	(*eth_dev)->dev_ops = &ops;
+	(*eth_dev)->pci_dev = pci_dev;
+
+	return 0;
+
+error:
+	if (data)
+		rte_free(data);
+	if (pci_dev)
+		rte_free(pci_dev);
+	for (q = 0; q < nb_queues; q++) {
+		if ((*internals)->rx_queue[q].rd)
+			rte_free((*internals)->rx_queue[q].rd);
+		if ((*internals)->tx_queue[q].rd)
+			rte_free((*internals)->tx_queue[q].rd);
+	}
+	if (*internals)
+		rte_free(*internals);
+	return -1;
+}
+
+static int
+rte_eth_from_packet(const char *name,
+                    int const *sockfd,
+                    const unsigned numa_node,
+                    struct rte_kvargs *kvlist)
+{
+	struct pmd_internals *internals = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	struct rte_kvargs_pair *pair = NULL;
+	unsigned k_idx;
+	unsigned int blockcount;
+	unsigned int blocksize = DFLT_BLOCK_SIZE;
+	unsigned int framesize = DFLT_FRAME_SIZE;
+	unsigned int framecount = DFLT_FRAME_COUNT;
+	unsigned int qpairs = 1;
+
+	/* do some parameter checking */
+	if (*sockfd < 0)
+		return -1;
+
+	/*
+	 * Walk arguments for configurable settings
+	 */
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
+			qpairs = atoi(pair->value);
+			if (qpairs < 1 ||
+			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid qpairs value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
+			blocksize = atoi(pair->value);
+			if (!blocksize) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid blocksize value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
+			framesize = atoi(pair->value);
+			if (!framesize) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid framesize value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
+			framecount = atoi(pair->value);
+			if (!framecount) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid framecount value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+	}
+
+	if (framesize > blocksize) {
+		RTE_LOG(ERR, PMD,
+			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
+		        name);
+		return -1;
+	}
+
+	blockcount = framecount / (blocksize / framesize);
+	if (!blockcount) {
+		RTE_LOG(ERR, PMD,
+			"%s: invalid AF_PACKET MMAP parameters\n", name);
+		return -1;
+	}
+
+	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
+	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
+	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
+	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
+	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
+
+	if (rte_pmd_init_internals(name, *sockfd, qpairs,
+	                           blocksize, blockcount,
+	                           framesize, framecount,
+	                           numa_node, &internals, &eth_dev,
+	                           kvlist) < 0)
+		return -1;
+
+	eth_dev->rx_pkt_burst = eth_packet_rx;
+	eth_dev->tx_pkt_burst = eth_packet_tx;
+
+	return 0;
+}
+
+int
+rte_pmd_packet_devinit(const char *name, const char *params)
+{
+	unsigned numa_node;
+	int ret;
+	struct rte_kvargs *kvlist;
+	int sockfd = -1;
+
+	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
+
+	numa_node = rte_socket_id();
+
+	kvlist = rte_kvargs_parse(params, valid_arguments);
+	if (kvlist == NULL)
+		return -1;
+
+	/*
+	 * If iface argument is passed we open the NICs and use them for
+	 * reading / writing
+	 */
+	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
+
+		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
+		                         &open_packet_iface, &sockfd);
+		if (ret < 0)
+			return -1;
+	}
+
+	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
+	close(sockfd); /* no longer needed */
+
+	if (ret < 0)
+		return -1;
+
+	return 0;
+}
+
+static struct rte_driver pmd_packet_drv = {
+	.name = "eth_packet",
+	.type = PMD_VDEV,
+	.init = rte_pmd_packet_devinit,
+};
+
+PMD_REGISTER_DRIVER(pmd_packet_drv);
diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
new file mode 100644
index 000000000000..f685611da3e9
--- /dev/null
+++ b/lib/librte_pmd_packet/rte_eth_packet.h
@@ -0,0 +1,55 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ETH_PACKET_H_
+#define _RTE_ETH_PACKET_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
+
+#define RTE_PMD_PACKET_MAX_RINGS 16
+
+/**
+ * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
+ * configured on command line.
+ */
+int rte_pmd_packet_devinit(const char *name, const char *params);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 34dff2a02a05..a6994c4dbe93 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
 LDLIBS += -lrte_pmd_pcap -lpcap
 endif
 
+ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
+LDLIBS += -lrte_pmd_packet
+endif
+
 endif # plugins
 
 LDLIBS += $(EXECENV_LDLIBS)
-- 
1.9.3

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-14 18:24 ` [dpdk-dev] [PATCH v2] " John W. Linville
@ 2014-07-15  0:15   ` Zhou, Danny
  2014-07-15 12:17     ` Neil Horman
  2014-09-12 18:05   ` John W. Linville
  1 sibling, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-07-15  0:15 UTC (permalink / raw)
  To: John W. Linville, dev

According to my performance measurement results for 64B small packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M pps) which make sense to me as for 16 queues case more CPU cycles (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for NAPI-enabled ixgbe driver to switch between polling and interrupt modes in order to service per-queue rx interrupts, so more context switch overhead involved. Also, since the eth_packet_rx/eth_packet_tx routines involves in two memory copies between DPDK mbuf and pbuf for each packet, it can hardly achieve high performance unless packet are directly DMA to mbuf which needs ixgbe driver to support.

> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Tuesday, July 15, 2014 2:25 AM
> To: dev@dpdk.org
> Cc: Thomas Monjalon; Richardson, Bruce; Zhou, Danny
> Subject: [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual
> devices
> 
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET socket.  This
> implementation uses mmap'ed ring buffers to limit copying and user/kernel
> transitions.  The PACKET_FANOUT_HASH behavior of AF_PACKET is used for
> frame reception.  In the current implementation, Tx and Rx queues are always paired,
> and therefore are always equal in number -- changing this would be a Simple Matter
> Of Programming.
> 
> Interfaces of this type are created with a command line option like
> "--vdev=eth_packet0,iface=...".  There are a number of options availabe as
> arguments:
> 
>  - Interface is chosen by "iface" (required)
>  - Number of queue pairs set by "qpairs" (optional, default: 1)
>  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
>  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
>  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> This PMD is intended to provide a means for using DPDK on a broad range of
> hardware without hardware-specific PMDs and (hopefully) with better performance
> than what PCAP offers in Linux.  This might be useful as a development platform for
> DPDK applications when DPDK-supported hardware is expensive or unavailable.
> 
> New in v2:
> 
> -- fixup some style issues found by check patch
> -- use if_index as part of fanout group ID
> -- set default number of queue pairs to 1
> 
>  config/common_bsdapp                   |   5 +
>  config/common_linuxapp                 |   5 +
>  lib/Makefile                           |   1 +
>  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
>  lib/librte_pmd_packet/Makefile         |  60 +++
>  lib/librte_pmd_packet/rte_eth_packet.c | 826
> +++++++++++++++++++++++++++++++++
> lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
>  mk/rte.app.mk                          |   4 +
>  8 files changed, 957 insertions(+)
>  create mode 100644 lib/librte_pmd_packet/Makefile  create mode 100644
> lib/librte_pmd_packet/rte_eth_packet.c
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> 
> diff --git a/config/common_bsdapp b/config/common_bsdapp index
> 943dce8f1ede..c317f031278e 100644
> --- a/config/common_bsdapp
> +++ b/config/common_bsdapp
> @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> CONFIG_RTE_LIBRTE_PMD_BOND=y
> 
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> +
> +#
>  # Do prefetch of packet data within PMD driver receive function  #
> CONFIG_RTE_PMD_PACKET_PREFETCH=y diff --git a/config/common_linuxapp
> b/config/common_linuxapp index 7bf5d80d4e26..f9e7bc3015ec 100644
> --- a/config/common_linuxapp
> +++ b/config/common_linuxapp
> @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> CONFIG_RTE_LIBRTE_PMD_BOND=y
> 
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> +
> +#
>  # Compile Xen PMD
>  #
>  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> diff --git a/lib/Makefile b/lib/Makefile index 10c5bb3045bc..930fadf29898 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> librte_pmd_i40e
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
>  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
>  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt diff --git
> a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> index 756d6b0c9301..feed24a63272 100644
> --- a/lib/librte_eal/linuxapp/eal/Makefile
> +++ b/lib/librte_eal/linuxapp/eal/Makefile
> @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether  CFLAGS +=
> -I$(RTE_SDK)/lib/librte_ivshmem  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
>  CFLAGS += $(WERROR_FLAGS) -O3
> 
> diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile new file
> mode 100644 index 000000000000..e1266fb992cd
> --- /dev/null
> +++ b/lib/librte_pmd_packet/Makefile
> @@ -0,0 +1,60 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_pmd_packet.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +#
> +# all source are stored in SRCS-y
> +#
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> +
> +#
> +# Export include files
> +#
> +SYMLINK-y-include += rte_eth_packet.h
> +
> +# this lib depends upon:
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> b/lib/librte_pmd_packet/rte_eth_packet.c
> new file mode 100644
> index 000000000000..9c82d16e730f
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> @@ -0,0 +1,826 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> + *
> + *   Originally based upon librte_pmd_pcap code:
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2014 6WIND S.A.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> + */
> +
> +#include <rte_mbuf.h>
> +#include <rte_ethdev.h>
> +#include <rte_malloc.h>
> +#include <rte_kvargs.h>
> +#include <rte_dev.h>
> +
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <arpa/inet.h>
> +#include <net/if.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <poll.h>
> +
> +#include "rte_eth_packet.h"
> +
> +#define ETH_PACKET_IFACE_ARG		"iface"
> +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> +
> +#define DFLT_BLOCK_SIZE		(1 << 12)
> +#define DFLT_FRAME_SIZE		(1 << 11)
> +#define DFLT_FRAME_COUNT	(1 << 9)
> +
> +struct pkt_rx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	struct rte_mempool *mb_pool;
> +
> +	volatile unsigned long rx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pkt_tx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	volatile unsigned long tx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pmd_internals {
> +	unsigned nb_queues;
> +
> +	int if_index;
> +	struct ether_addr eth_addr;
> +
> +	struct tpacket_req req;
> +
> +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +};
> +
> +static const char *valid_arguments[] = {
> +	ETH_PACKET_IFACE_ARG,
> +	ETH_PACKET_NUM_Q_ARG,
> +	ETH_PACKET_BLOCKSIZE_ARG,
> +	ETH_PACKET_FRAMESIZE_ARG,
> +	ETH_PACKET_FRAMECOUNT_ARG,
> +	NULL
> +};
> +
> +static const char *drivername = "AF_PACKET PMD";
> +
> +static struct rte_eth_link pmd_link = {
> +	.link_speed = 10000,
> +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> +	.link_status = 0
> +};
> +
> +static uint16_t
> +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> +	unsigned i;
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	struct pkt_rx_queue *pkt_q = queue;
> +	uint16_t num_rx = 0;
> +	unsigned int framecount, framenum;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	/*
> +	 * Reads the given number of packets from the AF_PACKET socket one by
> +	 * one and copies the packet data into a newly allocated mbuf.
> +	 */
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> +			break;
> +
> +		/* allocate the next mbuf */
> +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> +		if (unlikely(mbuf == NULL))
> +			break;
> +
> +		/* packet will fit in the mbuf, go ahead and receive it */
> +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_KERNEL;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +
> +		/* account for the receive frame */
> +		bufs[i] = mbuf;
> +		num_rx++;
> +	}
> +	pkt_q->framenum = framenum;
> +	pkt_q->rx_pkts += num_rx;
> +	return num_rx;
> +}
> +
> +/*
> + * Callback to handle sending packets through a real NIC.
> + */
> +static uint16_t
> +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	unsigned int framecount, framenum;
> +	struct pollfd pfd;
> +	struct pkt_tx_queue *pkt_q = queue;
> +	uint16_t num_tx = 0;
> +	int i;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	memset(&pfd, 0, sizeof(pfd));
> +	pfd.fd = pkt_q->sockfd;
> +	pfd.events = POLLOUT;
> +	pfd.revents = 0;
> +
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> +		    (poll(&pfd, 1, -1) < 0))
> +				continue;
> +
> +		/* copy the tx frame data */
> +		mbuf = bufs[num_tx];
> +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> +			sizeof(struct sockaddr_ll);
> +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +
> +		num_tx++;
> +		rte_pktmbuf_free(mbuf);
> +	}
> +
> +	/* kick-off transmits */
> +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +
> +	pkt_q->framenum = framenum;
> +	pkt_q->tx_pkts += num_tx;
> +	pkt_q->err_pkts += nb_pkts - num_tx;
> +	return num_tx;
> +}
> +
> +static int
> +eth_dev_start(struct rte_eth_dev *dev)
> +{
> +	dev->data->dev_link.link_status = 1;
> +	return 0;
> +}
> +
> +/*
> + * This function gets called when the current port gets stopped.
> + */
> +static void
> +eth_dev_stop(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	int sockfd;
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	for (i = 0; i < internals->nb_queues; i++) {
> +		sockfd = internals->rx_queue[i].sockfd;
> +		if (sockfd != -1)
> +			close(sockfd);
> +		sockfd = internals->tx_queue[i].sockfd;
> +		if (sockfd != -1)
> +			close(sockfd);
> +	}
> +
> +	dev->data->dev_link.link_status = 0;
> +}
> +
> +static int
> +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) {
> +	return 0;
> +}
> +
> +static void
> +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
> +*dev_info) {
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev_info->driver_name = drivername;
> +	dev_info->if_index = internals->if_index;
> +	dev_info->max_mac_addrs = 1;
> +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->min_rx_bufsize = 0;
> +	dev_info->pci_dev = NULL;
> +}
> +
> +static void
> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> +{
> +	unsigned i, imax;
> +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> +	const struct pmd_internals *internal = dev->data->dev_private;
> +
> +	memset(igb_stats, 0, sizeof(*igb_stats));
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> +		rx_total += igb_stats->q_ipackets[i];
> +	}
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> +		tx_total += igb_stats->q_opackets[i];
> +		tx_err_total += igb_stats->q_errors[i];
> +	}
> +
> +	igb_stats->ipackets = rx_total;
> +	igb_stats->opackets = tx_total;
> +	igb_stats->oerrors = tx_err_total;
> +}
> +
> +static void
> +eth_stats_reset(struct rte_eth_dev *dev) {
> +	unsigned i;
> +	struct pmd_internals *internal = dev->data->dev_private;
> +
> +	for (i = 0; i < internal->nb_queues; i++)
> +		internal->rx_queue[i].rx_pkts = 0;
> +
> +	for (i = 0; i < internal->nb_queues; i++) {
> +		internal->tx_queue[i].tx_pkts = 0;
> +		internal->tx_queue[i].err_pkts = 0;
> +	}
> +}
> +
> +static void
> +eth_dev_close(struct rte_eth_dev *dev __rte_unused) { }
> +
> +static void
> +eth_queue_release(void *q __rte_unused) { }
> +
> +static int
> +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> +                int wait_to_complete __rte_unused) {
> +	return 0;
> +}
> +
> +static int
> +eth_rx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t rx_queue_id,
> +                   uint16_t nb_rx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> +                   struct rte_mempool *mb_pool) {
> +	struct pmd_internals *internals = dev->data->dev_private;
> +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> +	struct rte_pktmbuf_pool_private *mbp_priv;
> +	uint16_t buf_size;
> +
> +	pkt_q->mb_pool = mb_pool;
> +
> +	/* Now get the space available for data in the mbuf */
> +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> +	                       RTE_PKTMBUF_HEADROOM);
> +
> +	if (ETH_FRAME_LEN > buf_size) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> +			dev->data->name, ETH_FRAME_LEN, buf_size);
> +		return -ENOMEM;
> +	}
> +
> +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> +
> +	return 0;
> +}
> +
> +static int
> +eth_tx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t tx_queue_id,
> +                   uint16_t nb_tx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_txconf *tx_conf __rte_unused) {
> +
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> +	return 0;
> +}
> +
> +static struct eth_dev_ops ops = {
> +	.dev_start = eth_dev_start,
> +	.dev_stop = eth_dev_stop,
> +	.dev_close = eth_dev_close,
> +	.dev_configure = eth_dev_configure,
> +	.dev_infos_get = eth_dev_info,
> +	.rx_queue_setup = eth_rx_queue_setup,
> +	.tx_queue_setup = eth_tx_queue_setup,
> +	.rx_queue_release = eth_queue_release,
> +	.tx_queue_release = eth_queue_release,
> +	.link_update = eth_link_update,
> +	.stats_get = eth_stats_get,
> +	.stats_reset = eth_stats_reset,
> +};
> +
> +/*
> + * Opens an AF_PACKET socket
> + */
> +static int
> +open_packet_iface(const char *key __rte_unused,
> +                  const char *value __rte_unused,
> +                  void *extra_args)
> +{
> +	int *sockfd = extra_args;
> +
> +	/* Open an AF_PACKET socket... */
> +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +	if (*sockfd == -1) {
> +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +rte_pmd_init_internals(const char *name,
> +                       const int sockfd,
> +                       const unsigned nb_queues,
> +                       unsigned int blocksize,
> +                       unsigned int blockcnt,
> +                       unsigned int framesize,
> +                       unsigned int framecnt,
> +                       const unsigned numa_node,
> +                       struct pmd_internals **internals,
> +                       struct rte_eth_dev **eth_dev,
> +                       struct rte_kvargs *kvlist) {
> +	struct rte_eth_dev_data *data = NULL;
> +	struct rte_pci_device *pci_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	struct ifreq ifr;
> +	size_t ifnamelen;
> +	unsigned k_idx;
> +	struct sockaddr_ll sockaddr;
> +	struct tpacket_req *req;
> +	struct pkt_rx_queue *rx_queue;
> +	struct pkt_tx_queue *tx_queue;
> +	int rc, tpver, discard, bypass;
> +	unsigned int i, q, rdsize;
> +	int qsockfd, fanout_arg;
> +
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> +			break;
> +	}
> +	if (pair == NULL) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: no interface specified for AF_PACKET ethdev\n",
> +		        name);
> +		goto error;
> +	}
> +
> +	RTE_LOG(INFO, PMD,
> +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> +		name, numa_node);
> +
> +	/*
> +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> +	 * and internal (private) data
> +	 */
> +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> +	if (data == NULL)
> +		goto error;
> +
> +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> +	if (pci_dev == NULL)
> +		goto error;
> +
> +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> +	                                0, numa_node);
> +	if (*internals == NULL)
> +		goto error;
> +
> +	req = &((*internals)->req);
> +
> +	req->tp_block_size = blocksize;
> +	req->tp_block_nr = blockcnt;
> +	req->tp_frame_size = framesize;
> +	req->tp_frame_nr = framecnt;
> +
> +	ifnamelen = strlen(pair->value);
> +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> +		ifr.ifr_name[ifnamelen] = '\0';
> +	} else {
> +		RTE_LOG(ERR, PMD,
> +			"%s: I/F name too long (%s)\n",
> +			name, pair->value);
> +		goto error;
> +	}
> +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> +		        name);
> +		goto error;
> +	}
> +	(*internals)->if_index = ifr.ifr_ifindex;
> +
> +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> +		        name);
> +		goto error;
> +	}
> +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> +
> +	memset(&sockaddr, 0, sizeof(sockaddr));
> +	sockaddr.sll_family = AF_PACKET;
> +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> +	sockaddr.sll_ifindex = (*internals)->if_index;
> +
> +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> +
> +	for (q = 0; q < nb_queues; q++) {
> +		/* Open an AF_PACKET socket for this queue... */
> +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +		if (qsockfd == -1) {
> +			RTE_LOG(ERR, PMD,
> +			        "%s: could not open AF_PACKET socket\n",
> +			        name);
> +			return -1;
> +		}
> +
> +		tpver = TPACKET_V2;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> +				&tpver, sizeof(tpver));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_VERSION on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		discard = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> +				&discard, sizeof(discard));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_LOSS on "
> +			        "AF_PACKET socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		bypass = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> +				&bypass, sizeof(bypass));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_QDISC_BYPASS "
> +			        "on AF_PACKET socket for %s\n", name,
> +			        pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req,
> sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req,
> sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rx_queue = &((*internals)->rx_queue[q]);
> +		rx_queue->framecount = req->tp_frame_nr;
> +
> +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> +				    PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_LOCKED,
> +				    qsockfd, 0);
> +		if (rx_queue->map == MAP_FAILED) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> +				name, pair->value);
> +			goto error;
> +		}
> +
> +		/* rdsize is same for both Tx and Rx */
> +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> +
> +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		rx_queue->sockfd = qsockfd;
> +
> +		tx_queue = &((*internals)->tx_queue[q]);
> +		tx_queue->framecount = req->tp_frame_nr;
> +
> +		tx_queue->map = rx_queue->map + req->tp_block_size *
> +req->tp_block_nr;
> +
> +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		tx_queue->sockfd = qsockfd;
> +
> +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not bind AF_PACKET socket to %s\n",
> +			        name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> +				&fanout_arg, sizeof(fanout_arg));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> +				"for %s\n", name, pair->value);
> +			goto error;
> +		}
> +	}
> +
> +	/* reserve an ethdev entry */
> +	*eth_dev = rte_eth_dev_allocate(name);
> +	if (*eth_dev == NULL)
> +		goto error;
> +
> +	/*
> +	 * now put it all together
> +	 * - store queue data in internals,
> +	 * - store numa_node info in pci_driver
> +	 * - point eth_dev_data to internals and pci_driver
> +	 * - and point eth_dev structure to new eth_dev_data structure
> +	 */
> +
> +	(*internals)->nb_queues = nb_queues;
> +
> +	data->dev_private = *internals;
> +	data->port_id = (*eth_dev)->data->port_id;
> +	data->nb_rx_queues = (uint16_t)nb_queues;
> +	data->nb_tx_queues = (uint16_t)nb_queues;
> +	data->dev_link = pmd_link;
> +	data->mac_addrs = &(*internals)->eth_addr;
> +
> +	pci_dev->numa_node = numa_node;
> +
> +	(*eth_dev)->data = data;
> +	(*eth_dev)->dev_ops = &ops;
> +	(*eth_dev)->pci_dev = pci_dev;
> +
> +	return 0;
> +
> +error:
> +	if (data)
> +		rte_free(data);
> +	if (pci_dev)
> +		rte_free(pci_dev);
> +	for (q = 0; q < nb_queues; q++) {
> +		if ((*internals)->rx_queue[q].rd)
> +			rte_free((*internals)->rx_queue[q].rd);
> +		if ((*internals)->tx_queue[q].rd)
> +			rte_free((*internals)->tx_queue[q].rd);
> +	}
> +	if (*internals)
> +		rte_free(*internals);
> +	return -1;
> +}
> +
> +static int
> +rte_eth_from_packet(const char *name,
> +                    int const *sockfd,
> +                    const unsigned numa_node,
> +                    struct rte_kvargs *kvlist) {
> +	struct pmd_internals *internals = NULL;
> +	struct rte_eth_dev *eth_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	unsigned k_idx;
> +	unsigned int blockcount;
> +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> +	unsigned int framesize = DFLT_FRAME_SIZE;
> +	unsigned int framecount = DFLT_FRAME_COUNT;
> +	unsigned int qpairs = 1;
> +
> +	/* do some parameter checking */
> +	if (*sockfd < 0)
> +		return -1;
> +
> +	/*
> +	 * Walk arguments for configurable settings
> +	 */
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> +			qpairs = atoi(pair->value);
> +			if (qpairs < 1 ||
> +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid qpairs value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> +			blocksize = atoi(pair->value);
> +			if (!blocksize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid blocksize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> +			framesize = atoi(pair->value);
> +			if (!framesize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framesize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> +			framecount = atoi(pair->value);
> +			if (!framecount) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framecount value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +	}
> +
> +	if (framesize > blocksize) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> +		        name);
> +		return -1;
> +	}
> +
> +	blockcount = framecount / (blocksize / framesize);
> +	if (!blockcount) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> +		return -1;
> +	}
> +
> +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> +
> +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> +	                           blocksize, blockcount,
> +	                           framesize, framecount,
> +	                           numa_node, &internals, &eth_dev,
> +	                           kvlist) < 0)
> +		return -1;
> +
> +	eth_dev->rx_pkt_burst = eth_packet_rx;
> +	eth_dev->tx_pkt_burst = eth_packet_tx;
> +
> +	return 0;
> +}
> +
> +int
> +rte_pmd_packet_devinit(const char *name, const char *params) {
> +	unsigned numa_node;
> +	int ret;
> +	struct rte_kvargs *kvlist;
> +	int sockfd = -1;
> +
> +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> +
> +	numa_node = rte_socket_id();
> +
> +	kvlist = rte_kvargs_parse(params, valid_arguments);
> +	if (kvlist == NULL)
> +		return -1;
> +
> +	/*
> +	 * If iface argument is passed we open the NICs and use them for
> +	 * reading / writing
> +	 */
> +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> +
> +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> +		                         &open_packet_iface, &sockfd);
> +		if (ret < 0)
> +			return -1;
> +	}
> +
> +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> +	close(sockfd); /* no longer needed */
> +
> +	if (ret < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static struct rte_driver pmd_packet_drv = {
> +	.name = "eth_packet",
> +	.type = PMD_VDEV,
> +	.init = rte_pmd_packet_devinit,
> +};
> +
> +PMD_REGISTER_DRIVER(pmd_packet_drv);
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.h
> b/lib/librte_pmd_packet/rte_eth_packet.h
> new file mode 100644
> index 000000000000..f685611da3e9
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> @@ -0,0 +1,55 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> + */
> +
> +#ifndef _RTE_ETH_PACKET_H_
> +#define _RTE_ETH_PACKET_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> +
> +#define RTE_PMD_PACKET_MAX_RINGS 16
> +
> +/**
> + * For use by the EAL only. Called as part of EAL init to set up any
> +dummy NICs
> + * configured on command line.
> + */
> +int rte_pmd_packet_devinit(const char *name, const char *params);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 34dff2a02a05..a6994c4dbe93
> 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)  LDLIBS
> += -lrte_pmd_pcap -lpcap  endif
> 
> +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> +LDLIBS += -lrte_pmd_packet
> +endif
> +
>  endif # plugins
> 
>  LDLIBS += $(EXECENV_LDLIBS)
> --
> 1.9.3

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15  0:15   ` Zhou, Danny
@ 2014-07-15 12:17     ` Neil Horman
  2014-07-15 14:01       ` John W. Linville
  2014-07-15 15:34       ` Zhou, Danny
  0 siblings, 2 replies; 76+ messages in thread
From: Neil Horman @ 2014-07-15 12:17 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> According to my performance measurement results for 64B small packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M pps) which make sense to me as for 16 queues case more CPU cycles (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for NAPI-enabled ixgbe driver to switch between polling and interrupt modes in order to service per-queue rx interrupts, so more context switch overhead involved. Also, since the eth_packet_rx/eth_packet_tx routines involves in two memory copies between DPDK mbuf and pbuf for each packet, it can hardly achieve high performance unless packet are directly DMA to mbuf which needs ixgbe driver to support.

I thought 16 queues would be spread out between as many cpus as you had though,
obviating the need for context switches, no?
Neil

> 
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Tuesday, July 15, 2014 2:25 AM
> > To: dev@dpdk.org
> > Cc: Thomas Monjalon; Richardson, Bruce; Zhou, Danny
> > Subject: [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual
> > devices
> > 
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET socket.  This
> > implementation uses mmap'ed ring buffers to limit copying and user/kernel
> > transitions.  The PACKET_FANOUT_HASH behavior of AF_PACKET is used for
> > frame reception.  In the current implementation, Tx and Rx queues are always paired,
> > and therefore are always equal in number -- changing this would be a Simple Matter
> > Of Programming.
> > 
> > Interfaces of this type are created with a command line option like
> > "--vdev=eth_packet0,iface=...".  There are a number of options availabe as
> > arguments:
> > 
> >  - Interface is chosen by "iface" (required)
> >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > 
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > This PMD is intended to provide a means for using DPDK on a broad range of
> > hardware without hardware-specific PMDs and (hopefully) with better performance
> > than what PCAP offers in Linux.  This might be useful as a development platform for
> > DPDK applications when DPDK-supported hardware is expensive or unavailable.
> > 
> > New in v2:
> > 
> > -- fixup some style issues found by check patch
> > -- use if_index as part of fanout group ID
> > -- set default number of queue pairs to 1
> > 
> >  config/common_bsdapp                   |   5 +
> >  config/common_linuxapp                 |   5 +
> >  lib/Makefile                           |   1 +
> >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> >  lib/librte_pmd_packet/Makefile         |  60 +++
> >  lib/librte_pmd_packet/rte_eth_packet.c | 826
> > +++++++++++++++++++++++++++++++++
> > lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> >  mk/rte.app.mk                          |   4 +
> >  8 files changed, 957 insertions(+)
> >  create mode 100644 lib/librte_pmd_packet/Makefile  create mode 100644
> > lib/librte_pmd_packet/rte_eth_packet.c
> >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > 
> > diff --git a/config/common_bsdapp b/config/common_bsdapp index
> > 943dce8f1ede..c317f031278e 100644
> > --- a/config/common_bsdapp
> > +++ b/config/common_bsdapp
> > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > 
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > +
> > +#
> >  # Do prefetch of packet data within PMD driver receive function  #
> > CONFIG_RTE_PMD_PACKET_PREFETCH=y diff --git a/config/common_linuxapp
> > b/config/common_linuxapp index 7bf5d80d4e26..f9e7bc3015ec 100644
> > --- a/config/common_linuxapp
> > +++ b/config/common_linuxapp
> > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > 
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > +
> > +#
> >  # Compile Xen PMD
> >  #
> >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > diff --git a/lib/Makefile b/lib/Makefile index 10c5bb3045bc..930fadf29898 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> > librte_pmd_i40e
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt diff --git
> > a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > index 756d6b0c9301..feed24a63272 100644
> > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether  CFLAGS +=
> > -I$(RTE_SDK)/lib/librte_ivshmem  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> >  CFLAGS += $(WERROR_FLAGS) -O3
> > 
> > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile new file
> > mode 100644 index 000000000000..e1266fb992cd
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/Makefile
> > @@ -0,0 +1,60 @@
> > +#   BSD LICENSE
> > +#
> > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > +#   Copyright(c) 2014 6WIND S.A.
> > +#   All rights reserved.
> > +#
> > +#   Redistribution and use in source and binary forms, with or without
> > +#   modification, are permitted provided that the following conditions
> > +#   are met:
> > +#
> > +#     * Redistributions of source code must retain the above copyright
> > +#       notice, this list of conditions and the following disclaimer.
> > +#     * Redistributions in binary form must reproduce the above copyright
> > +#       notice, this list of conditions and the following disclaimer in
> > +#       the documentation and/or other materials provided with the
> > +#       distribution.
> > +#     * Neither the name of Intel Corporation nor the names of its
> > +#       contributors may be used to endorse or promote products derived
> > +#       from this software without specific prior written permission.
> > +#
> > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > CONTRIBUTORS
> > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> > NOT
> > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > FITNESS FOR
> > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > COPYRIGHT
> > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > INCIDENTAL,
> > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> > NOT
> > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > LOSS OF USE,
> > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> > AND ON ANY
> > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> > THE USE
> > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > DAMAGE.
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +#
> > +# library name
> > +#
> > +LIB = librte_pmd_packet.a
> > +
> > +CFLAGS += -O3
> > +CFLAGS += $(WERROR_FLAGS)
> > +
> > +#
> > +# all source are stored in SRCS-y
> > +#
> > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > +
> > +#
> > +# Export include files
> > +#
> > +SYMLINK-y-include += rte_eth_packet.h
> > +
> > +# this lib depends upon:
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> > b/lib/librte_pmd_packet/rte_eth_packet.c
> > new file mode 100644
> > index 000000000000..9c82d16e730f
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > @@ -0,0 +1,826 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > + *
> > + *   Originally based upon librte_pmd_pcap code:
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   Copyright(c) 2014 6WIND S.A.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> > NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> > NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> > AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> > THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > DAMAGE.
> > + */
> > +
> > +#include <rte_mbuf.h>
> > +#include <rte_ethdev.h>
> > +#include <rte_malloc.h>
> > +#include <rte_kvargs.h>
> > +#include <rte_dev.h>
> > +
> > +#include <linux/if_ether.h>
> > +#include <linux/if_packet.h>
> > +#include <arpa/inet.h>
> > +#include <net/if.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +#include <poll.h>
> > +
> > +#include "rte_eth_packet.h"
> > +
> > +#define ETH_PACKET_IFACE_ARG		"iface"
> > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > +
> > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > +#define DFLT_FRAME_SIZE		(1 << 11)
> > +#define DFLT_FRAME_COUNT	(1 << 9)
> > +
> > +struct pkt_rx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	struct rte_mempool *mb_pool;
> > +
> > +	volatile unsigned long rx_pkts;
> > +	volatile unsigned long err_pkts;
> > +};
> > +
> > +struct pkt_tx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	volatile unsigned long tx_pkts;
> > +	volatile unsigned long err_pkts;
> > +};
> > +
> > +struct pmd_internals {
> > +	unsigned nb_queues;
> > +
> > +	int if_index;
> > +	struct ether_addr eth_addr;
> > +
> > +	struct tpacket_req req;
> > +
> > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > +};
> > +
> > +static const char *valid_arguments[] = {
> > +	ETH_PACKET_IFACE_ARG,
> > +	ETH_PACKET_NUM_Q_ARG,
> > +	ETH_PACKET_BLOCKSIZE_ARG,
> > +	ETH_PACKET_FRAMESIZE_ARG,
> > +	ETH_PACKET_FRAMECOUNT_ARG,
> > +	NULL
> > +};
> > +
> > +static const char *drivername = "AF_PACKET PMD";
> > +
> > +static struct rte_eth_link pmd_link = {
> > +	.link_speed = 10000,
> > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > +	.link_status = 0
> > +};
> > +
> > +static uint16_t
> > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> > +	unsigned i;
> > +	struct tpacket2_hdr *ppd;
> > +	struct rte_mbuf *mbuf;
> > +	uint8_t *pbuf;
> > +	struct pkt_rx_queue *pkt_q = queue;
> > +	uint16_t num_rx = 0;
> > +	unsigned int framecount, framenum;
> > +
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	/*
> > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > +	 * one and copies the packet data into a newly allocated mbuf.
> > +	 */
> > +	framecount = pkt_q->framecount;
> > +	framenum = pkt_q->framenum;
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		/* point at the next incoming frame */
> > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > +			break;
> > +
> > +		/* allocate the next mbuf */
> > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > +		if (unlikely(mbuf == NULL))
> > +			break;
> > +
> > +		/* packet will fit in the mbuf, go ahead and receive it */
> > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > +
> > +		/* release incoming frame and advance ring buffer */
> > +		ppd->tp_status = TP_STATUS_KERNEL;
> > +		if (++framenum >= framecount)
> > +			framenum = 0;
> > +
> > +		/* account for the receive frame */
> > +		bufs[i] = mbuf;
> > +		num_rx++;
> > +	}
> > +	pkt_q->framenum = framenum;
> > +	pkt_q->rx_pkts += num_rx;
> > +	return num_rx;
> > +}
> > +
> > +/*
> > + * Callback to handle sending packets through a real NIC.
> > + */
> > +static uint16_t
> > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> > +	struct tpacket2_hdr *ppd;
> > +	struct rte_mbuf *mbuf;
> > +	uint8_t *pbuf;
> > +	unsigned int framecount, framenum;
> > +	struct pollfd pfd;
> > +	struct pkt_tx_queue *pkt_q = queue;
> > +	uint16_t num_tx = 0;
> > +	int i;
> > +
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	memset(&pfd, 0, sizeof(pfd));
> > +	pfd.fd = pkt_q->sockfd;
> > +	pfd.events = POLLOUT;
> > +	pfd.revents = 0;
> > +
> > +	framecount = pkt_q->framecount;
> > +	framenum = pkt_q->framenum;
> > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		/* point at the next incoming frame */
> > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > +		    (poll(&pfd, 1, -1) < 0))
> > +				continue;
> > +
> > +		/* copy the tx frame data */
> > +		mbuf = bufs[num_tx];
> > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > +			sizeof(struct sockaddr_ll);
> > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > +
> > +		/* release incoming frame and advance ring buffer */
> > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > +		if (++framenum >= framecount)
> > +			framenum = 0;
> > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +
> > +		num_tx++;
> > +		rte_pktmbuf_free(mbuf);
> > +	}
> > +
> > +	/* kick-off transmits */
> > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > +
> > +	pkt_q->framenum = framenum;
> > +	pkt_q->tx_pkts += num_tx;
> > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > +	return num_tx;
> > +}
> > +
> > +static int
> > +eth_dev_start(struct rte_eth_dev *dev)
> > +{
> > +	dev->data->dev_link.link_status = 1;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * This function gets called when the current port gets stopped.
> > + */
> > +static void
> > +eth_dev_stop(struct rte_eth_dev *dev)
> > +{
> > +	unsigned i;
> > +	int sockfd;
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	for (i = 0; i < internals->nb_queues; i++) {
> > +		sockfd = internals->rx_queue[i].sockfd;
> > +		if (sockfd != -1)
> > +			close(sockfd);
> > +		sockfd = internals->tx_queue[i].sockfd;
> > +		if (sockfd != -1)
> > +			close(sockfd);
> > +	}
> > +
> > +	dev->data->dev_link.link_status = 0;
> > +}
> > +
> > +static int
> > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) {
> > +	return 0;
> > +}
> > +
> > +static void
> > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
> > +*dev_info) {
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	dev_info->driver_name = drivername;
> > +	dev_info->if_index = internals->if_index;
> > +	dev_info->max_mac_addrs = 1;
> > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > +	dev_info->min_rx_bufsize = 0;
> > +	dev_info->pci_dev = NULL;
> > +}
> > +
> > +static void
> > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > +{
> > +	unsigned i, imax;
> > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > +	const struct pmd_internals *internal = dev->data->dev_private;
> > +
> > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > +
> > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > +	for (i = 0; i < imax; i++) {
> > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > +		rx_total += igb_stats->q_ipackets[i];
> > +	}
> > +
> > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > +	for (i = 0; i < imax; i++) {
> > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > +		tx_total += igb_stats->q_opackets[i];
> > +		tx_err_total += igb_stats->q_errors[i];
> > +	}
> > +
> > +	igb_stats->ipackets = rx_total;
> > +	igb_stats->opackets = tx_total;
> > +	igb_stats->oerrors = tx_err_total;
> > +}
> > +
> > +static void
> > +eth_stats_reset(struct rte_eth_dev *dev) {
> > +	unsigned i;
> > +	struct pmd_internals *internal = dev->data->dev_private;
> > +
> > +	for (i = 0; i < internal->nb_queues; i++)
> > +		internal->rx_queue[i].rx_pkts = 0;
> > +
> > +	for (i = 0; i < internal->nb_queues; i++) {
> > +		internal->tx_queue[i].tx_pkts = 0;
> > +		internal->tx_queue[i].err_pkts = 0;
> > +	}
> > +}
> > +
> > +static void
> > +eth_dev_close(struct rte_eth_dev *dev __rte_unused) { }
> > +
> > +static void
> > +eth_queue_release(void *q __rte_unused) { }
> > +
> > +static int
> > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > +                int wait_to_complete __rte_unused) {
> > +	return 0;
> > +}
> > +
> > +static int
> > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > +                   uint16_t rx_queue_id,
> > +                   uint16_t nb_rx_desc __rte_unused,
> > +                   unsigned int socket_id __rte_unused,
> > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > +                   struct rte_mempool *mb_pool) {
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > +	uint16_t buf_size;
> > +
> > +	pkt_q->mb_pool = mb_pool;
> > +
> > +	/* Now get the space available for data in the mbuf */
> > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > +	                       RTE_PKTMBUF_HEADROOM);
> > +
> > +	if (ETH_FRAME_LEN > buf_size) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > +                   uint16_t tx_queue_id,
> > +                   uint16_t nb_tx_desc __rte_unused,
> > +                   unsigned int socket_id __rte_unused,
> > +                   const struct rte_eth_txconf *tx_conf __rte_unused) {
> > +
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > +	return 0;
> > +}
> > +
> > +static struct eth_dev_ops ops = {
> > +	.dev_start = eth_dev_start,
> > +	.dev_stop = eth_dev_stop,
> > +	.dev_close = eth_dev_close,
> > +	.dev_configure = eth_dev_configure,
> > +	.dev_infos_get = eth_dev_info,
> > +	.rx_queue_setup = eth_rx_queue_setup,
> > +	.tx_queue_setup = eth_tx_queue_setup,
> > +	.rx_queue_release = eth_queue_release,
> > +	.tx_queue_release = eth_queue_release,
> > +	.link_update = eth_link_update,
> > +	.stats_get = eth_stats_get,
> > +	.stats_reset = eth_stats_reset,
> > +};
> > +
> > +/*
> > + * Opens an AF_PACKET socket
> > + */
> > +static int
> > +open_packet_iface(const char *key __rte_unused,
> > +                  const char *value __rte_unused,
> > +                  void *extra_args)
> > +{
> > +	int *sockfd = extra_args;
> > +
> > +	/* Open an AF_PACKET socket... */
> > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > +	if (*sockfd == -1) {
> > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > +		return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +rte_pmd_init_internals(const char *name,
> > +                       const int sockfd,
> > +                       const unsigned nb_queues,
> > +                       unsigned int blocksize,
> > +                       unsigned int blockcnt,
> > +                       unsigned int framesize,
> > +                       unsigned int framecnt,
> > +                       const unsigned numa_node,
> > +                       struct pmd_internals **internals,
> > +                       struct rte_eth_dev **eth_dev,
> > +                       struct rte_kvargs *kvlist) {
> > +	struct rte_eth_dev_data *data = NULL;
> > +	struct rte_pci_device *pci_dev = NULL;
> > +	struct rte_kvargs_pair *pair = NULL;
> > +	struct ifreq ifr;
> > +	size_t ifnamelen;
> > +	unsigned k_idx;
> > +	struct sockaddr_ll sockaddr;
> > +	struct tpacket_req *req;
> > +	struct pkt_rx_queue *rx_queue;
> > +	struct pkt_tx_queue *tx_queue;
> > +	int rc, tpver, discard, bypass;
> > +	unsigned int i, q, rdsize;
> > +	int qsockfd, fanout_arg;
> > +
> > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > +		pair = &kvlist->pairs[k_idx];
> > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > +			break;
> > +	}
> > +	if (pair == NULL) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +
> > +	RTE_LOG(INFO, PMD,
> > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > +		name, numa_node);
> > +
> > +	/*
> > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > +	 * and internal (private) data
> > +	 */
> > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > +	if (data == NULL)
> > +		goto error;
> > +
> > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > +	if (pci_dev == NULL)
> > +		goto error;
> > +
> > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > +	                                0, numa_node);
> > +	if (*internals == NULL)
> > +		goto error;
> > +
> > +	req = &((*internals)->req);
> > +
> > +	req->tp_block_size = blocksize;
> > +	req->tp_block_nr = blockcnt;
> > +	req->tp_frame_size = framesize;
> > +	req->tp_frame_nr = framecnt;
> > +
> > +	ifnamelen = strlen(pair->value);
> > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > +		ifr.ifr_name[ifnamelen] = '\0';
> > +	} else {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: I/F name too long (%s)\n",
> > +			name, pair->value);
> > +		goto error;
> > +	}
> > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +	(*internals)->if_index = ifr.ifr_ifindex;
> > +
> > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > +
> > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > +	sockaddr.sll_family = AF_PACKET;
> > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > +
> > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > +
> > +	for (q = 0; q < nb_queues; q++) {
> > +		/* Open an AF_PACKET socket for this queue... */
> > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > +		if (qsockfd == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +			        "%s: could not open AF_PACKET socket\n",
> > +			        name);
> > +			return -1;
> > +		}
> > +
> > +		tpver = TPACKET_V2;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > +				&tpver, sizeof(tpver));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		discard = 1;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > +				&discard, sizeof(discard));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_LOSS on "
> > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		bypass = 1;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > +				&bypass, sizeof(bypass));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_QDISC_BYPASS "
> > +			        "on AF_PACKET socket for %s\n", name,
> > +			        pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req,
> > sizeof(*req));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req,
> > sizeof(*req));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rx_queue = &((*internals)->rx_queue[q]);
> > +		rx_queue->framecount = req->tp_frame_nr;
> > +
> > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > +				    PROT_READ | PROT_WRITE, MAP_SHARED |
> > MAP_LOCKED,
> > +				    qsockfd, 0);
> > +		if (rx_queue->map == MAP_FAILED) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > +				name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		/* rdsize is same for both Tx and Rx */
> > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > +
> > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > +		}
> > +		rx_queue->sockfd = qsockfd;
> > +
> > +		tx_queue = &((*internals)->tx_queue[q]);
> > +		tx_queue->framecount = req->tp_frame_nr;
> > +
> > +		tx_queue->map = rx_queue->map + req->tp_block_size *
> > +req->tp_block_nr;
> > +
> > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > +		}
> > +		tx_queue->sockfd = qsockfd;
> > +
> > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not bind AF_PACKET socket to %s\n",
> > +			        name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > +				&fanout_arg, sizeof(fanout_arg));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > +				"for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +	}
> > +
> > +	/* reserve an ethdev entry */
> > +	*eth_dev = rte_eth_dev_allocate(name);
> > +	if (*eth_dev == NULL)
> > +		goto error;
> > +
> > +	/*
> > +	 * now put it all together
> > +	 * - store queue data in internals,
> > +	 * - store numa_node info in pci_driver
> > +	 * - point eth_dev_data to internals and pci_driver
> > +	 * - and point eth_dev structure to new eth_dev_data structure
> > +	 */
> > +
> > +	(*internals)->nb_queues = nb_queues;
> > +
> > +	data->dev_private = *internals;
> > +	data->port_id = (*eth_dev)->data->port_id;
> > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > +	data->dev_link = pmd_link;
> > +	data->mac_addrs = &(*internals)->eth_addr;
> > +
> > +	pci_dev->numa_node = numa_node;
> > +
> > +	(*eth_dev)->data = data;
> > +	(*eth_dev)->dev_ops = &ops;
> > +	(*eth_dev)->pci_dev = pci_dev;
> > +
> > +	return 0;
> > +
> > +error:
> > +	if (data)
> > +		rte_free(data);
> > +	if (pci_dev)
> > +		rte_free(pci_dev);
> > +	for (q = 0; q < nb_queues; q++) {
> > +		if ((*internals)->rx_queue[q].rd)
> > +			rte_free((*internals)->rx_queue[q].rd);
> > +		if ((*internals)->tx_queue[q].rd)
> > +			rte_free((*internals)->tx_queue[q].rd);
> > +	}
> > +	if (*internals)
> > +		rte_free(*internals);
> > +	return -1;
> > +}
> > +
> > +static int
> > +rte_eth_from_packet(const char *name,
> > +                    int const *sockfd,
> > +                    const unsigned numa_node,
> > +                    struct rte_kvargs *kvlist) {
> > +	struct pmd_internals *internals = NULL;
> > +	struct rte_eth_dev *eth_dev = NULL;
> > +	struct rte_kvargs_pair *pair = NULL;
> > +	unsigned k_idx;
> > +	unsigned int blockcount;
> > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > +	unsigned int qpairs = 1;
> > +
> > +	/* do some parameter checking */
> > +	if (*sockfd < 0)
> > +		return -1;
> > +
> > +	/*
> > +	 * Walk arguments for configurable settings
> > +	 */
> > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > +		pair = &kvlist->pairs[k_idx];
> > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > +			qpairs = atoi(pair->value);
> > +			if (qpairs < 1 ||
> > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid qpairs value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > +			blocksize = atoi(pair->value);
> > +			if (!blocksize) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid blocksize value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > +			framesize = atoi(pair->value);
> > +			if (!framesize) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid framesize value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > +			framecount = atoi(pair->value);
> > +			if (!framecount) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid framecount value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +	}
> > +
> > +	if (framesize > blocksize) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > +		        name);
> > +		return -1;
> > +	}
> > +
> > +	blockcount = framecount / (blocksize / framesize);
> > +	if (!blockcount) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > +		return -1;
> > +	}
> > +
> > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > +
> > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > +	                           blocksize, blockcount,
> > +	                           framesize, framecount,
> > +	                           numa_node, &internals, &eth_dev,
> > +	                           kvlist) < 0)
> > +		return -1;
> > +
> > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_pmd_packet_devinit(const char *name, const char *params) {
> > +	unsigned numa_node;
> > +	int ret;
> > +	struct rte_kvargs *kvlist;
> > +	int sockfd = -1;
> > +
> > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > +
> > +	numa_node = rte_socket_id();
> > +
> > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > +	if (kvlist == NULL)
> > +		return -1;
> > +
> > +	/*
> > +	 * If iface argument is passed we open the NICs and use them for
> > +	 * reading / writing
> > +	 */
> > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > +
> > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > +		                         &open_packet_iface, &sockfd);
> > +		if (ret < 0)
> > +			return -1;
> > +	}
> > +
> > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > +	close(sockfd); /* no longer needed */
> > +
> > +	if (ret < 0)
> > +		return -1;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct rte_driver pmd_packet_drv = {
> > +	.name = "eth_packet",
> > +	.type = PMD_VDEV,
> > +	.init = rte_pmd_packet_devinit,
> > +};
> > +
> > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h
> > b/lib/librte_pmd_packet/rte_eth_packet.h
> > new file mode 100644
> > index 000000000000..f685611da3e9
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > @@ -0,0 +1,55 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> > NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> > NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> > AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> > THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > DAMAGE.
> > + */
> > +
> > +#ifndef _RTE_ETH_PACKET_H_
> > +#define _RTE_ETH_PACKET_H_
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > +
> > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > +
> > +/**
> > + * For use by the EAL only. Called as part of EAL init to set up any
> > +dummy NICs
> > + * configured on command line.
> > + */
> > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif
> > diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 34dff2a02a05..a6994c4dbe93
> > 100644
> > --- a/mk/rte.app.mk
> > +++ b/mk/rte.app.mk
> > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)  LDLIBS
> > += -lrte_pmd_pcap -lpcap  endif
> > 
> > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > +LDLIBS += -lrte_pmd_packet
> > +endif
> > +
> >  endif # plugins
> > 
> >  LDLIBS += $(EXECENV_LDLIBS)
> > --
> > 1.9.3
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15 12:17     ` Neil Horman
@ 2014-07-15 14:01       ` John W. Linville
  2014-07-15 15:40         ` Zhou, Danny
  2014-07-15 20:31         ` Neil Horman
  2014-07-15 15:34       ` Zhou, Danny
  1 sibling, 2 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-15 14:01 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > According to my performance measurement results for 64B small
> > packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M
> > pps) which make sense to me as for 16 queues case more CPU cycles (16
> > queues' 87% vs. 1 queue' 80%) in kernel land needed for NAPI-enabled
> > ixgbe driver to switch between polling and interrupt modes in order
> > to service per-queue rx interrupts, so more context switch overhead
> > involved. Also, since the eth_packet_rx/eth_packet_tx routines involves
> > in two memory copies between DPDK mbuf and pbuf for each packet,
> > it can hardly achieve high performance unless packet are directly
> > DMA to mbuf which needs ixgbe driver to support.
> 
> I thought 16 queues would be spread out between as many cpus as you had though,
> obviating the need for context switches, no?

I think Danny is testing the single CPU case.  Having more queues
than CPUs probably does not provide any benefit.

It would be cool to hack the DPDK memory management to work directly
out of the mmap'ed AF_PACKET buffers.  But at this point I don't
have enough knowledge of DPDK internals to know if that is at all
reasonable...

John

P.S.  Danny, have you run any performance tests on the PCAP driver?

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15 12:17     ` Neil Horman
  2014-07-15 14:01       ` John W. Linville
@ 2014-07-15 15:34       ` Zhou, Danny
  1 sibling, 0 replies; 76+ messages in thread
From: Zhou, Danny @ 2014-07-15 15:34 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev


> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Tuesday, July 15, 2014 8:18 PM
> To: Zhou, Danny
> Cc: John W. Linville; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > According to my performance measurement results for 64B small packet, 1 queue
> perf. is better than 16 queues (1.35M pps vs. 0.93M pps) which make sense to me as
> for 16 queues case more CPU cycles (16 queues' 87% vs. 1 queue' 80%) in kernel
> land needed for NAPI-enabled ixgbe driver to switch between polling and interrupt
> modes in order to service per-queue rx interrupts, so more context switch overhead
> involved. Also, since the eth_packet_rx/eth_packet_tx routines involves in two
> memory copies between DPDK mbuf and pbuf for each packet, it can hardly achieve
> high performance unless packet are directly DMA to mbuf which needs ixgbe driver
> to support.
> 
> I thought 16 queues would be spread out between as many cpus as you had though,
> obviating the need for context switches, no?
> Neil
> 

If you set those per-queue MSIX interrupt affinity to different cpus, then performance would be much better 
and linear scaling is expected. But in order to do apple-to-apple performance comparison against 1 queue case 
on single core, by default all interrupts are handled by one core, say core0, so lots of context switch impacts 
performance I think.

> >
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Tuesday, July 15, 2014 2:25 AM
> > > To: dev@dpdk.org
> > > Cc: Thomas Monjalon; Richardson, Bruce; Zhou, Danny
> > > Subject: [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based
> > > virtual devices
> > >
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  This implementation uses mmap'ed ring buffers to limit
> > > copying and user/kernel transitions.  The PACKET_FANOUT_HASH
> > > behavior of AF_PACKET is used for frame reception.  In the current
> > > implementation, Tx and Rx queues are always paired, and therefore
> > > are always equal in number -- changing this would be a Simple Matter Of
> Programming.
> > >
> > > Interfaces of this type are created with a command line option like
> > > "--vdev=eth_packet0,iface=...".  There are a number of options
> > > availabe as
> > > arguments:
> > >
> > >  - Interface is chosen by "iface" (required)
> > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default:
> > > 4096)
> > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default:
> > > 2048)
> > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default:
> > > 512)
> > >
> > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > ---
> > > This PMD is intended to provide a means for using DPDK on a broad
> > > range of hardware without hardware-specific PMDs and (hopefully)
> > > with better performance than what PCAP offers in Linux.  This might
> > > be useful as a development platform for DPDK applications when
> DPDK-supported hardware is expensive or unavailable.
> > >
> > > New in v2:
> > >
> > > -- fixup some style issues found by check patch
> > > -- use if_index as part of fanout group ID
> > > -- set default number of queue pairs to 1
> > >
> > >  config/common_bsdapp                   |   5 +
> > >  config/common_linuxapp                 |   5 +
> > >  lib/Makefile                           |   1 +
> > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > >  lib/librte_pmd_packet/rte_eth_packet.c | 826
> > > +++++++++++++++++++++++++++++++++
> > > lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > >  mk/rte.app.mk                          |   4 +
> > >  8 files changed, 957 insertions(+)
> > >  create mode 100644 lib/librte_pmd_packet/Makefile  create mode
> > > 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > >
> > > diff --git a/config/common_bsdapp b/config/common_bsdapp index
> > > 943dce8f1ede..c317f031278e 100644
> > > --- a/config/common_bsdapp
> > > +++ b/config/common_bsdapp
> > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > +
> > > +#
> > >  # Do prefetch of packet data within PMD driver receive function  #
> > > CONFIG_RTE_PMD_PACKET_PREFETCH=y diff --git
> a/config/common_linuxapp
> > > b/config/common_linuxapp index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > --- a/config/common_linuxapp
> > > +++ b/config/common_linuxapp
> > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > +
> > > +#
> > >  # Compile Xen PMD
> > >  #
> > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > diff --git a/lib/Makefile b/lib/Makefile index
> > > 10c5bb3045bc..930fadf29898 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> > > librte_pmd_i40e
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt diff
> > > --git a/lib/librte_eal/linuxapp/eal/Makefile
> > > b/lib/librte_eal/linuxapp/eal/Makefile
> > > index 756d6b0c9301..feed24a63272 100644
> > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether  CFLAGS +=
> > > -I$(RTE_SDK)/lib/librte_ivshmem  CFLAGS +=
> > > -I$(RTE_SDK)/lib/librte_pmd_ring CFLAGS +=
> > > -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > >  CFLAGS += $(WERROR_FLAGS) -O3
> > >
> > > diff --git a/lib/librte_pmd_packet/Makefile
> > > b/lib/librte_pmd_packet/Makefile new file mode 100644 index
> > > 000000000000..e1266fb992cd
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/Makefile
> > > @@ -0,0 +1,60 @@
> > > +#   BSD LICENSE
> > > +#
> > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > +#   Copyright(c) 2014 6WIND S.A.
> > > +#   All rights reserved.
> > > +#
> > > +#   Redistribution and use in source and binary forms, with or without
> > > +#   modification, are permitted provided that the following conditions
> > > +#   are met:
> > > +#
> > > +#     * Redistributions of source code must retain the above copyright
> > > +#       notice, this list of conditions and the following disclaimer.
> > > +#     * Redistributions in binary form must reproduce the above copyright
> > > +#       notice, this list of conditions and the following disclaimer in
> > > +#       the documentation and/or other materials provided with the
> > > +#       distribution.
> > > +#     * Neither the name of Intel Corporation nor the names of its
> > > +#       contributors may be used to endorse or promote products derived
> > > +#       from this software without specific prior written permission.
> > > +#
> > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > CONTRIBUTORS
> > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> > > NOT
> > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > FITNESS FOR
> > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > > COPYRIGHT
> > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > > INCIDENTAL,
> > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> > > NOT
> > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > > LOSS OF USE,
> > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED
> > > AND ON ANY
> > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF
> > > THE USE
> > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > > DAMAGE.
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +#
> > > +# library name
> > > +#
> > > +LIB = librte_pmd_packet.a
> > > +
> > > +CFLAGS += -O3
> > > +CFLAGS += $(WERROR_FLAGS)
> > > +
> > > +#
> > > +# all source are stored in SRCS-y
> > > +#
> > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > +
> > > +#
> > > +# Export include files
> > > +#
> > > +SYMLINK-y-include += rte_eth_packet.h
> > > +
> > > +# this lib depends upon:
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> > > b/lib/librte_pmd_packet/rte_eth_packet.c
> > > new file mode 100644
> > > index 000000000000..9c82d16e730f
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > @@ -0,0 +1,826 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > + *
> > > + *   Originally based upon librte_pmd_pcap code:
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   Copyright(c) 2014 6WIND S.A.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> > > NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > > COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > > INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> > > NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > > LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED
> > > AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > > TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF
> > > THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > > DAMAGE.
> > > + */
> > > +
> > > +#include <rte_mbuf.h>
> > > +#include <rte_ethdev.h>
> > > +#include <rte_malloc.h>
> > > +#include <rte_kvargs.h>
> > > +#include <rte_dev.h>
> > > +
> > > +#include <linux/if_ether.h>
> > > +#include <linux/if_packet.h>
> > > +#include <arpa/inet.h>
> > > +#include <net/if.h>
> > > +#include <sys/types.h>
> > > +#include <sys/socket.h>
> > > +#include <sys/ioctl.h>
> > > +#include <sys/mman.h>
> > > +#include <unistd.h>
> > > +#include <poll.h>
> > > +
> > > +#include "rte_eth_packet.h"
> > > +
> > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > +
> > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > +
> > > +struct pkt_rx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	struct rte_mempool *mb_pool;
> > > +
> > > +	volatile unsigned long rx_pkts;
> > > +	volatile unsigned long err_pkts;
> > > +};
> > > +
> > > +struct pkt_tx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	volatile unsigned long tx_pkts;
> > > +	volatile unsigned long err_pkts;
> > > +};
> > > +
> > > +struct pmd_internals {
> > > +	unsigned nb_queues;
> > > +
> > > +	int if_index;
> > > +	struct ether_addr eth_addr;
> > > +
> > > +	struct tpacket_req req;
> > > +
> > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > +};
> > > +
> > > +static const char *valid_arguments[] = {
> > > +	ETH_PACKET_IFACE_ARG,
> > > +	ETH_PACKET_NUM_Q_ARG,
> > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > +	NULL
> > > +};
> > > +
> > > +static const char *drivername = "AF_PACKET PMD";
> > > +
> > > +static struct rte_eth_link pmd_link = {
> > > +	.link_speed = 10000,
> > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > +	.link_status = 0
> > > +};
> > > +
> > > +static uint16_t
> > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> > > +	unsigned i;
> > > +	struct tpacket2_hdr *ppd;
> > > +	struct rte_mbuf *mbuf;
> > > +	uint8_t *pbuf;
> > > +	struct pkt_rx_queue *pkt_q = queue;
> > > +	uint16_t num_rx = 0;
> > > +	unsigned int framecount, framenum;
> > > +
> > > +	if (unlikely(nb_pkts == 0))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > +	 */
> > > +	framecount = pkt_q->framecount;
> > > +	framenum = pkt_q->framenum;
> > > +	for (i = 0; i < nb_pkts; i++) {
> > > +		/* point at the next incoming frame */
> > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > +			break;
> > > +
> > > +		/* allocate the next mbuf */
> > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > +		if (unlikely(mbuf == NULL))
> > > +			break;
> > > +
> > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > +
> > > +		/* release incoming frame and advance ring buffer */
> > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > +		if (++framenum >= framecount)
> > > +			framenum = 0;
> > > +
> > > +		/* account for the receive frame */
> > > +		bufs[i] = mbuf;
> > > +		num_rx++;
> > > +	}
> > > +	pkt_q->framenum = framenum;
> > > +	pkt_q->rx_pkts += num_rx;
> > > +	return num_rx;
> > > +}
> > > +
> > > +/*
> > > + * Callback to handle sending packets through a real NIC.
> > > + */
> > > +static uint16_t
> > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> > > +	struct tpacket2_hdr *ppd;
> > > +	struct rte_mbuf *mbuf;
> > > +	uint8_t *pbuf;
> > > +	unsigned int framecount, framenum;
> > > +	struct pollfd pfd;
> > > +	struct pkt_tx_queue *pkt_q = queue;
> > > +	uint16_t num_tx = 0;
> > > +	int i;
> > > +
> > > +	if (unlikely(nb_pkts == 0))
> > > +		return 0;
> > > +
> > > +	memset(&pfd, 0, sizeof(pfd));
> > > +	pfd.fd = pkt_q->sockfd;
> > > +	pfd.events = POLLOUT;
> > > +	pfd.revents = 0;
> > > +
> > > +	framecount = pkt_q->framecount;
> > > +	framenum = pkt_q->framenum;
> > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +	for (i = 0; i < nb_pkts; i++) {
> > > +		/* point at the next incoming frame */
> > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > +		    (poll(&pfd, 1, -1) < 0))
> > > +				continue;
> > > +
> > > +		/* copy the tx frame data */
> > > +		mbuf = bufs[num_tx];
> > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > +			sizeof(struct sockaddr_ll);
> > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > +
> > > +		/* release incoming frame and advance ring buffer */
> > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > +		if (++framenum >= framecount)
> > > +			framenum = 0;
> > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +
> > > +		num_tx++;
> > > +		rte_pktmbuf_free(mbuf);
> > > +	}
> > > +
> > > +	/* kick-off transmits */
> > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > +
> > > +	pkt_q->framenum = framenum;
> > > +	pkt_q->tx_pkts += num_tx;
> > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > +	return num_tx;
> > > +}
> > > +
> > > +static int
> > > +eth_dev_start(struct rte_eth_dev *dev) {
> > > +	dev->data->dev_link.link_status = 1;
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * This function gets called when the current port gets stopped.
> > > + */
> > > +static void
> > > +eth_dev_stop(struct rte_eth_dev *dev) {
> > > +	unsigned i;
> > > +	int sockfd;
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > +		sockfd = internals->rx_queue[i].sockfd;
> > > +		if (sockfd != -1)
> > > +			close(sockfd);
> > > +		sockfd = internals->tx_queue[i].sockfd;
> > > +		if (sockfd != -1)
> > > +			close(sockfd);
> > > +	}
> > > +
> > > +	dev->data->dev_link.link_status = 0; }
> > > +
> > > +static int
> > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) {
> > > +	return 0;
> > > +}
> > > +
> > > +static void
> > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
> > > +*dev_info) {
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	dev_info->driver_name = drivername;
> > > +	dev_info->if_index = internals->if_index;
> > > +	dev_info->max_mac_addrs = 1;
> > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > +	dev_info->min_rx_bufsize = 0;
> > > +	dev_info->pci_dev = NULL;
> > > +}
> > > +
> > > +static void
> > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats
> > > +*igb_stats) {
> > > +	unsigned i, imax;
> > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > +
> > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > +
> > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > +	for (i = 0; i < imax; i++) {
> > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > +		rx_total += igb_stats->q_ipackets[i];
> > > +	}
> > > +
> > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > +	for (i = 0; i < imax; i++) {
> > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > +		tx_total += igb_stats->q_opackets[i];
> > > +		tx_err_total += igb_stats->q_errors[i];
> > > +	}
> > > +
> > > +	igb_stats->ipackets = rx_total;
> > > +	igb_stats->opackets = tx_total;
> > > +	igb_stats->oerrors = tx_err_total; }
> > > +
> > > +static void
> > > +eth_stats_reset(struct rte_eth_dev *dev) {
> > > +	unsigned i;
> > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > +
> > > +	for (i = 0; i < internal->nb_queues; i++)
> > > +		internal->rx_queue[i].rx_pkts = 0;
> > > +
> > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > +		internal->tx_queue[i].tx_pkts = 0;
> > > +		internal->tx_queue[i].err_pkts = 0;
> > > +	}
> > > +}
> > > +
> > > +static void
> > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused) { }
> > > +
> > > +static void
> > > +eth_queue_release(void *q __rte_unused) { }
> > > +
> > > +static int
> > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > +                int wait_to_complete __rte_unused) {
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > +                   uint16_t rx_queue_id,
> > > +                   uint16_t nb_rx_desc __rte_unused,
> > > +                   unsigned int socket_id __rte_unused,
> > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > +                   struct rte_mempool *mb_pool) {
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > +	uint16_t buf_size;
> > > +
> > > +	pkt_q->mb_pool = mb_pool;
> > > +
> > > +	/* Now get the space available for data in the mbuf */
> > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > +	                       RTE_PKTMBUF_HEADROOM);
> > > +
> > > +	if (ETH_FRAME_LEN > buf_size) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > +                   uint16_t tx_queue_id,
> > > +                   uint16_t nb_tx_desc __rte_unused,
> > > +                   unsigned int socket_id __rte_unused,
> > > +                   const struct rte_eth_txconf *tx_conf
> > > +__rte_unused) {
> > > +
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > +	return 0;
> > > +}
> > > +
> > > +static struct eth_dev_ops ops = {
> > > +	.dev_start = eth_dev_start,
> > > +	.dev_stop = eth_dev_stop,
> > > +	.dev_close = eth_dev_close,
> > > +	.dev_configure = eth_dev_configure,
> > > +	.dev_infos_get = eth_dev_info,
> > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > +	.rx_queue_release = eth_queue_release,
> > > +	.tx_queue_release = eth_queue_release,
> > > +	.link_update = eth_link_update,
> > > +	.stats_get = eth_stats_get,
> > > +	.stats_reset = eth_stats_reset,
> > > +};
> > > +
> > > +/*
> > > + * Opens an AF_PACKET socket
> > > + */
> > > +static int
> > > +open_packet_iface(const char *key __rte_unused,
> > > +                  const char *value __rte_unused,
> > > +                  void *extra_args) {
> > > +	int *sockfd = extra_args;
> > > +
> > > +	/* Open an AF_PACKET socket... */
> > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > +	if (*sockfd == -1) {
> > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +rte_pmd_init_internals(const char *name,
> > > +                       const int sockfd,
> > > +                       const unsigned nb_queues,
> > > +                       unsigned int blocksize,
> > > +                       unsigned int blockcnt,
> > > +                       unsigned int framesize,
> > > +                       unsigned int framecnt,
> > > +                       const unsigned numa_node,
> > > +                       struct pmd_internals **internals,
> > > +                       struct rte_eth_dev **eth_dev,
> > > +                       struct rte_kvargs *kvlist) {
> > > +	struct rte_eth_dev_data *data = NULL;
> > > +	struct rte_pci_device *pci_dev = NULL;
> > > +	struct rte_kvargs_pair *pair = NULL;
> > > +	struct ifreq ifr;
> > > +	size_t ifnamelen;
> > > +	unsigned k_idx;
> > > +	struct sockaddr_ll sockaddr;
> > > +	struct tpacket_req *req;
> > > +	struct pkt_rx_queue *rx_queue;
> > > +	struct pkt_tx_queue *tx_queue;
> > > +	int rc, tpver, discard, bypass;
> > > +	unsigned int i, q, rdsize;
> > > +	int qsockfd, fanout_arg;
> > > +
> > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > +		pair = &kvlist->pairs[k_idx];
> > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > +			break;
> > > +	}
> > > +	if (pair == NULL) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +
> > > +	RTE_LOG(INFO, PMD,
> > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > +		name, numa_node);
> > > +
> > > +	/*
> > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > +	 * and internal (private) data
> > > +	 */
> > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > +	if (data == NULL)
> > > +		goto error;
> > > +
> > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > +	if (pci_dev == NULL)
> > > +		goto error;
> > > +
> > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > +	                                0, numa_node);
> > > +	if (*internals == NULL)
> > > +		goto error;
> > > +
> > > +	req = &((*internals)->req);
> > > +
> > > +	req->tp_block_size = blocksize;
> > > +	req->tp_block_nr = blockcnt;
> > > +	req->tp_frame_size = framesize;
> > > +	req->tp_frame_nr = framecnt;
> > > +
> > > +	ifnamelen = strlen(pair->value);
> > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > +	} else {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: I/F name too long (%s)\n",
> > > +			name, pair->value);
> > > +		goto error;
> > > +	}
> > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > +
> > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > +
> > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > +	sockaddr.sll_family = AF_PACKET;
> > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > +
> > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > +	fanout_arg |= (PACKET_FANOUT_HASH |
> PACKET_FANOUT_FLAG_DEFRAG |
> > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > +
> > > +	for (q = 0; q < nb_queues; q++) {
> > > +		/* Open an AF_PACKET socket for this queue... */
> > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > +		if (qsockfd == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +			        "%s: could not open AF_PACKET socket\n",
> > > +			        name);
> > > +			return -1;
> > > +		}
> > > +
> > > +		tpver = TPACKET_V2;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > +				&tpver, sizeof(tpver));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		discard = 1;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > +				&discard, sizeof(discard));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_LOSS on "
> > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		bypass = 1;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > +				&bypass, sizeof(bypass));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > +			        "on AF_PACKET socket for %s\n", name,
> > > +			        pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req,
> > > sizeof(*req));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req,
> > > sizeof(*req));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > +		rx_queue->framecount = req->tp_frame_nr;
> > > +
> > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size *
> req->tp_block_nr,
> > > +				    PROT_READ | PROT_WRITE, MAP_SHARED |
> > > MAP_LOCKED,
> > > +				    qsockfd, 0);
> > > +		if (rx_queue->map == MAP_FAILED) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > +				name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		/* rdsize is same for both Tx and Rx */
> > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > +
> > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > +		}
> > > +		rx_queue->sockfd = qsockfd;
> > > +
> > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > +		tx_queue->framecount = req->tp_frame_nr;
> > > +
> > > +		tx_queue->map = rx_queue->map + req->tp_block_size *
> > > +req->tp_block_nr;
> > > +
> > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > +		}
> > > +		tx_queue->sockfd = qsockfd;
> > > +
> > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr,
> sizeof(sockaddr));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > +			        name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > +				&fanout_arg, sizeof(fanout_arg));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_FANOUT on AF_PACKET
> socket "
> > > +				"for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +	}
> > > +
> > > +	/* reserve an ethdev entry */
> > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > +	if (*eth_dev == NULL)
> > > +		goto error;
> > > +
> > > +	/*
> > > +	 * now put it all together
> > > +	 * - store queue data in internals,
> > > +	 * - store numa_node info in pci_driver
> > > +	 * - point eth_dev_data to internals and pci_driver
> > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > +	 */
> > > +
> > > +	(*internals)->nb_queues = nb_queues;
> > > +
> > > +	data->dev_private = *internals;
> > > +	data->port_id = (*eth_dev)->data->port_id;
> > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > +	data->dev_link = pmd_link;
> > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > +
> > > +	pci_dev->numa_node = numa_node;
> > > +
> > > +	(*eth_dev)->data = data;
> > > +	(*eth_dev)->dev_ops = &ops;
> > > +	(*eth_dev)->pci_dev = pci_dev;
> > > +
> > > +	return 0;
> > > +
> > > +error:
> > > +	if (data)
> > > +		rte_free(data);
> > > +	if (pci_dev)
> > > +		rte_free(pci_dev);
> > > +	for (q = 0; q < nb_queues; q++) {
> > > +		if ((*internals)->rx_queue[q].rd)
> > > +			rte_free((*internals)->rx_queue[q].rd);
> > > +		if ((*internals)->tx_queue[q].rd)
> > > +			rte_free((*internals)->tx_queue[q].rd);
> > > +	}
> > > +	if (*internals)
> > > +		rte_free(*internals);
> > > +	return -1;
> > > +}
> > > +
> > > +static int
> > > +rte_eth_from_packet(const char *name,
> > > +                    int const *sockfd,
> > > +                    const unsigned numa_node,
> > > +                    struct rte_kvargs *kvlist) {
> > > +	struct pmd_internals *internals = NULL;
> > > +	struct rte_eth_dev *eth_dev = NULL;
> > > +	struct rte_kvargs_pair *pair = NULL;
> > > +	unsigned k_idx;
> > > +	unsigned int blockcount;
> > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > +	unsigned int qpairs = 1;
> > > +
> > > +	/* do some parameter checking */
> > > +	if (*sockfd < 0)
> > > +		return -1;
> > > +
> > > +	/*
> > > +	 * Walk arguments for configurable settings
> > > +	 */
> > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > +		pair = &kvlist->pairs[k_idx];
> > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > +			qpairs = atoi(pair->value);
> > > +			if (qpairs < 1 ||
> > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid qpairs value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > +			blocksize = atoi(pair->value);
> > > +			if (!blocksize) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid blocksize value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > +			framesize = atoi(pair->value);
> > > +			if (!framesize) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid framesize value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > +			framecount = atoi(pair->value);
> > > +			if (!framecount) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid framecount value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +	}
> > > +
> > > +	if (framesize > blocksize) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > +		        name);
> > > +		return -1;
> > > +	}
> > > +
> > > +	blockcount = framecount / (blocksize / framesize);
> > > +	if (!blockcount) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > +		return -1;
> > > +	}
> > > +
> > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > +
> > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > +	                           blocksize, blockcount,
> > > +	                           framesize, framecount,
> > > +	                           numa_node, &internals, &eth_dev,
> > > +	                           kvlist) < 0)
> > > +		return -1;
> > > +
> > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int
> > > +rte_pmd_packet_devinit(const char *name, const char *params) {
> > > +	unsigned numa_node;
> > > +	int ret;
> > > +	struct rte_kvargs *kvlist;
> > > +	int sockfd = -1;
> > > +
> > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > +
> > > +	numa_node = rte_socket_id();
> > > +
> > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > +	if (kvlist == NULL)
> > > +		return -1;
> > > +
> > > +	/*
> > > +	 * If iface argument is passed we open the NICs and use them for
> > > +	 * reading / writing
> > > +	 */
> > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > +
> > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > +		                         &open_packet_iface, &sockfd);
> > > +		if (ret < 0)
> > > +			return -1;
> > > +	}
> > > +
> > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > +	close(sockfd); /* no longer needed */
> > > +
> > > +	if (ret < 0)
> > > +		return -1;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static struct rte_driver pmd_packet_drv = {
> > > +	.name = "eth_packet",
> > > +	.type = PMD_VDEV,
> > > +	.init = rte_pmd_packet_devinit,
> > > +};
> > > +
> > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h
> > > b/lib/librte_pmd_packet/rte_eth_packet.h
> > > new file mode 100644
> > > index 000000000000..f685611da3e9
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > @@ -0,0 +1,55 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> > > NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > > COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > > INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> > > NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > > LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED
> > > AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > > TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF
> > > THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > > DAMAGE.
> > > + */
> > > +
> > > +#ifndef _RTE_ETH_PACKET_H_
> > > +#define _RTE_ETH_PACKET_H_
> > > +
> > > +#ifdef __cplusplus
> > > +extern "C" {
> > > +#endif
> > > +
> > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > +
> > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > +
> > > +/**
> > > + * For use by the EAL only. Called as part of EAL init to set up
> > > +any dummy NICs
> > > + * configured on command line.
> > > + */
> > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > +
> > > +#ifdef __cplusplus
> > > +}
> > > +#endif
> > > +
> > > +#endif
> > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk index
> > > 34dff2a02a05..a6994c4dbe93
> > > 100644
> > > --- a/mk/rte.app.mk
> > > +++ b/mk/rte.app.mk
> > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> LDLIBS
> > > += -lrte_pmd_pcap -lpcap  endif
> > >
> > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > +LDLIBS += -lrte_pmd_packet
> > > +endif
> > > +
> > >  endif # plugins
> > >
> > >  LDLIBS += $(EXECENV_LDLIBS)
> > > --
> > > 1.9.3
> >
> >

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15 14:01       ` John W. Linville
@ 2014-07-15 15:40         ` Zhou, Danny
  2014-07-15 19:08           ` John W. Linville
  2014-07-15 20:31         ` Neil Horman
  1 sibling, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-07-15 15:40 UTC (permalink / raw)
  To: John W. Linville, Neil Horman; +Cc: dev


> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Tuesday, July 15, 2014 10:01 PM
> To: Neil Horman
> Cc: Zhou, Danny; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> > On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > > According to my performance measurement results for 64B small
> > > packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M
> > > pps) which make sense to me as for 16 queues case more CPU cycles
> > > (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for
> > > NAPI-enabled ixgbe driver to switch between polling and interrupt
> > > modes in order to service per-queue rx interrupts, so more context
> > > switch overhead involved. Also, since the
> > > eth_packet_rx/eth_packet_tx routines involves in two memory copies
> > > between DPDK mbuf and pbuf for each packet, it can hardly achieve
> > > high performance unless packet are directly DMA to mbuf which needs ixgbe
> driver to support.
> >
> > I thought 16 queues would be spread out between as many cpus as you
> > had though, obviating the need for context switches, no?
> 
> I think Danny is testing the single CPU case.  Having more queues than CPUs
> probably does not provide any benefit.
> 
> It would be cool to hack the DPDK memory management to work directly out of the
> mmap'ed AF_PACKET buffers.  But at this point I don't have enough knowledge of
> DPDK internals to know if that is at all reasonable...
> 
> John
> 
> P.S.  Danny, have you run any performance tests on the PCAP driver?

No, I do not have PCAP driver performance results in hand. But I remember it is less than
1M pps for 64B.

> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15 15:40         ` Zhou, Danny
@ 2014-07-15 19:08           ` John W. Linville
  0 siblings, 0 replies; 76+ messages in thread
From: John W. Linville @ 2014-07-15 19:08 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

On Tue, Jul 15, 2014 at 03:40:56PM +0000, Zhou, Danny wrote:
> 
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Tuesday, July 15, 2014 10:01 PM
> > To: Neil Horman
> > Cc: Zhou, Danny; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> > 
> > On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> > > On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > > > According to my performance measurement results for 64B small
> > > > packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M
> > > > pps) which make sense to me as for 16 queues case more CPU cycles
> > > > (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for
> > > > NAPI-enabled ixgbe driver to switch between polling and interrupt
> > > > modes in order to service per-queue rx interrupts, so more context
> > > > switch overhead involved. Also, since the
> > > > eth_packet_rx/eth_packet_tx routines involves in two memory copies
> > > > between DPDK mbuf and pbuf for each packet, it can hardly achieve
> > > > high performance unless packet are directly DMA to mbuf which needs ixgbe
> > driver to support.
> > >
> > > I thought 16 queues would be spread out between as many cpus as you
> > > had though, obviating the need for context switches, no?
> > 
> > I think Danny is testing the single CPU case.  Having more queues than CPUs
> > probably does not provide any benefit.
> > 
> > It would be cool to hack the DPDK memory management to work directly out of the
> > mmap'ed AF_PACKET buffers.  But at this point I don't have enough knowledge of
> > DPDK internals to know if that is at all reasonable...
> > 
> > John
> > 
> > P.S.  Danny, have you run any performance tests on the PCAP driver?
> 
> No, I do not have PCAP driver performance results in hand. But I remember it is less than
> 1M pps for 64B.

Cool, good info...thanks!

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15 14:01       ` John W. Linville
  2014-07-15 15:40         ` Zhou, Danny
@ 2014-07-15 20:31         ` Neil Horman
  2014-07-15 20:41           ` Zhou, Danny
  1 sibling, 1 reply; 76+ messages in thread
From: Neil Horman @ 2014-07-15 20:31 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

On Tue, Jul 15, 2014 at 10:01:11AM -0400, John W. Linville wrote:
> On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> > On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > > According to my performance measurement results for 64B small
> > > packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M
> > > pps) which make sense to me as for 16 queues case more CPU cycles (16
> > > queues' 87% vs. 1 queue' 80%) in kernel land needed for NAPI-enabled
> > > ixgbe driver to switch between polling and interrupt modes in order
> > > to service per-queue rx interrupts, so more context switch overhead
> > > involved. Also, since the eth_packet_rx/eth_packet_tx routines involves
> > > in two memory copies between DPDK mbuf and pbuf for each packet,
> > > it can hardly achieve high performance unless packet are directly
> > > DMA to mbuf which needs ixgbe driver to support.
> > 
> > I thought 16 queues would be spread out between as many cpus as you had though,
> > obviating the need for context switches, no?
> 
> I think Danny is testing the single CPU case.  Having more queues
> than CPUs probably does not provide any benefit.
> 
Ah, yes, generally speaking, you never want nr_cpus < nr_queues.  Otherwise
you'll just be fighting yourself.

> It would be cool to hack the DPDK memory management to work directly
> out of the mmap'ed AF_PACKET buffers.  But at this point I don't
> have enough knowledge of DPDK internals to know if that is at all
> reasonable...
> 
> John
> 
> P.S.  Danny, have you run any performance tests on the PCAP driver?
> 
> -- 
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15 20:31         ` Neil Horman
@ 2014-07-15 20:41           ` Zhou, Danny
  0 siblings, 0 replies; 76+ messages in thread
From: Zhou, Danny @ 2014-07-15 20:41 UTC (permalink / raw)
  To: Neil Horman, John W. Linville; +Cc: dev


> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, July 16, 2014 4:31 AM
> To: John W. Linville
> Cc: Zhou, Danny; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> On Tue, Jul 15, 2014 at 10:01:11AM -0400, John W. Linville wrote:
> > On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> > > On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > > > According to my performance measurement results for 64B small
> > > > packet, 1 queue perf. is better than 16 queues (1.35M pps vs.
> > > > 0.93M
> > > > pps) which make sense to me as for 16 queues case more CPU cycles
> > > > (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for
> > > > NAPI-enabled ixgbe driver to switch between polling and interrupt
> > > > modes in order to service per-queue rx interrupts, so more context
> > > > switch overhead involved. Also, since the
> > > > eth_packet_rx/eth_packet_tx routines involves in two memory copies
> > > > between DPDK mbuf and pbuf for each packet, it can hardly achieve
> > > > high performance unless packet are directly DMA to mbuf which needs ixgbe
> driver to support.
> > >
> > > I thought 16 queues would be spread out between as many cpus as you
> > > had though, obviating the need for context switches, no?
> >
> > I think Danny is testing the single CPU case.  Having more queues than
> > CPUs probably does not provide any benefit.
> >
> Ah, yes, generally speaking, you never want nr_cpus < nr_queues.  Otherwise you'll
> just be fighting yourself.
> 

It is true for interrupt based NIC driver and this AF_PACKET based PMD because it depends 
on kernel NIC driver. But for poll-mode based DPDK native NIC driver, you can have a cpu pinning to
to a core polling multiple queues on a NIC or queues on different NICs, at the cost of more
power consumption or wasted CPU cycles busying waiting packets.

> > It would be cool to hack the DPDK memory management to work directly
> > out of the mmap'ed AF_PACKET buffers.  But at this point I don't have
> > enough knowledge of DPDK internals to know if that is at all
> > reasonable...
> >
> > John
> >
> > P.S.  Danny, have you run any performance tests on the PCAP driver?
> >
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.
> >

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-14 13:46         ` John W. Linville
@ 2014-07-15 21:27           ` Thomas Monjalon
  2014-07-16 12:35             ` Neil Horman
  2014-07-16 14:07             ` John W. Linville
  0 siblings, 2 replies; 76+ messages in thread
From: Thomas Monjalon @ 2014-07-15 21:27 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

2014-07-14 09:46, John W. Linville:
> On Sat, Jul 12, 2014 at 12:34:46AM +0200, Thomas Monjalon wrote:
> > 2014-07-11 13:40, John W. Linville:
> > > Is there an example of code in DPDK that requires specific kernel
> > > versions?  What is the preferred method for coding such dependencies?
> > 
> > No there is no userspace code checking kernel version in DPDK.
> > Feel free to use what you think the best method.
> > Please keep in mind that checking version number is a maintenance
> > nightmare
> > because of backports (like RedHat do ;).
> 
> I suppose that it could be a configuration option?

If there is no other way to configure kernel-dependent features, we can add 
options. But I feel that relying on a macro (#ifdef) would be better if such 
macro exist.

-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15 21:27           ` Thomas Monjalon
@ 2014-07-16 12:35             ` Neil Horman
  2014-07-16 13:37               ` Thomas Monjalon
  2014-07-16 14:07             ` John W. Linville
  1 sibling, 1 reply; 76+ messages in thread
From: Neil Horman @ 2014-07-16 12:35 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Tue, Jul 15, 2014 at 11:27:45PM +0200, Thomas Monjalon wrote:
> 2014-07-14 09:46, John W. Linville:
> > On Sat, Jul 12, 2014 at 12:34:46AM +0200, Thomas Monjalon wrote:
> > > 2014-07-11 13:40, John W. Linville:
> > > > Is there an example of code in DPDK that requires specific kernel
> > > > versions?  What is the preferred method for coding such dependencies?
> > > 
> > > No there is no userspace code checking kernel version in DPDK.
> > > Feel free to use what you think the best method.
> > > Please keep in mind that checking version number is a maintenance
> > > nightmare
> > > because of backports (like RedHat do ;).
> > 
Actually, I feel the need to correct this (I know you're being humorous, but
just the same).  You don't have a maintenence nightmare on your hands because
RedHat backports kernel features, you have a nightmare maintenece situation on
your hands because the DPDK uses kernel features that were never meant to be
directly accessed outside of kernel space.
Neil

> > I suppose that it could be a configuration option?
> 
> If there is no other way to configure kernel-dependent features, we can add 
> options. But I feel that relying on a macro (#ifdef) would be better if such 
> macro exist.
> 
> -- 
> Thomas
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-16 12:35             ` Neil Horman
@ 2014-07-16 13:37               ` Thomas Monjalon
  0 siblings, 0 replies; 76+ messages in thread
From: Thomas Monjalon @ 2014-07-16 13:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

2014-07-16 08:35, Neil Horman:
> On Tue, Jul 15, 2014 at 11:27:45PM +0200, Thomas Monjalon wrote:
> > 2014-07-14 09:46, John W. Linville:
> > > On Sat, Jul 12, 2014 at 12:34:46AM +0200, Thomas Monjalon wrote:
> > > > 2014-07-11 13:40, John W. Linville:
> > > > > Is there an example of code in DPDK that requires specific kernel
> > > > > versions?  What is the preferred method for coding such
> > > > > dependencies?
> > > > 
> > > > No there is no userspace code checking kernel version in DPDK.
> > > > Feel free to use what you think the best method.
> > > > Please keep in mind that checking version number is a maintenance
> > > > nightmare
> > > > because of backports (like RedHat do ;).
> 
> Actually, I feel the need to correct this (I know you're being humorous, but
> just the same).  You don't have a maintenence nightmare on your hands
> because RedHat backports kernel features, you have a nightmare maintenece
> situation on your hands because the DPDK uses kernel features that were
> never meant to be directly accessed outside of kernel space.
> Neil

You're right. Removing kernel modules from DPDK is a nice goal.
But here we were speaking about an userland library (AF_PACKET PMD) which rely 
on kernel features.

-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-15 21:27           ` Thomas Monjalon
  2014-07-16 12:35             ` Neil Horman
@ 2014-07-16 14:07             ` John W. Linville
  2014-07-16 14:26               ` Thomas Monjalon
  1 sibling, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-07-16 14:07 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Tue, Jul 15, 2014 at 11:27:45PM +0200, Thomas Monjalon wrote:
> 2014-07-14 09:46, John W. Linville:
> > On Sat, Jul 12, 2014 at 12:34:46AM +0200, Thomas Monjalon wrote:
> > > 2014-07-11 13:40, John W. Linville:
> > > > Is there an example of code in DPDK that requires specific kernel
> > > > versions?  What is the preferred method for coding such dependencies?
> > > 
> > > No there is no userspace code checking kernel version in DPDK.
> > > Feel free to use what you think the best method.
> > > Please keep in mind that checking version number is a maintenance
> > > nightmare
> > > because of backports (like RedHat do ;).
> > 
> > I suppose that it could be a configuration option?
> 
> If there is no other way to configure kernel-dependent features, we can add 
> options. But I feel that relying on a macro (#ifdef) would be better if such 
> macro exist.

I can add #ifdef or #if defined() for the newer definitions.  Is there
a minimum kernel version supported today?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-16 14:07             ` John W. Linville
@ 2014-07-16 14:26               ` Thomas Monjalon
  2014-07-16 15:59                 ` Shaw, Jeffrey B
  0 siblings, 1 reply; 76+ messages in thread
From: Thomas Monjalon @ 2014-07-16 14:26 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

2014-07-16 10:07, John W. Linville:
> On Tue, Jul 15, 2014 at 11:27:45PM +0200, Thomas Monjalon wrote:
> > 2014-07-14 09:46, John W. Linville:
> > > On Sat, Jul 12, 2014 at 12:34:46AM +0200, Thomas Monjalon wrote:
> > > > 2014-07-11 13:40, John W. Linville:
> > > > > Is there an example of code in DPDK that requires specific kernel
> > > > > versions?  What is the preferred method for coding such
> > > > > dependencies?
> > > > 
> > > > No there is no userspace code checking kernel version in DPDK.
> > > > Feel free to use what you think the best method.
> > > > Please keep in mind that checking version number is a maintenance
> > > > nightmare
> > > > because of backports (like RedHat do ;).
> > > 
> > > I suppose that it could be a configuration option?
> > 
> > If there is no other way to configure kernel-dependent features, we can
> > add
> > options. But I feel that relying on a macro (#ifdef) would be better if
> > such macro exist.
> 
> I can add #ifdef or #if defined() for the newer definitions.  Is there
> a minimum kernel version supported today?

2.6.32 is the minimum version.
But it's known to be easily usable since Linux 2.6.34.

-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-16 14:26               ` Thomas Monjalon
@ 2014-07-16 15:59                 ` Shaw, Jeffrey B
  0 siblings, 0 replies; 76+ messages in thread
From: Shaw, Jeffrey B @ 2014-07-16 15:59 UTC (permalink / raw)
  To: Thomas Monjalon, John W. Linville; +Cc: dev

2.6.32 is minimum, but I believe still needs patches to fix hugetlbfs issues.
I think the first kernel which had all the features we need, and doesn't require patches, is 2.6.33.6.
Jeff

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
Sent: Wednesday, July 16, 2014 7:27 AM
To: John W. Linville
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices

2014-07-16 10:07, John W. Linville:
> On Tue, Jul 15, 2014 at 11:27:45PM +0200, Thomas Monjalon wrote:
> > 2014-07-14 09:46, John W. Linville:
> > > On Sat, Jul 12, 2014 at 12:34:46AM +0200, Thomas Monjalon wrote:
> > > > 2014-07-11 13:40, John W. Linville:
> > > > > Is there an example of code in DPDK that requires specific 
> > > > > kernel versions?  What is the preferred method for coding such 
> > > > > dependencies?
> > > > 
> > > > No there is no userspace code checking kernel version in DPDK.
> > > > Feel free to use what you think the best method.
> > > > Please keep in mind that checking version number is a 
> > > > maintenance nightmare because of backports (like RedHat do ;).
> > > 
> > > I suppose that it could be a configuration option?
> > 
> > If there is no other way to configure kernel-dependent features, we 
> > can add options. But I feel that relying on a macro (#ifdef) would 
> > be better if such macro exist.
> 
> I can add #ifdef or #if defined() for the newer definitions.  Is there 
> a minimum kernel version supported today?

2.6.32 is the minimum version.
But it's known to be easily usable since Linux 2.6.34.

--
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-07-14 18:24 ` [dpdk-dev] [PATCH v2] " John W. Linville
  2014-07-15  0:15   ` Zhou, Danny
@ 2014-09-12 18:05   ` John W. Linville
  2014-09-12 18:31     ` Zhou, Danny
  2014-09-16 20:16     ` Neil Horman
  1 sibling, 2 replies; 76+ messages in thread
From: John W. Linville @ 2014-09-12 18:05 UTC (permalink / raw)
  To: dev

Ping?  Are there objections to this patch from mid-July?

John

On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> socket.  This implementation uses mmap'ed ring buffers to limit copying
> and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> AF_PACKET is used for frame reception.  In the current implementation,
> Tx and Rx queues are always paired, and therefore are always equal
> in number -- changing this would be a Simple Matter Of Programming.
> 
> Interfaces of this type are created with a command line option like
> "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> as arguments:
> 
>  - Interface is chosen by "iface" (required)
>  - Number of queue pairs set by "qpairs" (optional, default: 1)
>  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
>  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
>  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> This PMD is intended to provide a means for using DPDK on a broad
> range of hardware without hardware-specific PMDs and (hopefully)
> with better performance than what PCAP offers in Linux.  This might
> be useful as a development platform for DPDK applications when
> DPDK-supported hardware is expensive or unavailable.
> 
> New in v2:
> 
> -- fixup some style issues found by check patch
> -- use if_index as part of fanout group ID
> -- set default number of queue pairs to 1
> 
>  config/common_bsdapp                   |   5 +
>  config/common_linuxapp                 |   5 +
>  lib/Makefile                           |   1 +
>  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
>  lib/librte_pmd_packet/Makefile         |  60 +++
>  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
>  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
>  mk/rte.app.mk                          |   4 +
>  8 files changed, 957 insertions(+)
>  create mode 100644 lib/librte_pmd_packet/Makefile
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> 
> diff --git a/config/common_bsdapp b/config/common_bsdapp
> index 943dce8f1ede..c317f031278e 100644
> --- a/config/common_bsdapp
> +++ b/config/common_bsdapp
> @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
>  CONFIG_RTE_LIBRTE_PMD_BOND=y
>  
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> +#
> +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> +
> +#
>  # Do prefetch of packet data within PMD driver receive function
>  #
>  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> diff --git a/config/common_linuxapp b/config/common_linuxapp
> index 7bf5d80d4e26..f9e7bc3015ec 100644
> --- a/config/common_linuxapp
> +++ b/config/common_linuxapp
> @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
>  CONFIG_RTE_LIBRTE_PMD_BOND=y
>  
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> +#
> +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> +
> +#
>  # Compile Xen PMD
>  #
>  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> diff --git a/lib/Makefile b/lib/Makefile
> index 10c5bb3045bc..930fadf29898 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
>  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
>  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> index 756d6b0c9301..feed24a63272 100644
> --- a/lib/librte_eal/linuxapp/eal/Makefile
> +++ b/lib/librte_eal/linuxapp/eal/Makefile
> @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
>  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
>  CFLAGS += $(WERROR_FLAGS) -O3
>  
> diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> new file mode 100644
> index 000000000000..e1266fb992cd
> --- /dev/null
> +++ b/lib/librte_pmd_packet/Makefile
> @@ -0,0 +1,60 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_pmd_packet.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +#
> +# all source are stored in SRCS-y
> +#
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> +
> +#
> +# Export include files
> +#
> +SYMLINK-y-include += rte_eth_packet.h
> +
> +# this lib depends upon:
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> new file mode 100644
> index 000000000000..9c82d16e730f
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> @@ -0,0 +1,826 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> + *
> + *   Originally based upon librte_pmd_pcap code:
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2014 6WIND S.A.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <rte_mbuf.h>
> +#include <rte_ethdev.h>
> +#include <rte_malloc.h>
> +#include <rte_kvargs.h>
> +#include <rte_dev.h>
> +
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <arpa/inet.h>
> +#include <net/if.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <poll.h>
> +
> +#include "rte_eth_packet.h"
> +
> +#define ETH_PACKET_IFACE_ARG		"iface"
> +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> +
> +#define DFLT_BLOCK_SIZE		(1 << 12)
> +#define DFLT_FRAME_SIZE		(1 << 11)
> +#define DFLT_FRAME_COUNT	(1 << 9)
> +
> +struct pkt_rx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	struct rte_mempool *mb_pool;
> +
> +	volatile unsigned long rx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pkt_tx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	volatile unsigned long tx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pmd_internals {
> +	unsigned nb_queues;
> +
> +	int if_index;
> +	struct ether_addr eth_addr;
> +
> +	struct tpacket_req req;
> +
> +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +};
> +
> +static const char *valid_arguments[] = {
> +	ETH_PACKET_IFACE_ARG,
> +	ETH_PACKET_NUM_Q_ARG,
> +	ETH_PACKET_BLOCKSIZE_ARG,
> +	ETH_PACKET_FRAMESIZE_ARG,
> +	ETH_PACKET_FRAMECOUNT_ARG,
> +	NULL
> +};
> +
> +static const char *drivername = "AF_PACKET PMD";
> +
> +static struct rte_eth_link pmd_link = {
> +	.link_speed = 10000,
> +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> +	.link_status = 0
> +};
> +
> +static uint16_t
> +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	unsigned i;
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	struct pkt_rx_queue *pkt_q = queue;
> +	uint16_t num_rx = 0;
> +	unsigned int framecount, framenum;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	/*
> +	 * Reads the given number of packets from the AF_PACKET socket one by
> +	 * one and copies the packet data into a newly allocated mbuf.
> +	 */
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> +			break;
> +
> +		/* allocate the next mbuf */
> +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> +		if (unlikely(mbuf == NULL))
> +			break;
> +
> +		/* packet will fit in the mbuf, go ahead and receive it */
> +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_KERNEL;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +
> +		/* account for the receive frame */
> +		bufs[i] = mbuf;
> +		num_rx++;
> +	}
> +	pkt_q->framenum = framenum;
> +	pkt_q->rx_pkts += num_rx;
> +	return num_rx;
> +}
> +
> +/*
> + * Callback to handle sending packets through a real NIC.
> + */
> +static uint16_t
> +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	unsigned int framecount, framenum;
> +	struct pollfd pfd;
> +	struct pkt_tx_queue *pkt_q = queue;
> +	uint16_t num_tx = 0;
> +	int i;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	memset(&pfd, 0, sizeof(pfd));
> +	pfd.fd = pkt_q->sockfd;
> +	pfd.events = POLLOUT;
> +	pfd.revents = 0;
> +
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> +		    (poll(&pfd, 1, -1) < 0))
> +				continue;
> +
> +		/* copy the tx frame data */
> +		mbuf = bufs[num_tx];
> +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> +			sizeof(struct sockaddr_ll);
> +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +
> +		num_tx++;
> +		rte_pktmbuf_free(mbuf);
> +	}
> +
> +	/* kick-off transmits */
> +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +
> +	pkt_q->framenum = framenum;
> +	pkt_q->tx_pkts += num_tx;
> +	pkt_q->err_pkts += nb_pkts - num_tx;
> +	return num_tx;
> +}
> +
> +static int
> +eth_dev_start(struct rte_eth_dev *dev)
> +{
> +	dev->data->dev_link.link_status = 1;
> +	return 0;
> +}
> +
> +/*
> + * This function gets called when the current port gets stopped.
> + */
> +static void
> +eth_dev_stop(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	int sockfd;
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	for (i = 0; i < internals->nb_queues; i++) {
> +		sockfd = internals->rx_queue[i].sockfd;
> +		if (sockfd != -1)
> +			close(sockfd);
> +		sockfd = internals->tx_queue[i].sockfd;
> +		if (sockfd != -1)
> +			close(sockfd);
> +	}
> +
> +	dev->data->dev_link.link_status = 0;
> +}
> +
> +static int
> +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> +{
> +	return 0;
> +}
> +
> +static void
> +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> +{
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev_info->driver_name = drivername;
> +	dev_info->if_index = internals->if_index;
> +	dev_info->max_mac_addrs = 1;
> +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->min_rx_bufsize = 0;
> +	dev_info->pci_dev = NULL;
> +}
> +
> +static void
> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> +{
> +	unsigned i, imax;
> +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> +	const struct pmd_internals *internal = dev->data->dev_private;
> +
> +	memset(igb_stats, 0, sizeof(*igb_stats));
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> +		rx_total += igb_stats->q_ipackets[i];
> +	}
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> +		tx_total += igb_stats->q_opackets[i];
> +		tx_err_total += igb_stats->q_errors[i];
> +	}
> +
> +	igb_stats->ipackets = rx_total;
> +	igb_stats->opackets = tx_total;
> +	igb_stats->oerrors = tx_err_total;
> +}
> +
> +static void
> +eth_stats_reset(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	struct pmd_internals *internal = dev->data->dev_private;
> +
> +	for (i = 0; i < internal->nb_queues; i++)
> +		internal->rx_queue[i].rx_pkts = 0;
> +
> +	for (i = 0; i < internal->nb_queues; i++) {
> +		internal->tx_queue[i].tx_pkts = 0;
> +		internal->tx_queue[i].err_pkts = 0;
> +	}
> +}
> +
> +static void
> +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> +{
> +}
> +
> +static void
> +eth_queue_release(void *q __rte_unused)
> +{
> +}
> +
> +static int
> +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> +                int wait_to_complete __rte_unused)
> +{
> +	return 0;
> +}
> +
> +static int
> +eth_rx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t rx_queue_id,
> +                   uint16_t nb_rx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> +                   struct rte_mempool *mb_pool)
> +{
> +	struct pmd_internals *internals = dev->data->dev_private;
> +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> +	struct rte_pktmbuf_pool_private *mbp_priv;
> +	uint16_t buf_size;
> +
> +	pkt_q->mb_pool = mb_pool;
> +
> +	/* Now get the space available for data in the mbuf */
> +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> +	                       RTE_PKTMBUF_HEADROOM);
> +
> +	if (ETH_FRAME_LEN > buf_size) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> +			dev->data->name, ETH_FRAME_LEN, buf_size);
> +		return -ENOMEM;
> +	}
> +
> +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> +
> +	return 0;
> +}
> +
> +static int
> +eth_tx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t tx_queue_id,
> +                   uint16_t nb_tx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> +{
> +
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> +	return 0;
> +}
> +
> +static struct eth_dev_ops ops = {
> +	.dev_start = eth_dev_start,
> +	.dev_stop = eth_dev_stop,
> +	.dev_close = eth_dev_close,
> +	.dev_configure = eth_dev_configure,
> +	.dev_infos_get = eth_dev_info,
> +	.rx_queue_setup = eth_rx_queue_setup,
> +	.tx_queue_setup = eth_tx_queue_setup,
> +	.rx_queue_release = eth_queue_release,
> +	.tx_queue_release = eth_queue_release,
> +	.link_update = eth_link_update,
> +	.stats_get = eth_stats_get,
> +	.stats_reset = eth_stats_reset,
> +};
> +
> +/*
> + * Opens an AF_PACKET socket
> + */
> +static int
> +open_packet_iface(const char *key __rte_unused,
> +                  const char *value __rte_unused,
> +                  void *extra_args)
> +{
> +	int *sockfd = extra_args;
> +
> +	/* Open an AF_PACKET socket... */
> +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +	if (*sockfd == -1) {
> +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +rte_pmd_init_internals(const char *name,
> +                       const int sockfd,
> +                       const unsigned nb_queues,
> +                       unsigned int blocksize,
> +                       unsigned int blockcnt,
> +                       unsigned int framesize,
> +                       unsigned int framecnt,
> +                       const unsigned numa_node,
> +                       struct pmd_internals **internals,
> +                       struct rte_eth_dev **eth_dev,
> +                       struct rte_kvargs *kvlist)
> +{
> +	struct rte_eth_dev_data *data = NULL;
> +	struct rte_pci_device *pci_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	struct ifreq ifr;
> +	size_t ifnamelen;
> +	unsigned k_idx;
> +	struct sockaddr_ll sockaddr;
> +	struct tpacket_req *req;
> +	struct pkt_rx_queue *rx_queue;
> +	struct pkt_tx_queue *tx_queue;
> +	int rc, tpver, discard, bypass;
> +	unsigned int i, q, rdsize;
> +	int qsockfd, fanout_arg;
> +
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> +			break;
> +	}
> +	if (pair == NULL) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: no interface specified for AF_PACKET ethdev\n",
> +		        name);
> +		goto error;
> +	}
> +
> +	RTE_LOG(INFO, PMD,
> +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> +		name, numa_node);
> +
> +	/*
> +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> +	 * and internal (private) data
> +	 */
> +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> +	if (data == NULL)
> +		goto error;
> +
> +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> +	if (pci_dev == NULL)
> +		goto error;
> +
> +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> +	                                0, numa_node);
> +	if (*internals == NULL)
> +		goto error;
> +
> +	req = &((*internals)->req);
> +
> +	req->tp_block_size = blocksize;
> +	req->tp_block_nr = blockcnt;
> +	req->tp_frame_size = framesize;
> +	req->tp_frame_nr = framecnt;
> +
> +	ifnamelen = strlen(pair->value);
> +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> +		ifr.ifr_name[ifnamelen] = '\0';
> +	} else {
> +		RTE_LOG(ERR, PMD,
> +			"%s: I/F name too long (%s)\n",
> +			name, pair->value);
> +		goto error;
> +	}
> +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> +		        name);
> +		goto error;
> +	}
> +	(*internals)->if_index = ifr.ifr_ifindex;
> +
> +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> +		        name);
> +		goto error;
> +	}
> +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> +
> +	memset(&sockaddr, 0, sizeof(sockaddr));
> +	sockaddr.sll_family = AF_PACKET;
> +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> +	sockaddr.sll_ifindex = (*internals)->if_index;
> +
> +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> +
> +	for (q = 0; q < nb_queues; q++) {
> +		/* Open an AF_PACKET socket for this queue... */
> +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +		if (qsockfd == -1) {
> +			RTE_LOG(ERR, PMD,
> +			        "%s: could not open AF_PACKET socket\n",
> +			        name);
> +			return -1;
> +		}
> +
> +		tpver = TPACKET_V2;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> +				&tpver, sizeof(tpver));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_VERSION on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		discard = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> +				&discard, sizeof(discard));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_LOSS on "
> +			        "AF_PACKET socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		bypass = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> +				&bypass, sizeof(bypass));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_QDISC_BYPASS "
> +			        "on AF_PACKET socket for %s\n", name,
> +			        pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rx_queue = &((*internals)->rx_queue[q]);
> +		rx_queue->framecount = req->tp_frame_nr;
> +
> +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> +				    qsockfd, 0);
> +		if (rx_queue->map == MAP_FAILED) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> +				name, pair->value);
> +			goto error;
> +		}
> +
> +		/* rdsize is same for both Tx and Rx */
> +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> +
> +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		rx_queue->sockfd = qsockfd;
> +
> +		tx_queue = &((*internals)->tx_queue[q]);
> +		tx_queue->framecount = req->tp_frame_nr;
> +
> +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> +
> +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		tx_queue->sockfd = qsockfd;
> +
> +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not bind AF_PACKET socket to %s\n",
> +			        name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> +				&fanout_arg, sizeof(fanout_arg));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> +				"for %s\n", name, pair->value);
> +			goto error;
> +		}
> +	}
> +
> +	/* reserve an ethdev entry */
> +	*eth_dev = rte_eth_dev_allocate(name);
> +	if (*eth_dev == NULL)
> +		goto error;
> +
> +	/*
> +	 * now put it all together
> +	 * - store queue data in internals,
> +	 * - store numa_node info in pci_driver
> +	 * - point eth_dev_data to internals and pci_driver
> +	 * - and point eth_dev structure to new eth_dev_data structure
> +	 */
> +
> +	(*internals)->nb_queues = nb_queues;
> +
> +	data->dev_private = *internals;
> +	data->port_id = (*eth_dev)->data->port_id;
> +	data->nb_rx_queues = (uint16_t)nb_queues;
> +	data->nb_tx_queues = (uint16_t)nb_queues;
> +	data->dev_link = pmd_link;
> +	data->mac_addrs = &(*internals)->eth_addr;
> +
> +	pci_dev->numa_node = numa_node;
> +
> +	(*eth_dev)->data = data;
> +	(*eth_dev)->dev_ops = &ops;
> +	(*eth_dev)->pci_dev = pci_dev;
> +
> +	return 0;
> +
> +error:
> +	if (data)
> +		rte_free(data);
> +	if (pci_dev)
> +		rte_free(pci_dev);
> +	for (q = 0; q < nb_queues; q++) {
> +		if ((*internals)->rx_queue[q].rd)
> +			rte_free((*internals)->rx_queue[q].rd);
> +		if ((*internals)->tx_queue[q].rd)
> +			rte_free((*internals)->tx_queue[q].rd);
> +	}
> +	if (*internals)
> +		rte_free(*internals);
> +	return -1;
> +}
> +
> +static int
> +rte_eth_from_packet(const char *name,
> +                    int const *sockfd,
> +                    const unsigned numa_node,
> +                    struct rte_kvargs *kvlist)
> +{
> +	struct pmd_internals *internals = NULL;
> +	struct rte_eth_dev *eth_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	unsigned k_idx;
> +	unsigned int blockcount;
> +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> +	unsigned int framesize = DFLT_FRAME_SIZE;
> +	unsigned int framecount = DFLT_FRAME_COUNT;
> +	unsigned int qpairs = 1;
> +
> +	/* do some parameter checking */
> +	if (*sockfd < 0)
> +		return -1;
> +
> +	/*
> +	 * Walk arguments for configurable settings
> +	 */
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> +			qpairs = atoi(pair->value);
> +			if (qpairs < 1 ||
> +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid qpairs value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> +			blocksize = atoi(pair->value);
> +			if (!blocksize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid blocksize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> +			framesize = atoi(pair->value);
> +			if (!framesize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framesize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> +			framecount = atoi(pair->value);
> +			if (!framecount) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framecount value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +	}
> +
> +	if (framesize > blocksize) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> +		        name);
> +		return -1;
> +	}
> +
> +	blockcount = framecount / (blocksize / framesize);
> +	if (!blockcount) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> +		return -1;
> +	}
> +
> +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> +
> +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> +	                           blocksize, blockcount,
> +	                           framesize, framecount,
> +	                           numa_node, &internals, &eth_dev,
> +	                           kvlist) < 0)
> +		return -1;
> +
> +	eth_dev->rx_pkt_burst = eth_packet_rx;
> +	eth_dev->tx_pkt_burst = eth_packet_tx;
> +
> +	return 0;
> +}
> +
> +int
> +rte_pmd_packet_devinit(const char *name, const char *params)
> +{
> +	unsigned numa_node;
> +	int ret;
> +	struct rte_kvargs *kvlist;
> +	int sockfd = -1;
> +
> +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> +
> +	numa_node = rte_socket_id();
> +
> +	kvlist = rte_kvargs_parse(params, valid_arguments);
> +	if (kvlist == NULL)
> +		return -1;
> +
> +	/*
> +	 * If iface argument is passed we open the NICs and use them for
> +	 * reading / writing
> +	 */
> +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> +
> +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> +		                         &open_packet_iface, &sockfd);
> +		if (ret < 0)
> +			return -1;
> +	}
> +
> +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> +	close(sockfd); /* no longer needed */
> +
> +	if (ret < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static struct rte_driver pmd_packet_drv = {
> +	.name = "eth_packet",
> +	.type = PMD_VDEV,
> +	.init = rte_pmd_packet_devinit,
> +};
> +
> +PMD_REGISTER_DRIVER(pmd_packet_drv);
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> new file mode 100644
> index 000000000000..f685611da3e9
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> @@ -0,0 +1,55 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_ETH_PACKET_H_
> +#define _RTE_ETH_PACKET_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> +
> +#define RTE_PMD_PACKET_MAX_RINGS 16
> +
> +/**
> + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> + * configured on command line.
> + */
> +int rte_pmd_packet_devinit(const char *name, const char *params);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 34dff2a02a05..a6994c4dbe93 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
>  LDLIBS += -lrte_pmd_pcap -lpcap
>  endif
>  
> +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> +LDLIBS += -lrte_pmd_packet
> +endif
> +
>  endif # plugins
>  
>  LDLIBS += $(EXECENV_LDLIBS)
> -- 
> 1.9.3
> 
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-12 18:05   ` John W. Linville
@ 2014-09-12 18:31     ` Zhou, Danny
  2014-09-12 18:54       ` John W. Linville
  2014-09-16 20:16     ` Neil Horman
  1 sibling, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-09-12 18:31 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

I am concerned about its performance caused by too many memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy packets to skb, then af_packet copies packets to AF_PACKET buffer which are mapped to user space, and then those packets to be copied to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet copies which brings significant negative performance impact. We had a bifurcated driver prototype that can do zero-copy and achieve native DPDK performance, but it depends on base driver and AF_PACKET code changes in kernel, John R will be presenting it in coming Linux Plumbers Conference. Once kernel adopts it, the relevant PMD will be submitted to dpdk.org.

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> Sent: Saturday, September 13, 2014 2:05 AM
> To: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> 
> Ping?  Are there objections to this patch from mid-July?
> 
> John
> 
> On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > AF_PACKET is used for frame reception.  In the current implementation,
> > Tx and Rx queues are always paired, and therefore are always equal
> > in number -- changing this would be a Simple Matter Of Programming.
> >
> > Interfaces of this type are created with a command line option like
> > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > as arguments:
> >
> >  - Interface is chosen by "iface" (required)
> >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> >
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > This PMD is intended to provide a means for using DPDK on a broad
> > range of hardware without hardware-specific PMDs and (hopefully)
> > with better performance than what PCAP offers in Linux.  This might
> > be useful as a development platform for DPDK applications when
> > DPDK-supported hardware is expensive or unavailable.
> >
> > New in v2:
> >
> > -- fixup some style issues found by check patch
> > -- use if_index as part of fanout group ID
> > -- set default number of queue pairs to 1
> >
> >  config/common_bsdapp                   |   5 +
> >  config/common_linuxapp                 |   5 +
> >  lib/Makefile                           |   1 +
> >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> >  lib/librte_pmd_packet/Makefile         |  60 +++
> >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> >  mk/rte.app.mk                          |   4 +
> >  8 files changed, 957 insertions(+)
> >  create mode 100644 lib/librte_pmd_packet/Makefile
> >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> >
> > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > index 943dce8f1ede..c317f031278e 100644
> > --- a/config/common_bsdapp
> > +++ b/config/common_bsdapp
> > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> >
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > +#
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > +
> > +#
> >  # Do prefetch of packet data within PMD driver receive function
> >  #
> >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > --- a/config/common_linuxapp
> > +++ b/config/common_linuxapp
> > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> >
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > +#
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > +
> > +#
> >  # Compile Xen PMD
> >  #
> >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 10c5bb3045bc..930fadf29898 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > index 756d6b0c9301..feed24a63272 100644
> > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> >  CFLAGS += $(WERROR_FLAGS) -O3
> >
> > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > new file mode 100644
> > index 000000000000..e1266fb992cd
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/Makefile
> > @@ -0,0 +1,60 @@
> > +#   BSD LICENSE
> > +#
> > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > +#   Copyright(c) 2014 6WIND S.A.
> > +#   All rights reserved.
> > +#
> > +#   Redistribution and use in source and binary forms, with or without
> > +#   modification, are permitted provided that the following conditions
> > +#   are met:
> > +#
> > +#     * Redistributions of source code must retain the above copyright
> > +#       notice, this list of conditions and the following disclaimer.
> > +#     * Redistributions in binary form must reproduce the above copyright
> > +#       notice, this list of conditions and the following disclaimer in
> > +#       the documentation and/or other materials provided with the
> > +#       distribution.
> > +#     * Neither the name of Intel Corporation nor the names of its
> > +#       contributors may be used to endorse or promote products derived
> > +#       from this software without specific prior written permission.
> > +#
> > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +#
> > +# library name
> > +#
> > +LIB = librte_pmd_packet.a
> > +
> > +CFLAGS += -O3
> > +CFLAGS += $(WERROR_FLAGS)
> > +
> > +#
> > +# all source are stored in SRCS-y
> > +#
> > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > +
> > +#
> > +# Export include files
> > +#
> > +SYMLINK-y-include += rte_eth_packet.h
> > +
> > +# this lib depends upon:
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > new file mode 100644
> > index 000000000000..9c82d16e730f
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > @@ -0,0 +1,826 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > + *
> > + *   Originally based upon librte_pmd_pcap code:
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   Copyright(c) 2014 6WIND S.A.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#include <rte_mbuf.h>
> > +#include <rte_ethdev.h>
> > +#include <rte_malloc.h>
> > +#include <rte_kvargs.h>
> > +#include <rte_dev.h>
> > +
> > +#include <linux/if_ether.h>
> > +#include <linux/if_packet.h>
> > +#include <arpa/inet.h>
> > +#include <net/if.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +#include <poll.h>
> > +
> > +#include "rte_eth_packet.h"
> > +
> > +#define ETH_PACKET_IFACE_ARG		"iface"
> > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > +
> > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > +#define DFLT_FRAME_SIZE		(1 << 11)
> > +#define DFLT_FRAME_COUNT	(1 << 9)
> > +
> > +struct pkt_rx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	struct rte_mempool *mb_pool;
> > +
> > +	volatile unsigned long rx_pkts;
> > +	volatile unsigned long err_pkts;
> > +};
> > +
> > +struct pkt_tx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	volatile unsigned long tx_pkts;
> > +	volatile unsigned long err_pkts;
> > +};
> > +
> > +struct pmd_internals {
> > +	unsigned nb_queues;
> > +
> > +	int if_index;
> > +	struct ether_addr eth_addr;
> > +
> > +	struct tpacket_req req;
> > +
> > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > +};
> > +
> > +static const char *valid_arguments[] = {
> > +	ETH_PACKET_IFACE_ARG,
> > +	ETH_PACKET_NUM_Q_ARG,
> > +	ETH_PACKET_BLOCKSIZE_ARG,
> > +	ETH_PACKET_FRAMESIZE_ARG,
> > +	ETH_PACKET_FRAMECOUNT_ARG,
> > +	NULL
> > +};
> > +
> > +static const char *drivername = "AF_PACKET PMD";
> > +
> > +static struct rte_eth_link pmd_link = {
> > +	.link_speed = 10000,
> > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > +	.link_status = 0
> > +};
> > +
> > +static uint16_t
> > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > +{
> > +	unsigned i;
> > +	struct tpacket2_hdr *ppd;
> > +	struct rte_mbuf *mbuf;
> > +	uint8_t *pbuf;
> > +	struct pkt_rx_queue *pkt_q = queue;
> > +	uint16_t num_rx = 0;
> > +	unsigned int framecount, framenum;
> > +
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	/*
> > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > +	 * one and copies the packet data into a newly allocated mbuf.
> > +	 */
> > +	framecount = pkt_q->framecount;
> > +	framenum = pkt_q->framenum;
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		/* point at the next incoming frame */
> > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > +			break;
> > +
> > +		/* allocate the next mbuf */
> > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > +		if (unlikely(mbuf == NULL))
> > +			break;
> > +
> > +		/* packet will fit in the mbuf, go ahead and receive it */
> > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > +
> > +		/* release incoming frame and advance ring buffer */
> > +		ppd->tp_status = TP_STATUS_KERNEL;
> > +		if (++framenum >= framecount)
> > +			framenum = 0;
> > +
> > +		/* account for the receive frame */
> > +		bufs[i] = mbuf;
> > +		num_rx++;
> > +	}
> > +	pkt_q->framenum = framenum;
> > +	pkt_q->rx_pkts += num_rx;
> > +	return num_rx;
> > +}
> > +
> > +/*
> > + * Callback to handle sending packets through a real NIC.
> > + */
> > +static uint16_t
> > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > +{
> > +	struct tpacket2_hdr *ppd;
> > +	struct rte_mbuf *mbuf;
> > +	uint8_t *pbuf;
> > +	unsigned int framecount, framenum;
> > +	struct pollfd pfd;
> > +	struct pkt_tx_queue *pkt_q = queue;
> > +	uint16_t num_tx = 0;
> > +	int i;
> > +
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	memset(&pfd, 0, sizeof(pfd));
> > +	pfd.fd = pkt_q->sockfd;
> > +	pfd.events = POLLOUT;
> > +	pfd.revents = 0;
> > +
> > +	framecount = pkt_q->framecount;
> > +	framenum = pkt_q->framenum;
> > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		/* point at the next incoming frame */
> > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > +		    (poll(&pfd, 1, -1) < 0))
> > +				continue;
> > +
> > +		/* copy the tx frame data */
> > +		mbuf = bufs[num_tx];
> > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > +			sizeof(struct sockaddr_ll);
> > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > +
> > +		/* release incoming frame and advance ring buffer */
> > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > +		if (++framenum >= framecount)
> > +			framenum = 0;
> > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +
> > +		num_tx++;
> > +		rte_pktmbuf_free(mbuf);
> > +	}
> > +
> > +	/* kick-off transmits */
> > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > +
> > +	pkt_q->framenum = framenum;
> > +	pkt_q->tx_pkts += num_tx;
> > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > +	return num_tx;
> > +}
> > +
> > +static int
> > +eth_dev_start(struct rte_eth_dev *dev)
> > +{
> > +	dev->data->dev_link.link_status = 1;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * This function gets called when the current port gets stopped.
> > + */
> > +static void
> > +eth_dev_stop(struct rte_eth_dev *dev)
> > +{
> > +	unsigned i;
> > +	int sockfd;
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	for (i = 0; i < internals->nb_queues; i++) {
> > +		sockfd = internals->rx_queue[i].sockfd;
> > +		if (sockfd != -1)
> > +			close(sockfd);
> > +		sockfd = internals->tx_queue[i].sockfd;
> > +		if (sockfd != -1)
> > +			close(sockfd);
> > +	}
> > +
> > +	dev->data->dev_link.link_status = 0;
> > +}
> > +
> > +static int
> > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > +{
> > +	return 0;
> > +}
> > +
> > +static void
> > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > +{
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	dev_info->driver_name = drivername;
> > +	dev_info->if_index = internals->if_index;
> > +	dev_info->max_mac_addrs = 1;
> > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > +	dev_info->min_rx_bufsize = 0;
> > +	dev_info->pci_dev = NULL;
> > +}
> > +
> > +static void
> > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > +{
> > +	unsigned i, imax;
> > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > +	const struct pmd_internals *internal = dev->data->dev_private;
> > +
> > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > +
> > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > +	for (i = 0; i < imax; i++) {
> > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > +		rx_total += igb_stats->q_ipackets[i];
> > +	}
> > +
> > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > +	for (i = 0; i < imax; i++) {
> > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > +		tx_total += igb_stats->q_opackets[i];
> > +		tx_err_total += igb_stats->q_errors[i];
> > +	}
> > +
> > +	igb_stats->ipackets = rx_total;
> > +	igb_stats->opackets = tx_total;
> > +	igb_stats->oerrors = tx_err_total;
> > +}
> > +
> > +static void
> > +eth_stats_reset(struct rte_eth_dev *dev)
> > +{
> > +	unsigned i;
> > +	struct pmd_internals *internal = dev->data->dev_private;
> > +
> > +	for (i = 0; i < internal->nb_queues; i++)
> > +		internal->rx_queue[i].rx_pkts = 0;
> > +
> > +	for (i = 0; i < internal->nb_queues; i++) {
> > +		internal->tx_queue[i].tx_pkts = 0;
> > +		internal->tx_queue[i].err_pkts = 0;
> > +	}
> > +}
> > +
> > +static void
> > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > +{
> > +}
> > +
> > +static void
> > +eth_queue_release(void *q __rte_unused)
> > +{
> > +}
> > +
> > +static int
> > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > +                int wait_to_complete __rte_unused)
> > +{
> > +	return 0;
> > +}
> > +
> > +static int
> > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > +                   uint16_t rx_queue_id,
> > +                   uint16_t nb_rx_desc __rte_unused,
> > +                   unsigned int socket_id __rte_unused,
> > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > +                   struct rte_mempool *mb_pool)
> > +{
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > +	uint16_t buf_size;
> > +
> > +	pkt_q->mb_pool = mb_pool;
> > +
> > +	/* Now get the space available for data in the mbuf */
> > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > +	                       RTE_PKTMBUF_HEADROOM);
> > +
> > +	if (ETH_FRAME_LEN > buf_size) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > +                   uint16_t tx_queue_id,
> > +                   uint16_t nb_tx_desc __rte_unused,
> > +                   unsigned int socket_id __rte_unused,
> > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > +{
> > +
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > +	return 0;
> > +}
> > +
> > +static struct eth_dev_ops ops = {
> > +	.dev_start = eth_dev_start,
> > +	.dev_stop = eth_dev_stop,
> > +	.dev_close = eth_dev_close,
> > +	.dev_configure = eth_dev_configure,
> > +	.dev_infos_get = eth_dev_info,
> > +	.rx_queue_setup = eth_rx_queue_setup,
> > +	.tx_queue_setup = eth_tx_queue_setup,
> > +	.rx_queue_release = eth_queue_release,
> > +	.tx_queue_release = eth_queue_release,
> > +	.link_update = eth_link_update,
> > +	.stats_get = eth_stats_get,
> > +	.stats_reset = eth_stats_reset,
> > +};
> > +
> > +/*
> > + * Opens an AF_PACKET socket
> > + */
> > +static int
> > +open_packet_iface(const char *key __rte_unused,
> > +                  const char *value __rte_unused,
> > +                  void *extra_args)
> > +{
> > +	int *sockfd = extra_args;
> > +
> > +	/* Open an AF_PACKET socket... */
> > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > +	if (*sockfd == -1) {
> > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > +		return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +rte_pmd_init_internals(const char *name,
> > +                       const int sockfd,
> > +                       const unsigned nb_queues,
> > +                       unsigned int blocksize,
> > +                       unsigned int blockcnt,
> > +                       unsigned int framesize,
> > +                       unsigned int framecnt,
> > +                       const unsigned numa_node,
> > +                       struct pmd_internals **internals,
> > +                       struct rte_eth_dev **eth_dev,
> > +                       struct rte_kvargs *kvlist)
> > +{
> > +	struct rte_eth_dev_data *data = NULL;
> > +	struct rte_pci_device *pci_dev = NULL;
> > +	struct rte_kvargs_pair *pair = NULL;
> > +	struct ifreq ifr;
> > +	size_t ifnamelen;
> > +	unsigned k_idx;
> > +	struct sockaddr_ll sockaddr;
> > +	struct tpacket_req *req;
> > +	struct pkt_rx_queue *rx_queue;
> > +	struct pkt_tx_queue *tx_queue;
> > +	int rc, tpver, discard, bypass;
> > +	unsigned int i, q, rdsize;
> > +	int qsockfd, fanout_arg;
> > +
> > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > +		pair = &kvlist->pairs[k_idx];
> > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > +			break;
> > +	}
> > +	if (pair == NULL) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +
> > +	RTE_LOG(INFO, PMD,
> > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > +		name, numa_node);
> > +
> > +	/*
> > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > +	 * and internal (private) data
> > +	 */
> > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > +	if (data == NULL)
> > +		goto error;
> > +
> > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > +	if (pci_dev == NULL)
> > +		goto error;
> > +
> > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > +	                                0, numa_node);
> > +	if (*internals == NULL)
> > +		goto error;
> > +
> > +	req = &((*internals)->req);
> > +
> > +	req->tp_block_size = blocksize;
> > +	req->tp_block_nr = blockcnt;
> > +	req->tp_frame_size = framesize;
> > +	req->tp_frame_nr = framecnt;
> > +
> > +	ifnamelen = strlen(pair->value);
> > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > +		ifr.ifr_name[ifnamelen] = '\0';
> > +	} else {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: I/F name too long (%s)\n",
> > +			name, pair->value);
> > +		goto error;
> > +	}
> > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +	(*internals)->if_index = ifr.ifr_ifindex;
> > +
> > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > +
> > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > +	sockaddr.sll_family = AF_PACKET;
> > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > +
> > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > +
> > +	for (q = 0; q < nb_queues; q++) {
> > +		/* Open an AF_PACKET socket for this queue... */
> > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > +		if (qsockfd == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +			        "%s: could not open AF_PACKET socket\n",
> > +			        name);
> > +			return -1;
> > +		}
> > +
> > +		tpver = TPACKET_V2;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > +				&tpver, sizeof(tpver));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		discard = 1;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > +				&discard, sizeof(discard));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_LOSS on "
> > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		bypass = 1;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > +				&bypass, sizeof(bypass));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_QDISC_BYPASS "
> > +			        "on AF_PACKET socket for %s\n", name,
> > +			        pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rx_queue = &((*internals)->rx_queue[q]);
> > +		rx_queue->framecount = req->tp_frame_nr;
> > +
> > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > +				    qsockfd, 0);
> > +		if (rx_queue->map == MAP_FAILED) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > +				name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		/* rdsize is same for both Tx and Rx */
> > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > +
> > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > +		}
> > +		rx_queue->sockfd = qsockfd;
> > +
> > +		tx_queue = &((*internals)->tx_queue[q]);
> > +		tx_queue->framecount = req->tp_frame_nr;
> > +
> > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > +
> > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > +		}
> > +		tx_queue->sockfd = qsockfd;
> > +
> > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not bind AF_PACKET socket to %s\n",
> > +			        name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > +				&fanout_arg, sizeof(fanout_arg));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > +				"for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +	}
> > +
> > +	/* reserve an ethdev entry */
> > +	*eth_dev = rte_eth_dev_allocate(name);
> > +	if (*eth_dev == NULL)
> > +		goto error;
> > +
> > +	/*
> > +	 * now put it all together
> > +	 * - store queue data in internals,
> > +	 * - store numa_node info in pci_driver
> > +	 * - point eth_dev_data to internals and pci_driver
> > +	 * - and point eth_dev structure to new eth_dev_data structure
> > +	 */
> > +
> > +	(*internals)->nb_queues = nb_queues;
> > +
> > +	data->dev_private = *internals;
> > +	data->port_id = (*eth_dev)->data->port_id;
> > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > +	data->dev_link = pmd_link;
> > +	data->mac_addrs = &(*internals)->eth_addr;
> > +
> > +	pci_dev->numa_node = numa_node;
> > +
> > +	(*eth_dev)->data = data;
> > +	(*eth_dev)->dev_ops = &ops;
> > +	(*eth_dev)->pci_dev = pci_dev;
> > +
> > +	return 0;
> > +
> > +error:
> > +	if (data)
> > +		rte_free(data);
> > +	if (pci_dev)
> > +		rte_free(pci_dev);
> > +	for (q = 0; q < nb_queues; q++) {
> > +		if ((*internals)->rx_queue[q].rd)
> > +			rte_free((*internals)->rx_queue[q].rd);
> > +		if ((*internals)->tx_queue[q].rd)
> > +			rte_free((*internals)->tx_queue[q].rd);
> > +	}
> > +	if (*internals)
> > +		rte_free(*internals);
> > +	return -1;
> > +}
> > +
> > +static int
> > +rte_eth_from_packet(const char *name,
> > +                    int const *sockfd,
> > +                    const unsigned numa_node,
> > +                    struct rte_kvargs *kvlist)
> > +{
> > +	struct pmd_internals *internals = NULL;
> > +	struct rte_eth_dev *eth_dev = NULL;
> > +	struct rte_kvargs_pair *pair = NULL;
> > +	unsigned k_idx;
> > +	unsigned int blockcount;
> > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > +	unsigned int qpairs = 1;
> > +
> > +	/* do some parameter checking */
> > +	if (*sockfd < 0)
> > +		return -1;
> > +
> > +	/*
> > +	 * Walk arguments for configurable settings
> > +	 */
> > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > +		pair = &kvlist->pairs[k_idx];
> > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > +			qpairs = atoi(pair->value);
> > +			if (qpairs < 1 ||
> > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid qpairs value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > +			blocksize = atoi(pair->value);
> > +			if (!blocksize) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid blocksize value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > +			framesize = atoi(pair->value);
> > +			if (!framesize) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid framesize value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > +			framecount = atoi(pair->value);
> > +			if (!framecount) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid framecount value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +	}
> > +
> > +	if (framesize > blocksize) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > +		        name);
> > +		return -1;
> > +	}
> > +
> > +	blockcount = framecount / (blocksize / framesize);
> > +	if (!blockcount) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > +		return -1;
> > +	}
> > +
> > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > +
> > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > +	                           blocksize, blockcount,
> > +	                           framesize, framecount,
> > +	                           numa_node, &internals, &eth_dev,
> > +	                           kvlist) < 0)
> > +		return -1;
> > +
> > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_pmd_packet_devinit(const char *name, const char *params)
> > +{
> > +	unsigned numa_node;
> > +	int ret;
> > +	struct rte_kvargs *kvlist;
> > +	int sockfd = -1;
> > +
> > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > +
> > +	numa_node = rte_socket_id();
> > +
> > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > +	if (kvlist == NULL)
> > +		return -1;
> > +
> > +	/*
> > +	 * If iface argument is passed we open the NICs and use them for
> > +	 * reading / writing
> > +	 */
> > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > +
> > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > +		                         &open_packet_iface, &sockfd);
> > +		if (ret < 0)
> > +			return -1;
> > +	}
> > +
> > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > +	close(sockfd); /* no longer needed */
> > +
> > +	if (ret < 0)
> > +		return -1;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct rte_driver pmd_packet_drv = {
> > +	.name = "eth_packet",
> > +	.type = PMD_VDEV,
> > +	.init = rte_pmd_packet_devinit,
> > +};
> > +
> > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > new file mode 100644
> > index 000000000000..f685611da3e9
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > @@ -0,0 +1,55 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#ifndef _RTE_ETH_PACKET_H_
> > +#define _RTE_ETH_PACKET_H_
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > +
> > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > +
> > +/**
> > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > + * configured on command line.
> > + */
> > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif
> > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > index 34dff2a02a05..a6994c4dbe93 100644
> > --- a/mk/rte.app.mk
> > +++ b/mk/rte.app.mk
> > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> >  LDLIBS += -lrte_pmd_pcap -lpcap
> >  endif
> >
> > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > +LDLIBS += -lrte_pmd_packet
> > +endif
> > +
> >  endif # plugins
> >
> >  LDLIBS += $(EXECENV_LDLIBS)
> > --
> > 1.9.3
> >
> >
> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-12 18:31     ` Zhou, Danny
@ 2014-09-12 18:54       ` John W. Linville
  2014-09-12 20:35         ` Zhou, Danny
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-09-12 18:54 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> I am concerned about its performance caused by too many
> memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> packets to skb, then af_packet copies packets to AF_PACKET buffer
> which are mapped to user space, and then those packets to be copied
> to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> copies which brings significant negative performance impact. We
> had a bifurcated driver prototype that can do zero-copy and achieve
> native DPDK performance, but it depends on base driver and AF_PACKET
> code changes in kernel, John R will be presenting it in coming Linux
> Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> submitted to dpdk.org.

Admittedly, this is not as good a performer as most of the existing
PMDs.  It serves a different purpose, afterall.  FWIW, you did
previously indicate that it performed better than the pcap-based PMD.

I look forward to seeing the changes you mention -- they sound very
exciting.  But, they will still require both networking core and
driver changes in the kernel.  And as I understand things today,
the userland code will still need at least some knowledge of specific
devices and how they layout their packet descriptors, etc.  So while
those changes sound very promising, they will still have certain
drawbacks in common with the current situation.

It seems like the changes you mention will still need some sort of
AF_PACKET-based PMD driver.  Have you implemented that completely
separate from the code I already posted?  Or did you add that work
on top of mine?

John

> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > Sent: Saturday, September 13, 2014 2:05 AM
> > To: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > 
> > Ping?  Are there objections to this patch from mid-July?
> > 
> > John
> > 
> > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > AF_PACKET is used for frame reception.  In the current implementation,
> > > Tx and Rx queues are always paired, and therefore are always equal
> > > in number -- changing this would be a Simple Matter Of Programming.
> > >
> > > Interfaces of this type are created with a command line option like
> > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > as arguments:
> > >
> > >  - Interface is chosen by "iface" (required)
> > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > >
> > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > ---
> > > This PMD is intended to provide a means for using DPDK on a broad
> > > range of hardware without hardware-specific PMDs and (hopefully)
> > > with better performance than what PCAP offers in Linux.  This might
> > > be useful as a development platform for DPDK applications when
> > > DPDK-supported hardware is expensive or unavailable.
> > >
> > > New in v2:
> > >
> > > -- fixup some style issues found by check patch
> > > -- use if_index as part of fanout group ID
> > > -- set default number of queue pairs to 1
> > >
> > >  config/common_bsdapp                   |   5 +
> > >  config/common_linuxapp                 |   5 +
> > >  lib/Makefile                           |   1 +
> > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > >  mk/rte.app.mk                          |   4 +
> > >  8 files changed, 957 insertions(+)
> > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > >
> > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > index 943dce8f1ede..c317f031278e 100644
> > > --- a/config/common_bsdapp
> > > +++ b/config/common_bsdapp
> > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > +#
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > +
> > > +#
> > >  # Do prefetch of packet data within PMD driver receive function
> > >  #
> > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > --- a/config/common_linuxapp
> > > +++ b/config/common_linuxapp
> > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > +#
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > +
> > > +#
> > >  # Compile Xen PMD
> > >  #
> > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > diff --git a/lib/Makefile b/lib/Makefile
> > > index 10c5bb3045bc..930fadf29898 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > index 756d6b0c9301..feed24a63272 100644
> > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > >  CFLAGS += $(WERROR_FLAGS) -O3
> > >
> > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > new file mode 100644
> > > index 000000000000..e1266fb992cd
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/Makefile
> > > @@ -0,0 +1,60 @@
> > > +#   BSD LICENSE
> > > +#
> > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > +#   Copyright(c) 2014 6WIND S.A.
> > > +#   All rights reserved.
> > > +#
> > > +#   Redistribution and use in source and binary forms, with or without
> > > +#   modification, are permitted provided that the following conditions
> > > +#   are met:
> > > +#
> > > +#     * Redistributions of source code must retain the above copyright
> > > +#       notice, this list of conditions and the following disclaimer.
> > > +#     * Redistributions in binary form must reproduce the above copyright
> > > +#       notice, this list of conditions and the following disclaimer in
> > > +#       the documentation and/or other materials provided with the
> > > +#       distribution.
> > > +#     * Neither the name of Intel Corporation nor the names of its
> > > +#       contributors may be used to endorse or promote products derived
> > > +#       from this software without specific prior written permission.
> > > +#
> > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +#
> > > +# library name
> > > +#
> > > +LIB = librte_pmd_packet.a
> > > +
> > > +CFLAGS += -O3
> > > +CFLAGS += $(WERROR_FLAGS)
> > > +
> > > +#
> > > +# all source are stored in SRCS-y
> > > +#
> > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > +
> > > +#
> > > +# Export include files
> > > +#
> > > +SYMLINK-y-include += rte_eth_packet.h
> > > +
> > > +# this lib depends upon:
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > new file mode 100644
> > > index 000000000000..9c82d16e730f
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > @@ -0,0 +1,826 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > + *
> > > + *   Originally based upon librte_pmd_pcap code:
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   Copyright(c) 2014 6WIND S.A.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > + */
> > > +
> > > +#include <rte_mbuf.h>
> > > +#include <rte_ethdev.h>
> > > +#include <rte_malloc.h>
> > > +#include <rte_kvargs.h>
> > > +#include <rte_dev.h>
> > > +
> > > +#include <linux/if_ether.h>
> > > +#include <linux/if_packet.h>
> > > +#include <arpa/inet.h>
> > > +#include <net/if.h>
> > > +#include <sys/types.h>
> > > +#include <sys/socket.h>
> > > +#include <sys/ioctl.h>
> > > +#include <sys/mman.h>
> > > +#include <unistd.h>
> > > +#include <poll.h>
> > > +
> > > +#include "rte_eth_packet.h"
> > > +
> > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > +
> > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > +
> > > +struct pkt_rx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	struct rte_mempool *mb_pool;
> > > +
> > > +	volatile unsigned long rx_pkts;
> > > +	volatile unsigned long err_pkts;
> > > +};
> > > +
> > > +struct pkt_tx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	volatile unsigned long tx_pkts;
> > > +	volatile unsigned long err_pkts;
> > > +};
> > > +
> > > +struct pmd_internals {
> > > +	unsigned nb_queues;
> > > +
> > > +	int if_index;
> > > +	struct ether_addr eth_addr;
> > > +
> > > +	struct tpacket_req req;
> > > +
> > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > +};
> > > +
> > > +static const char *valid_arguments[] = {
> > > +	ETH_PACKET_IFACE_ARG,
> > > +	ETH_PACKET_NUM_Q_ARG,
> > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > +	NULL
> > > +};
> > > +
> > > +static const char *drivername = "AF_PACKET PMD";
> > > +
> > > +static struct rte_eth_link pmd_link = {
> > > +	.link_speed = 10000,
> > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > +	.link_status = 0
> > > +};
> > > +
> > > +static uint16_t
> > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > +{
> > > +	unsigned i;
> > > +	struct tpacket2_hdr *ppd;
> > > +	struct rte_mbuf *mbuf;
> > > +	uint8_t *pbuf;
> > > +	struct pkt_rx_queue *pkt_q = queue;
> > > +	uint16_t num_rx = 0;
> > > +	unsigned int framecount, framenum;
> > > +
> > > +	if (unlikely(nb_pkts == 0))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > +	 */
> > > +	framecount = pkt_q->framecount;
> > > +	framenum = pkt_q->framenum;
> > > +	for (i = 0; i < nb_pkts; i++) {
> > > +		/* point at the next incoming frame */
> > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > +			break;
> > > +
> > > +		/* allocate the next mbuf */
> > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > +		if (unlikely(mbuf == NULL))
> > > +			break;
> > > +
> > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > +
> > > +		/* release incoming frame and advance ring buffer */
> > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > +		if (++framenum >= framecount)
> > > +			framenum = 0;
> > > +
> > > +		/* account for the receive frame */
> > > +		bufs[i] = mbuf;
> > > +		num_rx++;
> > > +	}
> > > +	pkt_q->framenum = framenum;
> > > +	pkt_q->rx_pkts += num_rx;
> > > +	return num_rx;
> > > +}
> > > +
> > > +/*
> > > + * Callback to handle sending packets through a real NIC.
> > > + */
> > > +static uint16_t
> > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > +{
> > > +	struct tpacket2_hdr *ppd;
> > > +	struct rte_mbuf *mbuf;
> > > +	uint8_t *pbuf;
> > > +	unsigned int framecount, framenum;
> > > +	struct pollfd pfd;
> > > +	struct pkt_tx_queue *pkt_q = queue;
> > > +	uint16_t num_tx = 0;
> > > +	int i;
> > > +
> > > +	if (unlikely(nb_pkts == 0))
> > > +		return 0;
> > > +
> > > +	memset(&pfd, 0, sizeof(pfd));
> > > +	pfd.fd = pkt_q->sockfd;
> > > +	pfd.events = POLLOUT;
> > > +	pfd.revents = 0;
> > > +
> > > +	framecount = pkt_q->framecount;
> > > +	framenum = pkt_q->framenum;
> > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +	for (i = 0; i < nb_pkts; i++) {
> > > +		/* point at the next incoming frame */
> > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > +		    (poll(&pfd, 1, -1) < 0))
> > > +				continue;
> > > +
> > > +		/* copy the tx frame data */
> > > +		mbuf = bufs[num_tx];
> > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > +			sizeof(struct sockaddr_ll);
> > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > +
> > > +		/* release incoming frame and advance ring buffer */
> > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > +		if (++framenum >= framecount)
> > > +			framenum = 0;
> > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +
> > > +		num_tx++;
> > > +		rte_pktmbuf_free(mbuf);
> > > +	}
> > > +
> > > +	/* kick-off transmits */
> > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > +
> > > +	pkt_q->framenum = framenum;
> > > +	pkt_q->tx_pkts += num_tx;
> > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > +	return num_tx;
> > > +}
> > > +
> > > +static int
> > > +eth_dev_start(struct rte_eth_dev *dev)
> > > +{
> > > +	dev->data->dev_link.link_status = 1;
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * This function gets called when the current port gets stopped.
> > > + */
> > > +static void
> > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > +{
> > > +	unsigned i;
> > > +	int sockfd;
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > +		sockfd = internals->rx_queue[i].sockfd;
> > > +		if (sockfd != -1)
> > > +			close(sockfd);
> > > +		sockfd = internals->tx_queue[i].sockfd;
> > > +		if (sockfd != -1)
> > > +			close(sockfd);
> > > +	}
> > > +
> > > +	dev->data->dev_link.link_status = 0;
> > > +}
> > > +
> > > +static int
> > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static void
> > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > +{
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	dev_info->driver_name = drivername;
> > > +	dev_info->if_index = internals->if_index;
> > > +	dev_info->max_mac_addrs = 1;
> > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > +	dev_info->min_rx_bufsize = 0;
> > > +	dev_info->pci_dev = NULL;
> > > +}
> > > +
> > > +static void
> > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > +{
> > > +	unsigned i, imax;
> > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > +
> > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > +
> > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > +	for (i = 0; i < imax; i++) {
> > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > +		rx_total += igb_stats->q_ipackets[i];
> > > +	}
> > > +
> > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > +	for (i = 0; i < imax; i++) {
> > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > +		tx_total += igb_stats->q_opackets[i];
> > > +		tx_err_total += igb_stats->q_errors[i];
> > > +	}
> > > +
> > > +	igb_stats->ipackets = rx_total;
> > > +	igb_stats->opackets = tx_total;
> > > +	igb_stats->oerrors = tx_err_total;
> > > +}
> > > +
> > > +static void
> > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > +{
> > > +	unsigned i;
> > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > +
> > > +	for (i = 0; i < internal->nb_queues; i++)
> > > +		internal->rx_queue[i].rx_pkts = 0;
> > > +
> > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > +		internal->tx_queue[i].tx_pkts = 0;
> > > +		internal->tx_queue[i].err_pkts = 0;
> > > +	}
> > > +}
> > > +
> > > +static void
> > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > +{
> > > +}
> > > +
> > > +static void
> > > +eth_queue_release(void *q __rte_unused)
> > > +{
> > > +}
> > > +
> > > +static int
> > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > +                int wait_to_complete __rte_unused)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > +                   uint16_t rx_queue_id,
> > > +                   uint16_t nb_rx_desc __rte_unused,
> > > +                   unsigned int socket_id __rte_unused,
> > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > +                   struct rte_mempool *mb_pool)
> > > +{
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > +	uint16_t buf_size;
> > > +
> > > +	pkt_q->mb_pool = mb_pool;
> > > +
> > > +	/* Now get the space available for data in the mbuf */
> > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > +	                       RTE_PKTMBUF_HEADROOM);
> > > +
> > > +	if (ETH_FRAME_LEN > buf_size) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > +                   uint16_t tx_queue_id,
> > > +                   uint16_t nb_tx_desc __rte_unused,
> > > +                   unsigned int socket_id __rte_unused,
> > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > +{
> > > +
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > +	return 0;
> > > +}
> > > +
> > > +static struct eth_dev_ops ops = {
> > > +	.dev_start = eth_dev_start,
> > > +	.dev_stop = eth_dev_stop,
> > > +	.dev_close = eth_dev_close,
> > > +	.dev_configure = eth_dev_configure,
> > > +	.dev_infos_get = eth_dev_info,
> > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > +	.rx_queue_release = eth_queue_release,
> > > +	.tx_queue_release = eth_queue_release,
> > > +	.link_update = eth_link_update,
> > > +	.stats_get = eth_stats_get,
> > > +	.stats_reset = eth_stats_reset,
> > > +};
> > > +
> > > +/*
> > > + * Opens an AF_PACKET socket
> > > + */
> > > +static int
> > > +open_packet_iface(const char *key __rte_unused,
> > > +                  const char *value __rte_unused,
> > > +                  void *extra_args)
> > > +{
> > > +	int *sockfd = extra_args;
> > > +
> > > +	/* Open an AF_PACKET socket... */
> > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > +	if (*sockfd == -1) {
> > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +rte_pmd_init_internals(const char *name,
> > > +                       const int sockfd,
> > > +                       const unsigned nb_queues,
> > > +                       unsigned int blocksize,
> > > +                       unsigned int blockcnt,
> > > +                       unsigned int framesize,
> > > +                       unsigned int framecnt,
> > > +                       const unsigned numa_node,
> > > +                       struct pmd_internals **internals,
> > > +                       struct rte_eth_dev **eth_dev,
> > > +                       struct rte_kvargs *kvlist)
> > > +{
> > > +	struct rte_eth_dev_data *data = NULL;
> > > +	struct rte_pci_device *pci_dev = NULL;
> > > +	struct rte_kvargs_pair *pair = NULL;
> > > +	struct ifreq ifr;
> > > +	size_t ifnamelen;
> > > +	unsigned k_idx;
> > > +	struct sockaddr_ll sockaddr;
> > > +	struct tpacket_req *req;
> > > +	struct pkt_rx_queue *rx_queue;
> > > +	struct pkt_tx_queue *tx_queue;
> > > +	int rc, tpver, discard, bypass;
> > > +	unsigned int i, q, rdsize;
> > > +	int qsockfd, fanout_arg;
> > > +
> > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > +		pair = &kvlist->pairs[k_idx];
> > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > +			break;
> > > +	}
> > > +	if (pair == NULL) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +
> > > +	RTE_LOG(INFO, PMD,
> > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > +		name, numa_node);
> > > +
> > > +	/*
> > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > +	 * and internal (private) data
> > > +	 */
> > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > +	if (data == NULL)
> > > +		goto error;
> > > +
> > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > +	if (pci_dev == NULL)
> > > +		goto error;
> > > +
> > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > +	                                0, numa_node);
> > > +	if (*internals == NULL)
> > > +		goto error;
> > > +
> > > +	req = &((*internals)->req);
> > > +
> > > +	req->tp_block_size = blocksize;
> > > +	req->tp_block_nr = blockcnt;
> > > +	req->tp_frame_size = framesize;
> > > +	req->tp_frame_nr = framecnt;
> > > +
> > > +	ifnamelen = strlen(pair->value);
> > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > +	} else {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: I/F name too long (%s)\n",
> > > +			name, pair->value);
> > > +		goto error;
> > > +	}
> > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > +
> > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > +
> > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > +	sockaddr.sll_family = AF_PACKET;
> > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > +
> > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > +
> > > +	for (q = 0; q < nb_queues; q++) {
> > > +		/* Open an AF_PACKET socket for this queue... */
> > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > +		if (qsockfd == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +			        "%s: could not open AF_PACKET socket\n",
> > > +			        name);
> > > +			return -1;
> > > +		}
> > > +
> > > +		tpver = TPACKET_V2;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > +				&tpver, sizeof(tpver));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		discard = 1;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > +				&discard, sizeof(discard));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_LOSS on "
> > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		bypass = 1;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > +				&bypass, sizeof(bypass));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > +			        "on AF_PACKET socket for %s\n", name,
> > > +			        pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > +		rx_queue->framecount = req->tp_frame_nr;
> > > +
> > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > +				    qsockfd, 0);
> > > +		if (rx_queue->map == MAP_FAILED) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > +				name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		/* rdsize is same for both Tx and Rx */
> > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > +
> > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > +		}
> > > +		rx_queue->sockfd = qsockfd;
> > > +
> > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > +		tx_queue->framecount = req->tp_frame_nr;
> > > +
> > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > +
> > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > +		}
> > > +		tx_queue->sockfd = qsockfd;
> > > +
> > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > +			        name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > +				&fanout_arg, sizeof(fanout_arg));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > +				"for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +	}
> > > +
> > > +	/* reserve an ethdev entry */
> > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > +	if (*eth_dev == NULL)
> > > +		goto error;
> > > +
> > > +	/*
> > > +	 * now put it all together
> > > +	 * - store queue data in internals,
> > > +	 * - store numa_node info in pci_driver
> > > +	 * - point eth_dev_data to internals and pci_driver
> > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > +	 */
> > > +
> > > +	(*internals)->nb_queues = nb_queues;
> > > +
> > > +	data->dev_private = *internals;
> > > +	data->port_id = (*eth_dev)->data->port_id;
> > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > +	data->dev_link = pmd_link;
> > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > +
> > > +	pci_dev->numa_node = numa_node;
> > > +
> > > +	(*eth_dev)->data = data;
> > > +	(*eth_dev)->dev_ops = &ops;
> > > +	(*eth_dev)->pci_dev = pci_dev;
> > > +
> > > +	return 0;
> > > +
> > > +error:
> > > +	if (data)
> > > +		rte_free(data);
> > > +	if (pci_dev)
> > > +		rte_free(pci_dev);
> > > +	for (q = 0; q < nb_queues; q++) {
> > > +		if ((*internals)->rx_queue[q].rd)
> > > +			rte_free((*internals)->rx_queue[q].rd);
> > > +		if ((*internals)->tx_queue[q].rd)
> > > +			rte_free((*internals)->tx_queue[q].rd);
> > > +	}
> > > +	if (*internals)
> > > +		rte_free(*internals);
> > > +	return -1;
> > > +}
> > > +
> > > +static int
> > > +rte_eth_from_packet(const char *name,
> > > +                    int const *sockfd,
> > > +                    const unsigned numa_node,
> > > +                    struct rte_kvargs *kvlist)
> > > +{
> > > +	struct pmd_internals *internals = NULL;
> > > +	struct rte_eth_dev *eth_dev = NULL;
> > > +	struct rte_kvargs_pair *pair = NULL;
> > > +	unsigned k_idx;
> > > +	unsigned int blockcount;
> > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > +	unsigned int qpairs = 1;
> > > +
> > > +	/* do some parameter checking */
> > > +	if (*sockfd < 0)
> > > +		return -1;
> > > +
> > > +	/*
> > > +	 * Walk arguments for configurable settings
> > > +	 */
> > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > +		pair = &kvlist->pairs[k_idx];
> > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > +			qpairs = atoi(pair->value);
> > > +			if (qpairs < 1 ||
> > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid qpairs value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > +			blocksize = atoi(pair->value);
> > > +			if (!blocksize) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid blocksize value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > +			framesize = atoi(pair->value);
> > > +			if (!framesize) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid framesize value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > +			framecount = atoi(pair->value);
> > > +			if (!framecount) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid framecount value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +	}
> > > +
> > > +	if (framesize > blocksize) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > +		        name);
> > > +		return -1;
> > > +	}
> > > +
> > > +	blockcount = framecount / (blocksize / framesize);
> > > +	if (!blockcount) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > +		return -1;
> > > +	}
> > > +
> > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > +
> > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > +	                           blocksize, blockcount,
> > > +	                           framesize, framecount,
> > > +	                           numa_node, &internals, &eth_dev,
> > > +	                           kvlist) < 0)
> > > +		return -1;
> > > +
> > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int
> > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > +{
> > > +	unsigned numa_node;
> > > +	int ret;
> > > +	struct rte_kvargs *kvlist;
> > > +	int sockfd = -1;
> > > +
> > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > +
> > > +	numa_node = rte_socket_id();
> > > +
> > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > +	if (kvlist == NULL)
> > > +		return -1;
> > > +
> > > +	/*
> > > +	 * If iface argument is passed we open the NICs and use them for
> > > +	 * reading / writing
> > > +	 */
> > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > +
> > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > +		                         &open_packet_iface, &sockfd);
> > > +		if (ret < 0)
> > > +			return -1;
> > > +	}
> > > +
> > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > +	close(sockfd); /* no longer needed */
> > > +
> > > +	if (ret < 0)
> > > +		return -1;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static struct rte_driver pmd_packet_drv = {
> > > +	.name = "eth_packet",
> > > +	.type = PMD_VDEV,
> > > +	.init = rte_pmd_packet_devinit,
> > > +};
> > > +
> > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > new file mode 100644
> > > index 000000000000..f685611da3e9
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > @@ -0,0 +1,55 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > + */
> > > +
> > > +#ifndef _RTE_ETH_PACKET_H_
> > > +#define _RTE_ETH_PACKET_H_
> > > +
> > > +#ifdef __cplusplus
> > > +extern "C" {
> > > +#endif
> > > +
> > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > +
> > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > +
> > > +/**
> > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > + * configured on command line.
> > > + */
> > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > +
> > > +#ifdef __cplusplus
> > > +}
> > > +#endif
> > > +
> > > +#endif
> > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > index 34dff2a02a05..a6994c4dbe93 100644
> > > --- a/mk/rte.app.mk
> > > +++ b/mk/rte.app.mk
> > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > >  endif
> > >
> > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > +LDLIBS += -lrte_pmd_packet
> > > +endif
> > > +
> > >  endif # plugins
> > >
> > >  LDLIBS += $(EXECENV_LDLIBS)
> > > --
> > > 1.9.3
> > >
> > >
> > 
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-12 18:54       ` John W. Linville
@ 2014-09-12 20:35         ` Zhou, Danny
  2014-09-15 15:09           ` Neil Horman
  0 siblings, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-09-12 20:35 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Saturday, September 13, 2014 2:54 AM
> To: Zhou, Danny
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> 
> On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > I am concerned about its performance caused by too many
> > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > which are mapped to user space, and then those packets to be copied
> > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > copies which brings significant negative performance impact. We
> > had a bifurcated driver prototype that can do zero-copy and achieve
> > native DPDK performance, but it depends on base driver and AF_PACKET
> > code changes in kernel, John R will be presenting it in coming Linux
> > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > submitted to dpdk.org.
> 
> Admittedly, this is not as good a performer as most of the existing
> PMDs.  It serves a different purpose, afterall.  FWIW, you did
> previously indicate that it performed better than the pcap-based PMD.

Yes, slightly higher but makes no big difference.

> I look forward to seeing the changes you mention -- they sound very
> exciting.  But, they will still require both networking core and
> driver changes in the kernel.  And as I understand things today,
> the userland code will still need at least some knowledge of specific
> devices and how they layout their packet descriptors, etc.  So while
> those changes sound very promising, they will still have certain
> drawbacks in common with the current situation.

Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate device-specific 
packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will be much easier
to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.

> It seems like the changes you mention will still need some sort of
> AF_PACKET-based PMD driver.  Have you implemented that completely
> separate from the code I already posted?  Or did you add that work
> on top of mine?
> 

For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev library to do device
probe and support new socket options.

> John
> 
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > Sent: Saturday, September 13, 2014 2:05 AM
> > > To: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > >
> > > Ping?  Are there objections to this patch from mid-July?
> > >
> > > John
> > >
> > > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > in number -- changing this would be a Simple Matter Of Programming.
> > > >
> > > > Interfaces of this type are created with a command line option like
> > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > as arguments:
> > > >
> > > >  - Interface is chosen by "iface" (required)
> > > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > >
> > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > ---
> > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > with better performance than what PCAP offers in Linux.  This might
> > > > be useful as a development platform for DPDK applications when
> > > > DPDK-supported hardware is expensive or unavailable.
> > > >
> > > > New in v2:
> > > >
> > > > -- fixup some style issues found by check patch
> > > > -- use if_index as part of fanout group ID
> > > > -- set default number of queue pairs to 1
> > > >
> > > >  config/common_bsdapp                   |   5 +
> > > >  config/common_linuxapp                 |   5 +
> > > >  lib/Makefile                           |   1 +
> > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > >  mk/rte.app.mk                          |   4 +
> > > >  8 files changed, 957 insertions(+)
> > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > >
> > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > index 943dce8f1ede..c317f031278e 100644
> > > > --- a/config/common_bsdapp
> > > > +++ b/config/common_bsdapp
> > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > >
> > > >  #
> > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > +#
> > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > +
> > > > +#
> > > >  # Do prefetch of packet data within PMD driver receive function
> > > >  #
> > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > --- a/config/common_linuxapp
> > > > +++ b/config/common_linuxapp
> > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > >
> > > >  #
> > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > +#
> > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > +
> > > > +#
> > > >  # Compile Xen PMD
> > > >  #
> > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > index 10c5bb3045bc..930fadf29898 100644
> > > > --- a/lib/Makefile
> > > > +++ b/lib/Makefile
> > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > > index 756d6b0c9301..feed24a63272 100644
> > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > >
> > > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > > new file mode 100644
> > > > index 000000000000..e1266fb992cd
> > > > --- /dev/null
> > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > @@ -0,0 +1,60 @@
> > > > +#   BSD LICENSE
> > > > +#
> > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > +#   All rights reserved.
> > > > +#
> > > > +#   Redistribution and use in source and binary forms, with or without
> > > > +#   modification, are permitted provided that the following conditions
> > > > +#   are met:
> > > > +#
> > > > +#     * Redistributions of source code must retain the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer.
> > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer in
> > > > +#       the documentation and/or other materials provided with the
> > > > +#       distribution.
> > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > +#       contributors may be used to endorse or promote products derived
> > > > +#       from this software without specific prior written permission.
> > > > +#
> > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > +
> > > > +#
> > > > +# library name
> > > > +#
> > > > +LIB = librte_pmd_packet.a
> > > > +
> > > > +CFLAGS += -O3
> > > > +CFLAGS += $(WERROR_FLAGS)
> > > > +
> > > > +#
> > > > +# all source are stored in SRCS-y
> > > > +#
> > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > +
> > > > +#
> > > > +# Export include files
> > > > +#
> > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > +
> > > > +# this lib depends upon:
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > new file mode 100644
> > > > index 000000000000..9c82d16e730f
> > > > --- /dev/null
> > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > @@ -0,0 +1,826 @@
> > > > +/*-
> > > > + *   BSD LICENSE
> > > > + *
> > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > + *
> > > > + *   Originally based upon librte_pmd_pcap code:
> > > > + *
> > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > + *   All rights reserved.
> > > > + *
> > > > + *   Redistribution and use in source and binary forms, with or without
> > > > + *   modification, are permitted provided that the following conditions
> > > > + *   are met:
> > > > + *
> > > > + *     * Redistributions of source code must retain the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer.
> > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer in
> > > > + *       the documentation and/or other materials provided with the
> > > > + *       distribution.
> > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > + *       contributors may be used to endorse or promote products derived
> > > > + *       from this software without specific prior written permission.
> > > > + *
> > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > + */
> > > > +
> > > > +#include <rte_mbuf.h>
> > > > +#include <rte_ethdev.h>
> > > > +#include <rte_malloc.h>
> > > > +#include <rte_kvargs.h>
> > > > +#include <rte_dev.h>
> > > > +
> > > > +#include <linux/if_ether.h>
> > > > +#include <linux/if_packet.h>
> > > > +#include <arpa/inet.h>
> > > > +#include <net/if.h>
> > > > +#include <sys/types.h>
> > > > +#include <sys/socket.h>
> > > > +#include <sys/ioctl.h>
> > > > +#include <sys/mman.h>
> > > > +#include <unistd.h>
> > > > +#include <poll.h>
> > > > +
> > > > +#include "rte_eth_packet.h"
> > > > +
> > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > +
> > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > +
> > > > +struct pkt_rx_queue {
> > > > +	int sockfd;
> > > > +
> > > > +	struct iovec *rd;
> > > > +	uint8_t *map;
> > > > +	unsigned int framecount;
> > > > +	unsigned int framenum;
> > > > +
> > > > +	struct rte_mempool *mb_pool;
> > > > +
> > > > +	volatile unsigned long rx_pkts;
> > > > +	volatile unsigned long err_pkts;
> > > > +};
> > > > +
> > > > +struct pkt_tx_queue {
> > > > +	int sockfd;
> > > > +
> > > > +	struct iovec *rd;
> > > > +	uint8_t *map;
> > > > +	unsigned int framecount;
> > > > +	unsigned int framenum;
> > > > +
> > > > +	volatile unsigned long tx_pkts;
> > > > +	volatile unsigned long err_pkts;
> > > > +};
> > > > +
> > > > +struct pmd_internals {
> > > > +	unsigned nb_queues;
> > > > +
> > > > +	int if_index;
> > > > +	struct ether_addr eth_addr;
> > > > +
> > > > +	struct tpacket_req req;
> > > > +
> > > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > +};
> > > > +
> > > > +static const char *valid_arguments[] = {
> > > > +	ETH_PACKET_IFACE_ARG,
> > > > +	ETH_PACKET_NUM_Q_ARG,
> > > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > > +	NULL
> > > > +};
> > > > +
> > > > +static const char *drivername = "AF_PACKET PMD";
> > > > +
> > > > +static struct rte_eth_link pmd_link = {
> > > > +	.link_speed = 10000,
> > > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > > +	.link_status = 0
> > > > +};
> > > > +
> > > > +static uint16_t
> > > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > +{
> > > > +	unsigned i;
> > > > +	struct tpacket2_hdr *ppd;
> > > > +	struct rte_mbuf *mbuf;
> > > > +	uint8_t *pbuf;
> > > > +	struct pkt_rx_queue *pkt_q = queue;
> > > > +	uint16_t num_rx = 0;
> > > > +	unsigned int framecount, framenum;
> > > > +
> > > > +	if (unlikely(nb_pkts == 0))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > > +	 */
> > > > +	framecount = pkt_q->framecount;
> > > > +	framenum = pkt_q->framenum;
> > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > +		/* point at the next incoming frame */
> > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > > +			break;
> > > > +
> > > > +		/* allocate the next mbuf */
> > > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > > +		if (unlikely(mbuf == NULL))
> > > > +			break;
> > > > +
> > > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > > +
> > > > +		/* release incoming frame and advance ring buffer */
> > > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > > +		if (++framenum >= framecount)
> > > > +			framenum = 0;
> > > > +
> > > > +		/* account for the receive frame */
> > > > +		bufs[i] = mbuf;
> > > > +		num_rx++;
> > > > +	}
> > > > +	pkt_q->framenum = framenum;
> > > > +	pkt_q->rx_pkts += num_rx;
> > > > +	return num_rx;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Callback to handle sending packets through a real NIC.
> > > > + */
> > > > +static uint16_t
> > > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > +{
> > > > +	struct tpacket2_hdr *ppd;
> > > > +	struct rte_mbuf *mbuf;
> > > > +	uint8_t *pbuf;
> > > > +	unsigned int framecount, framenum;
> > > > +	struct pollfd pfd;
> > > > +	struct pkt_tx_queue *pkt_q = queue;
> > > > +	uint16_t num_tx = 0;
> > > > +	int i;
> > > > +
> > > > +	if (unlikely(nb_pkts == 0))
> > > > +		return 0;
> > > > +
> > > > +	memset(&pfd, 0, sizeof(pfd));
> > > > +	pfd.fd = pkt_q->sockfd;
> > > > +	pfd.events = POLLOUT;
> > > > +	pfd.revents = 0;
> > > > +
> > > > +	framecount = pkt_q->framecount;
> > > > +	framenum = pkt_q->framenum;
> > > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > +		/* point at the next incoming frame */
> > > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > > +		    (poll(&pfd, 1, -1) < 0))
> > > > +				continue;
> > > > +
> > > > +		/* copy the tx frame data */
> > > > +		mbuf = bufs[num_tx];
> > > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > > +			sizeof(struct sockaddr_ll);
> > > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > > +
> > > > +		/* release incoming frame and advance ring buffer */
> > > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > > +		if (++framenum >= framecount)
> > > > +			framenum = 0;
> > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > +
> > > > +		num_tx++;
> > > > +		rte_pktmbuf_free(mbuf);
> > > > +	}
> > > > +
> > > > +	/* kick-off transmits */
> > > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > > +
> > > > +	pkt_q->framenum = framenum;
> > > > +	pkt_q->tx_pkts += num_tx;
> > > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > > +	return num_tx;
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_dev_start(struct rte_eth_dev *dev)
> > > > +{
> > > > +	dev->data->dev_link.link_status = 1;
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * This function gets called when the current port gets stopped.
> > > > + */
> > > > +static void
> > > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > > +{
> > > > +	unsigned i;
> > > > +	int sockfd;
> > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > +
> > > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > > +		sockfd = internals->rx_queue[i].sockfd;
> > > > +		if (sockfd != -1)
> > > > +			close(sockfd);
> > > > +		sockfd = internals->tx_queue[i].sockfd;
> > > > +		if (sockfd != -1)
> > > > +			close(sockfd);
> > > > +	}
> > > > +
> > > > +	dev->data->dev_link.link_status = 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > > +{
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > > +{
> > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > +
> > > > +	dev_info->driver_name = drivername;
> > > > +	dev_info->if_index = internals->if_index;
> > > > +	dev_info->max_mac_addrs = 1;
> > > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > > +	dev_info->min_rx_bufsize = 0;
> > > > +	dev_info->pci_dev = NULL;
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > > +{
> > > > +	unsigned i, imax;
> > > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > > +
> > > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > > +
> > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > +	for (i = 0; i < imax; i++) {
> > > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > > +		rx_total += igb_stats->q_ipackets[i];
> > > > +	}
> > > > +
> > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > +	for (i = 0; i < imax; i++) {
> > > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > > +		tx_total += igb_stats->q_opackets[i];
> > > > +		tx_err_total += igb_stats->q_errors[i];
> > > > +	}
> > > > +
> > > > +	igb_stats->ipackets = rx_total;
> > > > +	igb_stats->opackets = tx_total;
> > > > +	igb_stats->oerrors = tx_err_total;
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > > +{
> > > > +	unsigned i;
> > > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > > +
> > > > +	for (i = 0; i < internal->nb_queues; i++)
> > > > +		internal->rx_queue[i].rx_pkts = 0;
> > > > +
> > > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > > +		internal->tx_queue[i].tx_pkts = 0;
> > > > +		internal->tx_queue[i].err_pkts = 0;
> > > > +	}
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > > +{
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_queue_release(void *q __rte_unused)
> > > > +{
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > > +                int wait_to_complete __rte_unused)
> > > > +{
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > > +                   uint16_t rx_queue_id,
> > > > +                   uint16_t nb_rx_desc __rte_unused,
> > > > +                   unsigned int socket_id __rte_unused,
> > > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > > +                   struct rte_mempool *mb_pool)
> > > > +{
> > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > > +	uint16_t buf_size;
> > > > +
> > > > +	pkt_q->mb_pool = mb_pool;
> > > > +
> > > > +	/* Now get the space available for data in the mbuf */
> > > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > > +	                       RTE_PKTMBUF_HEADROOM);
> > > > +
> > > > +	if (ETH_FRAME_LEN > buf_size) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > > +		return -ENOMEM;
> > > > +	}
> > > > +
> > > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > > +                   uint16_t tx_queue_id,
> > > > +                   uint16_t nb_tx_desc __rte_unused,
> > > > +                   unsigned int socket_id __rte_unused,
> > > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > > +{
> > > > +
> > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > +
> > > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static struct eth_dev_ops ops = {
> > > > +	.dev_start = eth_dev_start,
> > > > +	.dev_stop = eth_dev_stop,
> > > > +	.dev_close = eth_dev_close,
> > > > +	.dev_configure = eth_dev_configure,
> > > > +	.dev_infos_get = eth_dev_info,
> > > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > > +	.rx_queue_release = eth_queue_release,
> > > > +	.tx_queue_release = eth_queue_release,
> > > > +	.link_update = eth_link_update,
> > > > +	.stats_get = eth_stats_get,
> > > > +	.stats_reset = eth_stats_reset,
> > > > +};
> > > > +
> > > > +/*
> > > > + * Opens an AF_PACKET socket
> > > > + */
> > > > +static int
> > > > +open_packet_iface(const char *key __rte_unused,
> > > > +                  const char *value __rte_unused,
> > > > +                  void *extra_args)
> > > > +{
> > > > +	int *sockfd = extra_args;
> > > > +
> > > > +	/* Open an AF_PACKET socket... */
> > > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > +	if (*sockfd == -1) {
> > > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > > +		return -1;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +rte_pmd_init_internals(const char *name,
> > > > +                       const int sockfd,
> > > > +                       const unsigned nb_queues,
> > > > +                       unsigned int blocksize,
> > > > +                       unsigned int blockcnt,
> > > > +                       unsigned int framesize,
> > > > +                       unsigned int framecnt,
> > > > +                       const unsigned numa_node,
> > > > +                       struct pmd_internals **internals,
> > > > +                       struct rte_eth_dev **eth_dev,
> > > > +                       struct rte_kvargs *kvlist)
> > > > +{
> > > > +	struct rte_eth_dev_data *data = NULL;
> > > > +	struct rte_pci_device *pci_dev = NULL;
> > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > +	struct ifreq ifr;
> > > > +	size_t ifnamelen;
> > > > +	unsigned k_idx;
> > > > +	struct sockaddr_ll sockaddr;
> > > > +	struct tpacket_req *req;
> > > > +	struct pkt_rx_queue *rx_queue;
> > > > +	struct pkt_tx_queue *tx_queue;
> > > > +	int rc, tpver, discard, bypass;
> > > > +	unsigned int i, q, rdsize;
> > > > +	int qsockfd, fanout_arg;
> > > > +
> > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > +		pair = &kvlist->pairs[k_idx];
> > > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > > +			break;
> > > > +	}
> > > > +	if (pair == NULL) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > > +		        name);
> > > > +		goto error;
> > > > +	}
> > > > +
> > > > +	RTE_LOG(INFO, PMD,
> > > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > > +		name, numa_node);
> > > > +
> > > > +	/*
> > > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > > +	 * and internal (private) data
> > > > +	 */
> > > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > > +	if (data == NULL)
> > > > +		goto error;
> > > > +
> > > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > > +	if (pci_dev == NULL)
> > > > +		goto error;
> > > > +
> > > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > > +	                                0, numa_node);
> > > > +	if (*internals == NULL)
> > > > +		goto error;
> > > > +
> > > > +	req = &((*internals)->req);
> > > > +
> > > > +	req->tp_block_size = blocksize;
> > > > +	req->tp_block_nr = blockcnt;
> > > > +	req->tp_frame_size = framesize;
> > > > +	req->tp_frame_nr = framecnt;
> > > > +
> > > > +	ifnamelen = strlen(pair->value);
> > > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > > +	} else {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: I/F name too long (%s)\n",
> > > > +			name, pair->value);
> > > > +		goto error;
> > > > +	}
> > > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > > +		        name);
> > > > +		goto error;
> > > > +	}
> > > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > > +
> > > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > > +		        name);
> > > > +		goto error;
> > > > +	}
> > > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > > +
> > > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > > +	sockaddr.sll_family = AF_PACKET;
> > > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > > +
> > > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > > +
> > > > +	for (q = 0; q < nb_queues; q++) {
> > > > +		/* Open an AF_PACKET socket for this queue... */
> > > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > +		if (qsockfd == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +			        "%s: could not open AF_PACKET socket\n",
> > > > +			        name);
> > > > +			return -1;
> > > > +		}
> > > > +
> > > > +		tpver = TPACKET_V2;
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > > +				&tpver, sizeof(tpver));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > > +				"socket for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		discard = 1;
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > > +				&discard, sizeof(discard));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_LOSS on "
> > > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		bypass = 1;
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > > +				&bypass, sizeof(bypass));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > > +			        "on AF_PACKET socket for %s\n", name,
> > > > +			        pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > > +				"socket for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > > +				"socket for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > > +		rx_queue->framecount = req->tp_frame_nr;
> > > > +
> > > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > > +				    qsockfd, 0);
> > > > +		if (rx_queue->map == MAP_FAILED) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > > +				name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		/* rdsize is same for both Tx and Rx */
> > > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > > +
> > > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > +		}
> > > > +		rx_queue->sockfd = qsockfd;
> > > > +
> > > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > > +		tx_queue->framecount = req->tp_frame_nr;
> > > > +
> > > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > > +
> > > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > +		}
> > > > +		tx_queue->sockfd = qsockfd;
> > > > +
> > > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > > +			        name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > > +				&fanout_arg, sizeof(fanout_arg));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > > +				"for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	/* reserve an ethdev entry */
> > > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > > +	if (*eth_dev == NULL)
> > > > +		goto error;
> > > > +
> > > > +	/*
> > > > +	 * now put it all together
> > > > +	 * - store queue data in internals,
> > > > +	 * - store numa_node info in pci_driver
> > > > +	 * - point eth_dev_data to internals and pci_driver
> > > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > > +	 */
> > > > +
> > > > +	(*internals)->nb_queues = nb_queues;
> > > > +
> > > > +	data->dev_private = *internals;
> > > > +	data->port_id = (*eth_dev)->data->port_id;
> > > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > > +	data->dev_link = pmd_link;
> > > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > > +
> > > > +	pci_dev->numa_node = numa_node;
> > > > +
> > > > +	(*eth_dev)->data = data;
> > > > +	(*eth_dev)->dev_ops = &ops;
> > > > +	(*eth_dev)->pci_dev = pci_dev;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +error:
> > > > +	if (data)
> > > > +		rte_free(data);
> > > > +	if (pci_dev)
> > > > +		rte_free(pci_dev);
> > > > +	for (q = 0; q < nb_queues; q++) {
> > > > +		if ((*internals)->rx_queue[q].rd)
> > > > +			rte_free((*internals)->rx_queue[q].rd);
> > > > +		if ((*internals)->tx_queue[q].rd)
> > > > +			rte_free((*internals)->tx_queue[q].rd);
> > > > +	}
> > > > +	if (*internals)
> > > > +		rte_free(*internals);
> > > > +	return -1;
> > > > +}
> > > > +
> > > > +static int
> > > > +rte_eth_from_packet(const char *name,
> > > > +                    int const *sockfd,
> > > > +                    const unsigned numa_node,
> > > > +                    struct rte_kvargs *kvlist)
> > > > +{
> > > > +	struct pmd_internals *internals = NULL;
> > > > +	struct rte_eth_dev *eth_dev = NULL;
> > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > +	unsigned k_idx;
> > > > +	unsigned int blockcount;
> > > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > > +	unsigned int qpairs = 1;
> > > > +
> > > > +	/* do some parameter checking */
> > > > +	if (*sockfd < 0)
> > > > +		return -1;
> > > > +
> > > > +	/*
> > > > +	 * Walk arguments for configurable settings
> > > > +	 */
> > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > +		pair = &kvlist->pairs[k_idx];
> > > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > > +			qpairs = atoi(pair->value);
> > > > +			if (qpairs < 1 ||
> > > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > > +				RTE_LOG(ERR, PMD,
> > > > +					"%s: invalid qpairs value\n",
> > > > +				        name);
> > > > +				return -1;
> > > > +			}
> > > > +			continue;
> > > > +		}
> > > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > > +			blocksize = atoi(pair->value);
> > > > +			if (!blocksize) {
> > > > +				RTE_LOG(ERR, PMD,
> > > > +					"%s: invalid blocksize value\n",
> > > > +				        name);
> > > > +				return -1;
> > > > +			}
> > > > +			continue;
> > > > +		}
> > > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > > +			framesize = atoi(pair->value);
> > > > +			if (!framesize) {
> > > > +				RTE_LOG(ERR, PMD,
> > > > +					"%s: invalid framesize value\n",
> > > > +				        name);
> > > > +				return -1;
> > > > +			}
> > > > +			continue;
> > > > +		}
> > > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > > +			framecount = atoi(pair->value);
> > > > +			if (!framecount) {
> > > > +				RTE_LOG(ERR, PMD,
> > > > +					"%s: invalid framecount value\n",
> > > > +				        name);
> > > > +				return -1;
> > > > +			}
> > > > +			continue;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	if (framesize > blocksize) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > > +		        name);
> > > > +		return -1;
> > > > +	}
> > > > +
> > > > +	blockcount = framecount / (blocksize / framesize);
> > > > +	if (!blockcount) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > > +		return -1;
> > > > +	}
> > > > +
> > > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > > +
> > > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > > +	                           blocksize, blockcount,
> > > > +	                           framesize, framecount,
> > > > +	                           numa_node, &internals, &eth_dev,
> > > > +	                           kvlist) < 0)
> > > > +		return -1;
> > > > +
> > > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +int
> > > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > > +{
> > > > +	unsigned numa_node;
> > > > +	int ret;
> > > > +	struct rte_kvargs *kvlist;
> > > > +	int sockfd = -1;
> > > > +
> > > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > > +
> > > > +	numa_node = rte_socket_id();
> > > > +
> > > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > > +	if (kvlist == NULL)
> > > > +		return -1;
> > > > +
> > > > +	/*
> > > > +	 * If iface argument is passed we open the NICs and use them for
> > > > +	 * reading / writing
> > > > +	 */
> > > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > > +
> > > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > > +		                         &open_packet_iface, &sockfd);
> > > > +		if (ret < 0)
> > > > +			return -1;
> > > > +	}
> > > > +
> > > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > > +	close(sockfd); /* no longer needed */
> > > > +
> > > > +	if (ret < 0)
> > > > +		return -1;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static struct rte_driver pmd_packet_drv = {
> > > > +	.name = "eth_packet",
> > > > +	.type = PMD_VDEV,
> > > > +	.init = rte_pmd_packet_devinit,
> > > > +};
> > > > +
> > > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > new file mode 100644
> > > > index 000000000000..f685611da3e9
> > > > --- /dev/null
> > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > @@ -0,0 +1,55 @@
> > > > +/*-
> > > > + *   BSD LICENSE
> > > > + *
> > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > + *   All rights reserved.
> > > > + *
> > > > + *   Redistribution and use in source and binary forms, with or without
> > > > + *   modification, are permitted provided that the following conditions
> > > > + *   are met:
> > > > + *
> > > > + *     * Redistributions of source code must retain the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer.
> > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer in
> > > > + *       the documentation and/or other materials provided with the
> > > > + *       distribution.
> > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > + *       contributors may be used to endorse or promote products derived
> > > > + *       from this software without specific prior written permission.
> > > > + *
> > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > + */
> > > > +
> > > > +#ifndef _RTE_ETH_PACKET_H_
> > > > +#define _RTE_ETH_PACKET_H_
> > > > +
> > > > +#ifdef __cplusplus
> > > > +extern "C" {
> > > > +#endif
> > > > +
> > > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > > +
> > > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > > +
> > > > +/**
> > > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > > + * configured on command line.
> > > > + */
> > > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > > +
> > > > +#ifdef __cplusplus
> > > > +}
> > > > +#endif
> > > > +
> > > > +#endif
> > > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > > index 34dff2a02a05..a6994c4dbe93 100644
> > > > --- a/mk/rte.app.mk
> > > > +++ b/mk/rte.app.mk
> > > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > > >  endif
> > > >
> > > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > > +LDLIBS += -lrte_pmd_packet
> > > > +endif
> > > > +
> > > >  endif # plugins
> > > >
> > > >  LDLIBS += $(EXECENV_LDLIBS)
> > > > --
> > > > 1.9.3
> > > >
> > > >
> > >
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> >
> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-12 20:35         ` Zhou, Danny
@ 2014-09-15 15:09           ` Neil Horman
  2014-09-15 15:15             ` John W. Linville
  2014-09-15 15:43             ` Zhou, Danny
  0 siblings, 2 replies; 76+ messages in thread
From: Neil Horman @ 2014-09-15 15:09 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Saturday, September 13, 2014 2:54 AM
> > To: Zhou, Danny
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > 
> > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > I am concerned about its performance caused by too many
> > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > which are mapped to user space, and then those packets to be copied
> > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > copies which brings significant negative performance impact. We
> > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > code changes in kernel, John R will be presenting it in coming Linux
> > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > submitted to dpdk.org.
> > 
> > Admittedly, this is not as good a performer as most of the existing
> > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > previously indicate that it performed better than the pcap-based PMD.
> 
> Yes, slightly higher but makes no big difference.
> 
Do you have numbers for this?  It seems to me faster is faster as long as its
statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
to scale to multple cpus more easily than the pcap pmd, as it can make use of
the AF_PACKET fanout feature.

> > I look forward to seeing the changes you mention -- they sound very
> > exciting.  But, they will still require both networking core and
> > driver changes in the kernel.  And as I understand things today,
> > the userland code will still need at least some knowledge of specific
> > devices and how they layout their packet descriptors, etc.  So while
> > those changes sound very promising, they will still have certain
> > drawbacks in common with the current situation.
> 
> Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate device-specific 
> packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will be much easier
> to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> 

Not sure how this relates, what you're describing is the feature intel has been
working on to augment kernel drivers to provide better throughput via direct
hardware access to user space.  Johns PMD provides ubiquitous function on all
hardware. I'm not sure how the desire for one implies the other isn't valuable?

> > It seems like the changes you mention will still need some sort of
> > AF_PACKET-based PMD driver.  Have you implemented that completely
> > separate from the code I already posted?  Or did you add that work
> > on top of mine?
> > 
> 
> For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev library to do device
> probe and support new socket options.
> 

Ok, but again, PMD's are independent, and serve different needs.  If they're use
is at all overlapping from a functional standpoint, take this one now, and
deprecate it when a better one comes along.  Though from your description it
seems like both have a valid place in the ecosystem.

Neil

> > John
> > 
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > > Sent: Saturday, September 13, 2014 2:05 AM
> > > > To: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > >
> > > > Ping?  Are there objections to this patch from mid-July?
> > > >
> > > > John
> > > >
> > > > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > > in number -- changing this would be a Simple Matter Of Programming.
> > > > >
> > > > > Interfaces of this type are created with a command line option like
> > > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > > as arguments:
> > > > >
> > > > >  - Interface is chosen by "iface" (required)
> > > > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > > >
> > > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > > ---
> > > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > > with better performance than what PCAP offers in Linux.  This might
> > > > > be useful as a development platform for DPDK applications when
> > > > > DPDK-supported hardware is expensive or unavailable.
> > > > >
> > > > > New in v2:
> > > > >
> > > > > -- fixup some style issues found by check patch
> > > > > -- use if_index as part of fanout group ID
> > > > > -- set default number of queue pairs to 1
> > > > >
> > > > >  config/common_bsdapp                   |   5 +
> > > > >  config/common_linuxapp                 |   5 +
> > > > >  lib/Makefile                           |   1 +
> > > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > > >  mk/rte.app.mk                          |   4 +
> > > > >  8 files changed, 957 insertions(+)
> > > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > > >
> > > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > > index 943dce8f1ede..c317f031278e 100644
> > > > > --- a/config/common_bsdapp
> > > > > +++ b/config/common_bsdapp
> > > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > >
> > > > >  #
> > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > +#
> > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > > +
> > > > > +#
> > > > >  # Do prefetch of packet data within PMD driver receive function
> > > > >  #
> > > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > > --- a/config/common_linuxapp
> > > > > +++ b/config/common_linuxapp
> > > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > >
> > > > >  #
> > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > +#
> > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > > +
> > > > > +#
> > > > >  # Compile Xen PMD
> > > > >  #
> > > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > > index 10c5bb3045bc..930fadf29898 100644
> > > > > --- a/lib/Makefile
> > > > > +++ b/lib/Makefile
> > > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > index 756d6b0c9301..feed24a63272 100644
> > > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > > >
> > > > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > > > new file mode 100644
> > > > > index 000000000000..e1266fb992cd
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > > @@ -0,0 +1,60 @@
> > > > > +#   BSD LICENSE
> > > > > +#
> > > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > > +#   All rights reserved.
> > > > > +#
> > > > > +#   Redistribution and use in source and binary forms, with or without
> > > > > +#   modification, are permitted provided that the following conditions
> > > > > +#   are met:
> > > > > +#
> > > > > +#     * Redistributions of source code must retain the above copyright
> > > > > +#       notice, this list of conditions and the following disclaimer.
> > > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > > +#       notice, this list of conditions and the following disclaimer in
> > > > > +#       the documentation and/or other materials provided with the
> > > > > +#       distribution.
> > > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > > +#       contributors may be used to endorse or promote products derived
> > > > > +#       from this software without specific prior written permission.
> > > > > +#
> > > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > +
> > > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > > +
> > > > > +#
> > > > > +# library name
> > > > > +#
> > > > > +LIB = librte_pmd_packet.a
> > > > > +
> > > > > +CFLAGS += -O3
> > > > > +CFLAGS += $(WERROR_FLAGS)
> > > > > +
> > > > > +#
> > > > > +# all source are stored in SRCS-y
> > > > > +#
> > > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > > +
> > > > > +#
> > > > > +# Export include files
> > > > > +#
> > > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > > +
> > > > > +# this lib depends upon:
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > > +
> > > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > new file mode 100644
> > > > > index 000000000000..9c82d16e730f
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > @@ -0,0 +1,826 @@
> > > > > +/*-
> > > > > + *   BSD LICENSE
> > > > > + *
> > > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > > + *
> > > > > + *   Originally based upon librte_pmd_pcap code:
> > > > > + *
> > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > > + *   All rights reserved.
> > > > > + *
> > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > + *   modification, are permitted provided that the following conditions
> > > > > + *   are met:
> > > > > + *
> > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > + *       the documentation and/or other materials provided with the
> > > > > + *       distribution.
> > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > + *       contributors may be used to endorse or promote products derived
> > > > > + *       from this software without specific prior written permission.
> > > > > + *
> > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > + */
> > > > > +
> > > > > +#include <rte_mbuf.h>
> > > > > +#include <rte_ethdev.h>
> > > > > +#include <rte_malloc.h>
> > > > > +#include <rte_kvargs.h>
> > > > > +#include <rte_dev.h>
> > > > > +
> > > > > +#include <linux/if_ether.h>
> > > > > +#include <linux/if_packet.h>
> > > > > +#include <arpa/inet.h>
> > > > > +#include <net/if.h>
> > > > > +#include <sys/types.h>
> > > > > +#include <sys/socket.h>
> > > > > +#include <sys/ioctl.h>
> > > > > +#include <sys/mman.h>
> > > > > +#include <unistd.h>
> > > > > +#include <poll.h>
> > > > > +
> > > > > +#include "rte_eth_packet.h"
> > > > > +
> > > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > > +
> > > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > > +
> > > > > +struct pkt_rx_queue {
> > > > > +	int sockfd;
> > > > > +
> > > > > +	struct iovec *rd;
> > > > > +	uint8_t *map;
> > > > > +	unsigned int framecount;
> > > > > +	unsigned int framenum;
> > > > > +
> > > > > +	struct rte_mempool *mb_pool;
> > > > > +
> > > > > +	volatile unsigned long rx_pkts;
> > > > > +	volatile unsigned long err_pkts;
> > > > > +};
> > > > > +
> > > > > +struct pkt_tx_queue {
> > > > > +	int sockfd;
> > > > > +
> > > > > +	struct iovec *rd;
> > > > > +	uint8_t *map;
> > > > > +	unsigned int framecount;
> > > > > +	unsigned int framenum;
> > > > > +
> > > > > +	volatile unsigned long tx_pkts;
> > > > > +	volatile unsigned long err_pkts;
> > > > > +};
> > > > > +
> > > > > +struct pmd_internals {
> > > > > +	unsigned nb_queues;
> > > > > +
> > > > > +	int if_index;
> > > > > +	struct ether_addr eth_addr;
> > > > > +
> > > > > +	struct tpacket_req req;
> > > > > +
> > > > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > +};
> > > > > +
> > > > > +static const char *valid_arguments[] = {
> > > > > +	ETH_PACKET_IFACE_ARG,
> > > > > +	ETH_PACKET_NUM_Q_ARG,
> > > > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > > > +	NULL
> > > > > +};
> > > > > +
> > > > > +static const char *drivername = "AF_PACKET PMD";
> > > > > +
> > > > > +static struct rte_eth_link pmd_link = {
> > > > > +	.link_speed = 10000,
> > > > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > > > +	.link_status = 0
> > > > > +};
> > > > > +
> > > > > +static uint16_t
> > > > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > +{
> > > > > +	unsigned i;
> > > > > +	struct tpacket2_hdr *ppd;
> > > > > +	struct rte_mbuf *mbuf;
> > > > > +	uint8_t *pbuf;
> > > > > +	struct pkt_rx_queue *pkt_q = queue;
> > > > > +	uint16_t num_rx = 0;
> > > > > +	unsigned int framecount, framenum;
> > > > > +
> > > > > +	if (unlikely(nb_pkts == 0))
> > > > > +		return 0;
> > > > > +
> > > > > +	/*
> > > > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > > > +	 */
> > > > > +	framecount = pkt_q->framecount;
> > > > > +	framenum = pkt_q->framenum;
> > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > +		/* point at the next incoming frame */
> > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > > > +			break;
> > > > > +
> > > > > +		/* allocate the next mbuf */
> > > > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > > > +		if (unlikely(mbuf == NULL))
> > > > > +			break;
> > > > > +
> > > > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > > > +
> > > > > +		/* release incoming frame and advance ring buffer */
> > > > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > > > +		if (++framenum >= framecount)
> > > > > +			framenum = 0;
> > > > > +
> > > > > +		/* account for the receive frame */
> > > > > +		bufs[i] = mbuf;
> > > > > +		num_rx++;
> > > > > +	}
> > > > > +	pkt_q->framenum = framenum;
> > > > > +	pkt_q->rx_pkts += num_rx;
> > > > > +	return num_rx;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Callback to handle sending packets through a real NIC.
> > > > > + */
> > > > > +static uint16_t
> > > > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > +{
> > > > > +	struct tpacket2_hdr *ppd;
> > > > > +	struct rte_mbuf *mbuf;
> > > > > +	uint8_t *pbuf;
> > > > > +	unsigned int framecount, framenum;
> > > > > +	struct pollfd pfd;
> > > > > +	struct pkt_tx_queue *pkt_q = queue;
> > > > > +	uint16_t num_tx = 0;
> > > > > +	int i;
> > > > > +
> > > > > +	if (unlikely(nb_pkts == 0))
> > > > > +		return 0;
> > > > > +
> > > > > +	memset(&pfd, 0, sizeof(pfd));
> > > > > +	pfd.fd = pkt_q->sockfd;
> > > > > +	pfd.events = POLLOUT;
> > > > > +	pfd.revents = 0;
> > > > > +
> > > > > +	framecount = pkt_q->framecount;
> > > > > +	framenum = pkt_q->framenum;
> > > > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > +		/* point at the next incoming frame */
> > > > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > > > +		    (poll(&pfd, 1, -1) < 0))
> > > > > +				continue;
> > > > > +
> > > > > +		/* copy the tx frame data */
> > > > > +		mbuf = bufs[num_tx];
> > > > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > > > +			sizeof(struct sockaddr_ll);
> > > > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > > > +
> > > > > +		/* release incoming frame and advance ring buffer */
> > > > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > > > +		if (++framenum >= framecount)
> > > > > +			framenum = 0;
> > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > +
> > > > > +		num_tx++;
> > > > > +		rte_pktmbuf_free(mbuf);
> > > > > +	}
> > > > > +
> > > > > +	/* kick-off transmits */
> > > > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > > > +
> > > > > +	pkt_q->framenum = framenum;
> > > > > +	pkt_q->tx_pkts += num_tx;
> > > > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > > > +	return num_tx;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_dev_start(struct rte_eth_dev *dev)
> > > > > +{
> > > > > +	dev->data->dev_link.link_status = 1;
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * This function gets called when the current port gets stopped.
> > > > > + */
> > > > > +static void
> > > > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > > > +{
> > > > > +	unsigned i;
> > > > > +	int sockfd;
> > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > +
> > > > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > > > +		sockfd = internals->rx_queue[i].sockfd;
> > > > > +		if (sockfd != -1)
> > > > > +			close(sockfd);
> > > > > +		sockfd = internals->tx_queue[i].sockfd;
> > > > > +		if (sockfd != -1)
> > > > > +			close(sockfd);
> > > > > +	}
> > > > > +
> > > > > +	dev->data->dev_link.link_status = 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > > > +{
> > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > +
> > > > > +	dev_info->driver_name = drivername;
> > > > > +	dev_info->if_index = internals->if_index;
> > > > > +	dev_info->max_mac_addrs = 1;
> > > > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > > > +	dev_info->min_rx_bufsize = 0;
> > > > > +	dev_info->pci_dev = NULL;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > > > +{
> > > > > +	unsigned i, imax;
> > > > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > > > +
> > > > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > > > +
> > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > +	for (i = 0; i < imax; i++) {
> > > > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > > > +		rx_total += igb_stats->q_ipackets[i];
> > > > > +	}
> > > > > +
> > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > +	for (i = 0; i < imax; i++) {
> > > > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > > > +		tx_total += igb_stats->q_opackets[i];
> > > > > +		tx_err_total += igb_stats->q_errors[i];
> > > > > +	}
> > > > > +
> > > > > +	igb_stats->ipackets = rx_total;
> > > > > +	igb_stats->opackets = tx_total;
> > > > > +	igb_stats->oerrors = tx_err_total;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > > > +{
> > > > > +	unsigned i;
> > > > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > > > +
> > > > > +	for (i = 0; i < internal->nb_queues; i++)
> > > > > +		internal->rx_queue[i].rx_pkts = 0;
> > > > > +
> > > > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > > > +		internal->tx_queue[i].tx_pkts = 0;
> > > > > +		internal->tx_queue[i].err_pkts = 0;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > > > +{
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_queue_release(void *q __rte_unused)
> > > > > +{
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > > > +                int wait_to_complete __rte_unused)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > > > +                   uint16_t rx_queue_id,
> > > > > +                   uint16_t nb_rx_desc __rte_unused,
> > > > > +                   unsigned int socket_id __rte_unused,
> > > > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > > > +                   struct rte_mempool *mb_pool)
> > > > > +{
> > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > > > +	uint16_t buf_size;
> > > > > +
> > > > > +	pkt_q->mb_pool = mb_pool;
> > > > > +
> > > > > +	/* Now get the space available for data in the mbuf */
> > > > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > > > +	                       RTE_PKTMBUF_HEADROOM);
> > > > > +
> > > > > +	if (ETH_FRAME_LEN > buf_size) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > > > +		return -ENOMEM;
> > > > > +	}
> > > > > +
> > > > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > > > +                   uint16_t tx_queue_id,
> > > > > +                   uint16_t nb_tx_desc __rte_unused,
> > > > > +                   unsigned int socket_id __rte_unused,
> > > > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > > > +{
> > > > > +
> > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > +
> > > > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static struct eth_dev_ops ops = {
> > > > > +	.dev_start = eth_dev_start,
> > > > > +	.dev_stop = eth_dev_stop,
> > > > > +	.dev_close = eth_dev_close,
> > > > > +	.dev_configure = eth_dev_configure,
> > > > > +	.dev_infos_get = eth_dev_info,
> > > > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > > > +	.rx_queue_release = eth_queue_release,
> > > > > +	.tx_queue_release = eth_queue_release,
> > > > > +	.link_update = eth_link_update,
> > > > > +	.stats_get = eth_stats_get,
> > > > > +	.stats_reset = eth_stats_reset,
> > > > > +};
> > > > > +
> > > > > +/*
> > > > > + * Opens an AF_PACKET socket
> > > > > + */
> > > > > +static int
> > > > > +open_packet_iface(const char *key __rte_unused,
> > > > > +                  const char *value __rte_unused,
> > > > > +                  void *extra_args)
> > > > > +{
> > > > > +	int *sockfd = extra_args;
> > > > > +
> > > > > +	/* Open an AF_PACKET socket... */
> > > > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > +	if (*sockfd == -1) {
> > > > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > > > +		return -1;
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +rte_pmd_init_internals(const char *name,
> > > > > +                       const int sockfd,
> > > > > +                       const unsigned nb_queues,
> > > > > +                       unsigned int blocksize,
> > > > > +                       unsigned int blockcnt,
> > > > > +                       unsigned int framesize,
> > > > > +                       unsigned int framecnt,
> > > > > +                       const unsigned numa_node,
> > > > > +                       struct pmd_internals **internals,
> > > > > +                       struct rte_eth_dev **eth_dev,
> > > > > +                       struct rte_kvargs *kvlist)
> > > > > +{
> > > > > +	struct rte_eth_dev_data *data = NULL;
> > > > > +	struct rte_pci_device *pci_dev = NULL;
> > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > +	struct ifreq ifr;
> > > > > +	size_t ifnamelen;
> > > > > +	unsigned k_idx;
> > > > > +	struct sockaddr_ll sockaddr;
> > > > > +	struct tpacket_req *req;
> > > > > +	struct pkt_rx_queue *rx_queue;
> > > > > +	struct pkt_tx_queue *tx_queue;
> > > > > +	int rc, tpver, discard, bypass;
> > > > > +	unsigned int i, q, rdsize;
> > > > > +	int qsockfd, fanout_arg;
> > > > > +
> > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > > > +			break;
> > > > > +	}
> > > > > +	if (pair == NULL) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > > > +		        name);
> > > > > +		goto error;
> > > > > +	}
> > > > > +
> > > > > +	RTE_LOG(INFO, PMD,
> > > > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > > > +		name, numa_node);
> > > > > +
> > > > > +	/*
> > > > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > > > +	 * and internal (private) data
> > > > > +	 */
> > > > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > > > +	if (data == NULL)
> > > > > +		goto error;
> > > > > +
> > > > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > > > +	if (pci_dev == NULL)
> > > > > +		goto error;
> > > > > +
> > > > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > > > +	                                0, numa_node);
> > > > > +	if (*internals == NULL)
> > > > > +		goto error;
> > > > > +
> > > > > +	req = &((*internals)->req);
> > > > > +
> > > > > +	req->tp_block_size = blocksize;
> > > > > +	req->tp_block_nr = blockcnt;
> > > > > +	req->tp_frame_size = framesize;
> > > > > +	req->tp_frame_nr = framecnt;
> > > > > +
> > > > > +	ifnamelen = strlen(pair->value);
> > > > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > > > +	} else {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: I/F name too long (%s)\n",
> > > > > +			name, pair->value);
> > > > > +		goto error;
> > > > > +	}
> > > > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > > > +		        name);
> > > > > +		goto error;
> > > > > +	}
> > > > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > > > +
> > > > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > > > +		        name);
> > > > > +		goto error;
> > > > > +	}
> > > > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > > > +
> > > > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > > > +	sockaddr.sll_family = AF_PACKET;
> > > > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > > > +
> > > > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > > > +
> > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > +		/* Open an AF_PACKET socket for this queue... */
> > > > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > +		if (qsockfd == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +			        "%s: could not open AF_PACKET socket\n",
> > > > > +			        name);
> > > > > +			return -1;
> > > > > +		}
> > > > > +
> > > > > +		tpver = TPACKET_V2;
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > > > +				&tpver, sizeof(tpver));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > > > +				"socket for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		discard = 1;
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > > > +				&discard, sizeof(discard));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_LOSS on "
> > > > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		bypass = 1;
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > > > +				&bypass, sizeof(bypass));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > > > +			        "on AF_PACKET socket for %s\n", name,
> > > > > +			        pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > > > +				"socket for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > > > +				"socket for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > > > +		rx_queue->framecount = req->tp_frame_nr;
> > > > > +
> > > > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > > > +				    qsockfd, 0);
> > > > > +		if (rx_queue->map == MAP_FAILED) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > > > +				name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		/* rdsize is same for both Tx and Rx */
> > > > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > > > +
> > > > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > +		}
> > > > > +		rx_queue->sockfd = qsockfd;
> > > > > +
> > > > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > > > +		tx_queue->framecount = req->tp_frame_nr;
> > > > > +
> > > > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > > > +
> > > > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > +		}
> > > > > +		tx_queue->sockfd = qsockfd;
> > > > > +
> > > > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > > > +			        name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > > > +				&fanout_arg, sizeof(fanout_arg));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > > > +				"for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	/* reserve an ethdev entry */
> > > > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > > > +	if (*eth_dev == NULL)
> > > > > +		goto error;
> > > > > +
> > > > > +	/*
> > > > > +	 * now put it all together
> > > > > +	 * - store queue data in internals,
> > > > > +	 * - store numa_node info in pci_driver
> > > > > +	 * - point eth_dev_data to internals and pci_driver
> > > > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > > > +	 */
> > > > > +
> > > > > +	(*internals)->nb_queues = nb_queues;
> > > > > +
> > > > > +	data->dev_private = *internals;
> > > > > +	data->port_id = (*eth_dev)->data->port_id;
> > > > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > > > +	data->dev_link = pmd_link;
> > > > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > > > +
> > > > > +	pci_dev->numa_node = numa_node;
> > > > > +
> > > > > +	(*eth_dev)->data = data;
> > > > > +	(*eth_dev)->dev_ops = &ops;
> > > > > +	(*eth_dev)->pci_dev = pci_dev;
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +error:
> > > > > +	if (data)
> > > > > +		rte_free(data);
> > > > > +	if (pci_dev)
> > > > > +		rte_free(pci_dev);
> > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > +		if ((*internals)->rx_queue[q].rd)
> > > > > +			rte_free((*internals)->rx_queue[q].rd);
> > > > > +		if ((*internals)->tx_queue[q].rd)
> > > > > +			rte_free((*internals)->tx_queue[q].rd);
> > > > > +	}
> > > > > +	if (*internals)
> > > > > +		rte_free(*internals);
> > > > > +	return -1;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +rte_eth_from_packet(const char *name,
> > > > > +                    int const *sockfd,
> > > > > +                    const unsigned numa_node,
> > > > > +                    struct rte_kvargs *kvlist)
> > > > > +{
> > > > > +	struct pmd_internals *internals = NULL;
> > > > > +	struct rte_eth_dev *eth_dev = NULL;
> > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > +	unsigned k_idx;
> > > > > +	unsigned int blockcount;
> > > > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > > > +	unsigned int qpairs = 1;
> > > > > +
> > > > > +	/* do some parameter checking */
> > > > > +	if (*sockfd < 0)
> > > > > +		return -1;
> > > > > +
> > > > > +	/*
> > > > > +	 * Walk arguments for configurable settings
> > > > > +	 */
> > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > > > +			qpairs = atoi(pair->value);
> > > > > +			if (qpairs < 1 ||
> > > > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > > > +				RTE_LOG(ERR, PMD,
> > > > > +					"%s: invalid qpairs value\n",
> > > > > +				        name);
> > > > > +				return -1;
> > > > > +			}
> > > > > +			continue;
> > > > > +		}
> > > > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > > > +			blocksize = atoi(pair->value);
> > > > > +			if (!blocksize) {
> > > > > +				RTE_LOG(ERR, PMD,
> > > > > +					"%s: invalid blocksize value\n",
> > > > > +				        name);
> > > > > +				return -1;
> > > > > +			}
> > > > > +			continue;
> > > > > +		}
> > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > > > +			framesize = atoi(pair->value);
> > > > > +			if (!framesize) {
> > > > > +				RTE_LOG(ERR, PMD,
> > > > > +					"%s: invalid framesize value\n",
> > > > > +				        name);
> > > > > +				return -1;
> > > > > +			}
> > > > > +			continue;
> > > > > +		}
> > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > > > +			framecount = atoi(pair->value);
> > > > > +			if (!framecount) {
> > > > > +				RTE_LOG(ERR, PMD,
> > > > > +					"%s: invalid framecount value\n",
> > > > > +				        name);
> > > > > +				return -1;
> > > > > +			}
> > > > > +			continue;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	if (framesize > blocksize) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > > > +		        name);
> > > > > +		return -1;
> > > > > +	}
> > > > > +
> > > > > +	blockcount = framecount / (blocksize / framesize);
> > > > > +	if (!blockcount) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > > > +		return -1;
> > > > > +	}
> > > > > +
> > > > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > > > +
> > > > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > > > +	                           blocksize, blockcount,
> > > > > +	                           framesize, framecount,
> > > > > +	                           numa_node, &internals, &eth_dev,
> > > > > +	                           kvlist) < 0)
> > > > > +		return -1;
> > > > > +
> > > > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +int
> > > > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > > > +{
> > > > > +	unsigned numa_node;
> > > > > +	int ret;
> > > > > +	struct rte_kvargs *kvlist;
> > > > > +	int sockfd = -1;
> > > > > +
> > > > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > > > +
> > > > > +	numa_node = rte_socket_id();
> > > > > +
> > > > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > > > +	if (kvlist == NULL)
> > > > > +		return -1;
> > > > > +
> > > > > +	/*
> > > > > +	 * If iface argument is passed we open the NICs and use them for
> > > > > +	 * reading / writing
> > > > > +	 */
> > > > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > > > +
> > > > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > > > +		                         &open_packet_iface, &sockfd);
> > > > > +		if (ret < 0)
> > > > > +			return -1;
> > > > > +	}
> > > > > +
> > > > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > > > +	close(sockfd); /* no longer needed */
> > > > > +
> > > > > +	if (ret < 0)
> > > > > +		return -1;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static struct rte_driver pmd_packet_drv = {
> > > > > +	.name = "eth_packet",
> > > > > +	.type = PMD_VDEV,
> > > > > +	.init = rte_pmd_packet_devinit,
> > > > > +};
> > > > > +
> > > > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > new file mode 100644
> > > > > index 000000000000..f685611da3e9
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > @@ -0,0 +1,55 @@
> > > > > +/*-
> > > > > + *   BSD LICENSE
> > > > > + *
> > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > + *   All rights reserved.
> > > > > + *
> > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > + *   modification, are permitted provided that the following conditions
> > > > > + *   are met:
> > > > > + *
> > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > + *       the documentation and/or other materials provided with the
> > > > > + *       distribution.
> > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > + *       contributors may be used to endorse or promote products derived
> > > > > + *       from this software without specific prior written permission.
> > > > > + *
> > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > + */
> > > > > +
> > > > > +#ifndef _RTE_ETH_PACKET_H_
> > > > > +#define _RTE_ETH_PACKET_H_
> > > > > +
> > > > > +#ifdef __cplusplus
> > > > > +extern "C" {
> > > > > +#endif
> > > > > +
> > > > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > > > +
> > > > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > > > +
> > > > > +/**
> > > > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > > > + * configured on command line.
> > > > > + */
> > > > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > > > +
> > > > > +#ifdef __cplusplus
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > > +#endif
> > > > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > > > index 34dff2a02a05..a6994c4dbe93 100644
> > > > > --- a/mk/rte.app.mk
> > > > > +++ b/mk/rte.app.mk
> > > > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > > > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > > > >  endif
> > > > >
> > > > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > > > +LDLIBS += -lrte_pmd_packet
> > > > > +endif
> > > > > +
> > > > >  endif # plugins
> > > > >
> > > > >  LDLIBS += $(EXECENV_LDLIBS)
> > > > > --
> > > > > 1.9.3
> > > > >
> > > > >
> > > >
> > > > --
> > > > John W. Linville		Someday the world will need a hero, and you
> > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > >
> > 
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-15 15:09           ` Neil Horman
@ 2014-09-15 15:15             ` John W. Linville
  2014-09-15 15:43             ` Zhou, Danny
  1 sibling, 0 replies; 76+ messages in thread
From: John W. Linville @ 2014-09-15 15:15 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Mon, Sep 15, 2014 at 11:09:46AM -0400, Neil Horman wrote:
> On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Saturday, September 13, 2014 2:54 AM
> > > To: Zhou, Danny
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > 
> > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > I am concerned about its performance caused by too many
> > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > which are mapped to user space, and then those packets to be copied
> > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > copies which brings significant negative performance impact. We
> > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > submitted to dpdk.org.
> > > 
> > > Admittedly, this is not as good a performer as most of the existing
> > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > previously indicate that it performed better than the pcap-based PMD.
> > 
> > Yes, slightly higher but makes no big difference.
> > 
> Do you have numbers for this?  It seems to me faster is faster as long as its
> statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> to scale to multple cpus more easily than the pcap pmd, as it can make use of
> the AF_PACKET fanout feature.
> 
> > > I look forward to seeing the changes you mention -- they sound very
> > > exciting.  But, they will still require both networking core and
> > > driver changes in the kernel.  And as I understand things today,
> > > the userland code will still need at least some knowledge of specific
> > > devices and how they layout their packet descriptors, etc.  So while
> > > those changes sound very promising, they will still have certain
> > > drawbacks in common with the current situation.
> > 
> > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate device-specific 
> > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will be much easier
> > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> > 
> 
> Not sure how this relates, what you're describing is the feature intel has been
> working on to augment kernel drivers to provide better throughput via direct
> hardware access to user space.  Johns PMD provides ubiquitous function on all
> hardware. I'm not sure how the desire for one implies the other isn't valuable?
> 
> > > It seems like the changes you mention will still need some sort of
> > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > separate from the code I already posted?  Or did you add that work
> > > on top of mine?
> > > 
> > 
> > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev library to do device
> > probe and support new socket options.
> > 
> 
> Ok, but again, PMD's are independent, and serve different needs.  If they're use
> is at all overlapping from a functional standpoint, take this one now, and
> deprecate it when a better one comes along.  Though from your description it
> seems like both have a valid place in the ecosystem.

That's where I'm at as well -- I don't see anything in the above that
amounts to an argument against the AF_PACKET-based PMD I have posted.
"Wait for ours" doesn't hold much water, especially when we are trying
to address different problems.

John

> 
> Neil
> 
> > > John
> > > 
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > > > Sent: Saturday, September 13, 2014 2:05 AM
> > > > > To: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > > >
> > > > > Ping?  Are there objections to this patch from mid-July?
> > > > >
> > > > > John
> > > > >
> > > > > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > > > in number -- changing this would be a Simple Matter Of Programming.
> > > > > >
> > > > > > Interfaces of this type are created with a command line option like
> > > > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > > > as arguments:
> > > > > >
> > > > > >  - Interface is chosen by "iface" (required)
> > > > > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > > > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > > > >
> > > > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > > > ---
> > > > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > > > with better performance than what PCAP offers in Linux.  This might
> > > > > > be useful as a development platform for DPDK applications when
> > > > > > DPDK-supported hardware is expensive or unavailable.
> > > > > >
> > > > > > New in v2:
> > > > > >
> > > > > > -- fixup some style issues found by check patch
> > > > > > -- use if_index as part of fanout group ID
> > > > > > -- set default number of queue pairs to 1
> > > > > >
> > > > > >  config/common_bsdapp                   |   5 +
> > > > > >  config/common_linuxapp                 |   5 +
> > > > > >  lib/Makefile                           |   1 +
> > > > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > > > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > > > >  mk/rte.app.mk                          |   4 +
> > > > > >  8 files changed, 957 insertions(+)
> > > > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > > > >
> > > > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > > > index 943dce8f1ede..c317f031278e 100644
> > > > > > --- a/config/common_bsdapp
> > > > > > +++ b/config/common_bsdapp
> > > > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > > >
> > > > > >  #
> > > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > > +#
> > > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > > > +
> > > > > > +#
> > > > > >  # Do prefetch of packet data within PMD driver receive function
> > > > > >  #
> > > > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > > > --- a/config/common_linuxapp
> > > > > > +++ b/config/common_linuxapp
> > > > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > > >
> > > > > >  #
> > > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > > +#
> > > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > > > +
> > > > > > +#
> > > > > >  # Compile Xen PMD
> > > > > >  #
> > > > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > > > index 10c5bb3045bc..930fadf29898 100644
> > > > > > --- a/lib/Makefile
> > > > > > +++ b/lib/Makefile
> > > > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > index 756d6b0c9301..feed24a63272 100644
> > > > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > > > >
> > > > > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > > > > new file mode 100644
> > > > > > index 000000000000..e1266fb992cd
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > > > @@ -0,0 +1,60 @@
> > > > > > +#   BSD LICENSE
> > > > > > +#
> > > > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > > > +#   All rights reserved.
> > > > > > +#
> > > > > > +#   Redistribution and use in source and binary forms, with or without
> > > > > > +#   modification, are permitted provided that the following conditions
> > > > > > +#   are met:
> > > > > > +#
> > > > > > +#     * Redistributions of source code must retain the above copyright
> > > > > > +#       notice, this list of conditions and the following disclaimer.
> > > > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > > > +#       notice, this list of conditions and the following disclaimer in
> > > > > > +#       the documentation and/or other materials provided with the
> > > > > > +#       distribution.
> > > > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > > > +#       contributors may be used to endorse or promote products derived
> > > > > > +#       from this software without specific prior written permission.
> > > > > > +#
> > > > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > +
> > > > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > > > +
> > > > > > +#
> > > > > > +# library name
> > > > > > +#
> > > > > > +LIB = librte_pmd_packet.a
> > > > > > +
> > > > > > +CFLAGS += -O3
> > > > > > +CFLAGS += $(WERROR_FLAGS)
> > > > > > +
> > > > > > +#
> > > > > > +# all source are stored in SRCS-y
> > > > > > +#
> > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > > > +
> > > > > > +#
> > > > > > +# Export include files
> > > > > > +#
> > > > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > > > +
> > > > > > +# this lib depends upon:
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > > > +
> > > > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..9c82d16e730f
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > > @@ -0,0 +1,826 @@
> > > > > > +/*-
> > > > > > + *   BSD LICENSE
> > > > > > + *
> > > > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > > > + *
> > > > > > + *   Originally based upon librte_pmd_pcap code:
> > > > > > + *
> > > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > > > + *   All rights reserved.
> > > > > > + *
> > > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > > + *   modification, are permitted provided that the following conditions
> > > > > > + *   are met:
> > > > > > + *
> > > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > > + *       the documentation and/or other materials provided with the
> > > > > > + *       distribution.
> > > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > > + *       contributors may be used to endorse or promote products derived
> > > > > > + *       from this software without specific prior written permission.
> > > > > > + *
> > > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > + */
> > > > > > +
> > > > > > +#include <rte_mbuf.h>
> > > > > > +#include <rte_ethdev.h>
> > > > > > +#include <rte_malloc.h>
> > > > > > +#include <rte_kvargs.h>
> > > > > > +#include <rte_dev.h>
> > > > > > +
> > > > > > +#include <linux/if_ether.h>
> > > > > > +#include <linux/if_packet.h>
> > > > > > +#include <arpa/inet.h>
> > > > > > +#include <net/if.h>
> > > > > > +#include <sys/types.h>
> > > > > > +#include <sys/socket.h>
> > > > > > +#include <sys/ioctl.h>
> > > > > > +#include <sys/mman.h>
> > > > > > +#include <unistd.h>
> > > > > > +#include <poll.h>
> > > > > > +
> > > > > > +#include "rte_eth_packet.h"
> > > > > > +
> > > > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > > > +
> > > > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > > > +
> > > > > > +struct pkt_rx_queue {
> > > > > > +	int sockfd;
> > > > > > +
> > > > > > +	struct iovec *rd;
> > > > > > +	uint8_t *map;
> > > > > > +	unsigned int framecount;
> > > > > > +	unsigned int framenum;
> > > > > > +
> > > > > > +	struct rte_mempool *mb_pool;
> > > > > > +
> > > > > > +	volatile unsigned long rx_pkts;
> > > > > > +	volatile unsigned long err_pkts;
> > > > > > +};
> > > > > > +
> > > > > > +struct pkt_tx_queue {
> > > > > > +	int sockfd;
> > > > > > +
> > > > > > +	struct iovec *rd;
> > > > > > +	uint8_t *map;
> > > > > > +	unsigned int framecount;
> > > > > > +	unsigned int framenum;
> > > > > > +
> > > > > > +	volatile unsigned long tx_pkts;
> > > > > > +	volatile unsigned long err_pkts;
> > > > > > +};
> > > > > > +
> > > > > > +struct pmd_internals {
> > > > > > +	unsigned nb_queues;
> > > > > > +
> > > > > > +	int if_index;
> > > > > > +	struct ether_addr eth_addr;
> > > > > > +
> > > > > > +	struct tpacket_req req;
> > > > > > +
> > > > > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > > +};
> > > > > > +
> > > > > > +static const char *valid_arguments[] = {
> > > > > > +	ETH_PACKET_IFACE_ARG,
> > > > > > +	ETH_PACKET_NUM_Q_ARG,
> > > > > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > > > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > > > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > > > > +	NULL
> > > > > > +};
> > > > > > +
> > > > > > +static const char *drivername = "AF_PACKET PMD";
> > > > > > +
> > > > > > +static struct rte_eth_link pmd_link = {
> > > > > > +	.link_speed = 10000,
> > > > > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > > > > +	.link_status = 0
> > > > > > +};
> > > > > > +
> > > > > > +static uint16_t
> > > > > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	struct tpacket2_hdr *ppd;
> > > > > > +	struct rte_mbuf *mbuf;
> > > > > > +	uint8_t *pbuf;
> > > > > > +	struct pkt_rx_queue *pkt_q = queue;
> > > > > > +	uint16_t num_rx = 0;
> > > > > > +	unsigned int framecount, framenum;
> > > > > > +
> > > > > > +	if (unlikely(nb_pkts == 0))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > > > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > > > > +	 */
> > > > > > +	framecount = pkt_q->framecount;
> > > > > > +	framenum = pkt_q->framenum;
> > > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > > +		/* point at the next incoming frame */
> > > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > > > > +			break;
> > > > > > +
> > > > > > +		/* allocate the next mbuf */
> > > > > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > > > > +		if (unlikely(mbuf == NULL))
> > > > > > +			break;
> > > > > > +
> > > > > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > > > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > > > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > > > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > > > > +
> > > > > > +		/* release incoming frame and advance ring buffer */
> > > > > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > > > > +		if (++framenum >= framecount)
> > > > > > +			framenum = 0;
> > > > > > +
> > > > > > +		/* account for the receive frame */
> > > > > > +		bufs[i] = mbuf;
> > > > > > +		num_rx++;
> > > > > > +	}
> > > > > > +	pkt_q->framenum = framenum;
> > > > > > +	pkt_q->rx_pkts += num_rx;
> > > > > > +	return num_rx;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Callback to handle sending packets through a real NIC.
> > > > > > + */
> > > > > > +static uint16_t
> > > > > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > > +{
> > > > > > +	struct tpacket2_hdr *ppd;
> > > > > > +	struct rte_mbuf *mbuf;
> > > > > > +	uint8_t *pbuf;
> > > > > > +	unsigned int framecount, framenum;
> > > > > > +	struct pollfd pfd;
> > > > > > +	struct pkt_tx_queue *pkt_q = queue;
> > > > > > +	uint16_t num_tx = 0;
> > > > > > +	int i;
> > > > > > +
> > > > > > +	if (unlikely(nb_pkts == 0))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	memset(&pfd, 0, sizeof(pfd));
> > > > > > +	pfd.fd = pkt_q->sockfd;
> > > > > > +	pfd.events = POLLOUT;
> > > > > > +	pfd.revents = 0;
> > > > > > +
> > > > > > +	framecount = pkt_q->framecount;
> > > > > > +	framenum = pkt_q->framenum;
> > > > > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > > +		/* point at the next incoming frame */
> > > > > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > > > > +		    (poll(&pfd, 1, -1) < 0))
> > > > > > +				continue;
> > > > > > +
> > > > > > +		/* copy the tx frame data */
> > > > > > +		mbuf = bufs[num_tx];
> > > > > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > > > > +			sizeof(struct sockaddr_ll);
> > > > > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > > > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > > > > +
> > > > > > +		/* release incoming frame and advance ring buffer */
> > > > > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > > > > +		if (++framenum >= framecount)
> > > > > > +			framenum = 0;
> > > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +
> > > > > > +		num_tx++;
> > > > > > +		rte_pktmbuf_free(mbuf);
> > > > > > +	}
> > > > > > +
> > > > > > +	/* kick-off transmits */
> > > > > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > > > > +
> > > > > > +	pkt_q->framenum = framenum;
> > > > > > +	pkt_q->tx_pkts += num_tx;
> > > > > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > > > > +	return num_tx;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_dev_start(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	dev->data->dev_link.link_status = 1;
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * This function gets called when the current port gets stopped.
> > > > > > + */
> > > > > > +static void
> > > > > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	int sockfd;
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > > > > +		sockfd = internals->rx_queue[i].sockfd;
> > > > > > +		if (sockfd != -1)
> > > > > > +			close(sockfd);
> > > > > > +		sockfd = internals->tx_queue[i].sockfd;
> > > > > > +		if (sockfd != -1)
> > > > > > +			close(sockfd);
> > > > > > +	}
> > > > > > +
> > > > > > +	dev->data->dev_link.link_status = 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	dev_info->driver_name = drivername;
> > > > > > +	dev_info->if_index = internals->if_index;
> > > > > > +	dev_info->max_mac_addrs = 1;
> > > > > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > > > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > > > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > > > > +	dev_info->min_rx_bufsize = 0;
> > > > > > +	dev_info->pci_dev = NULL;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > > > > +{
> > > > > > +	unsigned i, imax;
> > > > > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > > > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > > > > +
> > > > > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > > > > +
> > > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > > +	for (i = 0; i < imax; i++) {
> > > > > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > > > > +		rx_total += igb_stats->q_ipackets[i];
> > > > > > +	}
> > > > > > +
> > > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > > +	for (i = 0; i < imax; i++) {
> > > > > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > > > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > > > > +		tx_total += igb_stats->q_opackets[i];
> > > > > > +		tx_err_total += igb_stats->q_errors[i];
> > > > > > +	}
> > > > > > +
> > > > > > +	igb_stats->ipackets = rx_total;
> > > > > > +	igb_stats->opackets = tx_total;
> > > > > > +	igb_stats->oerrors = tx_err_total;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > > > > +
> > > > > > +	for (i = 0; i < internal->nb_queues; i++)
> > > > > > +		internal->rx_queue[i].rx_pkts = 0;
> > > > > > +
> > > > > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > > > > +		internal->tx_queue[i].tx_pkts = 0;
> > > > > > +		internal->tx_queue[i].err_pkts = 0;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_queue_release(void *q __rte_unused)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > > > > +                int wait_to_complete __rte_unused)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > > > > +                   uint16_t rx_queue_id,
> > > > > > +                   uint16_t nb_rx_desc __rte_unused,
> > > > > > +                   unsigned int socket_id __rte_unused,
> > > > > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > > > > +                   struct rte_mempool *mb_pool)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > > > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > > > > +	uint16_t buf_size;
> > > > > > +
> > > > > > +	pkt_q->mb_pool = mb_pool;
> > > > > > +
> > > > > > +	/* Now get the space available for data in the mbuf */
> > > > > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > > > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > > > > +	                       RTE_PKTMBUF_HEADROOM);
> > > > > > +
> > > > > > +	if (ETH_FRAME_LEN > buf_size) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > > > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > > > > +		return -ENOMEM;
> > > > > > +	}
> > > > > > +
> > > > > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > > > > +                   uint16_t tx_queue_id,
> > > > > > +                   uint16_t nb_tx_desc __rte_unused,
> > > > > > +                   unsigned int socket_id __rte_unused,
> > > > > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > > > > +{
> > > > > > +
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct eth_dev_ops ops = {
> > > > > > +	.dev_start = eth_dev_start,
> > > > > > +	.dev_stop = eth_dev_stop,
> > > > > > +	.dev_close = eth_dev_close,
> > > > > > +	.dev_configure = eth_dev_configure,
> > > > > > +	.dev_infos_get = eth_dev_info,
> > > > > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > > > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > > > > +	.rx_queue_release = eth_queue_release,
> > > > > > +	.tx_queue_release = eth_queue_release,
> > > > > > +	.link_update = eth_link_update,
> > > > > > +	.stats_get = eth_stats_get,
> > > > > > +	.stats_reset = eth_stats_reset,
> > > > > > +};
> > > > > > +
> > > > > > +/*
> > > > > > + * Opens an AF_PACKET socket
> > > > > > + */
> > > > > > +static int
> > > > > > +open_packet_iface(const char *key __rte_unused,
> > > > > > +                  const char *value __rte_unused,
> > > > > > +                  void *extra_args)
> > > > > > +{
> > > > > > +	int *sockfd = extra_args;
> > > > > > +
> > > > > > +	/* Open an AF_PACKET socket... */
> > > > > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > > +	if (*sockfd == -1) {
> > > > > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +rte_pmd_init_internals(const char *name,
> > > > > > +                       const int sockfd,
> > > > > > +                       const unsigned nb_queues,
> > > > > > +                       unsigned int blocksize,
> > > > > > +                       unsigned int blockcnt,
> > > > > > +                       unsigned int framesize,
> > > > > > +                       unsigned int framecnt,
> > > > > > +                       const unsigned numa_node,
> > > > > > +                       struct pmd_internals **internals,
> > > > > > +                       struct rte_eth_dev **eth_dev,
> > > > > > +                       struct rte_kvargs *kvlist)
> > > > > > +{
> > > > > > +	struct rte_eth_dev_data *data = NULL;
> > > > > > +	struct rte_pci_device *pci_dev = NULL;
> > > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > > +	struct ifreq ifr;
> > > > > > +	size_t ifnamelen;
> > > > > > +	unsigned k_idx;
> > > > > > +	struct sockaddr_ll sockaddr;
> > > > > > +	struct tpacket_req *req;
> > > > > > +	struct pkt_rx_queue *rx_queue;
> > > > > > +	struct pkt_tx_queue *tx_queue;
> > > > > > +	int rc, tpver, discard, bypass;
> > > > > > +	unsigned int i, q, rdsize;
> > > > > > +	int qsockfd, fanout_arg;
> > > > > > +
> > > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > > > > +			break;
> > > > > > +	}
> > > > > > +	if (pair == NULL) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD,
> > > > > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > > > > +		name, numa_node);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > > > > +	 * and internal (private) data
> > > > > > +	 */
> > > > > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > > > > +	if (data == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > > > > +	if (pci_dev == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > > > > +	                                0, numa_node);
> > > > > > +	if (*internals == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	req = &((*internals)->req);
> > > > > > +
> > > > > > +	req->tp_block_size = blocksize;
> > > > > > +	req->tp_block_nr = blockcnt;
> > > > > > +	req->tp_frame_size = framesize;
> > > > > > +	req->tp_frame_nr = framecnt;
> > > > > > +
> > > > > > +	ifnamelen = strlen(pair->value);
> > > > > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > > > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > > > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > > > > +	} else {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: I/F name too long (%s)\n",
> > > > > > +			name, pair->value);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > > > > +
> > > > > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > > > > +
> > > > > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > > > > +	sockaddr.sll_family = AF_PACKET;
> > > > > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > > > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > > > > +
> > > > > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > > > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > > > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > > > > +
> > > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > > +		/* Open an AF_PACKET socket for this queue... */
> > > > > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > > +		if (qsockfd == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +			        "%s: could not open AF_PACKET socket\n",
> > > > > > +			        name);
> > > > > > +			return -1;
> > > > > > +		}
> > > > > > +
> > > > > > +		tpver = TPACKET_V2;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > > > > +				&tpver, sizeof(tpver));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		discard = 1;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > > > > +				&discard, sizeof(discard));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_LOSS on "
> > > > > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		bypass = 1;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > > > > +				&bypass, sizeof(bypass));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > > > > +			        "on AF_PACKET socket for %s\n", name,
> > > > > > +			        pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > > > > +		rx_queue->framecount = req->tp_frame_nr;
> > > > > > +
> > > > > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > > > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > > > > +				    qsockfd, 0);
> > > > > > +		if (rx_queue->map == MAP_FAILED) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > > > > +				name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		/* rdsize is same for both Tx and Rx */
> > > > > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > > > > +
> > > > > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > > > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > > +		}
> > > > > > +		rx_queue->sockfd = qsockfd;
> > > > > > +
> > > > > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > > > > +		tx_queue->framecount = req->tp_frame_nr;
> > > > > > +
> > > > > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > > > > +
> > > > > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > > > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > > +		}
> > > > > > +		tx_queue->sockfd = qsockfd;
> > > > > > +
> > > > > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > > > > +			        name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > > > > +				&fanout_arg, sizeof(fanout_arg));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > > > > +				"for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	/* reserve an ethdev entry */
> > > > > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > > > > +	if (*eth_dev == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * now put it all together
> > > > > > +	 * - store queue data in internals,
> > > > > > +	 * - store numa_node info in pci_driver
> > > > > > +	 * - point eth_dev_data to internals and pci_driver
> > > > > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > > > > +	 */
> > > > > > +
> > > > > > +	(*internals)->nb_queues = nb_queues;
> > > > > > +
> > > > > > +	data->dev_private = *internals;
> > > > > > +	data->port_id = (*eth_dev)->data->port_id;
> > > > > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > > > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > > > > +	data->dev_link = pmd_link;
> > > > > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > > > > +
> > > > > > +	pci_dev->numa_node = numa_node;
> > > > > > +
> > > > > > +	(*eth_dev)->data = data;
> > > > > > +	(*eth_dev)->dev_ops = &ops;
> > > > > > +	(*eth_dev)->pci_dev = pci_dev;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +error:
> > > > > > +	if (data)
> > > > > > +		rte_free(data);
> > > > > > +	if (pci_dev)
> > > > > > +		rte_free(pci_dev);
> > > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > > +		if ((*internals)->rx_queue[q].rd)
> > > > > > +			rte_free((*internals)->rx_queue[q].rd);
> > > > > > +		if ((*internals)->tx_queue[q].rd)
> > > > > > +			rte_free((*internals)->tx_queue[q].rd);
> > > > > > +	}
> > > > > > +	if (*internals)
> > > > > > +		rte_free(*internals);
> > > > > > +	return -1;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +rte_eth_from_packet(const char *name,
> > > > > > +                    int const *sockfd,
> > > > > > +                    const unsigned numa_node,
> > > > > > +                    struct rte_kvargs *kvlist)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = NULL;
> > > > > > +	struct rte_eth_dev *eth_dev = NULL;
> > > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > > +	unsigned k_idx;
> > > > > > +	unsigned int blockcount;
> > > > > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > > > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > > > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > > > > +	unsigned int qpairs = 1;
> > > > > > +
> > > > > > +	/* do some parameter checking */
> > > > > > +	if (*sockfd < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Walk arguments for configurable settings
> > > > > > +	 */
> > > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > > > > +			qpairs = atoi(pair->value);
> > > > > > +			if (qpairs < 1 ||
> > > > > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid qpairs value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > > > > +			blocksize = atoi(pair->value);
> > > > > > +			if (!blocksize) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid blocksize value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > > > > +			framesize = atoi(pair->value);
> > > > > > +			if (!framesize) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid framesize value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > > > > +			framecount = atoi(pair->value);
> > > > > > +			if (!framecount) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid framecount value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	if (framesize > blocksize) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > > > > +		        name);
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	blockcount = framecount / (blocksize / framesize);
> > > > > > +	if (!blockcount) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > > > > +
> > > > > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > > > > +	                           blocksize, blockcount,
> > > > > > +	                           framesize, framecount,
> > > > > > +	                           numa_node, &internals, &eth_dev,
> > > > > > +	                           kvlist) < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > > > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +int
> > > > > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > > > > +{
> > > > > > +	unsigned numa_node;
> > > > > > +	int ret;
> > > > > > +	struct rte_kvargs *kvlist;
> > > > > > +	int sockfd = -1;
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > > > > +
> > > > > > +	numa_node = rte_socket_id();
> > > > > > +
> > > > > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > > > > +	if (kvlist == NULL)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If iface argument is passed we open the NICs and use them for
> > > > > > +	 * reading / writing
> > > > > > +	 */
> > > > > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > > > > +
> > > > > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > > > > +		                         &open_packet_iface, &sockfd);
> > > > > > +		if (ret < 0)
> > > > > > +			return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > > > > +	close(sockfd); /* no longer needed */
> > > > > > +
> > > > > > +	if (ret < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct rte_driver pmd_packet_drv = {
> > > > > > +	.name = "eth_packet",
> > > > > > +	.type = PMD_VDEV,
> > > > > > +	.init = rte_pmd_packet_devinit,
> > > > > > +};
> > > > > > +
> > > > > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..f685611da3e9
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > > @@ -0,0 +1,55 @@
> > > > > > +/*-
> > > > > > + *   BSD LICENSE
> > > > > > + *
> > > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > + *   All rights reserved.
> > > > > > + *
> > > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > > + *   modification, are permitted provided that the following conditions
> > > > > > + *   are met:
> > > > > > + *
> > > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > > + *       the documentation and/or other materials provided with the
> > > > > > + *       distribution.
> > > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > > + *       contributors may be used to endorse or promote products derived
> > > > > > + *       from this software without specific prior written permission.
> > > > > > + *
> > > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef _RTE_ETH_PACKET_H_
> > > > > > +#define _RTE_ETH_PACKET_H_
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +extern "C" {
> > > > > > +#endif
> > > > > > +
> > > > > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > > > > +
> > > > > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > > > > +
> > > > > > +/**
> > > > > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > > > > + * configured on command line.
> > > > > > + */
> > > > > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > > +#endif
> > > > > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > > > > index 34dff2a02a05..a6994c4dbe93 100644
> > > > > > --- a/mk/rte.app.mk
> > > > > > +++ b/mk/rte.app.mk
> > > > > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > > > > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > > > > >  endif
> > > > > >
> > > > > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > > > > +LDLIBS += -lrte_pmd_packet
> > > > > > +endif
> > > > > > +
> > > > > >  endif # plugins
> > > > > >
> > > > > >  LDLIBS += $(EXECENV_LDLIBS)
> > > > > > --
> > > > > > 1.9.3
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > John W. Linville		Someday the world will need a hero, and you
> > > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > > >
> > > 
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> > 
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-15 15:09           ` Neil Horman
  2014-09-15 15:15             ` John W. Linville
@ 2014-09-15 15:43             ` Zhou, Danny
  2014-09-15 16:22               ` Neil Horman
  1 sibling, 1 reply; 76+ messages in thread
From: Zhou, Danny @ 2014-09-15 15:43 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev


> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Monday, September 15, 2014 11:10 PM
> To: Zhou, Danny
> Cc: John W. Linville; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> 
> On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Saturday, September 13, 2014 2:54 AM
> > > To: Zhou, Danny
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > >
> > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > I am concerned about its performance caused by too many
> > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > which are mapped to user space, and then those packets to be copied
> > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > copies which brings significant negative performance impact. We
> > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > submitted to dpdk.org.
> > >
> > > Admittedly, this is not as good a performer as most of the existing
> > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > previously indicate that it performed better than the pcap-based PMD.
> >
> > Yes, slightly higher but makes no big difference.
> >
> Do you have numbers for this?  It seems to me faster is faster as long as its
> statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> to scale to multple cpus more easily than the pcap pmd, as it can make use of
> the AF_PACKET fanout feature.

For 64B small packet, 1.35M pps with 1 queue. As both pcap and AF_PACKET PMDs depend on interrupt 
based NIC kernel drivers, all the DPDK performance optimization techniques are not utilized. Why should DPDK adopt 
two similar and poor performant PMDs which cannot demonstrate DPDK' key value "high performance"?

> 
> > > I look forward to seeing the changes you mention -- they sound very
> > > exciting.  But, they will still require both networking core and
> > > driver changes in the kernel.  And as I understand things today,
> > > the userland code will still need at least some knowledge of specific
> > > devices and how they layout their packet descriptors, etc.  So while
> > > those changes sound very promising, they will still have certain
> > > drawbacks in common with the current situation.
> >
> > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate
> device-specific
> > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will
> be much easier
> > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> >
> 
> Not sure how this relates, what you're describing is the feature intel has been
> working on to augment kernel drivers to provide better throughput via direct
> hardware access to user space.  Johns PMD provides ubiquitous function on all
> hardware. I'm not sure how the desire for one implies the other isn't valuable?
> 

Performance is the key value of DPDK, instead of commonality. But we are trying to improve commonality of our solution to make it easily 
adopted by other NIC vendors.

> > > It seems like the changes you mention will still need some sort of
> > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > separate from the code I already posted?  Or did you add that work
> > > on top of mine?
> > >
> >
> > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev
> library to do device
> > probe and support new socket options.
> >
> 
> Ok, but again, PMD's are independent, and serve different needs.  If they're use
> is at all overlapping from a functional standpoint, take this one now, and
> deprecate it when a better one comes along.  Though from your description it
> seems like both have a valid place in the ecosystem.
> 

I am ok with this approach, as long as this AF_PACKET PMD does not add extra maintain efforts. Thomas might make the call.

> Neil
> 
> > > John
> > >
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > > > Sent: Saturday, September 13, 2014 2:05 AM
> > > > > To: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > > >
> > > > > Ping?  Are there objections to this patch from mid-July?
> > > > >
> > > > > John
> > > > >
> > > > > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > > > in number -- changing this would be a Simple Matter Of Programming.
> > > > > >
> > > > > > Interfaces of this type are created with a command line option like
> > > > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > > > as arguments:
> > > > > >
> > > > > >  - Interface is chosen by "iface" (required)
> > > > > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > > > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > > > >
> > > > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > > > ---
> > > > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > > > with better performance than what PCAP offers in Linux.  This might
> > > > > > be useful as a development platform for DPDK applications when
> > > > > > DPDK-supported hardware is expensive or unavailable.
> > > > > >
> > > > > > New in v2:
> > > > > >
> > > > > > -- fixup some style issues found by check patch
> > > > > > -- use if_index as part of fanout group ID
> > > > > > -- set default number of queue pairs to 1
> > > > > >
> > > > > >  config/common_bsdapp                   |   5 +
> > > > > >  config/common_linuxapp                 |   5 +
> > > > > >  lib/Makefile                           |   1 +
> > > > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > > > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > > > >  mk/rte.app.mk                          |   4 +
> > > > > >  8 files changed, 957 insertions(+)
> > > > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > > > >
> > > > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > > > index 943dce8f1ede..c317f031278e 100644
> > > > > > --- a/config/common_bsdapp
> > > > > > +++ b/config/common_bsdapp
> > > > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > > >
> > > > > >  #
> > > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > > +#
> > > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > > > +
> > > > > > +#
> > > > > >  # Do prefetch of packet data within PMD driver receive function
> > > > > >  #
> > > > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > > > --- a/config/common_linuxapp
> > > > > > +++ b/config/common_linuxapp
> > > > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > > >
> > > > > >  #
> > > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > > +#
> > > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > > > +
> > > > > > +#
> > > > > >  # Compile Xen PMD
> > > > > >  #
> > > > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > > > index 10c5bb3045bc..930fadf29898 100644
> > > > > > --- a/lib/Makefile
> > > > > > +++ b/lib/Makefile
> > > > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > index 756d6b0c9301..feed24a63272 100644
> > > > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > > > >
> > > > > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > > > > new file mode 100644
> > > > > > index 000000000000..e1266fb992cd
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > > > @@ -0,0 +1,60 @@
> > > > > > +#   BSD LICENSE
> > > > > > +#
> > > > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > > > +#   All rights reserved.
> > > > > > +#
> > > > > > +#   Redistribution and use in source and binary forms, with or without
> > > > > > +#   modification, are permitted provided that the following conditions
> > > > > > +#   are met:
> > > > > > +#
> > > > > > +#     * Redistributions of source code must retain the above copyright
> > > > > > +#       notice, this list of conditions and the following disclaimer.
> > > > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > > > +#       notice, this list of conditions and the following disclaimer in
> > > > > > +#       the documentation and/or other materials provided with the
> > > > > > +#       distribution.
> > > > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > > > +#       contributors may be used to endorse or promote products derived
> > > > > > +#       from this software without specific prior written permission.
> > > > > > +#
> > > > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > +
> > > > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > > > +
> > > > > > +#
> > > > > > +# library name
> > > > > > +#
> > > > > > +LIB = librte_pmd_packet.a
> > > > > > +
> > > > > > +CFLAGS += -O3
> > > > > > +CFLAGS += $(WERROR_FLAGS)
> > > > > > +
> > > > > > +#
> > > > > > +# all source are stored in SRCS-y
> > > > > > +#
> > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > > > +
> > > > > > +#
> > > > > > +# Export include files
> > > > > > +#
> > > > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > > > +
> > > > > > +# this lib depends upon:
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > > > +
> > > > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..9c82d16e730f
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > > @@ -0,0 +1,826 @@
> > > > > > +/*-
> > > > > > + *   BSD LICENSE
> > > > > > + *
> > > > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > > > + *
> > > > > > + *   Originally based upon librte_pmd_pcap code:
> > > > > > + *
> > > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > > > + *   All rights reserved.
> > > > > > + *
> > > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > > + *   modification, are permitted provided that the following conditions
> > > > > > + *   are met:
> > > > > > + *
> > > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > > + *       the documentation and/or other materials provided with the
> > > > > > + *       distribution.
> > > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > > + *       contributors may be used to endorse or promote products derived
> > > > > > + *       from this software without specific prior written permission.
> > > > > > + *
> > > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > + */
> > > > > > +
> > > > > > +#include <rte_mbuf.h>
> > > > > > +#include <rte_ethdev.h>
> > > > > > +#include <rte_malloc.h>
> > > > > > +#include <rte_kvargs.h>
> > > > > > +#include <rte_dev.h>
> > > > > > +
> > > > > > +#include <linux/if_ether.h>
> > > > > > +#include <linux/if_packet.h>
> > > > > > +#include <arpa/inet.h>
> > > > > > +#include <net/if.h>
> > > > > > +#include <sys/types.h>
> > > > > > +#include <sys/socket.h>
> > > > > > +#include <sys/ioctl.h>
> > > > > > +#include <sys/mman.h>
> > > > > > +#include <unistd.h>
> > > > > > +#include <poll.h>
> > > > > > +
> > > > > > +#include "rte_eth_packet.h"
> > > > > > +
> > > > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > > > +
> > > > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > > > +
> > > > > > +struct pkt_rx_queue {
> > > > > > +	int sockfd;
> > > > > > +
> > > > > > +	struct iovec *rd;
> > > > > > +	uint8_t *map;
> > > > > > +	unsigned int framecount;
> > > > > > +	unsigned int framenum;
> > > > > > +
> > > > > > +	struct rte_mempool *mb_pool;
> > > > > > +
> > > > > > +	volatile unsigned long rx_pkts;
> > > > > > +	volatile unsigned long err_pkts;
> > > > > > +};
> > > > > > +
> > > > > > +struct pkt_tx_queue {
> > > > > > +	int sockfd;
> > > > > > +
> > > > > > +	struct iovec *rd;
> > > > > > +	uint8_t *map;
> > > > > > +	unsigned int framecount;
> > > > > > +	unsigned int framenum;
> > > > > > +
> > > > > > +	volatile unsigned long tx_pkts;
> > > > > > +	volatile unsigned long err_pkts;
> > > > > > +};
> > > > > > +
> > > > > > +struct pmd_internals {
> > > > > > +	unsigned nb_queues;
> > > > > > +
> > > > > > +	int if_index;
> > > > > > +	struct ether_addr eth_addr;
> > > > > > +
> > > > > > +	struct tpacket_req req;
> > > > > > +
> > > > > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > > +};
> > > > > > +
> > > > > > +static const char *valid_arguments[] = {
> > > > > > +	ETH_PACKET_IFACE_ARG,
> > > > > > +	ETH_PACKET_NUM_Q_ARG,
> > > > > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > > > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > > > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > > > > +	NULL
> > > > > > +};
> > > > > > +
> > > > > > +static const char *drivername = "AF_PACKET PMD";
> > > > > > +
> > > > > > +static struct rte_eth_link pmd_link = {
> > > > > > +	.link_speed = 10000,
> > > > > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > > > > +	.link_status = 0
> > > > > > +};
> > > > > > +
> > > > > > +static uint16_t
> > > > > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	struct tpacket2_hdr *ppd;
> > > > > > +	struct rte_mbuf *mbuf;
> > > > > > +	uint8_t *pbuf;
> > > > > > +	struct pkt_rx_queue *pkt_q = queue;
> > > > > > +	uint16_t num_rx = 0;
> > > > > > +	unsigned int framecount, framenum;
> > > > > > +
> > > > > > +	if (unlikely(nb_pkts == 0))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > > > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > > > > +	 */
> > > > > > +	framecount = pkt_q->framecount;
> > > > > > +	framenum = pkt_q->framenum;
> > > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > > +		/* point at the next incoming frame */
> > > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > > > > +			break;
> > > > > > +
> > > > > > +		/* allocate the next mbuf */
> > > > > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > > > > +		if (unlikely(mbuf == NULL))
> > > > > > +			break;
> > > > > > +
> > > > > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > > > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > > > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > > > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > > > > +
> > > > > > +		/* release incoming frame and advance ring buffer */
> > > > > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > > > > +		if (++framenum >= framecount)
> > > > > > +			framenum = 0;
> > > > > > +
> > > > > > +		/* account for the receive frame */
> > > > > > +		bufs[i] = mbuf;
> > > > > > +		num_rx++;
> > > > > > +	}
> > > > > > +	pkt_q->framenum = framenum;
> > > > > > +	pkt_q->rx_pkts += num_rx;
> > > > > > +	return num_rx;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Callback to handle sending packets through a real NIC.
> > > > > > + */
> > > > > > +static uint16_t
> > > > > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > > +{
> > > > > > +	struct tpacket2_hdr *ppd;
> > > > > > +	struct rte_mbuf *mbuf;
> > > > > > +	uint8_t *pbuf;
> > > > > > +	unsigned int framecount, framenum;
> > > > > > +	struct pollfd pfd;
> > > > > > +	struct pkt_tx_queue *pkt_q = queue;
> > > > > > +	uint16_t num_tx = 0;
> > > > > > +	int i;
> > > > > > +
> > > > > > +	if (unlikely(nb_pkts == 0))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	memset(&pfd, 0, sizeof(pfd));
> > > > > > +	pfd.fd = pkt_q->sockfd;
> > > > > > +	pfd.events = POLLOUT;
> > > > > > +	pfd.revents = 0;
> > > > > > +
> > > > > > +	framecount = pkt_q->framecount;
> > > > > > +	framenum = pkt_q->framenum;
> > > > > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > > +		/* point at the next incoming frame */
> > > > > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > > > > +		    (poll(&pfd, 1, -1) < 0))
> > > > > > +				continue;
> > > > > > +
> > > > > > +		/* copy the tx frame data */
> > > > > > +		mbuf = bufs[num_tx];
> > > > > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > > > > +			sizeof(struct sockaddr_ll);
> > > > > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > > > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > > > > +
> > > > > > +		/* release incoming frame and advance ring buffer */
> > > > > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > > > > +		if (++framenum >= framecount)
> > > > > > +			framenum = 0;
> > > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +
> > > > > > +		num_tx++;
> > > > > > +		rte_pktmbuf_free(mbuf);
> > > > > > +	}
> > > > > > +
> > > > > > +	/* kick-off transmits */
> > > > > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > > > > +
> > > > > > +	pkt_q->framenum = framenum;
> > > > > > +	pkt_q->tx_pkts += num_tx;
> > > > > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > > > > +	return num_tx;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_dev_start(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	dev->data->dev_link.link_status = 1;
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * This function gets called when the current port gets stopped.
> > > > > > + */
> > > > > > +static void
> > > > > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	int sockfd;
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > > > > +		sockfd = internals->rx_queue[i].sockfd;
> > > > > > +		if (sockfd != -1)
> > > > > > +			close(sockfd);
> > > > > > +		sockfd = internals->tx_queue[i].sockfd;
> > > > > > +		if (sockfd != -1)
> > > > > > +			close(sockfd);
> > > > > > +	}
> > > > > > +
> > > > > > +	dev->data->dev_link.link_status = 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	dev_info->driver_name = drivername;
> > > > > > +	dev_info->if_index = internals->if_index;
> > > > > > +	dev_info->max_mac_addrs = 1;
> > > > > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > > > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > > > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > > > > +	dev_info->min_rx_bufsize = 0;
> > > > > > +	dev_info->pci_dev = NULL;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > > > > +{
> > > > > > +	unsigned i, imax;
> > > > > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > > > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > > > > +
> > > > > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > > > > +
> > > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > > +	for (i = 0; i < imax; i++) {
> > > > > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > > > > +		rx_total += igb_stats->q_ipackets[i];
> > > > > > +	}
> > > > > > +
> > > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > > +	for (i = 0; i < imax; i++) {
> > > > > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > > > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > > > > +		tx_total += igb_stats->q_opackets[i];
> > > > > > +		tx_err_total += igb_stats->q_errors[i];
> > > > > > +	}
> > > > > > +
> > > > > > +	igb_stats->ipackets = rx_total;
> > > > > > +	igb_stats->opackets = tx_total;
> > > > > > +	igb_stats->oerrors = tx_err_total;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > > > > +
> > > > > > +	for (i = 0; i < internal->nb_queues; i++)
> > > > > > +		internal->rx_queue[i].rx_pkts = 0;
> > > > > > +
> > > > > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > > > > +		internal->tx_queue[i].tx_pkts = 0;
> > > > > > +		internal->tx_queue[i].err_pkts = 0;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_queue_release(void *q __rte_unused)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > > > > +                int wait_to_complete __rte_unused)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > > > > +                   uint16_t rx_queue_id,
> > > > > > +                   uint16_t nb_rx_desc __rte_unused,
> > > > > > +                   unsigned int socket_id __rte_unused,
> > > > > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > > > > +                   struct rte_mempool *mb_pool)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > > > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > > > > +	uint16_t buf_size;
> > > > > > +
> > > > > > +	pkt_q->mb_pool = mb_pool;
> > > > > > +
> > > > > > +	/* Now get the space available for data in the mbuf */
> > > > > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > > > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > > > > +	                       RTE_PKTMBUF_HEADROOM);
> > > > > > +
> > > > > > +	if (ETH_FRAME_LEN > buf_size) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > > > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > > > > +		return -ENOMEM;
> > > > > > +	}
> > > > > > +
> > > > > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > > > > +                   uint16_t tx_queue_id,
> > > > > > +                   uint16_t nb_tx_desc __rte_unused,
> > > > > > +                   unsigned int socket_id __rte_unused,
> > > > > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > > > > +{
> > > > > > +
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct eth_dev_ops ops = {
> > > > > > +	.dev_start = eth_dev_start,
> > > > > > +	.dev_stop = eth_dev_stop,
> > > > > > +	.dev_close = eth_dev_close,
> > > > > > +	.dev_configure = eth_dev_configure,
> > > > > > +	.dev_infos_get = eth_dev_info,
> > > > > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > > > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > > > > +	.rx_queue_release = eth_queue_release,
> > > > > > +	.tx_queue_release = eth_queue_release,
> > > > > > +	.link_update = eth_link_update,
> > > > > > +	.stats_get = eth_stats_get,
> > > > > > +	.stats_reset = eth_stats_reset,
> > > > > > +};
> > > > > > +
> > > > > > +/*
> > > > > > + * Opens an AF_PACKET socket
> > > > > > + */
> > > > > > +static int
> > > > > > +open_packet_iface(const char *key __rte_unused,
> > > > > > +                  const char *value __rte_unused,
> > > > > > +                  void *extra_args)
> > > > > > +{
> > > > > > +	int *sockfd = extra_args;
> > > > > > +
> > > > > > +	/* Open an AF_PACKET socket... */
> > > > > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > > +	if (*sockfd == -1) {
> > > > > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +rte_pmd_init_internals(const char *name,
> > > > > > +                       const int sockfd,
> > > > > > +                       const unsigned nb_queues,
> > > > > > +                       unsigned int blocksize,
> > > > > > +                       unsigned int blockcnt,
> > > > > > +                       unsigned int framesize,
> > > > > > +                       unsigned int framecnt,
> > > > > > +                       const unsigned numa_node,
> > > > > > +                       struct pmd_internals **internals,
> > > > > > +                       struct rte_eth_dev **eth_dev,
> > > > > > +                       struct rte_kvargs *kvlist)
> > > > > > +{
> > > > > > +	struct rte_eth_dev_data *data = NULL;
> > > > > > +	struct rte_pci_device *pci_dev = NULL;
> > > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > > +	struct ifreq ifr;
> > > > > > +	size_t ifnamelen;
> > > > > > +	unsigned k_idx;
> > > > > > +	struct sockaddr_ll sockaddr;
> > > > > > +	struct tpacket_req *req;
> > > > > > +	struct pkt_rx_queue *rx_queue;
> > > > > > +	struct pkt_tx_queue *tx_queue;
> > > > > > +	int rc, tpver, discard, bypass;
> > > > > > +	unsigned int i, q, rdsize;
> > > > > > +	int qsockfd, fanout_arg;
> > > > > > +
> > > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > > > > +			break;
> > > > > > +	}
> > > > > > +	if (pair == NULL) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD,
> > > > > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > > > > +		name, numa_node);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > > > > +	 * and internal (private) data
> > > > > > +	 */
> > > > > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > > > > +	if (data == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > > > > +	if (pci_dev == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > > > > +	                                0, numa_node);
> > > > > > +	if (*internals == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	req = &((*internals)->req);
> > > > > > +
> > > > > > +	req->tp_block_size = blocksize;
> > > > > > +	req->tp_block_nr = blockcnt;
> > > > > > +	req->tp_frame_size = framesize;
> > > > > > +	req->tp_frame_nr = framecnt;
> > > > > > +
> > > > > > +	ifnamelen = strlen(pair->value);
> > > > > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > > > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > > > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > > > > +	} else {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: I/F name too long (%s)\n",
> > > > > > +			name, pair->value);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > > > > +
> > > > > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > > > > +
> > > > > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > > > > +	sockaddr.sll_family = AF_PACKET;
> > > > > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > > > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > > > > +
> > > > > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > > > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > > > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > > > > +
> > > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > > +		/* Open an AF_PACKET socket for this queue... */
> > > > > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > > +		if (qsockfd == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +			        "%s: could not open AF_PACKET socket\n",
> > > > > > +			        name);
> > > > > > +			return -1;
> > > > > > +		}
> > > > > > +
> > > > > > +		tpver = TPACKET_V2;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > > > > +				&tpver, sizeof(tpver));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		discard = 1;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > > > > +				&discard, sizeof(discard));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_LOSS on "
> > > > > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		bypass = 1;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > > > > +				&bypass, sizeof(bypass));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > > > > +			        "on AF_PACKET socket for %s\n", name,
> > > > > > +			        pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > > > > +		rx_queue->framecount = req->tp_frame_nr;
> > > > > > +
> > > > > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > > > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > > > > +				    qsockfd, 0);
> > > > > > +		if (rx_queue->map == MAP_FAILED) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > > > > +				name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		/* rdsize is same for both Tx and Rx */
> > > > > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > > > > +
> > > > > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > > > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > > +		}
> > > > > > +		rx_queue->sockfd = qsockfd;
> > > > > > +
> > > > > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > > > > +		tx_queue->framecount = req->tp_frame_nr;
> > > > > > +
> > > > > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > > > > +
> > > > > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > > > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > > +		}
> > > > > > +		tx_queue->sockfd = qsockfd;
> > > > > > +
> > > > > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > > > > +			        name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > > > > +				&fanout_arg, sizeof(fanout_arg));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > > > > +				"for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	/* reserve an ethdev entry */
> > > > > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > > > > +	if (*eth_dev == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * now put it all together
> > > > > > +	 * - store queue data in internals,
> > > > > > +	 * - store numa_node info in pci_driver
> > > > > > +	 * - point eth_dev_data to internals and pci_driver
> > > > > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > > > > +	 */
> > > > > > +
> > > > > > +	(*internals)->nb_queues = nb_queues;
> > > > > > +
> > > > > > +	data->dev_private = *internals;
> > > > > > +	data->port_id = (*eth_dev)->data->port_id;
> > > > > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > > > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > > > > +	data->dev_link = pmd_link;
> > > > > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > > > > +
> > > > > > +	pci_dev->numa_node = numa_node;
> > > > > > +
> > > > > > +	(*eth_dev)->data = data;
> > > > > > +	(*eth_dev)->dev_ops = &ops;
> > > > > > +	(*eth_dev)->pci_dev = pci_dev;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +error:
> > > > > > +	if (data)
> > > > > > +		rte_free(data);
> > > > > > +	if (pci_dev)
> > > > > > +		rte_free(pci_dev);
> > > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > > +		if ((*internals)->rx_queue[q].rd)
> > > > > > +			rte_free((*internals)->rx_queue[q].rd);
> > > > > > +		if ((*internals)->tx_queue[q].rd)
> > > > > > +			rte_free((*internals)->tx_queue[q].rd);
> > > > > > +	}
> > > > > > +	if (*internals)
> > > > > > +		rte_free(*internals);
> > > > > > +	return -1;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +rte_eth_from_packet(const char *name,
> > > > > > +                    int const *sockfd,
> > > > > > +                    const unsigned numa_node,
> > > > > > +                    struct rte_kvargs *kvlist)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = NULL;
> > > > > > +	struct rte_eth_dev *eth_dev = NULL;
> > > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > > +	unsigned k_idx;
> > > > > > +	unsigned int blockcount;
> > > > > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > > > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > > > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > > > > +	unsigned int qpairs = 1;
> > > > > > +
> > > > > > +	/* do some parameter checking */
> > > > > > +	if (*sockfd < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Walk arguments for configurable settings
> > > > > > +	 */
> > > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > > > > +			qpairs = atoi(pair->value);
> > > > > > +			if (qpairs < 1 ||
> > > > > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid qpairs value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > > > > +			blocksize = atoi(pair->value);
> > > > > > +			if (!blocksize) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid blocksize value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > > > > +			framesize = atoi(pair->value);
> > > > > > +			if (!framesize) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid framesize value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > > > > +			framecount = atoi(pair->value);
> > > > > > +			if (!framecount) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid framecount value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	if (framesize > blocksize) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > > > > +		        name);
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	blockcount = framecount / (blocksize / framesize);
> > > > > > +	if (!blockcount) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > > > > +
> > > > > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > > > > +	                           blocksize, blockcount,
> > > > > > +	                           framesize, framecount,
> > > > > > +	                           numa_node, &internals, &eth_dev,
> > > > > > +	                           kvlist) < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > > > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +int
> > > > > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > > > > +{
> > > > > > +	unsigned numa_node;
> > > > > > +	int ret;
> > > > > > +	struct rte_kvargs *kvlist;
> > > > > > +	int sockfd = -1;
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > > > > +
> > > > > > +	numa_node = rte_socket_id();
> > > > > > +
> > > > > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > > > > +	if (kvlist == NULL)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If iface argument is passed we open the NICs and use them for
> > > > > > +	 * reading / writing
> > > > > > +	 */
> > > > > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > > > > +
> > > > > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > > > > +		                         &open_packet_iface, &sockfd);
> > > > > > +		if (ret < 0)
> > > > > > +			return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > > > > +	close(sockfd); /* no longer needed */
> > > > > > +
> > > > > > +	if (ret < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct rte_driver pmd_packet_drv = {
> > > > > > +	.name = "eth_packet",
> > > > > > +	.type = PMD_VDEV,
> > > > > > +	.init = rte_pmd_packet_devinit,
> > > > > > +};
> > > > > > +
> > > > > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..f685611da3e9
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > > @@ -0,0 +1,55 @@
> > > > > > +/*-
> > > > > > + *   BSD LICENSE
> > > > > > + *
> > > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > + *   All rights reserved.
> > > > > > + *
> > > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > > + *   modification, are permitted provided that the following conditions
> > > > > > + *   are met:
> > > > > > + *
> > > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > > + *       the documentation and/or other materials provided with the
> > > > > > + *       distribution.
> > > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > > + *       contributors may be used to endorse or promote products derived
> > > > > > + *       from this software without specific prior written permission.
> > > > > > + *
> > > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef _RTE_ETH_PACKET_H_
> > > > > > +#define _RTE_ETH_PACKET_H_
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +extern "C" {
> > > > > > +#endif
> > > > > > +
> > > > > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > > > > +
> > > > > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > > > > +
> > > > > > +/**
> > > > > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > > > > + * configured on command line.
> > > > > > + */
> > > > > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > > +#endif
> > > > > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > > > > index 34dff2a02a05..a6994c4dbe93 100644
> > > > > > --- a/mk/rte.app.mk
> > > > > > +++ b/mk/rte.app.mk
> > > > > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > > > > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > > > > >  endif
> > > > > >
> > > > > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > > > > +LDLIBS += -lrte_pmd_packet
> > > > > > +endif
> > > > > > +
> > > > > >  endif # plugins
> > > > > >
> > > > > >  LDLIBS += $(EXECENV_LDLIBS)
> > > > > > --
> > > > > > 1.9.3
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > John W. Linville		Someday the world will need a hero, and you
> > > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > > >
> > >
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> >

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-15 15:43             ` Zhou, Danny
@ 2014-09-15 16:22               ` Neil Horman
  2014-09-15 17:48                 ` John W. Linville
  0 siblings, 1 reply; 76+ messages in thread
From: Neil Horman @ 2014-09-15 16:22 UTC (permalink / raw)
  To: Zhou, Danny; +Cc: dev

On Mon, Sep 15, 2014 at 03:43:07PM +0000, Zhou, Danny wrote:
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Monday, September 15, 2014 11:10 PM
> > To: Zhou, Danny
> > Cc: John W. Linville; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > 
> > On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > > -----Original Message-----
> > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > Sent: Saturday, September 13, 2014 2:54 AM
> > > > To: Zhou, Danny
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > >
> > > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > > I am concerned about its performance caused by too many
> > > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > > which are mapped to user space, and then those packets to be copied
> > > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > > copies which brings significant negative performance impact. We
> > > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > > submitted to dpdk.org.
> > > >
> > > > Admittedly, this is not as good a performer as most of the existing
> > > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > > previously indicate that it performed better than the pcap-based PMD.
> > >
> > > Yes, slightly higher but makes no big difference.
> > >
> > Do you have numbers for this?  It seems to me faster is faster as long as its
> > statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> > to scale to multple cpus more easily than the pcap pmd, as it can make use of
> > the AF_PACKET fanout feature.
> 
> For 64B small packet, 1.35M pps with 1 queue.
Why did you only test with a single queue?  Multiqueue operation was one of the
big advantages of the AF_PACKET based pmd.  I would expect a single queue setup
to perform in a very simmilar fashion to the pcap PMD

 As both pcap and AF_PACKET PMDs depend on interrupt 
> based NIC kernel drivers, all the DPDK performance optimization techniques are not utilized. Why should DPDK adopt 
> two similar and poor performant PMDs which cannot demonstrate DPDK' key value "high performance"?
Several reasons:
* "High performance" isn't always the key need for end users.  Consider
pre-hardware availablity development phase.

* Better hardware modeling (consider AF_PACKETS multiqueue abiltiy)

* Better scaling (pcap doesn't make use of the fanout features that AF_PACKET
does)

* Space savings, Building the AF_PACKET pmd doesn't require the additional
building/storage of the pcap driver.


> 
> > 
> > > > I look forward to seeing the changes you mention -- they sound very
> > > > exciting.  But, they will still require both networking core and
> > > > driver changes in the kernel.  And as I understand things today,
> > > > the userland code will still need at least some knowledge of specific
> > > > devices and how they layout their packet descriptors, etc.  So while
> > > > those changes sound very promising, they will still have certain
> > > > drawbacks in common with the current situation.
> > >
> > > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate
> > device-specific
> > > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will
> > be much easier
> > > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> > >
> > 
> > Not sure how this relates, what you're describing is the feature intel has been
> > working on to augment kernel drivers to provide better throughput via direct
> > hardware access to user space.  Johns PMD provides ubiquitous function on all
> > hardware. I'm not sure how the desire for one implies the other isn't valuable?
> > 
> 
> Performance is the key value of DPDK, instead of commonality. But we are trying to improve commonality of our solution to make it easily 
> adopted by other NIC vendors.
> 
Thats completely irrelevant to the question at hand.  To go with your reasoning,
if performance is the key value of the DPDK, then you should remove all driver
support save for the most performant hardware you have.  By that same token,
you should deprecate the pcap driver in favor of this AF_PACKET driver, because
it has shown performance improvement.

I'm being facetious, of course, but the facts remain: Lack of superior
performance from one PMD to the next does not immediately obviate the need for
one PMD over another, as they quite likely address differing needs.  As you note
the DPDK seeks performance as a key goal, but its an open source project, there
are other needs from other users in play here.  The AF_PACKET pmd provides
superior performance on linux platforms when hardware independence is required.
It differs from the pcap PMD as it uses features that are only available on the
Linux platform, so it stands to reason we should have both.

> > > > It seems like the changes you mention will still need some sort of
> > > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > > separate from the code I already posted?  Or did you add that work
> > > > on top of mine?
> > > >
> > >
> > > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev
> > library to do device
> > > probe and support new socket options.
> > >
> > 
> > Ok, but again, PMD's are independent, and serve different needs.  If they're use
> > is at all overlapping from a functional standpoint, take this one now, and
> > deprecate it when a better one comes along.  Though from your description it
> > seems like both have a valid place in the ecosystem.
> > 
> 
> I am ok with this approach, as long as this AF_PACKET PMD does not add extra maintain efforts. Thomas might make the call.
> 
What extra maintainer efforts do you think are required here, that wouldn't be
required for any PMD?  To suggest that a given PMD shouldn't be included because
it would require additional effort to maintain holds it to a higher standard
than the PMD's already included.  I don't recall anyone asking if the i40e or
bonding pmds would require additional effort before being integrated.

Neil

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-15 16:22               ` Neil Horman
@ 2014-09-15 17:48                 ` John W. Linville
  2014-09-15 19:11                   ` Zhou, Danny
  0 siblings, 1 reply; 76+ messages in thread
From: John W. Linville @ 2014-09-15 17:48 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Mon, Sep 15, 2014 at 12:22:44PM -0400, Neil Horman wrote:
> On Mon, Sep 15, 2014 at 03:43:07PM +0000, Zhou, Danny wrote:
> > 
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Monday, September 15, 2014 11:10 PM
> > > To: Zhou, Danny
> > > Cc: John W. Linville; dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > 
> > > On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > > > -----Original Message-----
> > > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > > Sent: Saturday, September 13, 2014 2:54 AM
> > > > > To: Zhou, Danny
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > > >
> > > > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > > > I am concerned about its performance caused by too many
> > > > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > > > which are mapped to user space, and then those packets to be copied
> > > > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > > > copies which brings significant negative performance impact. We
> > > > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > > > submitted to dpdk.org.
> > > > >
> > > > > Admittedly, this is not as good a performer as most of the existing
> > > > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > > > previously indicate that it performed better than the pcap-based PMD.
> > > >
> > > > Yes, slightly higher but makes no big difference.
> > > >
> > > Do you have numbers for this?  It seems to me faster is faster as long as its
> > > statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> > > to scale to multple cpus more easily than the pcap pmd, as it can make use of
> > > the AF_PACKET fanout feature.
> > 
> > For 64B small packet, 1.35M pps with 1 queue.
> Why did you only test with a single queue?  Multiqueue operation was one of the
> big advantages of the AF_PACKET based pmd.  I would expect a single queue setup
> to perform in a very simmilar fashion to the pcap PMD
> 
>  As both pcap and AF_PACKET PMDs depend on interrupt 
> > based NIC kernel drivers, all the DPDK performance optimization techniques are not utilized. Why should DPDK adopt 
> > two similar and poor performant PMDs which cannot demonstrate DPDK' key value "high performance"?
> Several reasons:
> * "High performance" isn't always the key need for end users.  Consider
> pre-hardware availablity development phase.
> 
> * Better hardware modeling (consider AF_PACKETS multiqueue abiltiy)
> 
> * Better scaling (pcap doesn't make use of the fanout features that AF_PACKET
> does)
> 
> * Space savings, Building the AF_PACKET pmd doesn't require the additional
> building/storage of the pcap driver.

This would include not requiring a dependency on libpcap, if nothing else.
 
> > 
> > > 
> > > > > I look forward to seeing the changes you mention -- they sound very
> > > > > exciting.  But, they will still require both networking core and
> > > > > driver changes in the kernel.  And as I understand things today,
> > > > > the userland code will still need at least some knowledge of specific
> > > > > devices and how they layout their packet descriptors, etc.  So while
> > > > > those changes sound very promising, they will still have certain
> > > > > drawbacks in common with the current situation.
> > > >
> > > > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate
> > > device-specific
> > > > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will
> > > be much easier
> > > > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> > > >
> > > 
> > > Not sure how this relates, what you're describing is the feature intel has been
> > > working on to augment kernel drivers to provide better throughput via direct
> > > hardware access to user space.  Johns PMD provides ubiquitous function on all
> > > hardware. I'm not sure how the desire for one implies the other isn't valuable?
> > > 
> > 
> > Performance is the key value of DPDK, instead of commonality. But we are trying to improve commonality of our solution to make it easily 
> > adopted by other NIC vendors.
> > 
> Thats completely irrelevant to the question at hand.  To go with your reasoning,
> if performance is the key value of the DPDK, then you should remove all driver
> support save for the most performant hardware you have.  By that same token,
> you should deprecate the pcap driver in favor of this AF_PACKET driver, because
> it has shown performance improvement.
> 
> I'm being facetious, of course, but the facts remain: Lack of superior
> performance from one PMD to the next does not immediately obviate the need for
> one PMD over another, as they quite likely address differing needs.  As you note
> the DPDK seeks performance as a key goal, but its an open source project, there
> are other needs from other users in play here.  The AF_PACKET pmd provides
> superior performance on linux platforms when hardware independence is required.
> It differs from the pcap PMD as it uses features that are only available on the
> Linux platform, so it stands to reason we should have both.

IMHO, the biggest deficiency in DPDK is the lack of apps.  Let's face
it, no one really cares about running l2fwd except for testing the
drivers.  What people want is applications.  Providing a PMD to use
while developing an app without requiring specific hardware seems like
a win to me.  The pcap PMD addresses some of that, but it is more of
a stop-gap or special purpose thing (like for playing back captures).

> > > > > It seems like the changes you mention will still need some sort of
> > > > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > > > separate from the code I already posted?  Or did you add that work
> > > > > on top of mine?
> > > > >
> > > >
> > > > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev
> > > library to do device
> > > > probe and support new socket options.
> > > >
> > > 
> > > Ok, but again, PMD's are independent, and serve different needs.  If they're use
> > > is at all overlapping from a functional standpoint, take this one now, and
> > > deprecate it when a better one comes along.  Though from your description it
> > > seems like both have a valid place in the ecosystem.
> > > 
> > 
> > I am ok with this approach, as long as this AF_PACKET PMD does not add extra maintain efforts. Thomas might make the call.
> > 
> What extra maintainer efforts do you think are required here, that wouldn't be
> required for any PMD?  To suggest that a given PMD shouldn't be included because
> it would require additional effort to maintain holds it to a higher standard
> than the PMD's already included.  I don't recall anyone asking if the i40e or
> bonding pmds would require additional effort before being integrated.

Right -- how much maintainer effort is put into the pcap driver
these days?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-15 17:48                 ` John W. Linville
@ 2014-09-15 19:11                   ` Zhou, Danny
  0 siblings, 0 replies; 76+ messages in thread
From: Zhou, Danny @ 2014-09-15 19:11 UTC (permalink / raw)
  To: John W. Linville, Neil Horman; +Cc: dev

> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Tuesday, September 16, 2014 1:48 AM
> To: Neil Horman
> Cc: Zhou, Danny; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> 
> On Mon, Sep 15, 2014 at 12:22:44PM -0400, Neil Horman wrote:
> > On Mon, Sep 15, 2014 at 03:43:07PM +0000, Zhou, Danny wrote:
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Monday, September 15, 2014 11:10 PM
> > > > To: Zhou, Danny
> > > > Cc: John W. Linville; dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > >
> > > > On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > > > > -----Original Message-----
> > > > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > > > Sent: Saturday, September 13, 2014 2:54 AM
> > > > > > To: Zhou, Danny
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > > > >
> > > > > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > > > > I am concerned about its performance caused by too many
> > > > > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > > > > which are mapped to user space, and then those packets to be copied
> > > > > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > > > > copies which brings significant negative performance impact. We
> > > > > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > > > > submitted to dpdk.org.
> > > > > >
> > > > > > Admittedly, this is not as good a performer as most of the existing
> > > > > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > > > > previously indicate that it performed better than the pcap-based PMD.
> > > > >
> > > > > Yes, slightly higher but makes no big difference.
> > > > >
> > > > Do you have numbers for this?  It seems to me faster is faster as long as its
> > > > statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> > > > to scale to multple cpus more easily than the pcap pmd, as it can make use of
> > > > the AF_PACKET fanout feature.
> > >
> > > For 64B small packet, 1.35M pps with 1 queue.
> > Why did you only test with a single queue?  Multiqueue operation was one of the
> > big advantages of the AF_PACKET based pmd.  I would expect a single queue setup
> > to perform in a very simmilar fashion to the pcap PMD
> >
> >  As both pcap and AF_PACKET PMDs depend on interrupt
> > > based NIC kernel drivers, all the DPDK performance optimization techniques are not utilized. Why should DPDK adopt
> > > two similar and poor performant PMDs which cannot demonstrate DPDK' key value "high performance"?
> > Several reasons:
> > * "High performance" isn't always the key need for end users.  Consider
> > pre-hardware availablity development phase.
> >
> > * Better hardware modeling (consider AF_PACKETS multiqueue abiltiy)
> >
> > * Better scaling (pcap doesn't make use of the fanout features that AF_PACKET
> > does)
> >
> > * Space savings, Building the AF_PACKET pmd doesn't require the additional
> > building/storage of the pcap driver.
> 
> This would include not requiring a dependency on libpcap, if nothing else.

librte_pmd_pcap and librte_pmd_packet are both DPDK wrapper libraries on top of libpcap library and AF_PACKET module respectively, 
so they are not born for high performance, which is truly understandable. DPDK is moving toward to open to a larger public of data center
consumers who do not care about very high performance, so from that angle, it makes sense to adopt librte_pmd_packet in my mind.

> 
> > >
> > > >
> > > > > > I look forward to seeing the changes you mention -- they sound very
> > > > > > exciting.  But, they will still require both networking core and
> > > > > > driver changes in the kernel.  And as I understand things today,
> > > > > > the userland code will still need at least some knowledge of specific
> > > > > > devices and how they layout their packet descriptors, etc.  So while
> > > > > > those changes sound very promising, they will still have certain
> > > > > > drawbacks in common with the current situation.
> > > > >
> > > > > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate
> > > > device-specific
> > > > > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe
> it will
> > > > be much easier
> > > > > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> > > > >
> > > >
> > > > Not sure how this relates, what you're describing is the feature intel has been
> > > > working on to augment kernel drivers to provide better throughput via direct
> > > > hardware access to user space.  Johns PMD provides ubiquitous function on all
> > > > hardware. I'm not sure how the desire for one implies the other isn't valuable?
> > > >
> > >
> > > Performance is the key value of DPDK, instead of commonality. But we are trying to improve commonality of our solution to make
> it easily
> > > adopted by other NIC vendors.
> > >
> > Thats completely irrelevant to the question at hand.  To go with your reasoning,
> > if performance is the key value of the DPDK, then you should remove all driver
> > support save for the most performant hardware you have.  By that same token,
> > you should deprecate the pcap driver in favor of this AF_PACKET driver, because
> > it has shown performance improvement.
> >
> > I'm being facetious, of course, but the facts remain: Lack of superior
> > performance from one PMD to the next does not immediately obviate the need for
> > one PMD over another, as they quite likely address differing needs.  As you note
> > the DPDK seeks performance as a key goal, but its an open source project, there
> > are other needs from other users in play here.  The AF_PACKET pmd provides
> > superior performance on linux platforms when hardware independence is required.
> > It differs from the pcap PMD as it uses features that are only available on the
> > Linux platform, so it stands to reason we should have both.
> 
> IMHO, the biggest deficiency in DPDK is the lack of apps.  Let's face
> it, no one really cares about running l2fwd except for testing the
> drivers.  What people want is applications.  Providing a PMD to use
> while developing an app without requiring specific hardware seems like
> a win to me.  The pcap PMD addresses some of that, but it is more of
> a stop-gap or special purpose thing (like for playing back captures).
> 

It is not true for network middle boxes which resolve L2/L3 packet processing problems(which is the main problem DPDK wants to resolve when it was born), 
but it might be truefor data center or endpoint applications that primarily focus on addressing L4-L7 packet processing problems, which
do not care about L2/L3 high throughput and packet latency very much, as system performance bottle-neck are in the L4-L7 routines.

> > > > > > It seems like the changes you mention will still need some sort of
> > > > > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > > > > separate from the code I already posted?  Or did you add that work
> > > > > > on top of mine?
> > > > > >
> > > > >
> > > > > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into
> eth_dev
> > > > library to do device
> > > > > probe and support new socket options.
> > > > >
> > > >
> > > > Ok, but again, PMD's are independent, and serve different needs.  If they're use
> > > > is at all overlapping from a functional standpoint, take this one now, and
> > > > deprecate it when a better one comes along.  Though from your description it
> > > > seems like both have a valid place in the ecosystem.
> > > >
> > >
> > > I am ok with this approach, as long as this AF_PACKET PMD does not add extra maintain efforts. Thomas might make the call.
> > >
> > What extra maintainer efforts do you think are required here, that wouldn't be
> > required for any PMD?  To suggest that a given PMD shouldn't be included because
> > it would require additional effort to maintain holds it to a higher standard
> > than the PMD's already included.  I don't recall anyone asking if the i40e or
> > bonding pmds would require additional effort before being integrated.
> 
> Right -- how much maintainer effort is put into the pcap driver
> these days?

I do not know details, but I DO know validation guys need to put a lot efforts on measuring the performance for it on different platforms.
Probably a automation function and performance testsuite can help a lot.

> 
> John
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-12 18:05   ` John W. Linville
  2014-09-12 18:31     ` Zhou, Danny
@ 2014-09-16 20:16     ` Neil Horman
  2014-09-26  9:28       ` Thomas Monjalon
  1 sibling, 1 reply; 76+ messages in thread
From: Neil Horman @ 2014-09-16 20:16 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> Ping?  Are there objections to this patch from mid-July?
> 
> John
> 
Thomas, Where are you on this?  It seems like if you don't have any objections
to this patch, it should go in, in ilght of the lack of further commentary.

Neil

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-16 20:16     ` Neil Horman
@ 2014-09-26  9:28       ` Thomas Monjalon
  2014-09-26 14:08         ` Neil Horman
  0 siblings, 1 reply; 76+ messages in thread
From: Thomas Monjalon @ 2014-09-26  9:28 UTC (permalink / raw)
  To: Neil Horman, John W. Linville; +Cc: dev

2014-09-16 16:16, Neil Horman:
> On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > Ping?  Are there objections to this patch from mid-July?
> 
> Thomas, Where are you on this?  It seems like if you don't have any objections
> to this patch, it should go in, in ilght of the lack of further commentary.

1) It doesn't appear as a top priority.
2) It's competing with pcap PMD and bifurcated PMD to come
   (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
3) There is no test associated with this PMD.
If one of this item becomes wrong, it should go in.

Currently, 2 projects are being initiated for validation (dcts) and
documentation. Keeping new things outside of the DPDK core makes it
clear that they have not to be supported by dcts and doc yet.
So, it is better to have an external PMD, like memnic, acting as a
staging area.

During this time, keeping this PMD separately will allow you to update it
with a maintainer account in dpdk.org. I just need your SSH public key.

Thank you
-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-26  9:28       ` Thomas Monjalon
@ 2014-09-26 14:08         ` Neil Horman
  2014-09-29 10:05           ` Bruce Richardson
  0 siblings, 1 reply; 76+ messages in thread
From: Neil Horman @ 2014-09-26 14:08 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> 2014-09-16 16:16, Neil Horman:
> > On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > > Ping?  Are there objections to this patch from mid-July?
> > 
> > Thomas, Where are you on this?  It seems like if you don't have any objections
> > to this patch, it should go in, in ilght of the lack of further commentary.
> 
> 1) It doesn't appear as a top priority.
Thats your responsibility.  Patches can't languish and rot on a list forever
just because others aren't willing to test it.  If theres further testing that
you feel it needs, ask. But from my read, its been tested for functionality and
performance (though high performance is never expected from a AF_PACKET PMD).
Given that any one PMD will not affect the performance of another in isolation,
I'm not sure what more you're waiting for here.

> 2) It's competing with pcap PMD and bifurcated PMD to come
>    (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
Regarding the pcap PMD, so?  Its an alternate implementation that provides
different features with different limitations.  The fact that they are simmilar
is irrelevant.  If simmilarity was the test, then we wouldn't bother with the
bifurcated driver either, because the pcap pmd already exists.

Regarding the bifurcated driver, you can't hold existing patches on the promise
of another pmd thats comming at an indeterminate time in the future.  Theres no
reason not to take this now and deprecate it in the future if there is
sufficient overlap with the bifurcated driver, though to my point above, they
still address different needs with different limitations, so I don't see doing
so as necessecary.
 
> 3) There is no test associated with this PMD.
That would have been a great comment to make a few months back, though whats
wrong with testpmd here?  That seems to be the same test that every other pmd
uses. What exactly are you looking for?


> If one of this item becomes wrong, it should go in.
> 

> Currently, 2 projects are being initiated for validation (dcts) and
> documentation. Keeping new things outside of the DPDK core makes it
> clear that they have not to be supported by dcts and doc yet.
> So, it is better to have an external PMD, like memnic, acting as a
> staging area.
> 
So, this brings up an excellent point - Validation and support.  Commonly open
source projects don't provide support at the upstream HEAD. Those items are
applied and inforced by distributors.  Theres no need to ensure that the
upstream head is always the most performance and stable point of the tree.  Its
that need that keeps the development pace slow, and creates frustrations like
this one, where a patch sits unaddressed for long periods of time.  Commonly the
workflow for most open source projects is for there to be a window of time where
visual review and basic functional testing are sufficient for acceptance into
the head of the tree.  After the development window closes there is a
stabilization period where testing/validation is done to ensure that no
regressions have been encountered, optionally with a -next branch temporarily
being created to accept patches for upcomming future releases.  If regressions
are found, its a simple matter in git to bisect back to the offending patch,
allow the contributing developer an opportunity to fix the issue, or to drop the
patch.  Using a workflow like this we can have a reasonable balance of needs
(good patch turn around time, as well as reasonable testing).  We've discussed
this when I posted the PMD_REGISTER_DRIVER patch months ago, and I thought you
were going to move in the direction of this workflow.  What happened?

> During this time, keeping this PMD separately will allow you to update it
> with a maintainer account in dpdk.org. I just need your SSH public key.
> 
We've discussed this too, keeping PMDs maintained separately is a very bad idea.
Doing so means developers have to constantly be aware of changes to the core
tree and try to keep up individually.  Integrating them all means that API
changes can be easily propogated to all PMD's when needed without making work
for many people.  Its exactly the reason we encourage driver writers to open
source drivers in Linux, because not doing so closes developers off from the
free maintenence they get when optimizations are made to API's.  And if you
follow the development model above, you don't need to worry about implied
support, as that correctly becomes a distributor issue.


Neil

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-26 14:08         ` Neil Horman
@ 2014-09-29 10:05           ` Bruce Richardson
  2014-10-08 15:57             ` Thomas Monjalon
  0 siblings, 1 reply; 76+ messages in thread
From: Bruce Richardson @ 2014-09-29 10:05 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > 2014-09-16 16:16, Neil Horman:
> > > On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > > > Ping?  Are there objections to this patch from mid-July?
> > > 
> > > Thomas, Where are you on this?  It seems like if you don't have any objections
> > > to this patch, it should go in, in ilght of the lack of further commentary.
> > 
> > 1) It doesn't appear as a top priority.
> Thats your responsibility.  Patches can't languish and rot on a list forever
> just because others aren't willing to test it.  If theres further testing that
> you feel it needs, ask. But from my read, its been tested for functionality and
> performance (though high performance is never expected from a AF_PACKET PMD).
> Given that any one PMD will not affect the performance of another in isolation,
> I'm not sure what more you're waiting for here.
> 
> > 2) It's competing with pcap PMD and bifurcated PMD to come
> >    (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
> Regarding the pcap PMD, so?  Its an alternate implementation that provides
> different features with different limitations.  The fact that they are simmilar
> is irrelevant.  If simmilarity was the test, then we wouldn't bother with the
> bifurcated driver either, because the pcap pmd already exists.
> 
> Regarding the bifurcated driver, you can't hold existing patches on the promise
> of another pmd thats comming at an indeterminate time in the future.  Theres no
> reason not to take this now and deprecate it in the future if there is
> sufficient overlap with the bifurcated driver, though to my point above, they
> still address different needs with different limitations, so I don't see doing
> so as necessecary.
>  
> > 3) There is no test associated with this PMD.
> That would have been a great comment to make a few months back, though whats
> wrong with testpmd here?  That seems to be the same test that every other pmd
> uses. What exactly are you looking for?
> 
> 
> > If one of this item becomes wrong, it should go in.
> > 
> 
> > Currently, 2 projects are being initiated for validation (dcts) and
> > documentation. Keeping new things outside of the DPDK core makes it
> > clear that they have not to be supported by dcts and doc yet.
> > So, it is better to have an external PMD, like memnic, acting as a
> > staging area.
> > 
> So, this brings up an excellent point - Validation and support.  Commonly open
> source projects don't provide support at the upstream HEAD. Those items are
> applied and inforced by distributors.  Theres no need to ensure that the
> upstream head is always the most performance and stable point of the tree.  Its
> that need that keeps the development pace slow, and creates frustrations like
> this one, where a patch sits unaddressed for long periods of time.  Commonly the
> workflow for most open source projects is for there to be a window of time where
> visual review and basic functional testing are sufficient for acceptance into
> the head of the tree.  After the development window closes there is a
> stabilization period where testing/validation is done to ensure that no
> regressions have been encountered, optionally with a -next branch temporarily
> being created to accept patches for upcomming future releases.  If regressions
> are found, its a simple matter in git to bisect back to the offending patch,
> allow the contributing developer an opportunity to fix the issue, or to drop the
> patch.  Using a workflow like this we can have a reasonable balance of needs
> (good patch turn around time, as well as reasonable testing).  We've discussed
> this when I posted the PMD_REGISTER_DRIVER patch months ago, and I thought you
> were going to move in the direction of this workflow.  What happened?
> 
> > During this time, keeping this PMD separately will allow you to update it
> > with a maintainer account in dpdk.org. I just need your SSH public key.
> > 
> We've discussed this too, keeping PMDs maintained separately is a very bad idea.
> Doing so means developers have to constantly be aware of changes to the core
> tree and try to keep up individually.  Integrating them all means that API
> changes can be easily propogated to all PMD's when needed without making work
> for many people.  Its exactly the reason we encourage driver writers to open
> source drivers in Linux, because not doing so closes developers off from the
> free maintenence they get when optimizations are made to API's.  And if you
> follow the development model above, you don't need to worry about implied
> support, as that correctly becomes a distributor issue.
> 
> 
> Neil

While not wanting to get too involved in the discussion, I'd just like to 
express my support for getting this new PMD merged in.

/Bruce

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-09-29 10:05           ` Bruce Richardson
@ 2014-10-08 15:57             ` Thomas Monjalon
  2014-10-08 19:14               ` Neil Horman
  0 siblings, 1 reply; 76+ messages in thread
From: Thomas Monjalon @ 2014-10-08 15:57 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

2014-09-29 11:05, Bruce Richardson:
> On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> > On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > > 2014-09-16 16:16, Neil Horman:
> > > > On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > > > > Ping?  Are there objections to this patch from mid-July?
> > > > 
> > > > Thomas, Where are you on this?  It seems like if you don't have any objections
> > > > to this patch, it should go in, in ilght of the lack of further commentary.
> > > 
> > > 1) It doesn't appear as a top priority.
> > Thats your responsibility.  Patches can't languish and rot on a list forever
> > just because others aren't willing to test it.  If theres further testing that
> > you feel it needs, ask. But from my read, its been tested for functionality and
> > performance (though high performance is never expected from a AF_PACKET PMD).
> > Given that any one PMD will not affect the performance of another in isolation,
> > I'm not sure what more you're waiting for here.

Yes, integration of new PMD must be accelerated.

> > > 2) It's competing with pcap PMD and bifurcated PMD to come
> > >    (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
> > Regarding the pcap PMD, so?  Its an alternate implementation that provides
> > different features with different limitations.  The fact that they are simmilar
> > is irrelevant.  If simmilarity was the test, then we wouldn't bother with the
> > bifurcated driver either, because the pcap pmd already exists.
> > 
> > Regarding the bifurcated driver, you can't hold existing patches on the promise
> > of another pmd thats comming at an indeterminate time in the future.  Theres no
> > reason not to take this now and deprecate it in the future if there is
> > sufficient overlap with the bifurcated driver, though to my point above, they
> > still address different needs with different limitations, so I don't see doing
> > so as necessecary.

Yes, we'll discuss it when bifurcated driver will be released.

> > > 3) There is no test associated with this PMD.
> > That would have been a great comment to make a few months back, though whats
> > wrong with testpmd here?  That seems to be the same test that every other pmd
> > uses. What exactly are you looking for?

I was thinking of testing behaviour with different kernel configurations and
unit tests for --vdev options. But it's not a major blocker.

> > > If one of this item becomes wrong, it should go in.
> > 
> > > Currently, 2 projects are being initiated for validation (dcts) and
> > > documentation. Keeping new things outside of the DPDK core makes it
> > > clear that they have not to be supported by dcts and doc yet.
> > > So, it is better to have an external PMD, like memnic, acting as a
> > > staging area.
> > > 
> > So, this brings up an excellent point - Validation and support.  Commonly open
> > source projects don't provide support at the upstream HEAD. Those items are
> > applied and inforced by distributors.  Theres no need to ensure that the
> > upstream head is always the most performance and stable point of the tree.  Its
> > that need that keeps the development pace slow, and creates frustrations like
> > this one, where a patch sits unaddressed for long periods of time.  Commonly the
> > workflow for most open source projects is for there to be a window of time where
> > visual review and basic functional testing are sufficient for acceptance into
> > the head of the tree.  After the development window closes there is a
> > stabilization period where testing/validation is done to ensure that no
> > regressions have been encountered, optionally with a -next branch temporarily
> > being created to accept patches for upcomming future releases.  If regressions
> > are found, its a simple matter in git to bisect back to the offending patch,
> > allow the contributing developer an opportunity to fix the issue, or to drop the
> > patch.  Using a workflow like this we can have a reasonable balance of needs
> > (good patch turn around time, as well as reasonable testing).  We've discussed
> > this when I posted the PMD_REGISTER_DRIVER patch months ago, and I thought you
> > were going to move in the direction of this workflow.  What happened?

Yes, we are moving to a "merge window" workflow.

> > > During this time, keeping this PMD separately will allow you to update it
> > > with a maintainer account in dpdk.org. I just need your SSH public key.
> > > 
> > We've discussed this too, keeping PMDs maintained separately is a very bad idea.
> > Doing so means developers have to constantly be aware of changes to the core
> > tree and try to keep up individually.  Integrating them all means that API
> > changes can be easily propogated to all PMD's when needed without making work
> > for many people.  Its exactly the reason we encourage driver writers to open
> > source drivers in Linux, because not doing so closes developers off from the
> > free maintenence they get when optimizations are made to API's.  And if you
> > follow the development model above, you don't need to worry about implied
> > support, as that correctly becomes a distributor issue.
> > 
> > 
> > Neil
> 
> While not wanting to get too involved in the discussion, I'd just like to 
> express my support for getting this new PMD merged in.

If RedHat is committed for its maintenance, it could integrated in release 1.8.
But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
pmd_packet.

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [P