DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
@ 2014-06-06 19:25 John W. Linville
  2014-06-06 19:47 ` Chris Wright
  2014-06-06 20:30 ` Neil Horman
  0 siblings, 2 replies; 8+ messages in thread
From: John W. Linville @ 2014-06-06 19:25 UTC (permalink / raw)
  To: dev

This is a Linux-specific virtual PMD driver backed by an AF_PACKET
socket.  The current implementation uses mmap'ed ring buffers to
limit copying and user/kernel transitions.  The intent is also to take
advantage of fanout and any future AF_PACKET optimizations as well.

This is intended to provide a means for using DPDK on a broad range
of hardware without hardware-specifi PMDs and hopefully with better
performance than what PCAP offers in Linux.  This might be useful
as a development platform for DPDK applications when DPDK-supported
hardware is expensive or unavailable.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
---
I've been toying with this for a while without a lot of progress.
I was about to post the original RFC patch just as the PMD
initialization flows got rewritten.  I set this down while that was
settling-out, and only just recently got back to it.

Anyway, I figure it is better to get this out now and let people
comment on it and/or get some use out of it if they can.  I have
posted this as RFC as it has only had very limited testing locally
and I'm sure it still could use some clean-ups and improvements
(like parameterizing block/frame size/count).

 config/common_bsdapp                   |   5 +
 config/common_linuxapp                 |   5 +
 lib/Makefile                           |   1 +
 lib/librte_eal/linuxapp/eal/Makefile   |   1 +
 lib/librte_pmd_packet/Makefile         |  60 +++
 lib/librte_pmd_packet/rte_eth_packet.c | 706 +++++++++++++++++++++++++++++++++
 lib/librte_pmd_packet/rte_eth_packet.h |  53 +++
 mk/rte.app.mk                          |   4 +
 8 files changed, 835 insertions(+)
 create mode 100644 lib/librte_pmd_packet/Makefile
 create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
 create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h

diff --git a/config/common_bsdapp b/config/common_bsdapp
index 2cc7b800b37b..b28c22820719 100644
--- a/config/common_bsdapp
+++ b/config/common_bsdapp
@@ -187,6 +187,11 @@ CONFIG_RTE_PMD_RING_MAX_TX_RINGS=16
 CONFIG_RTE_LIBRTE_PMD_PCAP=y
 
 #
+# Compile software PMD backed by AF_PACKET sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_PACKET=n
+
+#
 # Do prefetch of packet data within PMD driver receive function
 #
 CONFIG_RTE_PMD_PACKET_PREFETCH=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 62619c6c3a38..3ee29e1dd7ed 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -210,6 +210,11 @@ CONFIG_RTE_PMD_RING_MAX_TX_RINGS=16
 #
 CONFIG_RTE_LIBRTE_PMD_PCAP=n
 
+#
+# Compile software PMD backed by AF_PACKET sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_PACKET=y
+
 
 CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
 
diff --git a/lib/Makefile b/lib/Makefile
index b92b3921e654..d530097e132b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -44,6 +44,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_E1000_PMD) += librte_pmd_e1000
 DIRS-$(CONFIG_RTE_LIBRTE_IXGBE_PMD) += librte_pmd_ixgbe
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
 DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
 DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index b05282047709..95ab389d8d74 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
 CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
+CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
 CFLAGS += $(WERROR_FLAGS) -O3
 
diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
new file mode 100644
index 000000000000..e1266fb992cd
--- /dev/null
+++ b/lib/librte_pmd_packet/Makefile
@@ -0,0 +1,60 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   Copyright(c) 2014 6WIND S.A.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_pmd_packet.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
+
+#
+# Export include files
+#
+SYMLINK-y-include += rte_eth_packet.h
+
+# this lib depends upon:
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
new file mode 100644
index 000000000000..5fb62df40d92
--- /dev/null
+++ b/lib/librte_pmd_packet/rte_eth_packet.c
@@ -0,0 +1,706 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
+ *
+ *   Originally based upon librte_pmd_pcap code:
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2014 6WIND S.A.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_mbuf.h>
+#include <rte_ethdev.h>
+#include <rte_malloc.h>
+#include <rte_kvargs.h>
+#include <rte_dev.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+
+#include "rte_eth_packet.h"
+
+#define ETH_PACKET_IFACE_ARG		"iface"
+
+/* These should be parameterized... */
+#define BLOCKSIZ	(1 << 16)
+#define BLOCKCNT	(1 << 2)
+#define FRAMESIZ	(1 << 11)
+#define FRAMECNT	((BLOCKSIZ * BLOCKCNT) / FRAMESIZ)
+#define FRMCNTMSK	(FRAMECNT - 1)
+
+struct pkt_rx_queue {
+	int sockfd;
+
+	struct iovec *rd;
+	uint8_t *map;
+	unsigned int framenum;
+
+	struct rte_mempool *mb_pool;
+
+	volatile unsigned long rx_pkts;
+	volatile unsigned long err_pkts;
+};
+
+struct pkt_tx_queue {
+	int sockfd;
+
+	struct iovec *rd;
+	uint8_t *map;
+	unsigned int framenum;
+
+	volatile unsigned long tx_pkts;
+	volatile unsigned long err_pkts;
+};
+
+struct pmd_internals {
+	unsigned nb_rx_queues;
+	unsigned nb_tx_queues;
+
+	int if_index;
+	struct ether_addr eth_addr;
+
+	struct tpacket_req req;
+
+	struct pkt_rx_queue rx_queue[RTE_PMD_RING_MAX_RX_RINGS];
+	struct pkt_tx_queue tx_queue[RTE_PMD_RING_MAX_TX_RINGS];
+};
+
+static const char *valid_arguments[] = {
+	ETH_PACKET_IFACE_ARG,
+	NULL
+};
+
+static const char *drivername = "AF_PACKET PMD";
+
+static struct rte_eth_link pmd_link = {
+	.link_speed = 10000,
+	.link_duplex = ETH_LINK_FULL_DUPLEX,
+	.link_status = 0
+};
+
+static uint16_t
+eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	unsigned i;
+	struct tpacket2_hdr *ppd;
+	struct rte_mbuf *mbuf;
+	uint8_t *pbuf;
+	struct pkt_rx_queue *pkt_q = queue;
+	uint16_t num_rx = 0;
+	unsigned int framenum;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	/*
+	 * Reads the given number of packets from the AF_PACKET socket one by
+	 * one and copies the packet data into a newly allocated mbuf.
+	 */
+	framenum = pkt_q->framenum;
+	for (i = 0; i < nb_pkts; i++) {
+		/* point at the next incoming frame */
+		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+		if ((ppd->tp_status & TP_STATUS_USER) == 0)
+			break;
+
+		/* allocate the next mbuf */
+		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
+		if (unlikely(mbuf == NULL))
+			break;
+
+		/* packet will fit in the mbuf, go ahead and receive it */
+		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
+		pbuf = (uint8_t *) ppd + ppd->tp_mac;
+		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
+
+		/* release incoming frame and advance ring buffer */
+		ppd->tp_status = TP_STATUS_KERNEL;
+		framenum = (framenum + 1) & FRMCNTMSK;
+
+		/* account for the receive frame */
+		bufs[i] = mbuf;
+		num_rx++;
+	}
+	pkt_q->framenum = framenum;
+	pkt_q->rx_pkts += num_rx;
+	return num_rx;
+}
+
+/*
+ * Callback to handle sending packets through a real NIC.
+ */
+static uint16_t
+eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tpacket2_hdr *ppd;
+	struct rte_mbuf *mbuf;
+	uint8_t *pbuf;
+	unsigned int framenum;
+	struct pollfd pfd;
+	struct pkt_tx_queue *pkt_q = queue;
+	uint16_t num_tx = 0;
+	int i;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	memset(&pfd, 0, sizeof(pfd));
+	pfd.fd = pkt_q->sockfd;
+	pfd.events = POLLOUT;
+	pfd.revents = 0;
+
+	framenum = pkt_q->framenum;
+	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+	for (i = 0; i < nb_pkts; i++) {
+		/* point at the next incoming frame */
+		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
+		    (poll(&pfd, 1, -1) < 0))
+				continue;
+
+		/* copy the tx frame data */
+		mbuf = bufs[num_tx];
+		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
+			sizeof(struct sockaddr_ll);
+		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
+		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
+
+		/* release incoming frame and advance ring buffer */
+		ppd->tp_status = TP_STATUS_SEND_REQUEST;
+		framenum = (framenum + 1) & FRMCNTMSK;
+		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+
+		num_tx++;
+		rte_pktmbuf_free(mbuf);
+	}
+
+	/* kick-off transmits */
+	send(pkt_q->sockfd, NULL, 0, 0);
+
+	pkt_q->framenum = framenum;
+	pkt_q->tx_pkts += num_tx;
+	pkt_q->err_pkts += nb_pkts - num_tx;
+	return num_tx;
+}
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = 1;
+	return 0;
+}
+
+/*
+ * This function gets called when the current port gets stopped.
+ */
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	unsigned i;
+	int sockfd;
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	for (i = 0; i < internals->nb_tx_queues; i++) {
+		sockfd = internals->tx_queue[i].sockfd;
+		if(sockfd != -1)
+			close(sockfd);
+	}
+
+	dev->data->dev_link.link_status = 0;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev_info->driver_name = drivername;
+	dev_info->if_index = internals->if_index;
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
+	dev_info->max_rx_queues = (uint16_t)internals->nb_rx_queues;
+	dev_info->max_tx_queues = (uint16_t)internals->nb_tx_queues;
+	dev_info->min_rx_bufsize = 0;
+	dev_info->pci_dev = NULL;
+}
+
+static void
+eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
+{
+	unsigned i, imax;
+	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
+	const struct pmd_internals *internal = dev->data->dev_private;
+
+	memset(igb_stats, 0, sizeof(*igb_stats));
+
+	imax = (internal->nb_rx_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
+	        internal->nb_rx_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
+	for (i = 0; i < imax; i++) {
+		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
+		rx_total += igb_stats->q_ipackets[i];
+	}
+
+	imax = (internal->nb_tx_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
+	        internal->nb_tx_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
+	for (i = 0; i < imax; i++) {
+		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
+		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
+		tx_total += igb_stats->q_opackets[i];
+		tx_err_total += igb_stats->q_errors[i];
+	}
+
+	igb_stats->ipackets = rx_total;
+	igb_stats->opackets = tx_total;
+	igb_stats->oerrors = tx_err_total;
+}
+
+static void
+eth_stats_reset(struct rte_eth_dev *dev)
+{
+	unsigned i;
+	struct pmd_internals *internal = dev->data->dev_private;
+
+	for (i = 0; i < internal->nb_rx_queues; i++)
+		internal->rx_queue[i].rx_pkts = 0;
+
+	for (i = 0; i < internal->nb_tx_queues; i++) {
+		internal->tx_queue[i].tx_pkts = 0;
+		internal->tx_queue[i].err_pkts = 0;
+	}
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev __rte_unused)
+{
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+                int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev,
+                   uint16_t rx_queue_id,
+                   uint16_t nb_rx_desc __rte_unused,
+                   unsigned int socket_id __rte_unused,
+                   const struct rte_eth_rxconf *rx_conf __rte_unused,
+                   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
+	struct rte_pktmbuf_pool_private *mbp_priv;
+	uint16_t buf_size;
+
+	pkt_q->mb_pool = mb_pool;
+
+	/* Now get the space available for data in the mbuf */
+	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
+	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
+	                       RTE_PKTMBUF_HEADROOM);
+
+	if (ETH_FRAME_LEN > buf_size) {
+		RTE_LOG(ERR, PMD,
+			"AF_PACKET %d bytes will not fit in mbuf (%d bytes)\n",
+			ETH_FRAME_LEN, buf_size);
+		return -ENOMEM;
+	}
+
+	dev->data->rx_queues[rx_queue_id] = pkt_q;
+
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev,
+                   uint16_t tx_queue_id,
+                   uint16_t nb_tx_desc __rte_unused,
+                   unsigned int socket_id __rte_unused,
+                   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
+	return 0;
+}
+
+static struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+	.stats_get = eth_stats_get,
+	.stats_reset = eth_stats_reset,
+};
+
+/*
+ * Opens an AF_PACKET socket
+ */
+static int
+open_packet_iface(const char *key __rte_unused,
+                  const char *value __rte_unused,
+                  void *extra_args)
+{
+	int *sockfd = extra_args;
+
+	/* Open an AF_PACKET socket... */
+	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	if (*sockfd == -1) {
+		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+rte_pmd_init_internals(const int sockfd,
+                       const unsigned nb_rx_queues,
+                       const unsigned nb_tx_queues,
+                       const unsigned numa_node,
+                       struct pmd_internals **internals,
+                       struct rte_eth_dev **eth_dev,
+                       struct rte_kvargs *kvlist)
+{
+	struct rte_eth_dev_data *data = NULL;
+	struct rte_pci_device *pci_dev = NULL;
+	struct rte_kvargs_pair *pair = NULL;
+	struct ifreq ifr;
+	size_t ifnamelen;
+	unsigned k_idx;
+	struct sockaddr_ll sockaddr;
+	struct tpacket_req *req;
+	struct pkt_rx_queue *rx_queue;
+	struct pkt_tx_queue *tx_queue;
+	int rc, tpver, discard, bypass;
+	unsigned int i, rdsize;
+
+
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
+			break;
+	}
+	if (pair == NULL) {
+		RTE_LOG(ERR, PMD,
+			"No interface specified for AF_PACKET ethdev\n");
+		goto error;
+	}
+
+	RTE_LOG(INFO, PMD,
+		"Creating AF_PACKET-backed ethdev on numa socket %u\n",
+		numa_node);
+
+	/*
+	 * now do all data allocation - for eth_dev structure, dummy pci driver
+	 * and internal (private) data
+	 */
+	data = rte_zmalloc_socket(NULL, sizeof(*data), 0, numa_node);
+	if (data == NULL)
+		goto error;
+
+	pci_dev = rte_zmalloc_socket(NULL, sizeof(*pci_dev), 0, numa_node);
+	if (pci_dev == NULL)
+		goto error;
+
+	*internals = rte_zmalloc_socket(NULL, sizeof(**internals),
+	                                0, numa_node);
+	if (*internals == NULL)
+		goto error;
+
+	tpver = TPACKET_V2;
+	rc = setsockopt(sockfd, SOL_PACKET, PACKET_VERSION,
+			&tpver, sizeof(tpver));
+	if (rc == -1) {
+		RTE_LOG(ERR, PMD,
+			"Could not set PACKET_VERSION on AF_PACKET "
+			"socket for %s\n", pair->value);
+		goto error;
+	}
+
+	discard = 1;
+	rc = setsockopt(sockfd, SOL_PACKET, PACKET_LOSS,
+			&discard, sizeof(discard));
+	if (rc == -1) {
+		RTE_LOG(ERR, PMD,
+			"Could not set PACKET_LOSS on AF_PACKET socket "
+			"for %s\n", pair->value);
+		goto error;
+	}
+
+	bypass = 1;
+	rc = setsockopt(sockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
+			&bypass, sizeof(bypass));
+	if (rc == -1) {
+		RTE_LOG(ERR, PMD,
+			"Could not set PACKET_QDISC_BYPASS on AF_PACKET socket "
+			"for %s\n", pair->value);
+		goto error;
+	}
+
+	req = &((*internals)->req);
+
+	req->tp_block_size = BLOCKSIZ;
+	req->tp_block_nr = BLOCKCNT;
+	req->tp_frame_size = FRAMESIZ;
+	req->tp_frame_nr = FRAMECNT;
+
+	rc = setsockopt(sockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
+	if (rc == -1) {
+		RTE_LOG(ERR, PMD,
+			"Could not set PACKET_RX_RING on AF_PACKET "
+			"socket for %s\n", pair->value);
+		goto error;
+	}
+
+	rc = setsockopt(sockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
+	if (rc == -1) {
+		RTE_LOG(ERR, PMD,
+			"Could not set PACKET_TX_RING on AF_PACKET "
+			"socket for %s\n", pair->value);
+		goto error;
+	}
+
+	rx_queue = &((*internals)->rx_queue[0]);
+
+	rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
+			    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
+			    sockfd, 0);
+	if (rx_queue->map == MAP_FAILED) {
+		RTE_LOG(ERR, PMD,
+			"Call to mmap failed on AF_PACKET socket for %s\n",
+			pair->value);
+		goto error;
+	}
+
+	/* rdsize is same for both Tx and Rx */
+	rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
+
+	rx_queue->rd = rte_zmalloc_socket(NULL, rdsize, 0, numa_node);
+	for (i = 0; i < req->tp_frame_nr; ++i) {
+		rx_queue->rd[i].iov_base = rx_queue->map + (i * FRAMESIZ);
+		rx_queue->rd[i].iov_len = req->tp_frame_size;
+	}
+
+	tx_queue = &((*internals)->tx_queue[0]);
+
+	tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
+
+	tx_queue->rd = rte_zmalloc_socket(NULL, rdsize, 0, numa_node);
+	for (i = 0; i < req->tp_frame_nr; ++i) {
+		tx_queue->rd[i].iov_base = tx_queue->map + (i * FRAMESIZ);
+		tx_queue->rd[i].iov_len = req->tp_frame_size;
+	}
+
+	ifnamelen = strlen(pair->value);
+	if (ifnamelen < sizeof(ifr.ifr_name)) {
+		memcpy(ifr.ifr_name, pair->value, ifnamelen);
+		ifr.ifr_name[ifnamelen]='\0';
+	} else {
+		RTE_LOG(ERR, PMD,
+			"AF_PACKET I/F name too long (%s)\n",
+			pair->value);
+		goto error;
+	}
+	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
+		RTE_LOG(ERR, PMD,
+			"AF_PACKET ioctl failed (SIOCGIFINDEX)\n");
+		goto error;
+	}
+	(*internals)->if_index = ifr.ifr_ifindex;
+
+	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
+		RTE_LOG(ERR, PMD,
+			"AF_PACKET ioctl failed (SIOCGIFHWADDR)\n");
+		goto error;
+	}
+	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
+
+	memset(&sockaddr, 0, sizeof(sockaddr));
+	sockaddr.sll_family = AF_PACKET;
+	sockaddr.sll_protocol = htons(ETH_P_ALL);
+	sockaddr.sll_ifindex = (*internals)->if_index;
+
+	rc = bind(sockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
+	if (rc == -1) {
+		RTE_LOG(ERR, PMD,
+			"Could not bind AF_PACKET socket to %s\n", pair->value);
+		goto error;
+	}
+
+	/* reserve an ethdev entry */
+	*eth_dev = rte_eth_dev_allocate();
+	if (*eth_dev == NULL)
+		goto error;
+
+	/*
+	 * now put it all together
+	 * - store queue data in internals,
+	 * - store numa_node info in pci_driver
+	 * - point eth_dev_data to internals and pci_driver
+	 * - and point eth_dev structure to new eth_dev_data structure
+	 */
+
+	(*internals)->nb_rx_queues = nb_rx_queues;
+	(*internals)->nb_tx_queues = nb_tx_queues;
+
+	data->dev_private = *internals;
+	data->port_id = (*eth_dev)->data->port_id;
+	data->nb_rx_queues = (uint16_t)nb_rx_queues;
+	data->nb_tx_queues = (uint16_t)nb_tx_queues;
+	data->dev_link = pmd_link;
+	data->mac_addrs = &(*internals)->eth_addr;
+
+	pci_dev->numa_node = numa_node;
+
+	(*eth_dev)->data = data;
+	(*eth_dev)->dev_ops = &ops;
+	(*eth_dev)->pci_dev = pci_dev;
+
+	return 0;
+
+error:
+	if (data)
+		rte_free(data);
+	if (pci_dev)
+		rte_free(pci_dev);
+	if ((*internals)->rx_queue[0].rd)
+		rte_free((*internals)->rx_queue[0].rd);
+	if ((*internals)->tx_queue[0].rd)
+		rte_free((*internals)->tx_queue[0].rd);
+	if (*internals)
+		rte_free(*internals);
+	return -1;
+}
+
+static int
+rte_eth_from_packet(int const *sockfd,
+                    const unsigned nb_rx_queues,
+                    const unsigned nb_tx_queues,
+                    const unsigned numa_node,
+                    struct rte_kvargs *kvlist)
+{
+	struct pmd_internals *internals = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+
+	/* do some parameter checking */
+	if (*sockfd < 0)
+		return -1;
+	if (nb_rx_queues < 1 || nb_tx_queues < 1)
+		return -1;
+
+	if (rte_pmd_init_internals(*sockfd,
+	                           nb_rx_queues, nb_tx_queues,
+	                           numa_node, &internals,
+	                           &eth_dev, kvlist) < 0)
+		return -1;
+
+	internals->rx_queue->sockfd = *sockfd;
+	internals->tx_queue->sockfd = *sockfd;
+
+	eth_dev->rx_pkt_burst = eth_packet_rx;
+	eth_dev->tx_pkt_burst = eth_packet_tx;
+
+	return 0;
+}
+
+int
+rte_pmd_packet_devinit(const char *name, const char *params)
+{
+	unsigned numa_node;
+	int ret;
+	struct rte_kvargs *kvlist;
+	int sockfd = -1;
+
+	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
+
+	numa_node = rte_socket_id();
+
+	kvlist = rte_kvargs_parse(params, valid_arguments);
+	if (kvlist == NULL)
+		return -1;
+
+	/*
+	 * If iface argument is passed we open the NICs and use them for
+	 * reading / writing
+	 */
+	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
+
+		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
+		                         &open_packet_iface, &sockfd);
+		if (ret < 0)
+			return -1;
+	}
+
+	ret = rte_eth_from_packet(&sockfd, 1, 1, numa_node, kvlist);
+
+	if (ret < 0) {
+		close(sockfd);
+		return -1;
+	}
+
+	return 0;
+}
+
+static struct rte_driver pmd_packet_drv = {
+	.name = "eth_packet",
+	.type = PMD_VDEV,
+	.init = rte_pmd_packet_devinit,
+};
+
+PMD_REGISTER_DRIVER(pmd_packet_drv);
diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
new file mode 100644
index 000000000000..957e551c1d28
--- /dev/null
+++ b/lib/librte_pmd_packet/rte_eth_packet.h
@@ -0,0 +1,53 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ETH_PACKET_H_
+#define _RTE_ETH_PACKET_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
+
+/**
+ * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
+ * configured on command line.
+ */
+int rte_pmd_packet_devinit(const char *name, const char *params);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index a8365775f176..e1439712f1d8 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -177,6 +177,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
 LDLIBS += -lrte_pmd_pcap -lpcap
 endif
 
+ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
+LDLIBS += -lrte_pmd_packet
+endif
+
 endif
 
 LDLIBS += $(EXECENV_LDLIBS)
-- 
1.9.3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-06-06 19:25 [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices John W. Linville
@ 2014-06-06 19:47 ` Chris Wright
  2014-06-06 19:57   ` John W. Linville
  2014-06-06 20:30 ` Neil Horman
  1 sibling, 1 reply; 8+ messages in thread
From: Chris Wright @ 2014-06-06 19:47 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

* John W. Linville (linville@tuxdriver.com) wrote:
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> socket.  The current implementation uses mmap'ed ring buffers to
> limit copying and user/kernel transitions.  The intent is also to take
> advantage of fanout and any future AF_PACKET optimizations as well.
> 
> This is intended to provide a means for using DPDK on a broad range
> of hardware without hardware-specifi PMDs and hopefully with better
> performance than what PCAP offers in Linux.  This might be useful
> as a development platform for DPDK applications when DPDK-supported
> hardware is expensive or unavailable.

Nice, have you compared yet w/ PCAP numbers?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-06-06 19:47 ` Chris Wright
@ 2014-06-06 19:57   ` John W. Linville
  2014-06-06 20:04     ` Richardson, Bruce
  2014-06-06 20:06     ` Chris Wright
  0 siblings, 2 replies; 8+ messages in thread
From: John W. Linville @ 2014-06-06 19:57 UTC (permalink / raw)
  To: Chris Wright; +Cc: dev

On Fri, Jun 06, 2014 at 12:47:48PM -0700, Chris Wright wrote:
> * John W. Linville (linville@tuxdriver.com) wrote:
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > socket.  The current implementation uses mmap'ed ring buffers to
> > limit copying and user/kernel transitions.  The intent is also to take
> > advantage of fanout and any future AF_PACKET optimizations as well.
> > 
> > This is intended to provide a means for using DPDK on a broad range
> > of hardware without hardware-specifi PMDs and hopefully with better
> > performance than what PCAP offers in Linux.  This might be useful
> > as a development platform for DPDK applications when DPDK-supported
> > hardware is expensive or unavailable.
> 
> Nice, have you compared yet w/ PCAP numbers?

No, sorry -- definitely needs more testing, including performance numbers...

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-06-06 19:57   ` John W. Linville
@ 2014-06-06 20:04     ` Richardson, Bruce
  2014-06-06 20:06     ` Chris Wright
  1 sibling, 0 replies; 8+ messages in thread
From: Richardson, Bruce @ 2014-06-06 20:04 UTC (permalink / raw)
  To: John W. Linville, Chris Wright; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> Sent: Friday, June 06, 2014 12:57 PM
> To: Chris Wright
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-
> based virtual devices
> 
> On Fri, Jun 06, 2014 at 12:47:48PM -0700, Chris Wright wrote:
> > * John W. Linville (linville@tuxdriver.com) wrote:
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  The current implementation uses mmap'ed ring buffers to
> > > limit copying and user/kernel transitions.  The intent is also to take
> > > advantage of fanout and any future AF_PACKET optimizations as well.
> > >
> > > This is intended to provide a means for using DPDK on a broad range
> > > of hardware without hardware-specifi PMDs and hopefully with better
> > > performance than what PCAP offers in Linux.  This might be useful
> > > as a development platform for DPDK applications when DPDK-supported
> > > hardware is expensive or unavailable.
> >
> > Nice, have you compared yet w/ PCAP numbers?

Agreed. Nice to have generic PMD options for those NICs that don't have specially optimized PMDs.

> 
> No, sorry -- definitely needs more testing, including performance numbers...
> 
Looking forward to seeing those when you have them. :-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-06-06 19:57   ` John W. Linville
  2014-06-06 20:04     ` Richardson, Bruce
@ 2014-06-06 20:06     ` Chris Wright
  1 sibling, 0 replies; 8+ messages in thread
From: Chris Wright @ 2014-06-06 20:06 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

* John W. Linville (linville@tuxdriver.com) wrote:
> On Fri, Jun 06, 2014 at 12:47:48PM -0700, Chris Wright wrote:
> > * John W. Linville (linville@tuxdriver.com) wrote:
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  The current implementation uses mmap'ed ring buffers to
> > > limit copying and user/kernel transitions.  The intent is also to take
> > > advantage of fanout and any future AF_PACKET optimizations as well.
> > > 
> > > This is intended to provide a means for using DPDK on a broad range
> > > of hardware without hardware-specifi PMDs and hopefully with better
> > > performance than what PCAP offers in Linux.  This might be useful
> > > as a development platform for DPDK applications when DPDK-supported
> > > hardware is expensive or unavailable.
> > 
> > Nice, have you compared yet w/ PCAP numbers?
> 
> No, sorry -- definitely needs more testing, including performance numbers...

No worries, just curious ;)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-06-06 19:25 [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices John W. Linville
  2014-06-06 19:47 ` Chris Wright
@ 2014-06-06 20:30 ` Neil Horman
  2014-06-06 20:36   ` John W. Linville
  1 sibling, 1 reply; 8+ messages in thread
From: Neil Horman @ 2014-06-06 20:30 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

On Fri, Jun 06, 2014 at 03:25:54PM -0400, John W. Linville wrote:
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> socket.  The current implementation uses mmap'ed ring buffers to
> limit copying and user/kernel transitions.  The intent is also to take
> advantage of fanout and any future AF_PACKET optimizations as well.
> 
> This is intended to provide a means for using DPDK on a broad range
> of hardware without hardware-specifi PMDs and hopefully with better
> performance than what PCAP offers in Linux.  This might be useful
> as a development platform for DPDK applications when DPDK-supported
> hardware is expensive or unavailable.
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> I've been toying with this for a while without a lot of progress.
> I was about to post the original RFC patch just as the PMD
> initialization flows got rewritten.  I set this down while that was
> settling-out, and only just recently got back to it.
> 
> Anyway, I figure it is better to get this out now and let people
> comment on it and/or get some use out of it if they can.  I have
> posted this as RFC as it has only had very limited testing locally
> and I'm sure it still could use some clean-ups and improvements
> (like parameterizing block/frame size/count).
> 
Looks pretty good.  I'll be interested to see how much beter we can do over
standard pcap when we turn on the features like fanout and increased memory
sizing.


One thought: Its not a feature, but is there advantage to making the transmit
batch size configurable?  e.g. how many packets you queue up for transmit in a
given memory buffer before calling send?  If you couple that with a timer, you
could trade of some initial latency for higher overall througput, as it reduces
the number of syscall traps you have to make.

Neil

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-06-06 20:30 ` Neil Horman
@ 2014-06-06 20:36   ` John W. Linville
  2014-06-06 20:51     ` Neil Horman
  0 siblings, 1 reply; 8+ messages in thread
From: John W. Linville @ 2014-06-06 20:36 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Fri, Jun 06, 2014 at 04:30:50PM -0400, Neil Horman wrote:
> On Fri, Jun 06, 2014 at 03:25:54PM -0400, John W. Linville wrote:
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > socket.  The current implementation uses mmap'ed ring buffers to
> > limit copying and user/kernel transitions.  The intent is also to take
> > advantage of fanout and any future AF_PACKET optimizations as well.
> > 
> > This is intended to provide a means for using DPDK on a broad range
> > of hardware without hardware-specifi PMDs and hopefully with better
> > performance than what PCAP offers in Linux.  This might be useful
> > as a development platform for DPDK applications when DPDK-supported
> > hardware is expensive or unavailable.
> > 
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > I've been toying with this for a while without a lot of progress.
> > I was about to post the original RFC patch just as the PMD
> > initialization flows got rewritten.  I set this down while that was
> > settling-out, and only just recently got back to it.
> > 
> > Anyway, I figure it is better to get this out now and let people
> > comment on it and/or get some use out of it if they can.  I have
> > posted this as RFC as it has only had very limited testing locally
> > and I'm sure it still could use some clean-ups and improvements
> > (like parameterizing block/frame size/count).
> > 
> Looks pretty good.  I'll be interested to see how much beter we can do over
> standard pcap when we turn on the features like fanout and increased memory
> sizing.
> 
> 
> One thought: Its not a feature, but is there advantage to making the transmit
> batch size configurable?  e.g. how many packets you queue up for transmit in a
> given memory buffer before calling send?  If you couple that with a timer, you
> could trade of some initial latency for higher overall througput, as it reduces
> the number of syscall traps you have to make.

Sure.  For now, that is gated on the number of packets passed to the
transmit function.  But I gather you are thinking of a bigger number
that the PMD would manage across multiple transmit batches?  In concept
that is similar to how 802.11n bundles frames into aggregates to reduce
the cost of contending for the media.  As you say, latency suffers,
but throughput can be improved.

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
  2014-06-06 20:36   ` John W. Linville
@ 2014-06-06 20:51     ` Neil Horman
  0 siblings, 0 replies; 8+ messages in thread
From: Neil Horman @ 2014-06-06 20:51 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

On Fri, Jun 06, 2014 at 04:36:11PM -0400, John W. Linville wrote:
> On Fri, Jun 06, 2014 at 04:30:50PM -0400, Neil Horman wrote:
> > On Fri, Jun 06, 2014 at 03:25:54PM -0400, John W. Linville wrote:
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  The current implementation uses mmap'ed ring buffers to
> > > limit copying and user/kernel transitions.  The intent is also to take
> > > advantage of fanout and any future AF_PACKET optimizations as well.
> > > 
> > > This is intended to provide a means for using DPDK on a broad range
> > > of hardware without hardware-specifi PMDs and hopefully with better
> > > performance than what PCAP offers in Linux.  This might be useful
> > > as a development platform for DPDK applications when DPDK-supported
> > > hardware is expensive or unavailable.
> > > 
> > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > ---
> > > I've been toying with this for a while without a lot of progress.
> > > I was about to post the original RFC patch just as the PMD
> > > initialization flows got rewritten.  I set this down while that was
> > > settling-out, and only just recently got back to it.
> > > 
> > > Anyway, I figure it is better to get this out now and let people
> > > comment on it and/or get some use out of it if they can.  I have
> > > posted this as RFC as it has only had very limited testing locally
> > > and I'm sure it still could use some clean-ups and improvements
> > > (like parameterizing block/frame size/count).
> > > 
> > Looks pretty good.  I'll be interested to see how much beter we can do over
> > standard pcap when we turn on the features like fanout and increased memory
> > sizing.
> > 
> > 
> > One thought: Its not a feature, but is there advantage to making the transmit
> > batch size configurable?  e.g. how many packets you queue up for transmit in a
> > given memory buffer before calling send?  If you couple that with a timer, you
> > could trade of some initial latency for higher overall througput, as it reduces
> > the number of syscall traps you have to make.
> 
> Sure.  For now, that is gated on the number of packets passed to the
> transmit function.  But I gather you are thinking of a bigger number
> that the PMD would manage across multiple transmit batches?  In concept
> that is similar to how 802.11n bundles frames into aggregates to reduce
> the cost of contending for the media.  As you say, latency suffers,
> but throughput can be improved.
> 
Exatly, yes.  Not sure if its helpful here, but might be good for future
consideration
Neil

> John
> -- 
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-06-06 20:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-06 19:25 [dpdk-dev] [RFC] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices John W. Linville
2014-06-06 19:47 ` Chris Wright
2014-06-06 19:57   ` John W. Linville
2014-06-06 20:04     ` Richardson, Bruce
2014-06-06 20:06     ` Chris Wright
2014-06-06 20:30 ` Neil Horman
2014-06-06 20:36   ` John W. Linville
2014-06-06 20:51     ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).