DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
@ 2018-02-27  9:32 Qi Zhang
  2018-02-27  9:33 ` [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:32 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

The RFC patches add a new PMD driver for AF_XDP which is a proposed
faster version of AF_PACKET interface in Linux, see below link for 
detail AF_XDP introduction:
https://fosdem.org/2018/schedule/event/af_xdp/
https://lwn.net/Articles/745934/

This patchset is base on v18.02.
It also require a linux kernel that have below AF_XDP RFC patches be
applied.
https://patchwork.ozlabs.org/patch/867961/
https://patchwork.ozlabs.org/patch/867960/
https://patchwork.ozlabs.org/patch/867938/
https://patchwork.ozlabs.org/patch/867939/
https://patchwork.ozlabs.org/patch/867940/
https://patchwork.ozlabs.org/patch/867941/
https://patchwork.ozlabs.org/patch/867942/
https://patchwork.ozlabs.org/patch/867943/
https://patchwork.ozlabs.org/patch/867944/
https://patchwork.ozlabs.org/patch/867945/
https://patchwork.ozlabs.org/patch/867946/
https://patchwork.ozlabs.org/patch/867947/
https://patchwork.ozlabs.org/patch/867948/
https://patchwork.ozlabs.org/patch/867949/
https://patchwork.ozlabs.org/patch/867950/
https://patchwork.ozlabs.org/patch/867951/
https://patchwork.ozlabs.org/patch/867952/
https://patchwork.ozlabs.org/patch/867953/
https://patchwork.ozlabs.org/patch/867954/
https://patchwork.ozlabs.org/patch/867955/
https://patchwork.ozlabs.org/patch/867956/
https://patchwork.ozlabs.org/patch/867957/
https://patchwork.ozlabs.org/patch/867958/
https://patchwork.ozlabs.org/patch/867959/

There is no clean upstream target yet since kernel patch is still in
RFC stage, The purpose of the patchset is just for anyone that want to
eveluate af_xdp with DPDK application and get feedback for further
improvement.

To try with the new PMD
1. compile and install the kernel with above patches applied.
2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
   and $TOOLS_DIR (dir at <kernel_src>/tools) at driver/net/af_xdp/Makefile
   before compile DPDK.
3. make sure libelf and libbpf is installed.

BTW, performance test shows our PMD can reach 94%~98% of the orignal benchmark
when share memory is enabled.

Qi Zhang (7):
  net/af_xdp: new PMD driver
  lib/mbuf: enable parse flags when create mempool
  lib/mempool: allow page size aligned mempool
  net/af_xdp: use mbuf mempool for buffer management
  net/af_xdp: enable share mempool
  net/af_xdp: load BPF file
  app/testpmd: enable parameter for mempool flags

 app/test-pmd/parameters.c                     |  12 +
 app/test-pmd/testpmd.c                        |  15 +-
 app/test-pmd/testpmd.h                        |   1 +
 config/common_base                            |   5 +
 config/common_linuxapp                        |   1 +
 drivers/net/Makefile                          |   1 +
 drivers/net/af_xdp/Makefile                   |  60 ++
 drivers/net/af_xdp/bpf_load.c                 | 798 +++++++++++++++++++++++
 drivers/net/af_xdp/bpf_load.h                 |  65 ++
 drivers/net/af_xdp/libbpf.h                   | 199 ++++++
 drivers/net/af_xdp/meson.build                |   7 +
 drivers/net/af_xdp/rte_eth_af_xdp.c           | 878 ++++++++++++++++++++++++++
 drivers/net/af_xdp/rte_pmd_af_xdp_version.map |   4 +
 drivers/net/af_xdp/xdpsock_queue.h            |  62 ++
 lib/librte_mbuf/rte_mbuf.c                    |  15 +-
 lib/librte_mbuf/rte_mbuf.h                    |   8 +-
 lib/librte_mempool/rte_mempool.c              |   2 +
 lib/librte_mempool/rte_mempool.h              |   1 +
 mk/rte.app.mk                                 |   1 +
 19 files changed, 2125 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/af_xdp/Makefile
 create mode 100644 drivers/net/af_xdp/bpf_load.c
 create mode 100644 drivers/net/af_xdp/bpf_load.h
 create mode 100644 drivers/net/af_xdp/libbpf.h
 create mode 100644 drivers/net/af_xdp/meson.build
 create mode 100644 drivers/net/af_xdp/rte_eth_af_xdp.c
 create mode 100644 drivers/net/af_xdp/rte_pmd_af_xdp_version.map
 create mode 100644 drivers/net/af_xdp/xdpsock_queue.h

-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-02-28 23:40   ` Stephen Hemminger
                     ` (3 more replies)
  2018-02-27  9:33 ` [dpdk-dev] [RFC 2/7] lib/mbuf: enable parse flags when create mempool Qi Zhang
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

This is the vanilla version.
Packet data will copy between af_xdp memory buffer and mbuf mempool.
indexes of memory buffer is simply managed by a fifo ring.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 config/common_base                            |   5 +
 config/common_linuxapp                        |   1 +
 drivers/net/Makefile                          |   1 +
 drivers/net/af_xdp/Makefile                   |  56 ++
 drivers/net/af_xdp/meson.build                |   7 +
 drivers/net/af_xdp/rte_eth_af_xdp.c           | 763 ++++++++++++++++++++++++++
 drivers/net/af_xdp/rte_pmd_af_xdp_version.map |   4 +
 drivers/net/af_xdp/xdpsock_queue.h            |  62 +++
 mk/rte.app.mk                                 |   1 +
 9 files changed, 900 insertions(+)
 create mode 100644 drivers/net/af_xdp/Makefile
 create mode 100644 drivers/net/af_xdp/meson.build
 create mode 100644 drivers/net/af_xdp/rte_eth_af_xdp.c
 create mode 100644 drivers/net/af_xdp/rte_pmd_af_xdp_version.map
 create mode 100644 drivers/net/af_xdp/xdpsock_queue.h

diff --git a/config/common_base b/config/common_base
index ad03cf433..84b7b3b7e 100644
--- a/config/common_base
+++ b/config/common_base
@@ -368,6 +368,11 @@ CONFIG_RTE_LIBRTE_VMXNET3_DEBUG_TX_FREE=n
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=n
 
 #
+# Compile software PMD backed by AF_XDP sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_AF_XDP=n
+
+#
 # Compile link bonding PMD library
 #
 CONFIG_RTE_LIBRTE_PMD_BOND=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index ff98f2355..3b10695b6 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -16,6 +16,7 @@ CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
+CONFIG_RTE_LIBRTE_PMD_AF_XDP=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
 CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e1127326b..409234ac3 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -9,6 +9,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD),d)
 endif
 
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET) += af_packet
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += af_xdp
 DIRS-$(CONFIG_RTE_LIBRTE_ARK_PMD) += ark
 DIRS-$(CONFIG_RTE_LIBRTE_AVF_PMD) += avf
 DIRS-$(CONFIG_RTE_LIBRTE_AVP_PMD) += avp
diff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile
new file mode 100644
index 000000000..ac38e20bf
--- /dev/null
+++ b/drivers/net/af_xdp/Makefile
@@ -0,0 +1,56 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   Copyright(c) 2014 6WIND S.A.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_pmd_af_xdp.a
+
+EXPORT_MAP := rte_pmd_af_xdp_version.map
+
+LIBABIVER := 1
+
+CFLAGS += -O3 -I/opt/af_xdp/linux_headers/include
+CFLAGS += $(WERROR_FLAGS)
+LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
+LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs
+LDLIBS += -lrte_bus_vdev
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += rte_eth_af_xdp.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/af_xdp/meson.build b/drivers/net/af_xdp/meson.build
new file mode 100644
index 000000000..4b5299c8e
--- /dev/null
+++ b/drivers/net/af_xdp/meson.build
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2017 Intel Corporation
+
+if host_machine.system() != 'linux'
+	build = false
+endif
+sources = files('rte_eth_af_xdp.c')
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
new file mode 100644
index 000000000..4eb8a2c28
--- /dev/null
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -0,0 +1,763 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
+ * Originally based upon librte_pmd_pcap code:
+ * Copyright(c) 2010-2015 Intel Corporation.
+ * Copyright(c) 2014 6WIND S.A.
+ * All rights reserved.
+ */
+
+#include <rte_mbuf.h>
+#include <rte_ethdev_driver.h>
+#include <rte_ethdev_vdev.h>
+#include <rte_malloc.h>
+#include <rte_kvargs.h>
+#include <rte_bus_vdev.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_xdp.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+#include "xdpsock_queue.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+#define ETH_AF_XDP_IFACE_ARG		"iface"
+#define ETH_AF_XDP_QUEUE_IDX_ARG	"queue"
+#define ETH_AF_XDP_RING_SIZE_ARG	"ringsz"
+
+#define ETH_AF_XDP_FRAME_SIZE		2048
+#define ETH_AF_XDP_NUM_BUFFERS		131072
+#define ETH_AF_XDP_DATA_HEADROOM	0
+#define ETH_AF_XDP_DFLT_RING_SIZE	1024
+#define ETH_AF_XDP_DFLT_QUEUE_IDX	0
+
+#define ETH_AF_XDP_RX_BATCH_SIZE	32
+#define ETH_AF_XDP_TX_BATCH_SIZE	32
+
+struct xdp_umem {
+	char *buffer;
+	size_t size;
+	unsigned int frame_size;
+	unsigned int frame_size_log2;
+	unsigned int nframes;
+	int mr_fd;
+};
+
+struct pmd_internals {
+	int sfd;
+	int if_index;
+	char if_name[0x100];
+	struct ether_addr eth_addr;
+	struct xdp_queue rx;
+	struct xdp_queue tx;
+	struct xdp_umem *umem;
+	struct rte_mempool *mb_pool;
+
+	unsigned long rx_pkts;
+	unsigned long rx_bytes;
+	unsigned long rx_dropped;
+
+	unsigned long tx_pkts;
+	unsigned long err_pkts;
+	unsigned long tx_bytes;
+
+	uint16_t port_id;
+	uint16_t queue_idx;
+	int ring_size;
+	struct rte_ring *buf_ring;
+};
+
+static const char * const valid_arguments[] = {
+	ETH_AF_XDP_IFACE_ARG,
+	ETH_AF_XDP_QUEUE_IDX_ARG,
+	ETH_AF_XDP_RING_SIZE_ARG,
+	NULL
+};
+
+static struct rte_eth_link pmd_link = {
+	.link_speed = ETH_SPEED_NUM_10G,
+	.link_duplex = ETH_LINK_FULL_DUPLEX,
+	.link_status = ETH_LINK_DOWN,
+	.link_autoneg = ETH_LINK_AUTONEG
+};
+
+static void *get_pkt_data(struct pmd_internals *internals,
+			  uint32_t index,
+			  uint32_t offset)
+{
+	return (uint8_t *)(internals->umem->buffer +
+			   (index << internals->umem->frame_size_log2) +
+			   offset);
+}
+
+static uint16_t
+eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct pmd_internals *internals = queue;
+	struct xdp_queue *rxq = &internals->rx;
+	struct rte_mbuf *mbuf;
+	unsigned long dropped = 0;
+	unsigned long rx_bytes = 0;
+	uint16_t count = 0;
+
+	nb_pkts = nb_pkts < ETH_AF_XDP_RX_BATCH_SIZE ?
+		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
+
+	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
+	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
+	int rcvd, i;
+	/* fill rx ring */
+	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {
+		int n = rte_ring_dequeue_bulk(internals->buf_ring,
+					      indexes,
+					      ETH_AF_XDP_RX_BATCH_SIZE,
+					      NULL);
+		for (i = 0; i < n; i++)
+			descs[i].idx = (uint32_t)((long int)indexes[i]);
+		xq_enq(rxq, descs, n);
+	}
+
+	/* read data */
+	rcvd = xq_deq(rxq, descs, nb_pkts);
+	if (rcvd == 0)
+		return 0;
+
+	for (i = 0; i < rcvd; i++) {
+		char *pkt;
+		uint32_t idx = descs[i].idx;
+
+		mbuf = rte_pktmbuf_alloc(internals->mb_pool);
+		rte_pktmbuf_pkt_len(mbuf) =
+			rte_pktmbuf_data_len(mbuf) =
+			descs[i].len;
+		if (mbuf) {
+			pkt = get_pkt_data(internals, idx, descs[i].offset);
+			memcpy(rte_pktmbuf_mtod(mbuf, void *),
+			       pkt, descs[i].len);
+			rx_bytes += descs[i].len;
+			bufs[count++] = mbuf;
+		} else {
+			dropped++;
+		}
+		indexes[i] = (void *)((long int)idx);
+	}
+
+	rte_ring_enqueue_bulk(internals->buf_ring, indexes, rcvd, NULL);
+
+	internals->rx_pkts += (rcvd - dropped);
+	internals->rx_bytes += rx_bytes;
+	internals->rx_dropped += dropped;
+
+	return count;
+}
+
+static void kick_tx(int fd)
+{
+	int ret;
+
+	for (;;) {
+		ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+		if (ret >= 0 || errno == ENOBUFS)
+			return;
+		if (errno == EAGAIN)
+			continue;
+	}
+}
+
+static uint16_t
+eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct pmd_internals *internals = queue;
+	struct xdp_queue *txq = &internals->tx;
+	struct rte_mbuf *mbuf;
+	struct xdp_desc descs[ETH_AF_XDP_TX_BATCH_SIZE];
+	void *indexes[ETH_AF_XDP_TX_BATCH_SIZE];
+	uint16_t i, valid;
+	unsigned long tx_bytes = 0;
+
+	nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ?
+		  nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE;
+
+	if (txq->num_free < ETH_AF_XDP_TX_BATCH_SIZE * 2) {
+		int n = xq_deq(txq, descs, ETH_AF_XDP_TX_BATCH_SIZE);
+
+		for (i = 0; i < n; i++)
+			indexes[i] = (void *)((long int)descs[i].idx);
+		rte_ring_enqueue_bulk(internals->buf_ring, indexes, n, NULL);
+	}
+
+	nb_pkts = nb_pkts > txq->num_free ? txq->num_free : nb_pkts;
+	nb_pkts = rte_ring_dequeue_bulk(internals->buf_ring, indexes,
+					nb_pkts, NULL);
+
+	valid = 0;
+	for (i = 0; i < nb_pkts; i++) {
+		char *pkt;
+		unsigned int buf_len =
+			internals->umem->frame_size - ETH_AF_XDP_DATA_HEADROOM;
+		mbuf = bufs[i];
+		if (mbuf->pkt_len <= buf_len) {
+			descs[valid].idx = (uint32_t)((long int)indexes[valid]);
+			descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
+			descs[valid].flags = 0;
+			descs[valid].len = mbuf->pkt_len;
+			pkt = get_pkt_data(internals, descs[i].idx,
+					   descs[i].offset);
+			memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
+			       descs[i].len);
+			valid++;
+			tx_bytes += mbuf->pkt_len;
+		}
+		rte_pktmbuf_free(mbuf);
+	}
+
+	xq_enq(txq, descs, valid);
+	kick_tx(internals->sfd);
+
+	if (valid < nb_pkts)
+		rte_ring_enqueue_bulk(internals->buf_ring, &indexes[valid],
+				      nb_pkts - valid, NULL);
+
+	internals->err_pkts += (nb_pkts - valid);
+	internals->tx_pkts += valid;
+	internals->tx_bytes += tx_bytes;
+
+	return valid;
+}
+
+static void
+fill_rx_desc(struct pmd_internals *internals)
+{
+	int num_free = internals->rx.num_free;
+	void *p = NULL;
+	int i;
+
+	for (i = 0; i < num_free; i++) {
+		struct xdp_desc desc = {};
+
+		rte_ring_dequeue(internals->buf_ring, &p);
+		desc.idx = (uint32_t)((long int)p);
+		xq_enq(&internals->rx, &desc, 1);
+	}
+}
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev->data->dev_link.link_status = ETH_LINK_UP;
+	fill_rx_desc(internals);
+
+	return 0;
+}
+
+/* This function gets called when the current port gets stopped. */
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = ETH_LINK_DOWN;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev_info->if_index = internals->if_index;
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
+	dev_info->max_rx_queues = 1;
+	dev_info->max_tx_queues = 1;
+	dev_info->min_rx_bufsize = 0;
+}
+
+static int
+eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
+{
+	const struct pmd_internals *internal = dev->data->dev_private;
+
+	stats->ipackets = stats->q_ipackets[0] =
+		internal->rx_pkts;
+	stats->ibytes = stats->q_ibytes[0] =
+		internal->rx_bytes;
+	stats->imissed =
+		internal->rx_dropped;
+
+	stats->opackets = stats->q_opackets[0]
+		= internal->tx_pkts;
+	stats->oerrors = stats->q_errors[0] =
+		internal->err_pkts;
+	stats->obytes = stats->q_obytes[0] =
+		internal->tx_bytes;
+
+	return 0;
+}
+
+static void
+eth_stats_reset(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *internal = dev->data->dev_private;
+
+	internal->rx_pkts = 0;
+	internal->rx_bytes = 0;
+	internal->rx_dropped = 0;
+
+	internal->tx_pkts = 0;
+	internal->err_pkts = 0;
+	internal->tx_bytes = 0;
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev __rte_unused)
+{
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+		int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd, size_t nbuffers)
+{
+	struct xdp_mr_req req = { .frame_size = ETH_AF_XDP_FRAME_SIZE,
+				  .data_headroom = ETH_AF_XDP_DATA_HEADROOM };
+	struct xdp_umem *umem;
+	void *bufs;
+	int ret;
+
+	ret = posix_memalign((void **)&bufs, getpagesize(),
+			     nbuffers * req.frame_size);
+	if (ret)
+		return NULL;
+
+	umem = calloc(1, sizeof(*umem));
+	if (!umem) {
+		free(bufs);
+		return NULL;
+	}
+
+	req.addr = (unsigned long)bufs;
+	req.len = nbuffers * req.frame_size;
+	ret = setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
+	RTE_ASSERT(ret == 0);
+
+	umem->frame_size = ETH_AF_XDP_FRAME_SIZE;
+	umem->frame_size_log2 = 11;
+	umem->buffer = bufs;
+	umem->size = nbuffers * req.frame_size;
+	umem->nframes = nbuffers;
+	umem->mr_fd = sfd;
+
+	return umem;
+}
+
+static int
+xdp_configure(struct pmd_internals *internals)
+{
+	struct sockaddr_xdp sxdp;
+	struct xdp_ring_req req;
+	char ring_name[0x100];
+	int ret = 0;
+	long int i;
+
+	snprintf(ring_name, 0x100, "%s_%s_%d", "af_xdp_ring",
+		 internals->if_name, internals->queue_idx);
+	internals->buf_ring = rte_ring_create(ring_name,
+					      ETH_AF_XDP_NUM_BUFFERS,
+					      SOCKET_ID_ANY,
+					      0x0);
+	if (!internals->buf_ring)
+		return -1;
+
+	for (i = 0; i < ETH_AF_XDP_NUM_BUFFERS; i++)
+		rte_ring_enqueue(internals->buf_ring, (void *)i);
+
+	internals->umem = xsk_alloc_and_mem_reg_buffers(internals->sfd,
+							ETH_AF_XDP_NUM_BUFFERS);
+	if (!internals->umem)
+		goto error;
+
+	req.mr_fd = internals->umem->mr_fd;
+	req.desc_nr = internals->ring_size;
+
+	ret = setsockopt(internals->sfd, SOL_XDP, XDP_RX_RING,
+			 &req, sizeof(req));
+
+	RTE_ASSERT(ret == 0);
+
+	ret = setsockopt(internals->sfd, SOL_XDP, XDP_TX_RING,
+			 &req, sizeof(req));
+
+	RTE_ASSERT(ret == 0);
+
+	internals->rx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
+				  PROT_READ | PROT_WRITE,
+				  MAP_SHARED | MAP_LOCKED | MAP_POPULATE,
+				  internals->sfd,
+				  XDP_PGOFF_RX_RING);
+	RTE_ASSERT(internals->rx.ring != MAP_FAILED);
+
+	internals->rx.num_free = req.desc_nr;
+	internals->rx.ring_mask = req.desc_nr - 1;
+
+	internals->tx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
+				  PROT_READ | PROT_WRITE,
+				  MAP_SHARED | MAP_LOCKED | MAP_POPULATE,
+				  internals->sfd,
+				  XDP_PGOFF_TX_RING);
+	RTE_ASSERT(internals->tx.ring != MAP_FAILED);
+
+	internals->tx.num_free = req.desc_nr;
+	internals->tx.ring_mask = req.desc_nr - 1;
+
+	sxdp.sxdp_family = PF_XDP;
+	sxdp.sxdp_ifindex = internals->if_index;
+	sxdp.sxdp_queue_id = internals->queue_idx;
+
+	ret = bind(internals->sfd, (struct sockaddr *)&sxdp, sizeof(sxdp));
+	RTE_ASSERT(ret == 0);
+
+	return ret;
+error:
+	rte_ring_free(internals->buf_ring);
+	internals->buf_ring = NULL;
+	return -1;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev,
+		   uint16_t rx_queue_id,
+		   uint16_t nb_rx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+	unsigned int buf_size, data_size;
+
+	RTE_ASSERT(rx_queue_id == 0);
+	internals->mb_pool = mb_pool;
+	xdp_configure(internals);
+
+	/* Now get the space available for data in the mbuf */
+	buf_size = rte_pktmbuf_data_room_size(internals->mb_pool) -
+		RTE_PKTMBUF_HEADROOM;
+	data_size = internals->umem->frame_size;
+
+	if (data_size > buf_size) {
+		RTE_LOG(ERR, PMD,
+			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
+			dev->device->name, data_size, buf_size);
+		return -ENOMEM;
+	}
+
+	dev->data->rx_queues[rx_queue_id] = internals;
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev,
+		   uint16_t tx_queue_id,
+		   uint16_t nb_tx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	RTE_ASSERT(tx_queue_id == 0);
+	dev->data->tx_queues[tx_queue_id] = internals;
+	return 0;
+}
+
+static int
+eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+	struct ifreq ifr = { .ifr_mtu = mtu };
+	int ret;
+	int s;
+
+	s = socket(PF_INET, SOCK_DGRAM, 0);
+	if (s < 0)
+		return -EINVAL;
+
+	snprintf(ifr.ifr_name, IFNAMSIZ, "%s", internals->if_name);
+	ret = ioctl(s, SIOCSIFMTU, &ifr);
+	close(s);
+
+	if (ret < 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void
+eth_dev_change_flags(char *if_name, uint32_t flags, uint32_t mask)
+{
+	struct ifreq ifr;
+	int s;
+
+	s = socket(PF_INET, SOCK_DGRAM, 0);
+	if (s < 0)
+		return;
+
+	snprintf(ifr.ifr_name, IFNAMSIZ, "%s", if_name);
+	if (ioctl(s, SIOCGIFFLAGS, &ifr) < 0)
+		goto out;
+	ifr.ifr_flags &= mask;
+	ifr.ifr_flags |= flags;
+	if (ioctl(s, SIOCSIFFLAGS, &ifr) < 0)
+		goto out;
+out:
+	close(s);
+}
+
+static void
+eth_dev_promiscuous_enable(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	eth_dev_change_flags(internals->if_name, IFF_PROMISC, ~0);
+}
+
+static void
+eth_dev_promiscuous_disable(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	eth_dev_change_flags(internals->if_name, 0, ~IFF_PROMISC);
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.mtu_set = eth_dev_mtu_set,
+	.promiscuous_enable = eth_dev_promiscuous_enable,
+	.promiscuous_disable = eth_dev_promiscuous_disable,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+	.stats_get = eth_stats_get,
+	.stats_reset = eth_stats_reset,
+};
+
+static struct rte_vdev_driver pmd_af_xdp_drv;
+
+static void
+parse_parameters(struct rte_kvargs *kvlist,
+		 char **if_name,
+		 int *queue_idx,
+		 int *ring_size)
+{
+	struct rte_kvargs_pair *pair = NULL;
+	unsigned int k_idx;
+
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_AF_XDP_IFACE_ARG))
+			*if_name = pair->value;
+		else if (strstr(pair->key, ETH_AF_XDP_QUEUE_IDX_ARG))
+			*queue_idx = atoi(pair->value);
+		else if (strstr(pair->key, ETH_AF_XDP_RING_SIZE_ARG))
+			*ring_size = atoi(pair->value);
+	}
+}
+
+static int
+get_iface_info(const char *if_name,
+	       struct ether_addr *eth_addr,
+	       int *if_index)
+{
+	struct ifreq ifr;
+	int sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
+
+	if (sock < 0)
+		return -1;
+
+	strcpy(ifr.ifr_name, if_name);
+	if (ioctl(sock, SIOCGIFINDEX, &ifr))
+		goto error;
+	*if_index = ifr.ifr_ifindex;
+
+	if (ioctl(sock, SIOCGIFHWADDR, &ifr))
+		goto error;
+
+	memcpy(eth_addr, ifr.ifr_hwaddr.sa_data, 6);
+
+	close(sock);
+	return 0;
+
+error:
+	close(sock);
+	return -1;
+}
+
+static int
+init_internals(struct rte_vdev_device *dev,
+	       const char *if_name,
+	       int queue_idx,
+	       int ring_size)
+{
+	const char *name = rte_vdev_device_name(dev);
+	struct rte_eth_dev *eth_dev = NULL;
+	struct rte_eth_dev_data *data = NULL;
+	const unsigned int numa_node = dev->device.numa_node;
+	struct pmd_internals *internals = NULL;
+	int ret;
+
+	data = rte_zmalloc_socket(name, sizeof(*internals), 0, numa_node);
+	if (!data)
+		return -1;
+
+	internals = rte_zmalloc_socket(name, sizeof(*internals), 0, numa_node);
+	if (!internals)
+		goto error_1;
+
+	internals->queue_idx = queue_idx;
+	internals->ring_size = ring_size;
+	strcpy(internals->if_name, if_name);
+	internals->sfd = socket(PF_XDP, SOCK_RAW, 0);
+	if (internals->sfd < 0)
+		goto error_2;
+
+	ret = get_iface_info(if_name, &internals->eth_addr,
+			     &internals->if_index);
+	if (ret)
+		goto error_3;
+
+	eth_dev = rte_eth_vdev_allocate(dev, 0);
+	if (!eth_dev)
+		goto error_3;
+
+	rte_memcpy(data, eth_dev->data, sizeof(*data));
+	internals->port_id = eth_dev->data->port_id;
+	data->dev_private = internals;
+	data->nb_rx_queues = 1;
+	data->nb_tx_queues = 1;
+	data->dev_link = pmd_link;
+	data->mac_addrs = &internals->eth_addr;
+
+	eth_dev->data = data;
+	eth_dev->dev_ops = &ops;
+
+	eth_dev->rx_pkt_burst = eth_af_xdp_rx;
+	eth_dev->tx_pkt_burst = eth_af_xdp_tx;
+
+	return 0;
+
+error_3:
+	close(internals->sfd);
+
+error_2:
+	rte_free(internals);
+
+error_1:
+	rte_free(data);
+	return -1;
+}
+
+static int
+rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
+{
+	struct rte_kvargs *kvlist;
+	char *if_name = NULL;
+	int ring_size = ETH_AF_XDP_DFLT_RING_SIZE;
+	int queue_idx = ETH_AF_XDP_DFLT_QUEUE_IDX;
+	int ret;
+
+	RTE_LOG(INFO, PMD, "Initializing pmd_af_packet for %s\n",
+		rte_vdev_device_name(dev));
+
+	kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments);
+	if (!kvlist) {
+		RTE_LOG(ERR, PMD,
+			"Invalid kvargs");
+		return -1;
+	}
+
+	if (dev->device.numa_node == SOCKET_ID_ANY)
+		dev->device.numa_node = rte_socket_id();
+
+	parse_parameters(kvlist, &if_name, &queue_idx, &ring_size);
+
+	ret = init_internals(dev, if_name, queue_idx, ring_size);
+	rte_kvargs_free(kvlist);
+
+	return ret;
+}
+
+static int
+rte_pmd_af_xdp_remove(struct rte_vdev_device *dev)
+{
+	struct rte_eth_dev *eth_dev = NULL;
+	struct pmd_internals *internals;
+
+	RTE_LOG(INFO, PMD, "Closing AF_XDP ethdev on numa socket %u\n",
+		rte_socket_id());
+
+	if (!dev)
+		return -1;
+
+	/* find the ethdev entry */
+	eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev));
+	if (!eth_dev)
+		return -1;
+
+	internals = eth_dev->data->dev_private;
+	rte_ring_free(internals->buf_ring);
+	rte_free(internals->umem);
+	rte_free(eth_dev->data->dev_private);
+	rte_free(eth_dev->data);
+	close(internals->sfd);
+
+	rte_eth_dev_release_port(eth_dev);
+
+	return 0;
+}
+
+static struct rte_vdev_driver pmd_af_xdp_drv = {
+	.probe = rte_pmd_af_xdp_probe,
+	.remove = rte_pmd_af_xdp_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_af_xdp, pmd_af_xdp_drv);
+RTE_PMD_REGISTER_ALIAS(net_af_xdp, eth_af_xdp);
+RTE_PMD_REGISTER_PARAM_STRING(net_af_xdp,
+			      "iface=<string> "
+			      "queue=<int> "
+			      "ringsz=<int> ");
diff --git a/drivers/net/af_xdp/rte_pmd_af_xdp_version.map b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map
new file mode 100644
index 000000000..ef3539840
--- /dev/null
+++ b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map
@@ -0,0 +1,4 @@
+DPDK_2.0 {
+
+	local: *;
+};
diff --git a/drivers/net/af_xdp/xdpsock_queue.h b/drivers/net/af_xdp/xdpsock_queue.h
new file mode 100644
index 000000000..0dc666a08
--- /dev/null
+++ b/drivers/net/af_xdp/xdpsock_queue.h
@@ -0,0 +1,62 @@
+#ifndef __XDPSOCK_QUEUE_H
+#define __XDPSOCK_QUEUE_H
+
+static inline int xq_enq(struct xdp_queue *q,
+			 const struct xdp_desc *descs,
+			 unsigned int ndescs)
+{
+	unsigned int avail_idx = q->avail_idx;
+	unsigned int i;
+	int j;
+
+	if (q->num_free < ndescs)
+		return -ENOSPC;
+
+	q->num_free -= ndescs;
+
+	for (i = 0; i < ndescs; i++) {
+		unsigned int idx = avail_idx++ & q->ring_mask;
+
+		q->ring[idx].idx	= descs[i].idx;
+		q->ring[idx].len	= descs[i].len;
+		q->ring[idx].offset	= descs[i].offset;
+		q->ring[idx].error	= 0;
+	}
+	rte_smp_wmb();
+
+	for (j = ndescs - 1; j >= 0; j--) {
+		unsigned int idx = (q->avail_idx + j) & q->ring_mask;
+
+		q->ring[idx].flags = descs[j].flags | XDP_DESC_KERNEL;
+	}
+	q->avail_idx += ndescs;
+
+	return 0;
+}
+
+static inline int xq_deq(struct xdp_queue *q,
+			 struct xdp_desc *descs,
+			 int ndescs)
+{
+	unsigned int idx, last_used_idx = q->last_used_idx;
+	int i, entries = 0;
+
+	for (i = 0; i < ndescs; i++) {
+		idx = (last_used_idx++) & q->ring_mask;
+		if (q->ring[idx].flags & XDP_DESC_KERNEL)
+			break;
+		entries++;
+	}
+	q->num_free += entries;
+
+	rte_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = q->last_used_idx++ & q->ring_mask;
+		descs[i] = q->ring[idx];
+	}
+
+	return entries;
+}
+
+#endif /* __XDPSOCK_QUEUE_H */
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3eb41d176..bc26e1457 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -120,6 +120,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
 _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET)  += -lrte_pmd_af_packet
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP)     += -lrte_pmd_af_xdp
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ARK_PMD)        += -lrte_pmd_ark
 _LDLIBS-$(CONFIG_RTE_LIBRTE_AVF_PMD)        += -lrte_pmd_avf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_AVP_PMD)        += -lrte_pmd_avp
-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [RFC 2/7] lib/mbuf: enable parse flags when create mempool
  2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
  2018-02-27  9:33 ` [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-02-27  9:33 ` [dpdk-dev] [RFC 3/7] lib/mempool: allow page size aligned mempool Qi Zhang
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

This give the option that applicaiton can configure each
memory chunk's size precisely. (by MEMPOOL_F_NO_SPREAD).

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 lib/librte_mbuf/rte_mbuf.c | 15 ++++++++++++---
 lib/librte_mbuf/rte_mbuf.h |  8 +++++++-
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 091d388d3..5fd91c87c 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -125,7 +125,7 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 struct rte_mempool * __rte_experimental
 rte_pktmbuf_pool_create_by_ops(const char *name, unsigned int n,
 	unsigned int cache_size, uint16_t priv_size, uint16_t data_room_size,
-	int socket_id, const char *ops_name)
+	unsigned int flags, int socket_id, const char *ops_name)
 {
 	struct rte_mempool *mp;
 	struct rte_pktmbuf_pool_private mbp_priv;
@@ -145,7 +145,7 @@ rte_pktmbuf_pool_create_by_ops(const char *name, unsigned int n,
 	mbp_priv.mbuf_priv_size = priv_size;
 
 	mp = rte_mempool_create_empty(name, n, elt_size, cache_size,
-		 sizeof(struct rte_pktmbuf_pool_private), socket_id, 0);
+		 sizeof(struct rte_pktmbuf_pool_private), socket_id, flags);
 	if (mp == NULL)
 		return NULL;
 
@@ -179,9 +179,18 @@ rte_pktmbuf_pool_create(const char *name, unsigned int n,
 	int socket_id)
 {
 	return rte_pktmbuf_pool_create_by_ops(name, n, cache_size, priv_size,
-			data_room_size, socket_id, NULL);
+			data_room_size, 0, socket_id, NULL);
 }
 
+/* helper to create a mbuf pool with NO_SPREAD */
+struct rte_mempool *
+rte_pktmbuf_pool_create_with_flags(const char *name, unsigned int n,
+	unsigned int cache_size, uint16_t priv_size, uint16_t data_room_size,
+	unsigned int flags, int socket_id)
+{
+	return rte_pktmbuf_pool_create_by_ops(name, n, cache_size, priv_size,
+			data_room_size, flags, socket_id, NULL);
+}
 /* do some sanity checks on a mbuf: panic if it fails */
 void
 rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header)
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 62740254d..6f6af42a8 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1079,6 +1079,12 @@ rte_pktmbuf_pool_create(const char *name, unsigned n,
 	unsigned cache_size, uint16_t priv_size, uint16_t data_room_size,
 	int socket_id);
 
+struct rte_mempool *
+rte_pktmbuf_pool_create_with_flags(const char *name, unsigned int n,
+	unsigned cache_size, uint16_t priv_size, uint16_t data_room_size,
+	unsigned flags, int socket_id);
+
+
 /**
  * Create a mbuf pool with a given mempool ops name
  *
@@ -1119,7 +1125,7 @@ rte_pktmbuf_pool_create(const char *name, unsigned n,
 struct rte_mempool * __rte_experimental
 rte_pktmbuf_pool_create_by_ops(const char *name, unsigned int n,
 	unsigned int cache_size, uint16_t priv_size, uint16_t data_room_size,
-	int socket_id, const char *ops_name);
+	unsigned int flags, int socket_id, const char *ops_name);
 
 /**
  * Get the data room size of mbufs stored in a pktmbuf_pool
-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [RFC 3/7] lib/mempool: allow page size aligned mempool
  2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
  2018-02-27  9:33 ` [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
  2018-02-27  9:33 ` [dpdk-dev] [RFC 2/7] lib/mbuf: enable parse flags when create mempool Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-02-27  9:33 ` [dpdk-dev] [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Allow create a mempool with page size aligned base address.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 lib/librte_mempool/rte_mempool.c | 2 ++
 lib/librte_mempool/rte_mempool.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 54f7f4ba4..f8d4814ad 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -567,6 +567,8 @@ rte_mempool_populate_default(struct rte_mempool *mp)
 		pg_shift = 0; /* not needed, zone is physically contiguous */
 		pg_sz = 0;
 		align = RTE_CACHE_LINE_SIZE;
+		if (mp->flags & MEMPOOL_F_PAGE_ALIGN)
+			align = getpagesize();
 	} else {
 		pg_sz = getpagesize();
 		pg_shift = rte_bsf32(pg_sz);
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 8b1b7f7ed..774ab0f66 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -245,6 +245,7 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_PHYS_CONTIG 0x0020 /**< Don't need physically contiguous objs. */
+#define MEMPOOL_F_PAGE_ALIGN     0x0040 /**< Base address is page aligned. */
 /**
  * This capability flag is advertised by a mempool handler, if the whole
  * memory area containing the objects must be physically contiguous.
-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management
  2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (2 preceding siblings ...)
  2018-02-27  9:33 ` [dpdk-dev] [RFC 3/7] lib/mempool: allow page size aligned mempool Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-03-01  2:08   ` Stephen Hemminger
  2018-02-27  9:33 ` [dpdk-dev] [RFC 5/7] net/af_xdp: enable share mempool Qi Zhang
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Now, af_xdp registered memory buffer is managed by rte_mempool.
mbuf be allocated from rte_mempool can be convert to descriptor
index and vice versa.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 165 +++++++++++++++++++++---------------
 1 file changed, 97 insertions(+), 68 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 4eb8a2c28..3c534c77c 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -43,7 +43,11 @@
 
 #define ETH_AF_XDP_FRAME_SIZE		2048
 #define ETH_AF_XDP_NUM_BUFFERS		131072
-#define ETH_AF_XDP_DATA_HEADROOM	0
+/* mempool hdrobj size (64 bytes) + sizeof(struct rte_mbuf) (128 bytes) */
+#define ETH_AF_XDP_MBUF_OVERHEAD	192
+/* data start from offset 320 (192 + 128) bytes */
+#define ETH_AF_XDP_DATA_HEADROOM \
+	(ETH_AF_XDP_MBUF_OVERHEAD + RTE_PKTMBUF_HEADROOM)
 #define ETH_AF_XDP_DFLT_RING_SIZE	1024
 #define ETH_AF_XDP_DFLT_QUEUE_IDX	0
 
@@ -57,6 +61,7 @@ struct xdp_umem {
 	unsigned int frame_size_log2;
 	unsigned int nframes;
 	int mr_fd;
+	struct rte_mempool *mb_pool;
 };
 
 struct pmd_internals {
@@ -67,7 +72,7 @@ struct pmd_internals {
 	struct xdp_queue rx;
 	struct xdp_queue tx;
 	struct xdp_umem *umem;
-	struct rte_mempool *mb_pool;
+	struct rte_mempool *ext_mb_pool;
 
 	unsigned long rx_pkts;
 	unsigned long rx_bytes;
@@ -80,7 +85,6 @@ struct pmd_internals {
 	uint16_t port_id;
 	uint16_t queue_idx;
 	int ring_size;
-	struct rte_ring *buf_ring;
 };
 
 static const char * const valid_arguments[] = {
@@ -106,6 +110,21 @@ static void *get_pkt_data(struct pmd_internals *internals,
 			   offset);
 }
 
+static uint32_t
+mbuf_to_idx(struct pmd_internals *internals, struct rte_mbuf *mbuf)
+{
+	return (uint32_t)(((uint64_t)mbuf->buf_addr -
+			   (uint64_t)internals->umem->buffer) >>
+			  internals->umem->frame_size_log2);
+}
+
+static struct rte_mbuf *
+idx_to_mbuf(struct pmd_internals *internals, uint32_t idx)
+{
+	return (struct rte_mbuf *)(void *)(internals->umem->buffer + (idx
+			<< internals->umem->frame_size_log2) + 0x40);
+}
+
 static uint16_t
 eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -120,17 +139,18 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
 
 	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
-	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
+	struct rte_mbuf *mbufs[ETH_AF_XDP_RX_BATCH_SIZE];
 	int rcvd, i;
 	/* fill rx ring */
 	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {
-		int n = rte_ring_dequeue_bulk(internals->buf_ring,
-					      indexes,
-					      ETH_AF_XDP_RX_BATCH_SIZE,
-					      NULL);
-		for (i = 0; i < n; i++)
-			descs[i].idx = (uint32_t)((long int)indexes[i]);
-		xq_enq(rxq, descs, n);
+		int ret = rte_mempool_get_bulk(internals->umem->mb_pool,
+					     (void *)mbufs,
+					     ETH_AF_XDP_RX_BATCH_SIZE);
+		if (!ret) {
+			for (i = 0; i < ETH_AF_XDP_RX_BATCH_SIZE; i++)
+				descs[i].idx = mbuf_to_idx(internals, mbufs[i]);
+			xq_enq(rxq, descs, ETH_AF_XDP_RX_BATCH_SIZE);
+		}
 	}
 
 	/* read data */
@@ -142,7 +162,7 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		char *pkt;
 		uint32_t idx = descs[i].idx;
 
-		mbuf = rte_pktmbuf_alloc(internals->mb_pool);
+		mbuf = rte_pktmbuf_alloc(internals->ext_mb_pool);
 		rte_pktmbuf_pkt_len(mbuf) =
 			rte_pktmbuf_data_len(mbuf) =
 			descs[i].len;
@@ -155,11 +175,9 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		} else {
 			dropped++;
 		}
-		indexes[i] = (void *)((long int)idx);
+		rte_pktmbuf_free(idx_to_mbuf(internals, idx));
 	}
 
-	rte_ring_enqueue_bulk(internals->buf_ring, indexes, rcvd, NULL);
-
 	internals->rx_pkts += (rcvd - dropped);
 	internals->rx_bytes += rx_bytes;
 	internals->rx_dropped += dropped;
@@ -187,9 +205,10 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	struct xdp_queue *txq = &internals->tx;
 	struct rte_mbuf *mbuf;
 	struct xdp_desc descs[ETH_AF_XDP_TX_BATCH_SIZE];
-	void *indexes[ETH_AF_XDP_TX_BATCH_SIZE];
+	struct rte_mbuf *mbufs[ETH_AF_XDP_TX_BATCH_SIZE];
 	uint16_t i, valid;
 	unsigned long tx_bytes = 0;
+	int ret;
 
 	nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ?
 		  nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE;
@@ -198,13 +217,15 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		int n = xq_deq(txq, descs, ETH_AF_XDP_TX_BATCH_SIZE);
 
 		for (i = 0; i < n; i++)
-			indexes[i] = (void *)((long int)descs[i].idx);
-		rte_ring_enqueue_bulk(internals->buf_ring, indexes, n, NULL);
+			rte_pktmbuf_free(idx_to_mbuf(internals, descs[i].idx));
 	}
 
 	nb_pkts = nb_pkts > txq->num_free ? txq->num_free : nb_pkts;
-	nb_pkts = rte_ring_dequeue_bulk(internals->buf_ring, indexes,
-					nb_pkts, NULL);
+	ret = rte_mempool_get_bulk(internals->umem->mb_pool,
+				   (void *)mbufs,
+				   nb_pkts);
+	if (ret)
+		return 0;
 
 	valid = 0;
 	for (i = 0; i < nb_pkts; i++) {
@@ -213,14 +234,14 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			internals->umem->frame_size - ETH_AF_XDP_DATA_HEADROOM;
 		mbuf = bufs[i];
 		if (mbuf->pkt_len <= buf_len) {
-			descs[valid].idx = (uint32_t)((long int)indexes[valid]);
+			descs[valid].idx = mbuf_to_idx(internals, mbufs[i]);
 			descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
 			descs[valid].flags = 0;
 			descs[valid].len = mbuf->pkt_len;
 			pkt = get_pkt_data(internals, descs[i].idx,
 					   descs[i].offset);
 			memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
-			       descs[i].len);
+					   descs[i].len);
 			valid++;
 			tx_bytes += mbuf->pkt_len;
 		}
@@ -230,9 +251,10 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	xq_enq(txq, descs, valid);
 	kick_tx(internals->sfd);
 
-	if (valid < nb_pkts)
-		rte_ring_enqueue_bulk(internals->buf_ring, &indexes[valid],
-				      nb_pkts - valid, NULL);
+	if (valid < nb_pkts) {
+		for (i = valid; i < nb_pkts; i++)
+			rte_pktmbuf_free(mbufs[i]);
+	}
 
 	internals->err_pkts += (nb_pkts - valid);
 	internals->tx_pkts += valid;
@@ -245,14 +267,13 @@ static void
 fill_rx_desc(struct pmd_internals *internals)
 {
 	int num_free = internals->rx.num_free;
-	void *p = NULL;
 	int i;
-
 	for (i = 0; i < num_free; i++) {
 		struct xdp_desc desc = {};
+		struct rte_mbuf *mbuf =
+			rte_pktmbuf_alloc(internals->umem->mb_pool);
 
-		rte_ring_dequeue(internals->buf_ring, &p);
-		desc.idx = (uint32_t)((long int)p);
+		desc.idx = mbuf_to_idx(internals, mbuf);
 		xq_enq(&internals->rx, &desc, 1);
 	}
 }
@@ -347,33 +368,53 @@ eth_link_update(struct rte_eth_dev *dev __rte_unused,
 	return 0;
 }
 
-static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd, size_t nbuffers)
+static void *get_base_addr(struct rte_mempool *mb_pool)
+{
+	struct rte_mempool_memhdr *memhdr;
+
+	STAILQ_FOREACH(memhdr, &mb_pool->mem_list, next) {
+		return memhdr->addr;
+	}
+	return NULL;
+}
+
+static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd,
+						      size_t nbuffers,
+						      const char *pool_name)
 {
 	struct xdp_mr_req req = { .frame_size = ETH_AF_XDP_FRAME_SIZE,
 				  .data_headroom = ETH_AF_XDP_DATA_HEADROOM };
-	struct xdp_umem *umem;
-	void *bufs;
-	int ret;
+	struct xdp_umem *umem = calloc(1, sizeof(*umem));
 
-	ret = posix_memalign((void **)&bufs, getpagesize(),
-			     nbuffers * req.frame_size);
-	if (ret)
+	if (!umem)
+		return NULL;
+
+	umem->mb_pool =
+		rte_pktmbuf_pool_create_with_flags(
+			pool_name, nbuffers,
+			250, 0,
+			(ETH_AF_XDP_FRAME_SIZE - ETH_AF_XDP_MBUF_OVERHEAD),
+			MEMPOOL_F_NO_SPREAD | MEMPOOL_F_PAGE_ALIGN,
+			SOCKET_ID_ANY);
+
+	if (!umem->mb_pool) {
+		free(umem);
 		return NULL;
+	}
 
-	umem = calloc(1, sizeof(*umem));
-	if (!umem) {
-		free(bufs);
+	if (umem->mb_pool->nb_mem_chunks > 1) {
+		rte_mempool_free(umem->mb_pool);
+		free(umem);
 		return NULL;
 	}
 
-	req.addr = (unsigned long)bufs;
+	req.addr = (uint64_t)get_base_addr(umem->mb_pool);
 	req.len = nbuffers * req.frame_size;
-	ret = setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
-	RTE_ASSERT(ret == 0);
+	setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
 
 	umem->frame_size = ETH_AF_XDP_FRAME_SIZE;
 	umem->frame_size_log2 = 11;
-	umem->buffer = bufs;
+	umem->buffer = (char *)req.addr;
 	umem->size = nbuffers * req.frame_size;
 	umem->nframes = nbuffers;
 	umem->mr_fd = sfd;
@@ -386,38 +427,27 @@ xdp_configure(struct pmd_internals *internals)
 {
 	struct sockaddr_xdp sxdp;
 	struct xdp_ring_req req;
-	char ring_name[0x100];
+	char pool_name[0x100];
+
 	int ret = 0;
-	long int i;
 
-	snprintf(ring_name, 0x100, "%s_%s_%d", "af_xdp_ring",
+	snprintf(pool_name, 0x100, "%s_%s_%d", "af_xdp_pool",
 		 internals->if_name, internals->queue_idx);
-	internals->buf_ring = rte_ring_create(ring_name,
-					      ETH_AF_XDP_NUM_BUFFERS,
-					      SOCKET_ID_ANY,
-					      0x0);
-	if (!internals->buf_ring)
-		return -1;
-
-	for (i = 0; i < ETH_AF_XDP_NUM_BUFFERS; i++)
-		rte_ring_enqueue(internals->buf_ring, (void *)i);
-
 	internals->umem = xsk_alloc_and_mem_reg_buffers(internals->sfd,
-							ETH_AF_XDP_NUM_BUFFERS);
+							ETH_AF_XDP_NUM_BUFFERS,
+							pool_name);
 	if (!internals->umem)
-		goto error;
+		return -1;
 
 	req.mr_fd = internals->umem->mr_fd;
 	req.desc_nr = internals->ring_size;
 
 	ret = setsockopt(internals->sfd, SOL_XDP, XDP_RX_RING,
 			 &req, sizeof(req));
-
 	RTE_ASSERT(ret == 0);
 
 	ret = setsockopt(internals->sfd, SOL_XDP, XDP_TX_RING,
 			 &req, sizeof(req));
-
 	RTE_ASSERT(ret == 0);
 
 	internals->rx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
@@ -448,10 +478,6 @@ xdp_configure(struct pmd_internals *internals)
 	RTE_ASSERT(ret == 0);
 
 	return ret;
-error:
-	rte_ring_free(internals->buf_ring);
-	internals->buf_ring = NULL;
-	return -1;
 }
 
 static int
@@ -466,11 +492,11 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
 	unsigned int buf_size, data_size;
 
 	RTE_ASSERT(rx_queue_id == 0);
-	internals->mb_pool = mb_pool;
+	internals->ext_mb_pool = mb_pool;
 	xdp_configure(internals);
 
 	/* Now get the space available for data in the mbuf */
-	buf_size = rte_pktmbuf_data_room_size(internals->mb_pool) -
+	buf_size = rte_pktmbuf_data_room_size(internals->ext_mb_pool) -
 		RTE_PKTMBUF_HEADROOM;
 	data_size = internals->umem->frame_size;
 
@@ -739,8 +765,11 @@ rte_pmd_af_xdp_remove(struct rte_vdev_device *dev)
 		return -1;
 
 	internals = eth_dev->data->dev_private;
-	rte_ring_free(internals->buf_ring);
-	rte_free(internals->umem);
+	if (internals->umem) {
+		if (internals->umem->mb_pool)
+			rte_mempool_free(internals->umem->mb_pool);
+		rte_free(internals->umem);
+	}
 	rte_free(eth_dev->data->dev_private);
 	rte_free(eth_dev->data);
 	close(internals->sfd);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [RFC 5/7] net/af_xdp: enable share mempool
  2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (3 preceding siblings ...)
  2018-02-27  9:33 ` [dpdk-dev] [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-02-27  9:33 ` [dpdk-dev] [RFC 6/7] net/af_xdp: load BPF file Qi Zhang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Try to check if external mempool (from rx_queue_setup) is fit for
af_xdp, if it is, it will be registered to af_xdp socket directly and
there will be no packet data copy on Rx and Tx.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 191 +++++++++++++++++++++++-------------
 1 file changed, 125 insertions(+), 66 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 3c534c77c..d0939022b 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -60,7 +60,6 @@ struct xdp_umem {
 	unsigned int frame_size;
 	unsigned int frame_size_log2;
 	unsigned int nframes;
-	int mr_fd;
 	struct rte_mempool *mb_pool;
 };
 
@@ -73,6 +72,7 @@ struct pmd_internals {
 	struct xdp_queue tx;
 	struct xdp_umem *umem;
 	struct rte_mempool *ext_mb_pool;
+	uint8_t share_mb_pool;
 
 	unsigned long rx_pkts;
 	unsigned long rx_bytes;
@@ -162,20 +162,30 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		char *pkt;
 		uint32_t idx = descs[i].idx;
 
-		mbuf = rte_pktmbuf_alloc(internals->ext_mb_pool);
-		rte_pktmbuf_pkt_len(mbuf) =
-			rte_pktmbuf_data_len(mbuf) =
-			descs[i].len;
-		if (mbuf) {
-			pkt = get_pkt_data(internals, idx, descs[i].offset);
-			memcpy(rte_pktmbuf_mtod(mbuf, void *),
-			       pkt, descs[i].len);
-			rx_bytes += descs[i].len;
-			bufs[count++] = mbuf;
+		if (!internals->share_mb_pool) {
+			mbuf = rte_pktmbuf_alloc(internals->ext_mb_pool);
+			rte_pktmbuf_pkt_len(mbuf) =
+				rte_pktmbuf_data_len(mbuf) =
+				descs[i].len;
+			if (mbuf) {
+				pkt = get_pkt_data(internals, idx,
+						   descs[i].offset);
+				memcpy(rte_pktmbuf_mtod(mbuf, void *), pkt,
+				       descs[i].len);
+				rx_bytes += descs[i].len;
+				bufs[count++] = mbuf;
+			} else {
+				dropped++;
+			}
+			rte_pktmbuf_free(idx_to_mbuf(internals, idx));
 		} else {
-			dropped++;
+			mbuf = idx_to_mbuf(internals, idx);
+			rte_pktmbuf_pkt_len(mbuf) =
+				rte_pktmbuf_data_len(mbuf) =
+				descs[i].len;
+			bufs[count++] = mbuf;
+			rx_bytes += descs[i].len;
 		}
-		rte_pktmbuf_free(idx_to_mbuf(internals, idx));
 	}
 
 	internals->rx_pkts += (rcvd - dropped);
@@ -209,51 +219,71 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	uint16_t i, valid;
 	unsigned long tx_bytes = 0;
 	int ret;
+	uint8_t share_mempool = 0;
 
 	nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ?
 		  nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE;
 
 	if (txq->num_free < ETH_AF_XDP_TX_BATCH_SIZE * 2) {
 		int n = xq_deq(txq, descs, ETH_AF_XDP_TX_BATCH_SIZE);
-
 		for (i = 0; i < n; i++)
 			rte_pktmbuf_free(idx_to_mbuf(internals, descs[i].idx));
 	}
 
 	nb_pkts = nb_pkts > txq->num_free ? txq->num_free : nb_pkts;
-	ret = rte_mempool_get_bulk(internals->umem->mb_pool,
-				   (void *)mbufs,
-				   nb_pkts);
-	if (ret)
+	if (nb_pkts == 0)
 		return 0;
 
+	if (bufs[0]->pool == internals->ext_mb_pool && internals->share_mb_pool)
+		share_mempool = 1;
+
+	if (!share_mempool) {
+		ret = rte_mempool_get_bulk(internals->umem->mb_pool,
+					   (void *)mbufs,
+					   nb_pkts);
+		if (ret)
+			return 0;
+	}
+
 	valid = 0;
 	for (i = 0; i < nb_pkts; i++) {
 		char *pkt;
-		unsigned int buf_len =
-			internals->umem->frame_size - ETH_AF_XDP_DATA_HEADROOM;
 		mbuf = bufs[i];
-		if (mbuf->pkt_len <= buf_len) {
-			descs[valid].idx = mbuf_to_idx(internals, mbufs[i]);
-			descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
-			descs[valid].flags = 0;
-			descs[valid].len = mbuf->pkt_len;
-			pkt = get_pkt_data(internals, descs[i].idx,
-					   descs[i].offset);
-			memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
-					   descs[i].len);
-			valid++;
+		if (!share_mempool) {
+			if (mbuf->pkt_len <=
+				(internals->umem->frame_size -
+				 ETH_AF_XDP_DATA_HEADROOM)) {
+				descs[valid].idx =
+					mbuf_to_idx(internals, mbufs[i]);
+				descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
+				descs[valid].flags = 0;
+				descs[valid].len = mbuf->pkt_len;
+				pkt = get_pkt_data(internals, descs[i].idx,
+						   descs[i].offset);
+				memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
+				       descs[i].len);
+				valid++;
+				tx_bytes += mbuf->pkt_len;
+			}
+			rte_pktmbuf_free(mbuf);
+		} else {
+			descs[i].idx = mbuf_to_idx(internals, mbuf);
+			descs[i].offset = ETH_AF_XDP_DATA_HEADROOM;
+			descs[i].flags = 0;
+			descs[i].len = mbuf->pkt_len;
 			tx_bytes += mbuf->pkt_len;
+			valid++;
 		}
-		rte_pktmbuf_free(mbuf);
 	}
 
 	xq_enq(txq, descs, valid);
 	kick_tx(internals->sfd);
 
-	if (valid < nb_pkts) {
-		for (i = valid; i < nb_pkts; i++)
-			rte_pktmbuf_free(mbufs[i]);
+	if (!share_mempool) {
+		if (valid < nb_pkts) {
+			for (i = valid; i < nb_pkts; i++)
+				rte_pktmbuf_free(mbufs[i]);
+		}
 	}
 
 	internals->err_pkts += (nb_pkts - valid);
@@ -378,46 +408,81 @@ static void *get_base_addr(struct rte_mempool *mb_pool)
 	return NULL;
 }
 
-static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd,
-						      size_t nbuffers,
-						      const char *pool_name)
+static uint8_t
+check_mempool(struct rte_mempool *mp)
+{
+	RTE_ASSERT(mp);
+
+	/* must continues */
+	if (mp->nb_mem_chunks > 1)
+		return 0;
+
+	/* check header size */
+	if (mp->header_size != RTE_CACHE_LINE_SIZE)
+		return 0;
+
+	/* check base address */
+	if ((uint64_t)get_base_addr(mp) % getpagesize() != 0)
+		return 0;
+
+	/* check chunk size */
+	if ((mp->elt_size + mp->header_size + mp->trailer_size) %
+			ETH_AF_XDP_FRAME_SIZE != 0)
+		return 0;
+
+	return 1;
+}
+
+static struct xdp_umem *
+xsk_alloc_and_mem_reg_buffers(struct pmd_internals *internals)
 {
 	struct xdp_mr_req req = { .frame_size = ETH_AF_XDP_FRAME_SIZE,
 				  .data_headroom = ETH_AF_XDP_DATA_HEADROOM };
+	char pool_name[0x100];
+	int nbuffers;
 	struct xdp_umem *umem = calloc(1, sizeof(*umem));
 
 	if (!umem)
 		return NULL;
 
-	umem->mb_pool =
-		rte_pktmbuf_pool_create_with_flags(
-			pool_name, nbuffers,
-			250, 0,
-			(ETH_AF_XDP_FRAME_SIZE - ETH_AF_XDP_MBUF_OVERHEAD),
-			MEMPOOL_F_NO_SPREAD | MEMPOOL_F_PAGE_ALIGN,
-			SOCKET_ID_ANY);
-
-	if (!umem->mb_pool) {
-		free(umem);
-		return NULL;
-	}
+	internals->share_mb_pool = check_mempool(internals->ext_mb_pool);
+	if (!internals->share_mb_pool) {
+		snprintf(pool_name, 0x100, "%s_%s_%d", "af_xdp_pool",
+			 internals->if_name, internals->queue_idx);
+		umem->mb_pool =
+			rte_pktmbuf_pool_create_with_flags(
+				pool_name,
+				ETH_AF_XDP_NUM_BUFFERS,
+				250, 0,
+				(ETH_AF_XDP_FRAME_SIZE -
+				 ETH_AF_XDP_MBUF_OVERHEAD),
+				MEMPOOL_F_NO_SPREAD | MEMPOOL_F_PAGE_ALIGN,
+				SOCKET_ID_ANY);
+		if (!umem->mb_pool) {
+			free(umem);
+			return NULL;
+		}
 
-	if (umem->mb_pool->nb_mem_chunks > 1) {
-		rte_mempool_free(umem->mb_pool);
-		free(umem);
-		return NULL;
+		if (umem->mb_pool->nb_mem_chunks > 1) {
+			rte_mempool_free(umem->mb_pool);
+			free(umem);
+			return NULL;
+		}
+		nbuffers = ETH_AF_XDP_NUM_BUFFERS;
+	} else {
+		umem->mb_pool = internals->ext_mb_pool;
+		nbuffers = umem->mb_pool->populated_size;
 	}
 
 	req.addr = (uint64_t)get_base_addr(umem->mb_pool);
-	req.len = nbuffers * req.frame_size;
-	setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
+	req.len = ETH_AF_XDP_NUM_BUFFERS * req.frame_size;
+	setsockopt(internals->sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
 
 	umem->frame_size = ETH_AF_XDP_FRAME_SIZE;
 	umem->frame_size_log2 = 11;
 	umem->buffer = (char *)req.addr;
 	umem->size = nbuffers * req.frame_size;
 	umem->nframes = nbuffers;
-	umem->mr_fd = sfd;
 
 	return umem;
 }
@@ -427,19 +492,13 @@ xdp_configure(struct pmd_internals *internals)
 {
 	struct sockaddr_xdp sxdp;
 	struct xdp_ring_req req;
-	char pool_name[0x100];
-
 	int ret = 0;
 
-	snprintf(pool_name, 0x100, "%s_%s_%d", "af_xdp_pool",
-		 internals->if_name, internals->queue_idx);
-	internals->umem = xsk_alloc_and_mem_reg_buffers(internals->sfd,
-							ETH_AF_XDP_NUM_BUFFERS,
-							pool_name);
+	internals->umem = xsk_alloc_and_mem_reg_buffers(internals);
 	if (!internals->umem)
 		return -1;
 
-	req.mr_fd = internals->umem->mr_fd;
+	req.mr_fd = internals->sfd;
 	req.desc_nr = internals->ring_size;
 
 	ret = setsockopt(internals->sfd, SOL_XDP, XDP_RX_RING,
@@ -500,7 +559,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
 		RTE_PKTMBUF_HEADROOM;
 	data_size = internals->umem->frame_size;
 
-	if (data_size > buf_size) {
+	if (data_size - ETH_AF_XDP_DATA_HEADROOM > buf_size) {
 		RTE_LOG(ERR, PMD,
 			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
 			dev->device->name, data_size, buf_size);
@@ -766,7 +825,7 @@ rte_pmd_af_xdp_remove(struct rte_vdev_device *dev)
 
 	internals = eth_dev->data->dev_private;
 	if (internals->umem) {
-		if (internals->umem->mb_pool)
+		if (internals->umem->mb_pool && !internals->share_mb_pool)
 			rte_mempool_free(internals->umem->mb_pool);
 		rte_free(internals->umem);
 	}
-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [RFC 6/7] net/af_xdp: load BPF file
  2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (4 preceding siblings ...)
  2018-02-27  9:33 ` [dpdk-dev] [RFC 5/7] net/af_xdp: enable share mempool Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-03-01  2:10   ` Stephen Hemminger
  2018-02-27  9:33 ` [dpdk-dev] [RFC 7/7] app/testpmd: enable parameter for mempool flags Qi Zhang
  2018-03-01  2:52 ` [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Jason Wang
  7 siblings, 1 reply; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Add libbpf and libelf dependency in Makefile.
Durring initialization, bpf file "xdpsock_kern.o" will be loaded.
Then the driver will always try to link XDP fd with DRV mode first,
then SKB mode if failed in previoius.
Link will be released during dev_close.

Note: this is workaround solution, af_xdp may remove BPF dependency
in future.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 drivers/net/af_xdp/Makefile         |   6 +-
 drivers/net/af_xdp/bpf_load.c       | 798 ++++++++++++++++++++++++++++++++++++
 drivers/net/af_xdp/bpf_load.h       |  65 +++
 drivers/net/af_xdp/libbpf.h         | 199 +++++++++
 drivers/net/af_xdp/rte_eth_af_xdp.c |  31 +-
 mk/rte.app.mk                       |   2 +-
 6 files changed, 1097 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/af_xdp/bpf_load.c
 create mode 100644 drivers/net/af_xdp/bpf_load.h
 create mode 100644 drivers/net/af_xdp/libbpf.h

diff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile
index ac38e20bf..a642786de 100644
--- a/drivers/net/af_xdp/Makefile
+++ b/drivers/net/af_xdp/Makefile
@@ -42,7 +42,10 @@ EXPORT_MAP := rte_pmd_af_xdp_version.map
 
 LIBABIVER := 1
 
-CFLAGS += -O3 -I/opt/af_xdp/linux_headers/include
+LINUX_HEADER_DIR := /opt/af_xdp/linux_headers/include
+TOOLS_DIR := /root/af_xdp/npg_dna-dna-linux/tools
+
+CFLAGS += -O3 -I$(LINUX_HEADER_DIR) -I$(TOOLS_DIR)/perf -I$(TOOLS_DIR)/include -Wno-error=sign-compare -Wno-error=cast-qual
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs
@@ -52,5 +55,6 @@ LDLIBS += -lrte_bus_vdev
 # all source are stored in SRCS-y
 #
 SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += rte_eth_af_xdp.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += bpf_load.c
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/af_xdp/bpf_load.c b/drivers/net/af_xdp/bpf_load.c
new file mode 100644
index 000000000..aa632207f
--- /dev/null
+++ b/drivers/net/af_xdp/bpf_load.c
@@ -0,0 +1,798 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <libelf.h>
+#include <gelf.h>
+#include <errno.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/perf_event.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <linux/types.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <poll.h>
+#include <ctype.h>
+#include <assert.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "perf-sys.h"
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+static char license[128];
+static int kern_version;
+static bool processed_sec[128];
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+int map_fd[MAX_MAPS];
+int prog_fd[MAX_PROGS];
+int event_fd[MAX_PROGS];
+int prog_cnt;
+int prog_array_fd = -1;
+
+struct bpf_map_data map_data[MAX_MAPS];
+int map_data_count = 0;
+
+static int populate_prog_array(const char *event, int prog_fd)
+{
+	int ind = atoi(event), err;
+
+	err = bpf_map_update_elem(prog_array_fd, &ind, &prog_fd, BPF_ANY);
+	if (err < 0) {
+		printf("failed to store prog_fd in prog_array\n");
+		return -1;
+	}
+	return 0;
+}
+
+static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
+{
+	bool is_socket = strncmp(event, "socket", 6) == 0;
+	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
+	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+	bool is_xdp = strncmp(event, "xdp", 3) == 0;
+	bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
+	bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
+	bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
+	bool is_sockops = strncmp(event, "sockops", 7) == 0;
+	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
+	size_t insns_cnt = size / sizeof(struct bpf_insn);
+	enum bpf_prog_type prog_type;
+	char buf[256];
+	int fd, efd, err, id;
+	struct perf_event_attr attr = {};
+
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+
+	if (is_socket) {
+		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+	} else if (is_kprobe || is_kretprobe) {
+		prog_type = BPF_PROG_TYPE_KPROBE;
+	} else if (is_tracepoint) {
+		prog_type = BPF_PROG_TYPE_TRACEPOINT;
+	} else if (is_xdp) {
+		prog_type = BPF_PROG_TYPE_XDP;
+	} else if (is_perf_event) {
+		prog_type = BPF_PROG_TYPE_PERF_EVENT;
+	} else if (is_cgroup_skb) {
+		prog_type = BPF_PROG_TYPE_CGROUP_SKB;
+	} else if (is_cgroup_sk) {
+		prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
+	} else if (is_sockops) {
+		prog_type = BPF_PROG_TYPE_SOCK_OPS;
+	} else if (is_sk_skb) {
+		prog_type = BPF_PROG_TYPE_SK_SKB;
+	} else {
+		printf("Unknown event '%s'\n", event);
+		return -1;
+	}
+
+	fd = bpf_load_program(prog_type, prog, insns_cnt, license, kern_version,
+			      bpf_log_buf, BPF_LOG_BUF_SIZE);
+	if (fd < 0) {
+		printf("bpf_load_program() err=%d\n%s", errno, bpf_log_buf);
+		return -1;
+	}
+
+	prog_fd[prog_cnt++] = fd;
+
+	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
+		return 0;
+
+	if (is_socket || is_sockops || is_sk_skb) {
+		if (is_socket)
+			event += 6;
+		else
+			event += 7;
+		if (*event != '/')
+			return 0;
+		event++;
+		if (!isdigit(*event)) {
+			printf("invalid prog number\n");
+			return -1;
+		}
+		return populate_prog_array(event, fd);
+	}
+
+	if (is_kprobe || is_kretprobe) {
+		if (is_kprobe)
+			event += 7;
+		else
+			event += 10;
+
+		if (*event == 0) {
+			printf("event name cannot be empty\n");
+			return -1;
+		}
+
+		if (isdigit(*event))
+			return populate_prog_array(event, fd);
+
+		snprintf(buf, sizeof(buf),
+			 "echo '%c:%s %s' >> /sys/kernel/debug/tracing/kprobe_events",
+			 is_kprobe ? 'p' : 'r', event, event);
+		err = system(buf);
+		if (err < 0) {
+			printf("failed to create kprobe '%s' error '%s'\n",
+			       event, strerror(errno));
+			return -1;
+		}
+
+		strcpy(buf, DEBUGFS);
+		strcat(buf, "events/kprobes/");
+		strcat(buf, event);
+		strcat(buf, "/id");
+	} else if (is_tracepoint) {
+		event += 11;
+
+		if (*event == 0) {
+			printf("event name cannot be empty\n");
+			return -1;
+		}
+		strcpy(buf, DEBUGFS);
+		strcat(buf, "events/");
+		strcat(buf, event);
+		strcat(buf, "/id");
+	}
+
+	efd = open(buf, O_RDONLY, 0);
+	if (efd < 0) {
+		printf("failed to open event %s\n", event);
+		return -1;
+	}
+
+	err = read(efd, buf, sizeof(buf));
+	if (err < 0 || err >= sizeof(buf)) {
+		printf("read from '%s' failed '%s'\n", event, strerror(errno));
+		return -1;
+	}
+
+	close(efd);
+
+	buf[err] = 0;
+	id = atoi(buf);
+	attr.config = id;
+
+	efd = sys_perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+	if (efd < 0) {
+		printf("event %d fd %d err %s\n", id, efd, strerror(errno));
+		return -1;
+	}
+	event_fd[prog_cnt - 1] = efd;
+	err = ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
+	if (err < 0) {
+		printf("ioctl PERF_EVENT_IOC_ENABLE failed err %s\n",
+		       strerror(errno));
+		return -1;
+	}
+	err = ioctl(efd, PERF_EVENT_IOC_SET_BPF, fd);
+	if (err < 0) {
+		printf("ioctl PERF_EVENT_IOC_SET_BPF failed err %s\n",
+		       strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int load_maps(struct bpf_map_data *maps, int nr_maps,
+		     fixup_map_cb fixup_map)
+{
+	int i, numa_node;
+
+	for (i = 0; i < nr_maps; i++) {
+		if (fixup_map) {
+			fixup_map(&maps[i], i);
+			/* Allow userspace to assign map FD prior to creation */
+			if (maps[i].fd != -1) {
+				map_fd[i] = maps[i].fd;
+				continue;
+			}
+		}
+
+		numa_node = maps[i].def.map_flags & BPF_F_NUMA_NODE ?
+			maps[i].def.numa_node : -1;
+
+		if (maps[i].def.type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+		    maps[i].def.type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+			int inner_map_fd = map_fd[maps[i].def.inner_map_idx];
+
+			map_fd[i] = bpf_create_map_in_map_node(maps[i].def.type,
+							maps[i].name,
+							maps[i].def.key_size,
+							inner_map_fd,
+							maps[i].def.max_entries,
+							maps[i].def.map_flags,
+							numa_node);
+		} else {
+			map_fd[i] = bpf_create_map_node(maps[i].def.type,
+							maps[i].name,
+							maps[i].def.key_size,
+							maps[i].def.value_size,
+							maps[i].def.max_entries,
+							maps[i].def.map_flags,
+							numa_node);
+		}
+		if (map_fd[i] < 0) {
+			printf("failed to create a map: %d %s\n",
+			       errno, strerror(errno));
+			return 1;
+		}
+		maps[i].fd = map_fd[i];
+
+		if (maps[i].def.type == BPF_MAP_TYPE_PROG_ARRAY)
+			prog_array_fd = map_fd[i];
+	}
+	return 0;
+}
+
+static int get_sec(Elf *elf, int i, GElf_Ehdr *ehdr, char **shname,
+		   GElf_Shdr *shdr, Elf_Data **data)
+{
+	Elf_Scn *scn;
+
+	scn = elf_getscn(elf, i);
+	if (!scn)
+		return 1;
+
+	if (gelf_getshdr(scn, shdr) != shdr)
+		return 2;
+
+	*shname = elf_strptr(elf, ehdr->e_shstrndx, shdr->sh_name);
+	if (!*shname || !shdr->sh_size)
+		return 3;
+
+	*data = elf_getdata(scn, 0);
+	if (!*data || elf_getdata(scn, *data) != NULL)
+		return 4;
+
+	return 0;
+}
+
+static int parse_relo_and_apply(Elf_Data *data, Elf_Data *symbols,
+				GElf_Shdr *shdr, struct bpf_insn *insn,
+				struct bpf_map_data *maps, int nr_maps)
+{
+	int i, nrels;
+
+	nrels = shdr->sh_size / shdr->sh_entsize;
+
+	for (i = 0; i < nrels; i++) {
+		GElf_Sym sym;
+		GElf_Rel rel;
+		unsigned int insn_idx;
+		bool match = false;
+		int map_idx;
+
+		gelf_getrel(data, i, &rel);
+
+		insn_idx = rel.r_offset / sizeof(struct bpf_insn);
+
+		gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym);
+
+		if (insn[insn_idx].code != (BPF_LD | BPF_IMM | BPF_DW)) {
+			printf("invalid relo for insn[%d].code 0x%x\n",
+			       insn_idx, insn[insn_idx].code);
+			return 1;
+		}
+		insn[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
+
+		/* Match FD relocation against recorded map_data[] offset */
+		for (map_idx = 0; map_idx < nr_maps; map_idx++) {
+			if (maps[map_idx].elf_offset == sym.st_value) {
+				match = true;
+				break;
+			}
+		}
+		if (match) {
+			insn[insn_idx].imm = maps[map_idx].fd;
+		} else {
+			printf("invalid relo for insn[%d] no map_data match\n",
+			       insn_idx);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
+static int cmp_symbols(const void *l, const void *r)
+{
+	const GElf_Sym *lsym = (const GElf_Sym *)l;
+	const GElf_Sym *rsym = (const GElf_Sym *)r;
+
+	if (lsym->st_value < rsym->st_value)
+		return -1;
+	else if (lsym->st_value > rsym->st_value)
+		return 1;
+	else
+		return 0;
+}
+
+static int load_elf_maps_section(struct bpf_map_data *maps, int maps_shndx,
+				 Elf *elf, Elf_Data *symbols, int strtabidx)
+{
+	int map_sz_elf, map_sz_copy;
+	bool validate_zero = false;
+	Elf_Data *data_maps;
+	int i, nr_maps;
+	GElf_Sym *sym;
+	Elf_Scn *scn;
+
+	if (maps_shndx < 0)
+		return -EINVAL;
+	if (!symbols)
+		return -EINVAL;
+
+	/* Get data for maps section via elf index */
+	scn = elf_getscn(elf, maps_shndx);
+	if (scn)
+		data_maps = elf_getdata(scn, NULL);
+	if (!scn || !data_maps) {
+		printf("Failed to get Elf_Data from maps section %d\n",
+		       maps_shndx);
+		return -EINVAL;
+	}
+
+	/* For each map get corrosponding symbol table entry */
+	sym = calloc(MAX_MAPS+1, sizeof(GElf_Sym));
+	for (i = 0, nr_maps = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) {
+		assert(nr_maps < MAX_MAPS+1);
+		if (!gelf_getsym(symbols, i, &sym[nr_maps]))
+			continue;
+		if (sym[nr_maps].st_shndx != maps_shndx)
+			continue;
+		/* Only increment iif maps section */
+		nr_maps++;
+	}
+
+	/* Align to map_fd[] order, via sort on offset in sym.st_value */
+	qsort(sym, nr_maps, sizeof(GElf_Sym), cmp_symbols);
+
+	/* Keeping compatible with ELF maps section changes
+	 * ------------------------------------------------
+	 * The program size of struct bpf_map_def is known by loader
+	 * code, but struct stored in ELF file can be different.
+	 *
+	 * Unfortunately sym[i].st_size is zero.  To calculate the
+	 * struct size stored in the ELF file, assume all struct have
+	 * the same size, and simply divide with number of map
+	 * symbols.
+	 */
+	map_sz_elf = data_maps->d_size / nr_maps;
+	map_sz_copy = sizeof(struct bpf_map_def);
+	if (map_sz_elf < map_sz_copy) {
+		/*
+		 * Backward compat, loading older ELF file with
+		 * smaller struct, keeping remaining bytes zero.
+		 */
+		map_sz_copy = map_sz_elf;
+	} else if (map_sz_elf > map_sz_copy) {
+		/*
+		 * Forward compat, loading newer ELF file with larger
+		 * struct with unknown features. Assume zero means
+		 * feature not used.  Thus, validate rest of struct
+		 * data is zero.
+		 */
+		validate_zero = true;
+	}
+
+	/* Memcpy relevant part of ELF maps data to loader maps */
+	for (i = 0; i < nr_maps; i++) {
+		unsigned char *addr, *end;
+		struct bpf_map_def *def;
+		const char *map_name;
+		size_t offset;
+
+		map_name = elf_strptr(elf, strtabidx, sym[i].st_name);
+		maps[i].name = strdup(map_name);
+		if (!maps[i].name) {
+			printf("strdup(%s): %s(%d)\n", map_name,
+			       strerror(errno), errno);
+			free(sym);
+			return -errno;
+		}
+
+		/* Symbol value is offset into ELF maps section data area */
+		offset = sym[i].st_value;
+		def = (struct bpf_map_def *)((uint8_t *)data_maps->d_buf + offset);
+		maps[i].elf_offset = offset;
+		memset(&maps[i].def, 0, sizeof(struct bpf_map_def));
+		memcpy(&maps[i].def, def, map_sz_copy);
+
+		/* Verify no newer features were requested */
+		if (validate_zero) {
+			addr = (unsigned char*) def + map_sz_copy;
+			end  = (unsigned char*) def + map_sz_elf;
+			for (; addr < end; addr++) {
+				if (*addr != 0) {
+					free(sym);
+					return -EFBIG;
+				}
+			}
+		}
+	}
+
+	free(sym);
+	return nr_maps;
+}
+
+static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
+{
+	int fd, i, ret, maps_shndx = -1, strtabidx = -1;
+	Elf *elf;
+	GElf_Ehdr ehdr;
+	GElf_Shdr shdr, shdr_prog;
+	Elf_Data *data, *data_prog, *data_maps = NULL, *symbols = NULL;
+	char *shname, *shname_prog;
+	int nr_maps = 0;
+
+	/* reset global variables */
+	kern_version = 0;
+	memset(license, 0, sizeof(license));
+	memset(processed_sec, 0, sizeof(processed_sec));
+
+	if (elf_version(EV_CURRENT) == EV_NONE)
+		return 1;
+
+	fd = open(path, O_RDONLY, 0);
+	if (fd < 0)
+		return 1;
+
+	elf = elf_begin(fd, ELF_C_READ, NULL);
+
+	if (!elf)
+		return 1;
+
+	if (gelf_getehdr(elf, &ehdr) != &ehdr)
+		return 1;
+
+	/* clear all kprobes */
+	i = system("echo \"\" > /sys/kernel/debug/tracing/kprobe_events");
+
+	/* scan over all elf sections to get license and map info */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (0) /* helpful for llvm debugging */
+			printf("section %d:%s data %p size %zd link %d flags %d\n",
+			       i, shname, data->d_buf, data->d_size,
+			       shdr.sh_link, (int) shdr.sh_flags);
+
+		if (strcmp(shname, "license") == 0) {
+			processed_sec[i] = true;
+			memcpy(license, data->d_buf, data->d_size);
+		} else if (strcmp(shname, "version") == 0) {
+			processed_sec[i] = true;
+			if (data->d_size != sizeof(int)) {
+				printf("invalid size of version section %zd\n",
+				       data->d_size);
+				return 1;
+			}
+			memcpy(&kern_version, data->d_buf, sizeof(int));
+		} else if (strcmp(shname, "maps") == 0) {
+			int j;
+
+			maps_shndx = i;
+			data_maps = data;
+			for (j = 0; j < MAX_MAPS; j++)
+				map_data[j].fd = -1;
+		} else if (shdr.sh_type == SHT_SYMTAB) {
+			strtabidx = shdr.sh_link;
+			symbols = data;
+		}
+	}
+
+	ret = 1;
+
+	if (!symbols) {
+		printf("missing SHT_SYMTAB section\n");
+		goto done;
+	}
+
+	if (data_maps) {
+		nr_maps = load_elf_maps_section(map_data, maps_shndx,
+						elf, symbols, strtabidx);
+		if (nr_maps < 0) {
+			printf("Error: Failed loading ELF maps (errno:%d):%s\n",
+			       nr_maps, strerror(-nr_maps));
+			ret = 1;
+			goto done;
+		}
+		if (load_maps(map_data, nr_maps, fixup_map))
+			goto done;
+		map_data_count = nr_maps;
+
+		processed_sec[maps_shndx] = true;
+	}
+
+	/* process all relo sections, and rewrite bpf insns for maps */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+		if (processed_sec[i])
+			continue;
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (shdr.sh_type == SHT_REL) {
+			struct bpf_insn *insns;
+
+			/* locate prog sec that need map fixup (relocations) */
+			if (get_sec(elf, shdr.sh_info, &ehdr, &shname_prog,
+				    &shdr_prog, &data_prog))
+				continue;
+
+			if (shdr_prog.sh_type != SHT_PROGBITS ||
+			    !(shdr_prog.sh_flags & SHF_EXECINSTR))
+				continue;
+
+			insns = (struct bpf_insn *) data_prog->d_buf;
+			processed_sec[i] = true; /* relo section */
+
+			if (parse_relo_and_apply(data, symbols, &shdr, insns,
+						 map_data, nr_maps))
+				continue;
+		}
+	}
+
+	/* load programs */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (processed_sec[i])
+			continue;
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (memcmp(shname, "kprobe/", 7) == 0 ||
+		    memcmp(shname, "kretprobe/", 10) == 0 ||
+		    memcmp(shname, "tracepoint/", 11) == 0 ||
+		    memcmp(shname, "xdp", 3) == 0 ||
+		    memcmp(shname, "perf_event", 10) == 0 ||
+		    memcmp(shname, "socket", 6) == 0 ||
+		    memcmp(shname, "cgroup/", 7) == 0 ||
+		    memcmp(shname, "sockops", 7) == 0 ||
+		    memcmp(shname, "sk_skb", 6) == 0) {
+			ret = load_and_attach(shname, data->d_buf,
+					      data->d_size);
+			if (ret != 0)
+				goto done;
+		}
+	}
+
+	ret = 0;
+done:
+	close(fd);
+	return ret;
+}
+
+int load_bpf_file(const char *path)
+{
+	return do_load_bpf_file(path, NULL);
+}
+
+int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map)
+{
+	return do_load_bpf_file(path, fixup_map);
+}
+
+void read_trace_pipe(void)
+{
+	int trace_fd;
+
+	trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+	if (trace_fd < 0)
+		return;
+
+	while (1) {
+		static char buf[4096];
+		ssize_t sz;
+
+		sz = read(trace_fd, buf, sizeof(buf));
+		if (sz > 0) {
+			buf[sz] = 0;
+			puts(buf);
+		}
+	}
+}
+
+#define MAX_SYMS 300000
+static struct ksym syms[MAX_SYMS];
+static int sym_cnt;
+
+static int ksym_cmp(const void *p1, const void *p2)
+{
+	return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
+}
+
+int load_kallsyms(void)
+{
+	FILE *f = fopen("/proc/kallsyms", "r");
+	char func[256], buf[256];
+	char symbol;
+	void *addr;
+	int i = 0;
+
+	if (!f)
+		return -ENOENT;
+
+	while (!feof(f)) {
+		if (!fgets(buf, sizeof(buf), f))
+			break;
+		if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
+			break;
+		if (!addr)
+			continue;
+		syms[i].addr = (long) addr;
+		syms[i].name = strdup(func);
+		i++;
+	}
+	sym_cnt = i;
+	qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
+	return 0;
+}
+
+struct ksym *ksym_search(long key)
+{
+	int start = 0, end = sym_cnt;
+	int result;
+
+	while (start < end) {
+		size_t mid = start + (end - start) / 2;
+
+		result = key - syms[mid].addr;
+		if (result < 0)
+			end = mid;
+		else if (result > 0)
+			start = mid + 1;
+		else
+			return &syms[mid];
+	}
+
+	if (start >= 1 && syms[start - 1].addr < key &&
+	    key < syms[start].addr)
+		/* valid ksym */
+		return &syms[start - 1];
+
+	/* out of range. return _stext */
+	return &syms[0];
+}
+
+int set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+
+	/* started nested attribute for XDP */
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+	nla->nla_len = NLA_HDRLEN;
+
+	/* add XDP fd */
+	nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len += nla_xdp->nla_len;
+
+	/* if user passed in any flags, add those too */
+	if (flags) {
+		nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+		nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/;
+		nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+		memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+		nla->nla_len += nla_xdp->nla_len;
+	}
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
diff --git a/drivers/net/af_xdp/bpf_load.h b/drivers/net/af_xdp/bpf_load.h
new file mode 100644
index 000000000..5450e8b19
--- /dev/null
+++ b/drivers/net/af_xdp/bpf_load.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __BPF_LOAD_H
+#define __BPF_LOAD_H
+
+#include "libbpf.h"
+
+#define MAX_MAPS 32
+#define MAX_PROGS 32
+
+struct bpf_map_def {
+	unsigned int type;
+	unsigned int key_size;
+	unsigned int value_size;
+	unsigned int max_entries;
+	unsigned int map_flags;
+	unsigned int inner_map_idx;
+	unsigned int numa_node;
+};
+
+struct bpf_map_data {
+	int fd;
+	char *name;
+	size_t elf_offset;
+	struct bpf_map_def def;
+};
+
+typedef void (*fixup_map_cb)(struct bpf_map_data *map, int idx);
+
+extern int prog_fd[MAX_PROGS];
+extern int event_fd[MAX_PROGS];
+extern char bpf_log_buf[BPF_LOG_BUF_SIZE];
+extern int prog_cnt;
+
+/* There is a one-to-one mapping between map_fd[] and map_data[].
+ * The map_data[] just contains more rich info on the given map.
+ */
+extern int map_fd[MAX_MAPS];
+extern struct bpf_map_data map_data[MAX_MAPS];
+extern int map_data_count;
+
+/* parses elf file compiled by llvm .c->.o
+ * . parses 'maps' section and creates maps via BPF syscall
+ * . parses 'license' section and passes it to syscall
+ * . parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by
+ *   storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
+ * . loads eBPF programs via BPF syscall
+ *
+ * One ELF file can contain multiple BPF programs which will be loaded
+ * and their FDs stored stored in prog_fd array
+ *
+ * returns zero on success
+ */
+int load_bpf_file(const char *path);
+int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map);
+
+void read_trace_pipe(void);
+struct ksym {
+	long addr;
+	char *name;
+};
+
+int load_kallsyms(void);
+struct ksym *ksym_search(long key);
+int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
+#endif
diff --git a/drivers/net/af_xdp/libbpf.h b/drivers/net/af_xdp/libbpf.h
new file mode 100644
index 000000000..18bfee5aa
--- /dev/null
+++ b/drivers/net/af_xdp/libbpf.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* eBPF mini library */
+#ifndef __LIBBPF_H
+#define __LIBBPF_H
+
+#include <bpf/bpf.h>
+
+struct bpf_insn;
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_MOV32_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM)					\
+	BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_DW | BPF_IMM,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = (__u32) (IMM) }),			\
+	((struct bpf_insn) {					\
+		.code  = 0, /* zero is reserved opcode */	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((__u64) (IMM)) >> 32 })
+
+#ifndef BPF_PSEUDO_MAP_FD
+# define BPF_PSEUDO_MAP_FD	1
+#endif
+
+/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
+#define BPF_LD_MAP_FD(DST, MAP_FD)				\
+	BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
+
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
+
+#define BPF_STX_XADD(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = CODE,					\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN()						\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_EXIT,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#endif
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index d0939022b..903ca0d01 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -15,6 +15,7 @@
 
 #include <linux/if_ether.h>
 #include <linux/if_xdp.h>
+#include <linux/if_link.h>
 #include <arpa/inet.h>
 #include <net/if.h>
 #include <sys/types.h>
@@ -24,6 +25,7 @@
 #include <unistd.h>
 #include <poll.h>
 #include "xdpsock_queue.h"
+#include "bpf_load.h"
 
 #ifndef SOL_XDP
 #define SOL_XDP 283
@@ -85,6 +87,8 @@ struct pmd_internals {
 	uint16_t port_id;
 	uint16_t queue_idx;
 	int ring_size;
+
+	uint32_t xdp_flags;
 };
 
 static const char * const valid_arguments[] = {
@@ -382,8 +386,12 @@ eth_stats_reset(struct rte_eth_dev *dev)
 }
 
 static void
-eth_dev_close(struct rte_eth_dev *dev __rte_unused)
+eth_dev_close(struct rte_eth_dev *dev)
 {
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	if (internals->xdp_flags)
+		set_link_xdp_fd(internals->if_index, -1, internals->xdp_flags);
 }
 
 static void
@@ -745,9 +753,25 @@ init_internals(struct rte_vdev_device *dev,
 	if (ret)
 		goto error_3;
 
+	/* need fix: hard coded bpf file */
+	if (load_bpf_file("xdpsock_kern.o")) {
+		printf("load bpf file failed\n");
+		goto error_3;
+	}
+	RTE_ASSERT(prog_fd[0]);
+
+	if (!set_link_xdp_fd(internals->if_index, prog_fd[0],
+			     XDP_FLAGS_DRV_MODE))
+		internals->xdp_flags = XDP_FLAGS_DRV_MODE;
+	else if (!set_link_xdp_fd(internals->if_index, prog_fd[0],
+				  XDP_FLAGS_SKB_MODE))
+		internals->xdp_flags = XDP_FLAGS_SKB_MODE;
+	else
+		goto error_3;
+
 	eth_dev = rte_eth_vdev_allocate(dev, 0);
 	if (!eth_dev)
-		goto error_3;
+		goto error_4;
 
 	rte_memcpy(data, eth_dev->data, sizeof(*data));
 	internals->port_id = eth_dev->data->port_id;
@@ -765,6 +789,9 @@ init_internals(struct rte_vdev_device *dev,
 
 	return 0;
 
+error_4:
+	set_link_xdp_fd(internals->if_index, -1, internals->xdp_flags);
+
 error_3:
 	close(internals->sfd);
 
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index bc26e1457..d05e6c0e4 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -120,7 +120,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
 _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET)  += -lrte_pmd_af_packet
-_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP)     += -lrte_pmd_af_xdp
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP)     += -lrte_pmd_af_xdp -lelf -lbpf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ARK_PMD)        += -lrte_pmd_ark
 _LDLIBS-$(CONFIG_RTE_LIBRTE_AVF_PMD)        += -lrte_pmd_avf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_AVP_PMD)        += -lrte_pmd_avp
-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [RFC 7/7] app/testpmd: enable parameter for mempool flags
  2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (5 preceding siblings ...)
  2018-02-27  9:33 ` [dpdk-dev] [RFC 6/7] net/af_xdp: load BPF file Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-03-01  2:52 ` [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Jason Wang
  7 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Now, it is possible for testpmd to create a af_xdp friendly mempool.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 app/test-pmd/parameters.c | 12 ++++++++++++
 app/test-pmd/testpmd.c    | 15 +++++++++------
 app/test-pmd/testpmd.h    |  1 +
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 97d22b860..19675671e 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -61,6 +61,7 @@ usage(char* progname)
 	       "--tx-first | --stats-period=PERIOD | "
 	       "--coremask=COREMASK --portmask=PORTMASK --numa "
 	       "--mbuf-size= | --total-num-mbufs= | "
+	       "--mp-flags= | "
 	       "--nb-cores= | --nb-ports= | "
 #ifdef RTE_LIBRTE_CMDLINE
 	       "--eth-peers-configfile= | "
@@ -105,6 +106,7 @@ usage(char* progname)
 	printf("  --socket-num=N: set socket from which all memory is allocated "
 	       "in NUMA mode.\n");
 	printf("  --mbuf-size=N: set the data size of mbuf to N bytes.\n");
+	printf("  --mp-flags=N: set the flags when create mbuf memory pool.\n");
 	printf("  --total-num-mbufs=N: set the number of mbufs to be allocated "
 	       "in mbuf pools.\n");
 	printf("  --max-pkt-len=N: set the maximum size of packet to N bytes.\n");
@@ -568,6 +570,7 @@ launch_args_parse(int argc, char** argv)
 		{ "ring-numa-config",           1, 0, 0 },
 		{ "socket-num",			1, 0, 0 },
 		{ "mbuf-size",			1, 0, 0 },
+		{ "mp-flags",			1, 0, 0 },
 		{ "total-num-mbufs",		1, 0, 0 },
 		{ "max-pkt-len",		1, 0, 0 },
 		{ "pkt-filter-mode",            1, 0, 0 },
@@ -769,6 +772,15 @@ launch_args_parse(int argc, char** argv)
 					rte_exit(EXIT_FAILURE,
 						 "mbuf-size should be > 0 and < 65536\n");
 			}
+			if (!strcmp(lgopts[opt_idx].name, "mp-flags")) {
+				n = atoi(optarg);
+				if (n > 0 && n <= 0xFFFF)
+					mp_flags = (uint16_t)n;
+				else
+					rte_exit(EXIT_FAILURE,
+						 "mp-flags should be > 0 and < 65536\n");
+			}
+
 			if (!strcmp(lgopts[opt_idx].name, "total-num-mbufs")) {
 				n = atoi(optarg);
 				if (n > 1024)
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 4c0e2586c..887899919 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -171,6 +171,7 @@ uint32_t burst_tx_delay_time = BURST_TX_WAIT_US;
 uint32_t burst_tx_retry_num = BURST_TX_RETRIES;
 
 uint16_t mbuf_data_size = DEFAULT_MBUF_DATA_SIZE; /**< Mbuf data space size. */
+uint16_t mp_flags = 0; /**< flags parsed when create mempool */
 uint32_t param_total_num_mbufs = 0;  /**< number of mbufs in all pools - if
                                       * specified on command-line. */
 uint16_t stats_period; /**< Period to show statistics (disabled by default) */
@@ -486,6 +487,7 @@ set_def_fwd_config(void)
  */
 static void
 mbuf_pool_create(uint16_t mbuf_seg_size, unsigned nb_mbuf,
+		 unsigned int flags,
 		 unsigned int socket_id)
 {
 	char pool_name[RTE_MEMPOOL_NAMESIZE];
@@ -503,7 +505,7 @@ mbuf_pool_create(uint16_t mbuf_seg_size, unsigned nb_mbuf,
 		rte_mp = rte_mempool_create_empty(pool_name, nb_mbuf,
 			mb_size, (unsigned) mb_mempool_cache,
 			sizeof(struct rte_pktmbuf_pool_private),
-			socket_id, 0);
+			socket_id, flags);
 		if (rte_mp == NULL)
 			goto err;
 
@@ -518,8 +520,8 @@ mbuf_pool_create(uint16_t mbuf_seg_size, unsigned nb_mbuf,
 		/* wrapper to rte_mempool_create() */
 		TESTPMD_LOG(INFO, "preferred mempool ops selected: %s\n",
 				rte_mbuf_best_mempool_ops());
-		rte_mp = rte_pktmbuf_pool_create(pool_name, nb_mbuf,
-			mb_mempool_cache, 0, mbuf_seg_size, socket_id);
+		rte_mp = rte_pktmbuf_pool_create_with_flags(pool_name, nb_mbuf,
+			mb_mempool_cache, 0, mbuf_seg_size, flags, socket_id);
 	}
 
 err:
@@ -735,13 +737,14 @@ init_config(void)
 
 		for (i = 0; i < num_sockets; i++)
 			mbuf_pool_create(mbuf_data_size, nb_mbuf_per_pool,
-					 socket_ids[i]);
+					 mp_flags, socket_ids[i]);
 	} else {
 		if (socket_num == UMA_NO_CONFIG)
-			mbuf_pool_create(mbuf_data_size, nb_mbuf_per_pool, 0);
+			mbuf_pool_create(mbuf_data_size, nb_mbuf_per_pool,
+					 mp_flags, 0);
 		else
 			mbuf_pool_create(mbuf_data_size, nb_mbuf_per_pool,
-						 socket_num);
+					 mp_flags, socket_num);
 	}
 
 	init_port_config();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 153abea05..11c2ea681 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -386,6 +386,7 @@ extern uint8_t dcb_config;
 extern uint8_t dcb_test;
 
 extern uint16_t mbuf_data_size; /**< Mbuf data space size. */
+extern uint16_t mp_flags;  /**< flags for mempool creation. */
 extern uint32_t param_total_num_mbufs;
 
 extern uint16_t stats_period;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:33 ` [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
@ 2018-02-28 23:40   ` Stephen Hemminger
  2018-02-28 23:42   ` Stephen Hemminger
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2018-02-28 23:40 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:00 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> iff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile
> new file mode 100644
> index 000000000..ac38e20bf
> --- /dev/null
> +++ b/drivers/net/af_xdp/Makefile
> @@ -0,0 +1,56 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVI

Please use SPDX on new files.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:33 ` [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
  2018-02-28 23:40   ` Stephen Hemminger
@ 2018-02-28 23:42   ` Stephen Hemminger
  2018-03-01  1:51     ` Zhang, Qi Z
  2018-02-28 23:42   ` Stephen Hemminger
  2018-02-28 23:45   ` Stephen Hemminger
  3 siblings, 1 reply; 24+ messages in thread
From: Stephen Hemminger @ 2018-02-28 23:42 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:00 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> struct pmd_internals {
> +	int sfd;
> +	int if_index;
> +	char if_name[0x100];

why not IFNAMSIZ?

> +	struct ether_addr eth_addr;
> +	struct xdp_queue rx;
> +	struct xdp_queue tx;
> +	struct xdp_umem *umem;
> +	struct rte_mempool *mb_pool;
> +
> +	unsigned long rx_pkts;
> +	unsigned long rx_bytes;
> +	unsigned long rx_dropped;
> +
> +	unsigned long tx_pkts;
> +	unsigned long err_pkts;
> +	unsigned long tx_bytes;

why not per-queue stats? per-port stats are expensive

> +	uint16_t port_id;
> +	uint16_t queue_idx;
> +	int ring_size;
> +	struct rte_ring *buf_ring;
> +};

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:33 ` [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
  2018-02-28 23:40   ` Stephen Hemminger
  2018-02-28 23:42   ` Stephen Hemminger
@ 2018-02-28 23:42   ` Stephen Hemminger
  2018-02-28 23:45   ` Stephen Hemminger
  3 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2018-02-28 23:42 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:00 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> +
> +static void *get_pkt_data(struct pmd_internals *internals,
> +			  uint32_t index,
> +			  uint32_t offset)
> +{
> +	return (uint8_t *)(internals->umem->buffer +
> +			   (index << internals->umem->frame_size_log2) +
> +			   offset);

You are returning void *, cast here is unnecessary

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:33 ` [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
                     ` (2 preceding siblings ...)
  2018-02-28 23:42   ` Stephen Hemminger
@ 2018-02-28 23:45   ` Stephen Hemminger
  2018-03-01  1:59     ` Zhang, Qi Z
  3 siblings, 1 reply; 24+ messages in thread
From: Stephen Hemminger @ 2018-02-28 23:45 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:00 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> +
> +static uint16_t
> +eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	struct pmd_internals *internals = queue;
> +	struct xdp_queue *rxq = &internals->rx;
> +	struct rte_mbuf *mbuf;
> +	unsigned long dropped = 0;
> +	unsigned long rx_bytes = 0;
> +	uint16_t count = 0;
> +
> +	nb_pkts = nb_pkts < ETH_AF_XDP_RX_BATCH_SIZE ?
> +		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
> +

Put declarations first.
Why not iterate if nb_pkts is huge?

> +	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
> +	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
> +	int rcvd, i;
> +	/* fill rx ring */
> +	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {

Blank line after declarations before code please.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-28 23:42   ` Stephen Hemminger
@ 2018-03-01  1:51     ` Zhang, Qi Z
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01  1:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Karlsson, Magnus, Topel, Bjorn



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Thursday, March 1, 2018 7:42 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>
> Cc: dev@dpdk.org; magnus.karlsson@intei.com; Topel, Bjorn
> <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
> 
> On Tue, 27 Feb 2018 17:33:00 +0800
> Qi Zhang <qi.z.zhang@intel.com> wrote:
> 
> > struct pmd_internals {
> > +	int sfd;
> > +	int if_index;
> > +	char if_name[0x100];
> 
> why not IFNAMSIZ?
> 
> > +	struct ether_addr eth_addr;
> > +	struct xdp_queue rx;
> > +	struct xdp_queue tx;
> > +	struct xdp_umem *umem;
> > +	struct rte_mempool *mb_pool;
> > +
> > +	unsigned long rx_pkts;
> > +	unsigned long rx_bytes;
> > +	unsigned long rx_dropped;
> > +
> > +	unsigned long tx_pkts;
> > +	unsigned long err_pkts;
> > +	unsigned long tx_bytes;
> 
> why not per-queue stats? per-port stats are expensive

multi-queue is not supported in this implementation, but will be considered.

Regards
Qi
> 
> > +	uint16_t port_id;
> > +	uint16_t queue_idx;
> > +	int ring_size;
> > +	struct rte_ring *buf_ring;
> > +};

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-28 23:45   ` Stephen Hemminger
@ 2018-03-01  1:59     ` Zhang, Qi Z
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01  1:59 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, magnus.karlsson, Topel, Bjorn



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Thursday, March 1, 2018 7:45 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>
> Cc: dev@dpdk.org; magnus.karlsson@intei.com; Topel, Bjorn
> <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
> 
> On Tue, 27 Feb 2018 17:33:00 +0800
> Qi Zhang <qi.z.zhang@intel.com> wrote:
> 
> > +
> > +static uint16_t
> > +eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > +{
> > +	struct pmd_internals *internals = queue;
> > +	struct xdp_queue *rxq = &internals->rx;
> > +	struct rte_mbuf *mbuf;
> > +	unsigned long dropped = 0;
> > +	unsigned long rx_bytes = 0;
> > +	uint16_t count = 0;
> > +
> > +	nb_pkts = nb_pkts < ETH_AF_XDP_RX_BATCH_SIZE ?
> > +		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
> > +
> 
> Put declarations first.
> Why not iterate if nb_pkts is huge?
Yes, it is not necessary to only read one batch, just for simple implementation, will be fixed.
> 
> > +	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
> > +	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
> > +	int rcvd, i;
> > +	/* fill rx ring */
> > +	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {
> 
> Blank line after declarations before code please.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management
  2018-02-27  9:33 ` [dpdk-dev] [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang
@ 2018-03-01  2:08   ` Stephen Hemminger
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2018-03-01  2:08 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:03 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> +static uint32_t
> +mbuf_to_idx(struct pmd_internals *internals, struct rte_mbuf *mbuf)
> +{
> +	return (uint32_t)(((uint64_t)mbuf->buf_addr -
> +			   (uint64_t)internals->umem->buffer) >>
> +			  internals->umem->frame_size_log2);
> +}
> +
> +static struct rte_mbuf *
> +idx_to_mbuf(struct pmd_internals *internals, uint32_t idx)
> +{
> +	return (struct rte_mbuf *)(void *)(internals->umem->buffer + (idx
> +			<< internals->umem->frame_size_log2) + 0x40);
> +}

More unnecessary casts's here.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 6/7] net/af_xdp: load BPF file
  2018-02-27  9:33 ` [dpdk-dev] [RFC 6/7] net/af_xdp: load BPF file Qi Zhang
@ 2018-03-01  2:10   ` Stephen Hemminger
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2018-03-01  2:10 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:05 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

>  include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/net/af_xdp/bpf_load.c b/drivers/net/af_xdp/bpf_load.c
> new file mode 100644
> index 000000000..aa632207f
> --- /dev/null
> +++ b/drivers/net/af_xdp/bpf_load.c
> @@ -0,0 +1,798 @@
> +// SPDX-License-Identifier: GPL-2.0

Sorry all DPDK drivers must be BSD licensed. You can't use GPL-2.0 code.
Either get the code dual licensed or write a new loader.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
  2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (6 preceding siblings ...)
  2018-02-27  9:33 ` [dpdk-dev] [RFC 7/7] app/testpmd: enable parameter for mempool flags Qi Zhang
@ 2018-03-01  2:52 ` Jason Wang
  2018-03-01  4:18   ` Zhang, Qi Z
  7 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2018-03-01  2:52 UTC (permalink / raw)
  To: Qi Zhang, dev; +Cc: magnus.karlsson, bjorn.topel



On 2018年02月27日 17:32, Qi Zhang wrote:
> The RFC patches add a new PMD driver for AF_XDP which is a proposed
> faster version of AF_PACKET interface in Linux, see below link for
> detail AF_XDP introduction:
> https://fosdem.org/2018/schedule/event/af_xdp/
> https://lwn.net/Articles/745934/
>
> This patchset is base on v18.02.
> It also require a linux kernel that have below AF_XDP RFC patches be
> applied.
> https://patchwork.ozlabs.org/patch/867961/
> https://patchwork.ozlabs.org/patch/867960/
> https://patchwork.ozlabs.org/patch/867938/
> https://patchwork.ozlabs.org/patch/867939/
> https://patchwork.ozlabs.org/patch/867940/
> https://patchwork.ozlabs.org/patch/867941/
> https://patchwork.ozlabs.org/patch/867942/
> https://patchwork.ozlabs.org/patch/867943/
> https://patchwork.ozlabs.org/patch/867944/
> https://patchwork.ozlabs.org/patch/867945/
> https://patchwork.ozlabs.org/patch/867946/
> https://patchwork.ozlabs.org/patch/867947/
> https://patchwork.ozlabs.org/patch/867948/
> https://patchwork.ozlabs.org/patch/867949/
> https://patchwork.ozlabs.org/patch/867950/
> https://patchwork.ozlabs.org/patch/867951/
> https://patchwork.ozlabs.org/patch/867952/
> https://patchwork.ozlabs.org/patch/867953/
> https://patchwork.ozlabs.org/patch/867954/
> https://patchwork.ozlabs.org/patch/867955/
> https://patchwork.ozlabs.org/patch/867956/
> https://patchwork.ozlabs.org/patch/867957/
> https://patchwork.ozlabs.org/patch/867958/
> https://patchwork.ozlabs.org/patch/867959/
>
> There is no clean upstream target yet since kernel patch is still in
> RFC stage, The purpose of the patchset is just for anyone that want to
> eveluate af_xdp with DPDK application and get feedback for further
> improvement.
>
> To try with the new PMD
> 1. compile and install the kernel with above patches applied.
> 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
>     and $TOOLS_DIR (dir at <kernel_src>/tools) at driver/net/af_xdp/Makefile
>     before compile DPDK.
> 3. make sure libelf and libbpf is installed.
>
> BTW, performance test shows our PMD can reach 94%~98% of the orignal benchmark
> when share memory is enabled.

Hi:

Looks like zerocopy is not used in this series. Any plan to support 
that? If not, what's the advantage compared to vhost-net + tap + 
XDP_REDIRECT?

Have you measured l2fwd performance in this case? I believe the number 
you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
  2018-03-01  2:52 ` [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Jason Wang
@ 2018-03-01  4:18   ` Zhang, Qi Z
  2018-03-01  4:20     ` Zhang, Qi Z
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01  4:18 UTC (permalink / raw)
  To: Jason Wang, dev; +Cc: magnus.karlsson, Topel, Bjorn



> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Thursday, March 1, 2018 10:52 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; dev@dpdk.org
> Cc: magnus.karlsson@intei.com; Topel, Bjorn <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> 
> 
> 
> On 2018年02月27日 17:32, Qi Zhang wrote:
> > The RFC patches add a new PMD driver for AF_XDP which is a proposed
> > faster version of AF_PACKET interface in Linux, see below link for
> > detail AF_XDP introduction:
> > https://fosdem.org/2018/schedule/event/af_xdp/
> > https://lwn.net/Articles/745934/
> >
> > This patchset is base on v18.02.
> > It also require a linux kernel that have below AF_XDP RFC patches be
> > applied.
> > https://patchwork.ozlabs.org/patch/867961/
> > https://patchwork.ozlabs.org/patch/867960/
> > https://patchwork.ozlabs.org/patch/867938/
> > https://patchwork.ozlabs.org/patch/867939/
> > https://patchwork.ozlabs.org/patch/867940/
> > https://patchwork.ozlabs.org/patch/867941/
> > https://patchwork.ozlabs.org/patch/867942/
> > https://patchwork.ozlabs.org/patch/867943/
> > https://patchwork.ozlabs.org/patch/867944/
> > https://patchwork.ozlabs.org/patch/867945/
> > https://patchwork.ozlabs.org/patch/867946/
> > https://patchwork.ozlabs.org/patch/867947/
> > https://patchwork.ozlabs.org/patch/867948/
> > https://patchwork.ozlabs.org/patch/867949/
> > https://patchwork.ozlabs.org/patch/867950/
> > https://patchwork.ozlabs.org/patch/867951/
> > https://patchwork.ozlabs.org/patch/867952/
> > https://patchwork.ozlabs.org/patch/867953/
> > https://patchwork.ozlabs.org/patch/867954/
> > https://patchwork.ozlabs.org/patch/867955/
> > https://patchwork.ozlabs.org/patch/867956/
> > https://patchwork.ozlabs.org/patch/867957/
> > https://patchwork.ozlabs.org/patch/867958/
> > https://patchwork.ozlabs.org/patch/867959/
> >
> > There is no clean upstream target yet since kernel patch is still in
> > RFC stage, The purpose of the patchset is just for anyone that want to
> > eveluate af_xdp with DPDK application and get feedback for further
> > improvement.
> >
> > To try with the new PMD
> > 1. compile and install the kernel with above patches applied.
> > 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
> >     and $TOOLS_DIR (dir at <kernel_src>/tools) at
> driver/net/af_xdp/Makefile
> >     before compile DPDK.
> > 3. make sure libelf and libbpf is installed.
> >
> > BTW, performance test shows our PMD can reach 94%~98% of the orignal
> > benchmark when share memory is enabled.
> 
> Hi:
> 
> Looks like zero copy is not used in this series. Any plan to support that? 

Zero copy is enabled in patch 5, if a mempool passed check_mempool, it will be registered to af_xdp socket.
so there will be no memcpy between mbuf and af_xdp.

> If not, what's the advantage compared to vhost-net + tap + XDP_REDIRECT?
> 
> Have you measured l2fwd performance in this case? I believe the number
> you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.

Actually we measure the performance on rxonly / txonly / l2fwd on i40e with XDP_SKB and XDP_DRV_ZC 

Regards
Qi

> 
> Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
  2018-03-01  4:18   ` Zhang, Qi Z
@ 2018-03-01  4:20     ` Zhang, Qi Z
  2018-03-01  7:46       ` Jason Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01  4:20 UTC (permalink / raw)
  To: Zhang, Qi Z, Jason Wang, dev; +Cc: Karlsson, Magnus, Topel, Bjorn

+Magnus, since a typo in my first batch in email address.

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhang, Qi Z
> Sent: Thursday, March 1, 2018 12:19 PM
> To: Jason Wang <jasowang@redhat.com>; dev@dpdk.org
> Cc: magnus.karlsson@intei.com; Topel, Bjorn <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> 
> 
> 
> > -----Original Message-----
> > From: Jason Wang [mailto:jasowang@redhat.com]
> > Sent: Thursday, March 1, 2018 10:52 AM
> > To: Zhang, Qi Z <qi.z.zhang@intel.com>; dev@dpdk.org
> > Cc: magnus.karlsson@intei.com; Topel, Bjorn <bjorn.topel@intel.com>
> > Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> >
> >
> >
> > On 2018年02月27日 17:32, Qi Zhang wrote:
> > > The RFC patches add a new PMD driver for AF_XDP which is a proposed
> > > faster version of AF_PACKET interface in Linux, see below link for
> > > detail AF_XDP introduction:
> > > https://fosdem.org/2018/schedule/event/af_xdp/
> > > https://lwn.net/Articles/745934/
> > >
> > > This patchset is base on v18.02.
> > > It also require a linux kernel that have below AF_XDP RFC patches be
> > > applied.
> > > https://patchwork.ozlabs.org/patch/867961/
> > > https://patchwork.ozlabs.org/patch/867960/
> > > https://patchwork.ozlabs.org/patch/867938/
> > > https://patchwork.ozlabs.org/patch/867939/
> > > https://patchwork.ozlabs.org/patch/867940/
> > > https://patchwork.ozlabs.org/patch/867941/
> > > https://patchwork.ozlabs.org/patch/867942/
> > > https://patchwork.ozlabs.org/patch/867943/
> > > https://patchwork.ozlabs.org/patch/867944/
> > > https://patchwork.ozlabs.org/patch/867945/
> > > https://patchwork.ozlabs.org/patch/867946/
> > > https://patchwork.ozlabs.org/patch/867947/
> > > https://patchwork.ozlabs.org/patch/867948/
> > > https://patchwork.ozlabs.org/patch/867949/
> > > https://patchwork.ozlabs.org/patch/867950/
> > > https://patchwork.ozlabs.org/patch/867951/
> > > https://patchwork.ozlabs.org/patch/867952/
> > > https://patchwork.ozlabs.org/patch/867953/
> > > https://patchwork.ozlabs.org/patch/867954/
> > > https://patchwork.ozlabs.org/patch/867955/
> > > https://patchwork.ozlabs.org/patch/867956/
> > > https://patchwork.ozlabs.org/patch/867957/
> > > https://patchwork.ozlabs.org/patch/867958/
> > > https://patchwork.ozlabs.org/patch/867959/
> > >
> > > There is no clean upstream target yet since kernel patch is still in
> > > RFC stage, The purpose of the patchset is just for anyone that want
> > > to eveluate af_xdp with DPDK application and get feedback for
> > > further improvement.
> > >
> > > To try with the new PMD
> > > 1. compile and install the kernel with above patches applied.
> > > 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
> > >     and $TOOLS_DIR (dir at <kernel_src>/tools) at
> > driver/net/af_xdp/Makefile
> > >     before compile DPDK.
> > > 3. make sure libelf and libbpf is installed.
> > >
> > > BTW, performance test shows our PMD can reach 94%~98% of the orignal
> > > benchmark when share memory is enabled.
> >
> > Hi:
> >
> > Looks like zero copy is not used in this series. Any plan to support that?
> 
> Zero copy is enabled in patch 5, if a mempool passed check_mempool, it will
> be registered to af_xdp socket.
> so there will be no memcpy between mbuf and af_xdp.
> 
> > If not, what's the advantage compared to vhost-net + tap + XDP_REDIRECT?
> >
> > Have you measured l2fwd performance in this case? I believe the number
> > you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.
> 
> Actually we measure the performance on rxonly / txonly / l2fwd on i40e with
> XDP_SKB and XDP_DRV_ZC
> 
> Regards
> Qi
> 
> >
> > Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
  2018-03-01  4:20     ` Zhang, Qi Z
@ 2018-03-01  7:46       ` Jason Wang
  2018-03-01 12:56         ` Zhang, Qi Z
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2018-03-01  7:46 UTC (permalink / raw)
  To: Zhang, Qi Z, dev; +Cc: Karlsson, Magnus, Topel, Bjorn



On 2018年03月01日 12:20, Zhang, Qi Z wrote:
> +Magnus, since a typo in my first batch in email address.
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhang, Qi Z
>> Sent: Thursday, March 1, 2018 12:19 PM
>> To: Jason Wang<jasowang@redhat.com>;dev@dpdk.org
>> Cc:magnus.karlsson@intei.com; Topel, Bjorn<bjorn.topel@intel.com>
>> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
>>
>>
>>
>>> -----Original Message-----
>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>> Sent: Thursday, March 1, 2018 10:52 AM
>>> To: Zhang, Qi Z<qi.z.zhang@intel.com>;dev@dpdk.org
>>> Cc:magnus.karlsson@intei.com; Topel, Bjorn<bjorn.topel@intel.com>
>>> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
>>>
>>>
>>>
>>> On 2018年02月27日 17:32, Qi Zhang wrote:
>>>> The RFC patches add a new PMD driver for AF_XDP which is a proposed
>>>> faster version of AF_PACKET interface in Linux, see below link for
>>>> detail AF_XDP introduction:
>>>> https://fosdem.org/2018/schedule/event/af_xdp/
>>>> https://lwn.net/Articles/745934/
>>>>
>>>> This patchset is base on v18.02.
>>>> It also require a linux kernel that have below AF_XDP RFC patches be
>>>> applied.
>>>> https://patchwork.ozlabs.org/patch/867961/
>>>> https://patchwork.ozlabs.org/patch/867960/
>>>> https://patchwork.ozlabs.org/patch/867938/
>>>> https://patchwork.ozlabs.org/patch/867939/
>>>> https://patchwork.ozlabs.org/patch/867940/
>>>> https://patchwork.ozlabs.org/patch/867941/
>>>> https://patchwork.ozlabs.org/patch/867942/
>>>> https://patchwork.ozlabs.org/patch/867943/
>>>> https://patchwork.ozlabs.org/patch/867944/
>>>> https://patchwork.ozlabs.org/patch/867945/
>>>> https://patchwork.ozlabs.org/patch/867946/
>>>> https://patchwork.ozlabs.org/patch/867947/
>>>> https://patchwork.ozlabs.org/patch/867948/
>>>> https://patchwork.ozlabs.org/patch/867949/
>>>> https://patchwork.ozlabs.org/patch/867950/
>>>> https://patchwork.ozlabs.org/patch/867951/
>>>> https://patchwork.ozlabs.org/patch/867952/
>>>> https://patchwork.ozlabs.org/patch/867953/
>>>> https://patchwork.ozlabs.org/patch/867954/
>>>> https://patchwork.ozlabs.org/patch/867955/
>>>> https://patchwork.ozlabs.org/patch/867956/
>>>> https://patchwork.ozlabs.org/patch/867957/
>>>> https://patchwork.ozlabs.org/patch/867958/
>>>> https://patchwork.ozlabs.org/patch/867959/
>>>>
>>>> There is no clean upstream target yet since kernel patch is still in
>>>> RFC stage, The purpose of the patchset is just for anyone that want
>>>> to eveluate af_xdp with DPDK application and get feedback for
>>>> further improvement.
>>>>
>>>> To try with the new PMD
>>>> 1. compile and install the kernel with above patches applied.
>>>> 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
>>>>      and $TOOLS_DIR (dir at <kernel_src>/tools) at
>>> driver/net/af_xdp/Makefile
>>>>      before compile DPDK.
>>>> 3. make sure libelf and libbpf is installed.
>>>>
>>>> BTW, performance test shows our PMD can reach 94%~98% of the orignal
>>>> benchmark when share memory is enabled.
>>> Hi:
>>>
>>> Looks like zero copy is not used in this series. Any plan to support that?
>> Zero copy is enabled in patch 5, if a mempool passed check_mempool, it will
>> be registered to af_xdp socket.
>> so there will be no memcpy between mbuf and af_xdp.

Aha, I see. So the zerocopy was limited to some specific use case. And 
if I understand it correctly, zc mode could not be used for VM.

Thanks

>>> If not, what's the advantage compared to vhost-net + tap + XDP_REDIRECT?
>>>
>>> Have you measured l2fwd performance in this case? I believe the number
>>> you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.
>> Actually we measure the performance on rxonly / txonly / l2fwd on i40e with
>> XDP_SKB and XDP_DRV_ZC
>>
>> Regards
>> Qi
>>
>>> Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
  2018-03-01  7:46       ` Jason Wang
@ 2018-03-01 12:56         ` Zhang, Qi Z
  2018-03-01 13:18           ` Jason Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01 12:56 UTC (permalink / raw)
  To: 'Jason Wang'; +Cc: Karlsson, Magnus, Topel, Bjorn, dev



> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Thursday, March 1, 2018 3:46 PM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; dev@dpdk.org
> Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Topel, Bjorn
> <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> 
> 
> 
> On 2018年03月01日 12:20, Zhang, Qi Z wrote:
> > +Magnus, since a typo in my first batch in email address.
> >
> >> -----Original Message-----
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhang, Qi Z
> >> Sent: Thursday, March 1, 2018 12:19 PM
> >> To: Jason Wang<jasowang@redhat.com>;dev@dpdk.org
> >> Cc:magnus.karlsson@intei.com; Topel, Bjorn<bjorn.topel@intel.com>
> >> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>> Sent: Thursday, March 1, 2018 10:52 AM
> >>> To: Zhang, Qi Z<qi.z.zhang@intel.com>;dev@dpdk.org
> >>> Cc:magnus.karlsson@intei.com; Topel, Bjorn<bjorn.topel@intel.com>
> >>> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> >>>
> >>>
> >>>
> >>> On 2018年02月27日 17:32, Qi Zhang wrote:
> >>>> The RFC patches add a new PMD driver for AF_XDP which is a proposed
> >>>> faster version of AF_PACKET interface in Linux, see below link for
> >>>> detail AF_XDP introduction:
> >>>> https://fosdem.org/2018/schedule/event/af_xdp/
> >>>> https://lwn.net/Articles/745934/
> >>>>
> >>>> This patchset is base on v18.02.
> >>>> It also require a linux kernel that have below AF_XDP RFC patches
> >>>> be applied.
> >>>> https://patchwork.ozlabs.org/patch/867961/
> >>>> https://patchwork.ozlabs.org/patch/867960/
> >>>> https://patchwork.ozlabs.org/patch/867938/
> >>>> https://patchwork.ozlabs.org/patch/867939/
> >>>> https://patchwork.ozlabs.org/patch/867940/
> >>>> https://patchwork.ozlabs.org/patch/867941/
> >>>> https://patchwork.ozlabs.org/patch/867942/
> >>>> https://patchwork.ozlabs.org/patch/867943/
> >>>> https://patchwork.ozlabs.org/patch/867944/
> >>>> https://patchwork.ozlabs.org/patch/867945/
> >>>> https://patchwork.ozlabs.org/patch/867946/
> >>>> https://patchwork.ozlabs.org/patch/867947/
> >>>> https://patchwork.ozlabs.org/patch/867948/
> >>>> https://patchwork.ozlabs.org/patch/867949/
> >>>> https://patchwork.ozlabs.org/patch/867950/
> >>>> https://patchwork.ozlabs.org/patch/867951/
> >>>> https://patchwork.ozlabs.org/patch/867952/
> >>>> https://patchwork.ozlabs.org/patch/867953/
> >>>> https://patchwork.ozlabs.org/patch/867954/
> >>>> https://patchwork.ozlabs.org/patch/867955/
> >>>> https://patchwork.ozlabs.org/patch/867956/
> >>>> https://patchwork.ozlabs.org/patch/867957/
> >>>> https://patchwork.ozlabs.org/patch/867958/
> >>>> https://patchwork.ozlabs.org/patch/867959/
> >>>>
> >>>> There is no clean upstream target yet since kernel patch is still
> >>>> in RFC stage, The purpose of the patchset is just for anyone that
> >>>> want to eveluate af_xdp with DPDK application and get feedback for
> >>>> further improvement.
> >>>>
> >>>> To try with the new PMD
> >>>> 1. compile and install the kernel with above patches applied.
> >>>> 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
> >>>>      and $TOOLS_DIR (dir at <kernel_src>/tools) at
> >>> driver/net/af_xdp/Makefile
> >>>>      before compile DPDK.
> >>>> 3. make sure libelf and libbpf is installed.
> >>>>
> >>>> BTW, performance test shows our PMD can reach 94%~98% of the
> >>>> orignal benchmark when share memory is enabled.
> >>> Hi:
> >>>
> >>> Looks like zero copy is not used in this series. Any plan to support that?
> >> Zero copy is enabled in patch 5, if a mempool passed check_mempool,
> >> it will be registered to af_xdp socket.
> >> so there will be no memcpy between mbuf and af_xdp.
> 
> Aha, I see. So the zerocopy was limited to some specific use case. And if I
> understand it correctly, zc mode could not be used for VM.

I think except the limitation for mempool layout, zerocopy is transparent to DPDK application, only difference is performance.
Sorry, I may not get your point, if you could explain more about the VM usage.

Regards
Qi
> 
> Thanks
> 
> >>> If not, what's the advantage compared to vhost-net + tap +
> XDP_REDIRECT?
> >>>
> >>> Have you measured l2fwd performance in this case? I believe the
> >>> number you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.
> >> Actually we measure the performance on rxonly / txonly / l2fwd on
> >> i40e with XDP_SKB and XDP_DRV_ZC
> >>
> >> Regards
> >> Qi
> >>
> >>> Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
  2018-03-01 12:56         ` Zhang, Qi Z
@ 2018-03-01 13:18           ` Jason Wang
  2018-03-02  4:05             ` Zhang, Qi Z
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2018-03-01 13:18 UTC (permalink / raw)
  To: Zhang, Qi Z; +Cc: Karlsson, Magnus, Topel, Bjorn, dev



On 2018年03月01日 20:56, Zhang, Qi Z wrote:
>>>>>> BTW, performance test shows our PMD can reach 94%~98% of the
>>>>>> orignal benchmark when share memory is enabled.
>>>>> Hi:
>>>>>
>>>>> Looks like zero copy is not used in this series. Any plan to support that?
>>>> Zero copy is enabled in patch 5, if a mempool passed check_mempool,
>>>> it will be registered to af_xdp socket.
>>>> so there will be no memcpy between mbuf and af_xdp.
>> Aha, I see. So the zerocopy was limited to some specific use case. And if I
>> understand it correctly, zc mode could not be used for VM.
> I think except the limitation for mempool layout, zerocopy is transparent to DPDK application, only difference is performance.
> Sorry, I may not get your point, if you could explain more about the VM usage.
>
> Regards
> Qi

No problem, so the question is:

Can zerocopy be used when using testpmd to foward packets between 
vhost-user and AF_XDP socket?

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
  2018-03-01 13:18           ` Jason Wang
@ 2018-03-02  4:05             ` Zhang, Qi Z
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-02  4:05 UTC (permalink / raw)
  To: Jason Wang; +Cc: Karlsson, Magnus, Topel, Bjorn, dev



> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Thursday, March 1, 2018 9:18 PM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>
> Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Topel, Bjorn
> <bjorn.topel@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> 
> 
> 
> On 2018年03月01日 20:56, Zhang, Qi Z wrote:
> >>>>>> BTW, performance test shows our PMD can reach 94%~98% of the
> >>>>>> orignal benchmark when share memory is enabled.
> >>>>> Hi:
> >>>>>
> >>>>> Looks like zero copy is not used in this series. Any plan to support
> that?
> >>>> Zero copy is enabled in patch 5, if a mempool passed check_mempool,
> >>>> it will be registered to af_xdp socket.
> >>>> so there will be no memcpy between mbuf and af_xdp.
> >> Aha, I see. So the zerocopy was limited to some specific use case.
> >> And if I understand it correctly, zc mode could not be used for VM.
> > I think except the limitation for mempool layout, zerocopy is transparent
> to DPDK application, only difference is performance.
> > Sorry, I may not get your point, if you could explain more about the VM
> usage.
> >
> > Regards
> > Qi
> 
> No problem, so the question is:
> 
> Can zerocopy be used when using testpmd to foward packets between
> vhost-user and AF_XDP socket?

I'm not very familiar with vhost-user, but I guess the answer should be same as the case for forward packet between vhost-user and i40e, 
(if vhost-user does not have any special requirement for mempool that conflict with af_xdp ZC's requirement)

Regards
Qi

> 
> Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management
  2018-02-27  9:35 Qi Zhang
@ 2018-02-27  9:35 ` Qi Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:35 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Now, af_xdp registered memory buffer is managed by rte_mempool.
mbuf be allocated from rte_mempool can be convert to descriptor
index and vice versa.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 165 +++++++++++++++++++++---------------
 1 file changed, 97 insertions(+), 68 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 4eb8a2c28..3c534c77c 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -43,7 +43,11 @@
 
 #define ETH_AF_XDP_FRAME_SIZE		2048
 #define ETH_AF_XDP_NUM_BUFFERS		131072
-#define ETH_AF_XDP_DATA_HEADROOM	0
+/* mempool hdrobj size (64 bytes) + sizeof(struct rte_mbuf) (128 bytes) */
+#define ETH_AF_XDP_MBUF_OVERHEAD	192
+/* data start from offset 320 (192 + 128) bytes */
+#define ETH_AF_XDP_DATA_HEADROOM \
+	(ETH_AF_XDP_MBUF_OVERHEAD + RTE_PKTMBUF_HEADROOM)
 #define ETH_AF_XDP_DFLT_RING_SIZE	1024
 #define ETH_AF_XDP_DFLT_QUEUE_IDX	0
 
@@ -57,6 +61,7 @@ struct xdp_umem {
 	unsigned int frame_size_log2;
 	unsigned int nframes;
 	int mr_fd;
+	struct rte_mempool *mb_pool;
 };
 
 struct pmd_internals {
@@ -67,7 +72,7 @@ struct pmd_internals {
 	struct xdp_queue rx;
 	struct xdp_queue tx;
 	struct xdp_umem *umem;
-	struct rte_mempool *mb_pool;
+	struct rte_mempool *ext_mb_pool;
 
 	unsigned long rx_pkts;
 	unsigned long rx_bytes;
@@ -80,7 +85,6 @@ struct pmd_internals {
 	uint16_t port_id;
 	uint16_t queue_idx;
 	int ring_size;
-	struct rte_ring *buf_ring;
 };
 
 static const char * const valid_arguments[] = {
@@ -106,6 +110,21 @@ static void *get_pkt_data(struct pmd_internals *internals,
 			   offset);
 }
 
+static uint32_t
+mbuf_to_idx(struct pmd_internals *internals, struct rte_mbuf *mbuf)
+{
+	return (uint32_t)(((uint64_t)mbuf->buf_addr -
+			   (uint64_t)internals->umem->buffer) >>
+			  internals->umem->frame_size_log2);
+}
+
+static struct rte_mbuf *
+idx_to_mbuf(struct pmd_internals *internals, uint32_t idx)
+{
+	return (struct rte_mbuf *)(void *)(internals->umem->buffer + (idx
+			<< internals->umem->frame_size_log2) + 0x40);
+}
+
 static uint16_t
 eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -120,17 +139,18 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
 
 	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
-	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
+	struct rte_mbuf *mbufs[ETH_AF_XDP_RX_BATCH_SIZE];
 	int rcvd, i;
 	/* fill rx ring */
 	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {
-		int n = rte_ring_dequeue_bulk(internals->buf_ring,
-					      indexes,
-					      ETH_AF_XDP_RX_BATCH_SIZE,
-					      NULL);
-		for (i = 0; i < n; i++)
-			descs[i].idx = (uint32_t)((long int)indexes[i]);
-		xq_enq(rxq, descs, n);
+		int ret = rte_mempool_get_bulk(internals->umem->mb_pool,
+					     (void *)mbufs,
+					     ETH_AF_XDP_RX_BATCH_SIZE);
+		if (!ret) {
+			for (i = 0; i < ETH_AF_XDP_RX_BATCH_SIZE; i++)
+				descs[i].idx = mbuf_to_idx(internals, mbufs[i]);
+			xq_enq(rxq, descs, ETH_AF_XDP_RX_BATCH_SIZE);
+		}
 	}
 
 	/* read data */
@@ -142,7 +162,7 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		char *pkt;
 		uint32_t idx = descs[i].idx;
 
-		mbuf = rte_pktmbuf_alloc(internals->mb_pool);
+		mbuf = rte_pktmbuf_alloc(internals->ext_mb_pool);
 		rte_pktmbuf_pkt_len(mbuf) =
 			rte_pktmbuf_data_len(mbuf) =
 			descs[i].len;
@@ -155,11 +175,9 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		} else {
 			dropped++;
 		}
-		indexes[i] = (void *)((long int)idx);
+		rte_pktmbuf_free(idx_to_mbuf(internals, idx));
 	}
 
-	rte_ring_enqueue_bulk(internals->buf_ring, indexes, rcvd, NULL);
-
 	internals->rx_pkts += (rcvd - dropped);
 	internals->rx_bytes += rx_bytes;
 	internals->rx_dropped += dropped;
@@ -187,9 +205,10 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	struct xdp_queue *txq = &internals->tx;
 	struct rte_mbuf *mbuf;
 	struct xdp_desc descs[ETH_AF_XDP_TX_BATCH_SIZE];
-	void *indexes[ETH_AF_XDP_TX_BATCH_SIZE];
+	struct rte_mbuf *mbufs[ETH_AF_XDP_TX_BATCH_SIZE];
 	uint16_t i, valid;
 	unsigned long tx_bytes = 0;
+	int ret;
 
 	nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ?
 		  nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE;
@@ -198,13 +217,15 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		int n = xq_deq(txq, descs, ETH_AF_XDP_TX_BATCH_SIZE);
 
 		for (i = 0; i < n; i++)
-			indexes[i] = (void *)((long int)descs[i].idx);
-		rte_ring_enqueue_bulk(internals->buf_ring, indexes, n, NULL);
+			rte_pktmbuf_free(idx_to_mbuf(internals, descs[i].idx));
 	}
 
 	nb_pkts = nb_pkts > txq->num_free ? txq->num_free : nb_pkts;
-	nb_pkts = rte_ring_dequeue_bulk(internals->buf_ring, indexes,
-					nb_pkts, NULL);
+	ret = rte_mempool_get_bulk(internals->umem->mb_pool,
+				   (void *)mbufs,
+				   nb_pkts);
+	if (ret)
+		return 0;
 
 	valid = 0;
 	for (i = 0; i < nb_pkts; i++) {
@@ -213,14 +234,14 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			internals->umem->frame_size - ETH_AF_XDP_DATA_HEADROOM;
 		mbuf = bufs[i];
 		if (mbuf->pkt_len <= buf_len) {
-			descs[valid].idx = (uint32_t)((long int)indexes[valid]);
+			descs[valid].idx = mbuf_to_idx(internals, mbufs[i]);
 			descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
 			descs[valid].flags = 0;
 			descs[valid].len = mbuf->pkt_len;
 			pkt = get_pkt_data(internals, descs[i].idx,
 					   descs[i].offset);
 			memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
-			       descs[i].len);
+					   descs[i].len);
 			valid++;
 			tx_bytes += mbuf->pkt_len;
 		}
@@ -230,9 +251,10 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	xq_enq(txq, descs, valid);
 	kick_tx(internals->sfd);
 
-	if (valid < nb_pkts)
-		rte_ring_enqueue_bulk(internals->buf_ring, &indexes[valid],
-				      nb_pkts - valid, NULL);
+	if (valid < nb_pkts) {
+		for (i = valid; i < nb_pkts; i++)
+			rte_pktmbuf_free(mbufs[i]);
+	}
 
 	internals->err_pkts += (nb_pkts - valid);
 	internals->tx_pkts += valid;
@@ -245,14 +267,13 @@ static void
 fill_rx_desc(struct pmd_internals *internals)
 {
 	int num_free = internals->rx.num_free;
-	void *p = NULL;
 	int i;
-
 	for (i = 0; i < num_free; i++) {
 		struct xdp_desc desc = {};
+		struct rte_mbuf *mbuf =
+			rte_pktmbuf_alloc(internals->umem->mb_pool);
 
-		rte_ring_dequeue(internals->buf_ring, &p);
-		desc.idx = (uint32_t)((long int)p);
+		desc.idx = mbuf_to_idx(internals, mbuf);
 		xq_enq(&internals->rx, &desc, 1);
 	}
 }
@@ -347,33 +368,53 @@ eth_link_update(struct rte_eth_dev *dev __rte_unused,
 	return 0;
 }
 
-static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd, size_t nbuffers)
+static void *get_base_addr(struct rte_mempool *mb_pool)
+{
+	struct rte_mempool_memhdr *memhdr;
+
+	STAILQ_FOREACH(memhdr, &mb_pool->mem_list, next) {
+		return memhdr->addr;
+	}
+	return NULL;
+}
+
+static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd,
+						      size_t nbuffers,
+						      const char *pool_name)
 {
 	struct xdp_mr_req req = { .frame_size = ETH_AF_XDP_FRAME_SIZE,
 				  .data_headroom = ETH_AF_XDP_DATA_HEADROOM };
-	struct xdp_umem *umem;
-	void *bufs;
-	int ret;
+	struct xdp_umem *umem = calloc(1, sizeof(*umem));
 
-	ret = posix_memalign((void **)&bufs, getpagesize(),
-			     nbuffers * req.frame_size);
-	if (ret)
+	if (!umem)
+		return NULL;
+
+	umem->mb_pool =
+		rte_pktmbuf_pool_create_with_flags(
+			pool_name, nbuffers,
+			250, 0,
+			(ETH_AF_XDP_FRAME_SIZE - ETH_AF_XDP_MBUF_OVERHEAD),
+			MEMPOOL_F_NO_SPREAD | MEMPOOL_F_PAGE_ALIGN,
+			SOCKET_ID_ANY);
+
+	if (!umem->mb_pool) {
+		free(umem);
 		return NULL;
+	}
 
-	umem = calloc(1, sizeof(*umem));
-	if (!umem) {
-		free(bufs);
+	if (umem->mb_pool->nb_mem_chunks > 1) {
+		rte_mempool_free(umem->mb_pool);
+		free(umem);
 		return NULL;
 	}
 
-	req.addr = (unsigned long)bufs;
+	req.addr = (uint64_t)get_base_addr(umem->mb_pool);
 	req.len = nbuffers * req.frame_size;
-	ret = setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
-	RTE_ASSERT(ret == 0);
+	setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
 
 	umem->frame_size = ETH_AF_XDP_FRAME_SIZE;
 	umem->frame_size_log2 = 11;
-	umem->buffer = bufs;
+	umem->buffer = (char *)req.addr;
 	umem->size = nbuffers * req.frame_size;
 	umem->nframes = nbuffers;
 	umem->mr_fd = sfd;
@@ -386,38 +427,27 @@ xdp_configure(struct pmd_internals *internals)
 {
 	struct sockaddr_xdp sxdp;
 	struct xdp_ring_req req;
-	char ring_name[0x100];
+	char pool_name[0x100];
+
 	int ret = 0;
-	long int i;
 
-	snprintf(ring_name, 0x100, "%s_%s_%d", "af_xdp_ring",
+	snprintf(pool_name, 0x100, "%s_%s_%d", "af_xdp_pool",
 		 internals->if_name, internals->queue_idx);
-	internals->buf_ring = rte_ring_create(ring_name,
-					      ETH_AF_XDP_NUM_BUFFERS,
-					      SOCKET_ID_ANY,
-					      0x0);
-	if (!internals->buf_ring)
-		return -1;
-
-	for (i = 0; i < ETH_AF_XDP_NUM_BUFFERS; i++)
-		rte_ring_enqueue(internals->buf_ring, (void *)i);
-
 	internals->umem = xsk_alloc_and_mem_reg_buffers(internals->sfd,
-							ETH_AF_XDP_NUM_BUFFERS);
+							ETH_AF_XDP_NUM_BUFFERS,
+							pool_name);
 	if (!internals->umem)
-		goto error;
+		return -1;
 
 	req.mr_fd = internals->umem->mr_fd;
 	req.desc_nr = internals->ring_size;
 
 	ret = setsockopt(internals->sfd, SOL_XDP, XDP_RX_RING,
 			 &req, sizeof(req));
-
 	RTE_ASSERT(ret == 0);
 
 	ret = setsockopt(internals->sfd, SOL_XDP, XDP_TX_RING,
 			 &req, sizeof(req));
-
 	RTE_ASSERT(ret == 0);
 
 	internals->rx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
@@ -448,10 +478,6 @@ xdp_configure(struct pmd_internals *internals)
 	RTE_ASSERT(ret == 0);
 
 	return ret;
-error:
-	rte_ring_free(internals->buf_ring);
-	internals->buf_ring = NULL;
-	return -1;
 }
 
 static int
@@ -466,11 +492,11 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
 	unsigned int buf_size, data_size;
 
 	RTE_ASSERT(rx_queue_id == 0);
-	internals->mb_pool = mb_pool;
+	internals->ext_mb_pool = mb_pool;
 	xdp_configure(internals);
 
 	/* Now get the space available for data in the mbuf */
-	buf_size = rte_pktmbuf_data_room_size(internals->mb_pool) -
+	buf_size = rte_pktmbuf_data_room_size(internals->ext_mb_pool) -
 		RTE_PKTMBUF_HEADROOM;
 	data_size = internals->umem->frame_size;
 
@@ -739,8 +765,11 @@ rte_pmd_af_xdp_remove(struct rte_vdev_device *dev)
 		return -1;
 
 	internals = eth_dev->data->dev_private;
-	rte_ring_free(internals->buf_ring);
-	rte_free(internals->umem);
+	if (internals->umem) {
+		if (internals->umem->mb_pool)
+			rte_mempool_free(internals->umem->mb_pool);
+		rte_free(internals->umem);
+	}
 	rte_free(eth_dev->data->dev_private);
 	rte_free(eth_dev->data);
 	close(internals->sfd);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-03-02  4:05 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27  9:32 [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Qi Zhang
2018-02-27  9:33 ` [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
2018-02-28 23:40   ` Stephen Hemminger
2018-02-28 23:42   ` Stephen Hemminger
2018-03-01  1:51     ` Zhang, Qi Z
2018-02-28 23:42   ` Stephen Hemminger
2018-02-28 23:45   ` Stephen Hemminger
2018-03-01  1:59     ` Zhang, Qi Z
2018-02-27  9:33 ` [dpdk-dev] [RFC 2/7] lib/mbuf: enable parse flags when create mempool Qi Zhang
2018-02-27  9:33 ` [dpdk-dev] [RFC 3/7] lib/mempool: allow page size aligned mempool Qi Zhang
2018-02-27  9:33 ` [dpdk-dev] [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang
2018-03-01  2:08   ` Stephen Hemminger
2018-02-27  9:33 ` [dpdk-dev] [RFC 5/7] net/af_xdp: enable share mempool Qi Zhang
2018-02-27  9:33 ` [dpdk-dev] [RFC 6/7] net/af_xdp: load BPF file Qi Zhang
2018-03-01  2:10   ` Stephen Hemminger
2018-02-27  9:33 ` [dpdk-dev] [RFC 7/7] app/testpmd: enable parameter for mempool flags Qi Zhang
2018-03-01  2:52 ` [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP Jason Wang
2018-03-01  4:18   ` Zhang, Qi Z
2018-03-01  4:20     ` Zhang, Qi Z
2018-03-01  7:46       ` Jason Wang
2018-03-01 12:56         ` Zhang, Qi Z
2018-03-01 13:18           ` Jason Wang
2018-03-02  4:05             ` Zhang, Qi Z
2018-02-27  9:35 Qi Zhang
2018-02-27  9:35 ` [dpdk-dev] [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).