From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by dpdk.org (Postfix) with ESMTP id E531011A4 for ; Tue, 19 Mar 2019 10:07:48 +0100 (CET) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 79D3240008 for ; Tue, 19 Mar 2019 10:07:48 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 612C940012; Tue, 19 Mar 2019 10:07:48 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-0.6 required=5.0 tests=ALL_TRUSTED,AWL,URIBL_SBL, URIBL_SBL_A autolearn=disabled version=3.4.1 X-Spam-Score: -0.6 Received: from [192.168.1.59] (host-90-232-144-184.mobileonline.telia.com [90.232.144.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 24AE140008; Tue, 19 Mar 2019 10:07:43 +0100 (CET) To: Xiaolong Ye , dev@dpdk.org Cc: Qi Zhang , Karlsson Magnus , Topel Bjorn References: <20190301080947.91086-1-xiaolong.ye@intel.com> <20190319071256.26302-1-xiaolong.ye@intel.com> <20190319071256.26302-2-xiaolong.ye@intel.com> From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= Message-ID: Date: Tue, 19 Mar 2019 10:07:43 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 MIME-Version: 1.0 In-Reply-To: <20190319071256.26302-2-xiaolong.ye@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP Subject: Re: [dpdk-dev] [PATCH v2 1/6] net/af_xdp: introduce AF XDP PMD driver X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Mar 2019 09:07:49 -0000 On 2019-03-19 08:12, Xiaolong Ye wrote: > Add a new PMD driver for AF_XDP which is a proposed faster version of > AF_PACKET interface in Linux. More info about AF_XDP, please refer to [1] > [2]. > > This is the vanilla version PMD which just uses a raw buffer registered as > the umem. > > [1] https://fosdem.org/2018/schedule/event/af_xdp/ > [2] https://lwn.net/Articles/745934/ > > Signed-off-by: Xiaolong Ye > --- > MAINTAINERS | 6 + > config/common_base | 5 + > config/common_linux | 1 + > doc/guides/nics/af_xdp.rst | 45 + > doc/guides/nics/features/af_xdp.ini | 11 + > doc/guides/nics/index.rst | 1 + > doc/guides/rel_notes/release_19_05.rst | 7 + > drivers/net/Makefile | 1 + > drivers/net/af_xdp/Makefile | 33 + > drivers/net/af_xdp/meson.build | 21 + > drivers/net/af_xdp/rte_eth_af_xdp.c | 930 ++++++++++++++++++ > drivers/net/af_xdp/rte_pmd_af_xdp_version.map | 3 + > drivers/net/meson.build | 1 + > mk/rte.app.mk | 1 + > 14 files changed, 1066 insertions(+) > create mode 100644 doc/guides/nics/af_xdp.rst > create mode 100644 doc/guides/nics/features/af_xdp.ini > create mode 100644 drivers/net/af_xdp/Makefile > create mode 100644 drivers/net/af_xdp/meson.build > create mode 100644 drivers/net/af_xdp/rte_eth_af_xdp.c > create mode 100644 drivers/net/af_xdp/rte_pmd_af_xdp_version.map > > diff --git a/MAINTAINERS b/MAINTAINERS > index 452b8eb82..1cc54b439 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -468,6 +468,12 @@ M: John W. Linville > F: drivers/net/af_packet/ > F: doc/guides/nics/features/afpacket.ini > > +Linux AF_XDP > +M: Xiaolong Ye > +M: Qi Zhang > +F: drivers/net/af_xdp/ > +F: doc/guides/nics/features/af_xdp.rst > + > Amazon ENA > M: Marcin Wojtas > M: Michal Krawczyk > diff --git a/config/common_base b/config/common_base > index 0b09a9348..4044de205 100644 > --- a/config/common_base > +++ b/config/common_base > @@ -416,6 +416,11 @@ CONFIG_RTE_LIBRTE_VMXNET3_DEBUG_TX_FREE=n > # > CONFIG_RTE_LIBRTE_PMD_AF_PACKET=n > > +# > +# Compile software PMD backed by AF_XDP sockets (Linux only) > +# > +CONFIG_RTE_LIBRTE_PMD_AF_XDP=n > + > # > # Compile link bonding PMD library > # > diff --git a/config/common_linux b/config/common_linux > index 75334273d..0b1249da0 100644 > --- a/config/common_linux > +++ b/config/common_linux > @@ -19,6 +19,7 @@ CONFIG_RTE_LIBRTE_VHOST_POSTCOPY=n > CONFIG_RTE_LIBRTE_PMD_VHOST=y > CONFIG_RTE_LIBRTE_IFC_PMD=y > CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y > +CONFIG_RTE_LIBRTE_PMD_AF_XDP=y > CONFIG_RTE_LIBRTE_PMD_SOFTNIC=y > CONFIG_RTE_LIBRTE_PMD_TAP=y > CONFIG_RTE_LIBRTE_AVP_PMD=y > diff --git a/doc/guides/nics/af_xdp.rst b/doc/guides/nics/af_xdp.rst > new file mode 100644 > index 000000000..dd5654dd1 > --- /dev/null > +++ b/doc/guides/nics/af_xdp.rst > @@ -0,0 +1,45 @@ > +.. SPDX-License-Identifier: BSD-3-Clause > + Copyright(c) 2018 Intel Corporation. > + > +AF_XDP Poll Mode Driver > +========================== > + > +AF_XDP is an address family that is optimized for high performance > +packet processing. AF_XDP sockets enable the possibility for XDP program to > +redirect packets to a memory buffer in userspace. > + > +For the full details behind AF_XDP socket, you can refer to > +`AF_XDP documentation in the Kernel > +`_. > + > +This Linux-specific PMD driver creates the AF_XDP socket and binds it to a > +specific netdev queue, it allows a DPDK application to send and receive raw > +packets through the socket which would bypass the kernel network stack. > +Current implementation only supports single queue, multi-queues feature will > +be added later. > + > +Options > +------- > + > +The following options can be provided to set up an af_xdp port in DPDK. > + > +* ``iface`` - name of the Kernel interface to attach to (required); > +* ``queue`` - netdev queue id (optional, default 0); > + > +Prerequisites > +------------- > + > +This is a Linux-specific PMD, thus the following prerequisites apply: > + > +* A Linux Kernel (version > 4.18) with XDP sockets configuration enabled; > +* libbpf (within kernel version > 5.1) with latest af_xdp support installed > +* A Kernel bound interface to attach to. > + > +Set up an af_xdp interface > +----------------------------- > + > +The following example will set up an af_xdp interface in DPDK: > + > +.. code-block:: console > + > + --vdev eth_af_xdp,iface=ens786f1,queue=0 > diff --git a/doc/guides/nics/features/af_xdp.ini b/doc/guides/nics/features/af_xdp.ini > new file mode 100644 > index 000000000..7b8fcce00 > --- /dev/null > +++ b/doc/guides/nics/features/af_xdp.ini > @@ -0,0 +1,11 @@ > +; > +; Supported features of the 'af_xdp' network poll mode driver. > +; > +; Refer to default.ini for the full list of available PMD features. > +; > +[Features] > +Link status = Y > +MTU update = Y > +Promiscuous mode = Y > +Stats per queue = Y > +x86-64 = Y > \ No newline at end of file > diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst > index 5c80e3baa..a4b80a3d0 100644 > --- a/doc/guides/nics/index.rst > +++ b/doc/guides/nics/index.rst > @@ -12,6 +12,7 @@ Network Interface Controller Drivers > features > build_and_test > af_packet > + af_xdp > ark > atlantic > avp > diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst > index 61a2c7383..062facf89 100644 > --- a/doc/guides/rel_notes/release_19_05.rst > +++ b/doc/guides/rel_notes/release_19_05.rst > @@ -65,6 +65,13 @@ New Features > process. > * Added support for Rx packet types list in a secondary process. > > +* **Added the AF_XDP PMD.** > + > + Added a Linux-specific PMD driver for AF_XDP, it can create the AF_XDP socket > + and bind it to a specific netdev queue, it allows a DPDK application to send > + and receive raw packets through the socket which would bypass the kernel > + network stack to achieve high performance packet processing. > + > * **Updated Mellanox drivers.** > > New features and improvements were done in mlx4 and mlx5 PMDs: > diff --git a/drivers/net/Makefile b/drivers/net/Makefile > index 502869a87..5d401b8c5 100644 > --- a/drivers/net/Makefile > +++ b/drivers/net/Makefile > @@ -9,6 +9,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD),d) > endif > > DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET) += af_packet > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += af_xdp > DIRS-$(CONFIG_RTE_LIBRTE_ARK_PMD) += ark > DIRS-$(CONFIG_RTE_LIBRTE_ATLANTIC_PMD) += atlantic > DIRS-$(CONFIG_RTE_LIBRTE_AVP_PMD) += avp > diff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile > new file mode 100644 > index 000000000..6cf0ed7db > --- /dev/null > +++ b/drivers/net/af_xdp/Makefile > @@ -0,0 +1,33 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(c) 2018 Intel Corporation > + > +include $(RTE_SDK)/mk/rte.vars.mk > + > +# > +# library name > +# > +LIB = librte_pmd_af_xdp.a > + > +EXPORT_MAP := rte_pmd_af_xdp_version.map > + > +LIBABIVER := 1 > + > +CFLAGS += -O3 > + > +# require kernel version >= v5.1-rc1 > +LINUX_VERSION := $(shell uname -r) > +CFLAGS += -I/lib/modules/$(LINUX_VERSION)/build/tools/include > +CFLAGS += -I/lib/modules/$(LINUX_VERSION)/build/tools/lib/bpf > + > +CFLAGS += $(WERROR_FLAGS) > +LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring > +LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs > +LDLIBS += -lrte_bus_vdev > +LDLIBS += -lbpf > + > +# > +# all source are stored in SRCS-y > +# > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += rte_eth_af_xdp.c > + > +include $(RTE_SDK)/mk/rte.lib.mk > diff --git a/drivers/net/af_xdp/meson.build b/drivers/net/af_xdp/meson.build > new file mode 100644 > index 000000000..635e67483 > --- /dev/null > +++ b/drivers/net/af_xdp/meson.build > @@ -0,0 +1,21 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(c) 2018 Intel Corporation > + > +if host_machine.system() != 'linux' > + build = false > +endif > + > +bpf_dep = dependency('libbpf', required: false) > +if bpf_dep.found() > + build = true > +else > + bpf_dep = cc.find_library('libbpf', required: false) > + if bpf_dep.found() and cc.has_header('xsk.h', dependencies: bpf_dep) > + build = true > + pkgconfig_extra_libs += '-lbpf' > + else > + build = false > + endif > +endif > +sources = files('rte_eth_af_xdp.c') > +ext_deps += bpf_dep > diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c > new file mode 100644 > index 000000000..96dedc0c4 > --- /dev/null > +++ b/drivers/net/af_xdp/rte_eth_af_xdp.c > @@ -0,0 +1,930 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2019 Intel Corporation. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > +#include Is this include used? > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define RTE_LOGTYPE_AF_XDP RTE_LOGTYPE_USER1 > +#ifndef SOL_XDP > +#define SOL_XDP 283 > +#endif > + > +#ifndef AF_XDP > +#define AF_XDP 44 > +#endif > + > +#ifndef PF_XDP > +#define PF_XDP AF_XDP > +#endif > + > +#define ETH_AF_XDP_IFACE_ARG "iface" > +#define ETH_AF_XDP_QUEUE_IDX_ARG "queue" > + > +#define ETH_AF_XDP_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE > +#define ETH_AF_XDP_NUM_BUFFERS 4096 > +#define ETH_AF_XDP_DATA_HEADROOM 0 > +#define ETH_AF_XDP_DFLT_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS > +#define ETH_AF_XDP_DFLT_QUEUE_IDX 0 > + > +#define ETH_AF_XDP_RX_BATCH_SIZE 32 > +#define ETH_AF_XDP_TX_BATCH_SIZE 32 > + > +#define ETH_AF_XDP_MAX_QUEUE_PAIRS 16 > + > +struct xsk_umem_info { > + struct xsk_ring_prod fq; > + struct xsk_ring_cons cq; > + struct xsk_umem *umem; > + struct rte_ring *buf_ring; > + void *buffer; > +}; > + > +struct pkt_rx_queue { > + struct xsk_ring_cons rx; > + struct xsk_umem_info *umem; > + struct xsk_socket *xsk; > + struct rte_mempool *mb_pool; > + > + uint64_t rx_pkts; > + uint64_t rx_bytes; > + uint64_t rx_dropped; > + > + struct pkt_tx_queue *pair; > + uint16_t queue_idx; > +}; > + > +struct pkt_tx_queue { > + struct xsk_ring_prod tx; > + > + uint64_t tx_pkts; > + uint64_t err_pkts; > + uint64_t tx_bytes; > + > + struct pkt_rx_queue *pair; > + uint16_t queue_idx; > +}; > + > +struct pmd_internals { > + int if_index; > + char if_name[IFNAMSIZ]; > + uint16_t queue_idx; > + struct ether_addr eth_addr; > + struct xsk_umem_info *umem; > + struct rte_mempool *mb_pool_share; > + > + struct pkt_rx_queue rx_queues[ETH_AF_XDP_MAX_QUEUE_PAIRS]; > + struct pkt_tx_queue tx_queues[ETH_AF_XDP_MAX_QUEUE_PAIRS]; > +}; > + > +static const char * const valid_arguments[] = { > + ETH_AF_XDP_IFACE_ARG, > + ETH_AF_XDP_QUEUE_IDX_ARG, > + NULL > +}; > + > +static struct rte_eth_link pmd_link = { > + .link_speed = ETH_SPEED_NUM_10G, > + .link_duplex = ETH_LINK_FULL_DUPLEX, > + .link_status = ETH_LINK_DOWN, > + .link_autoneg = ETH_LINK_AUTONEG > +}; > + > +static inline int > +reserve_fill_queue(struct xsk_umem_info *umem, int reserve_size) > +{ > + struct xsk_ring_prod *fq = &umem->fq; > + uint32_t idx; > + void *addr = NULL; > + int i, ret = 0; No need to initialize 'ret'. Is there a point to set 'addr'? > + > + ret = xsk_ring_prod__reserve(fq, reserve_size, &idx); > + if (!ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to reserve enough fq descs.\n"); > + return ret; > + } > + > + for (i = 0; i < reserve_size; i++) { > + rte_ring_dequeue(umem->buf_ring, &addr); > + *xsk_ring_prod__fill_addr(fq, idx++) = (uint64_t)addr; Consider introducing a tmp variable to make this more readable. > + } > + > + xsk_ring_prod__submit(fq, reserve_size); > + > + return 0; > +} > + > +static uint16_t > +eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > +{ > + struct pkt_rx_queue *rxq = queue; > + struct xsk_ring_cons *rx = &rxq->rx; > + struct xsk_umem_info *umem = rxq->umem; > + struct xsk_ring_prod *fq = &umem->fq; > + uint32_t idx_rx; > + uint32_t free_thresh = fq->size >> 1; > + struct rte_mbuf *mbuf; > + unsigned long dropped = 0; > + unsigned long rx_bytes = 0; > + uint16_t count = 0; > + int rcvd, i; > + > + nb_pkts = nb_pkts < ETH_AF_XDP_RX_BATCH_SIZE ? > + nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE; > + > + rcvd = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx); > + if (!rcvd) Since peek returns the number of entries, not a boolean, do: rcvd == 0 > + return 0; > + > + if (xsk_prod_nb_free(fq, free_thresh) >= free_thresh) > + (void)reserve_fill_queue(umem, ETH_AF_XDP_RX_BATCH_SIZE); > + > + for (i = 0; i < rcvd; i++) { > + uint64_t addr = xsk_ring_cons__rx_desc(rx, idx_rx)->addr; > + uint32_t len = xsk_ring_cons__rx_desc(rx, idx_rx++)->len; Use a tmp variable, instead of two calls. > + char *pkt = xsk_umem__get_data(rxq->umem->buffer, addr); > + Don't mix declaration and code. Why is this a char pointer? As opppose to void. > + mbuf = rte_pktmbuf_alloc(rxq->mb_pool); > + if (mbuf) { 1.8.1 > + memcpy(rte_pktmbuf_mtod(mbuf, void*), pkt, len); rte_memcpy() > + rte_pktmbuf_pkt_len(mbuf) = > + rte_pktmbuf_data_len(mbuf) = len; Consider splitting this into two statements. > + rx_bytes += len; > + bufs[count++] = mbuf; > + } else { > + dropped++; > + } > + rte_ring_enqueue(umem->buf_ring, (void *)addr); > + } > + > + xsk_ring_cons__release(rx, rcvd); > + > + /* statistics */ > + rxq->rx_pkts += (rcvd - dropped); > + rxq->rx_bytes += rx_bytes; > + rxq->rx_dropped += dropped; > + > + return count; > +} > + > +static void pull_umem_cq(struct xsk_umem_info *umem, int size) > +{ > + struct xsk_ring_cons *cq = &umem->cq; > + int i, n; > + uint32_t idx_cq; > + uint64_t addr; > + > + n = xsk_ring_cons__peek(cq, size, &idx_cq); Use size_t for n. > + if (n > 0) { > + for (i = 0; i < n; i++) { Consider declaring 'addr' in this scope. > + addr = *xsk_ring_cons__comp_addr(cq, > + idx_cq++); > + rte_ring_enqueue(umem->buf_ring, (void *)addr); > + } > + > + xsk_ring_cons__release(cq, n); > + } > +} > + > +static void kick_tx(struct pkt_tx_queue *txq) > +{ > + struct xsk_umem_info *umem = txq->pair->umem; > + int ret; > + > + while (1) { for (;;) > + ret = sendto(xsk_socket__fd(txq->pair->xsk), NULL, 0, > + MSG_DONTWAIT, NULL, 0); > + > + /* everything is ok */ > + if (ret >= 0) Use likely()? > + break; > + > + /* some thing unexpected */ > + if (errno != EBUSY && errno != EAGAIN) > + break; > + > + /* pull from complete qeueu to leave more space */ > + if (errno == EAGAIN) > + pull_umem_cq(umem, ETH_AF_XDP_TX_BATCH_SIZE); > + } > + pull_umem_cq(umem, ETH_AF_XDP_TX_BATCH_SIZE); > +} > + > +static uint16_t > +eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > +{ > + struct pkt_tx_queue *txq = queue; > + struct xsk_umem_info *umem = txq->pair->umem; > + struct rte_mbuf *mbuf; > + void *addrs[ETH_AF_XDP_TX_BATCH_SIZE]; > + unsigned long tx_bytes = 0; > + int i, valid = 0; > + uint32_t idx_tx; > + > + nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ? > + nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE; Use RTE_MIN(). > + > + pull_umem_cq(umem, nb_pkts); > + > + nb_pkts = rte_ring_dequeue_bulk(umem->buf_ring, addrs, > + nb_pkts, NULL); > + if (!nb_pkts) nb_pkts == 0 > + return 0; > + > + if (xsk_ring_prod__reserve(&txq->tx, nb_pkts, &idx_tx) != nb_pkts) { > + kick_tx(txq); > + return 0; > + } > + > + for (i = 0; i < nb_pkts; i++) { > + struct xdp_desc *desc; > + char *pkt; Use void pointer? > + unsigned int buf_len = ETH_AF_XDP_FRAME_SIZE > + - ETH_AF_XDP_DATA_HEADROOM; Use uint32_t, as you seem to do elsewhere. > + desc = xsk_ring_prod__tx_desc(&txq->tx, idx_tx + i); > + mbuf = bufs[i]; > + if (mbuf->pkt_len <= buf_len) { > + desc->addr = (uint64_t)addrs[valid]; > + desc->len = mbuf->pkt_len; > + pkt = xsk_umem__get_data(umem->buffer, > + desc->addr); > + memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *), > + desc->len); rte_memcpy() > + valid++; > + tx_bytes += mbuf->pkt_len; > + } > + rte_pktmbuf_free(mbuf); > + } > + > + xsk_ring_prod__submit(&txq->tx, nb_pkts); > + > + kick_tx(txq); > + > + if (valid < nb_pkts) > + rte_ring_enqueue_bulk(umem->buf_ring, &addrs[valid], > + nb_pkts - valid, NULL); > + > + txq->err_pkts += nb_pkts - valid; > + txq->tx_pkts += valid; > + txq->tx_bytes += tx_bytes; > + > + return nb_pkts; > +} > + > +static int > +eth_dev_start(struct rte_eth_dev *dev) > +{ > + dev->data->dev_link.link_status = ETH_LINK_UP; > + > + return 0; > +} > + > +/* This function gets called when the current port gets stopped. */ > +static void > +eth_dev_stop(struct rte_eth_dev *dev) > +{ > + dev->data->dev_link.link_status = ETH_LINK_DOWN; > +} > + > +static int > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) > +{ > + /* rx/tx must be paired */ > + if (dev->data->nb_rx_queues != dev->data->nb_tx_queues) > + return -EINVAL; > + > + return 0; > +} > + > +static void > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + > + dev_info->if_index = internals->if_index; > + dev_info->max_mac_addrs = 1; > + dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN; > + dev_info->max_rx_queues = 1; > + dev_info->max_tx_queues = 1; > + dev_info->min_rx_bufsize = 0; > + > + dev_info->default_rxportconf.nb_queues = 1; > + dev_info->default_txportconf.nb_queues = 1; > + dev_info->default_rxportconf.ring_size = ETH_AF_XDP_DFLT_NUM_DESCS; > + dev_info->default_txportconf.ring_size = ETH_AF_XDP_DFLT_NUM_DESCS; > +} > + > +static int > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + struct xdp_statistics xdp_stats; > + struct pkt_rx_queue *rxq; > + socklen_t optlen; > + int i; > + > + optlen = sizeof(struct xdp_statistics); > + for (i = 0; i < dev->data->nb_rx_queues; i++) { > + rxq = &internals->rx_queues[i]; > + stats->q_ipackets[i] = internals->rx_queues[i].rx_pkts; > + stats->q_ibytes[i] = internals->rx_queues[i].rx_bytes; > + > + stats->q_opackets[i] = internals->tx_queues[i].tx_pkts; > + stats->q_obytes[i] = internals->tx_queues[i].tx_bytes; > + > + stats->ipackets += stats->q_ipackets[i]; > + stats->ibytes += stats->q_ibytes[i]; > + stats->imissed += internals->rx_queues[i].rx_dropped; > + getsockopt(xsk_socket__fd(rxq->xsk), SOL_XDP, XDP_STATISTICS, > + &xdp_stats, &optlen); > + stats->imissed += xdp_stats.rx_dropped; > + > + stats->opackets += stats->q_opackets[i]; > + stats->oerrors += stats->q_errors[i]; > + stats->oerrors += internals->tx_queues[i].err_pkts; > + stats->obytes += stats->q_obytes[i]; > + } > + > + return 0; > +} > + > +static void > +eth_stats_reset(struct rte_eth_dev *dev) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + int i; > + > + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { > + internals->rx_queues[i].rx_pkts = 0; > + internals->rx_queues[i].rx_bytes = 0; > + internals->rx_queues[i].rx_dropped = 0; > + > + internals->tx_queues[i].tx_pkts = 0; > + internals->tx_queues[i].err_pkts = 0; > + internals->tx_queues[i].tx_bytes = 0; > + } > +} > + > +static void remove_xdp_program(struct pmd_internals *internals) > +{ > + uint32_t curr_prog_id = 0; > + > + if (bpf_get_link_xdp_id(internals->if_index, &curr_prog_id, > + XDP_FLAGS_UPDATE_IF_NOEXIST)) { > + RTE_LOG(ERR, AF_XDP, "bpf_get_link_xdp_id failed\n"); > + return; > + } > + bpf_set_link_xdp_fd(internals->if_index, -1, > + XDP_FLAGS_UPDATE_IF_NOEXIST); > +} > + > +static void > +eth_dev_close(struct rte_eth_dev *dev) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + struct pkt_rx_queue *rxq; > + int i; > + > + RTE_LOG(INFO, AF_XDP, "Closing AF_XDP ethdev on numa socket %u\n", > + rte_socket_id()); > + > + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { > + rxq = &internals->rx_queues[i]; > + if (!rxq->umem) > + break; > + xsk_socket__delete(rxq->xsk); > + } > + > + (void)xsk_umem__delete(internals->umem->umem); > + remove_xdp_program(internals); > +} > + > +static void > +eth_queue_release(void *q __rte_unused) > +{ > +} > + > +static int > +eth_link_update(struct rte_eth_dev *dev __rte_unused, > + int wait_to_complete __rte_unused) > +{ > + return 0; > +} > + > +static void xdp_umem_destroy(struct xsk_umem_info *umem) > +{ > + free(umem->buffer); > + umem->buffer = NULL; > + > + rte_ring_free(umem->buf_ring); > + umem->buf_ring = NULL; > + > + free(umem); > + umem = NULL; > +} > + > +static struct xsk_umem_info *xdp_umem_configure(void) > +{ > + struct xsk_umem_info *umem; > + struct xsk_umem_config usr_config = { > + .fill_size = ETH_AF_XDP_DFLT_NUM_DESCS, > + .comp_size = ETH_AF_XDP_DFLT_NUM_DESCS, > + .frame_size = ETH_AF_XDP_FRAME_SIZE, > + .frame_headroom = ETH_AF_XDP_DATA_HEADROOM }; > + void *bufs = NULL; > + char ring_name[0x100]; > + int ret; > + uint64_t i; > + > + umem = calloc(1, sizeof(*umem)); > + if (!umem) { 1.8.1 > + RTE_LOG(ERR, AF_XDP, "Failed to allocate umem info"); > + return NULL; > + } > + > + snprintf(ring_name, 0x100, "af_xdp_ring"); Again the magical 0x100. > + umem->buf_ring = rte_ring_create(ring_name, > + ETH_AF_XDP_NUM_BUFFERS, > + SOCKET_ID_ANY, > + 0x0); > + if (!umem->buf_ring) { 1.8.1 > + RTE_LOG(ERR, AF_XDP, > + "Failed to create rte_ring\n"); > + goto err; > + } > + > + for (i = 0; i < ETH_AF_XDP_NUM_BUFFERS; i++) > + rte_ring_enqueue(umem->buf_ring, > + (void *)(i * ETH_AF_XDP_FRAME_SIZE + > + ETH_AF_XDP_DATA_HEADROOM)); > + > + if (posix_memalign(&bufs, getpagesize(), > + ETH_AF_XDP_NUM_BUFFERS * ETH_AF_XDP_FRAME_SIZE)) { > + RTE_LOG(ERR, AF_XDP, "Failed to allocate memory pool.\n"); > + goto err; > + } > + ret = xsk_umem__create(&umem->umem, bufs, > + ETH_AF_XDP_NUM_BUFFERS * ETH_AF_XDP_FRAME_SIZE, > + &umem->fq, &umem->cq, > + &usr_config); > + > + if (ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to create umem"); > + goto err; > + } > + umem->buffer = bufs; > + > + return umem; > + > +err: > + xdp_umem_destroy(umem); > + return NULL; > +} > + > +static int > +xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq, > + int ring_size) > +{ > + struct xsk_socket_config cfg; > + struct pkt_tx_queue *txq = rxq->pair; > + int ret = 0; > + int reserve_size; > + > + rxq->umem = xdp_umem_configure(); > + if (!rxq->umem) { 1.8.1 > + ret = -ENOMEM; > + goto err; > + } > + > + cfg.rx_size = ring_size; > + cfg.tx_size = ring_size; > + cfg.libbpf_flags = 0; > + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST; > + cfg.bind_flags = 0; > + ret = xsk_socket__create(&rxq->xsk, internals->if_name, > + internals->queue_idx, rxq->umem->umem, &rxq->rx, > + &txq->tx, &cfg); > + if (ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to create xsk socket.\n"); > + goto err; > + } > + > + reserve_size = ETH_AF_XDP_DFLT_NUM_DESCS / 2; > + ret = reserve_fill_queue(rxq->umem, reserve_size); > + if (ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to reserve fill queue.\n"); > + goto err; > + } > + > + return 0; > + > +err: > + xdp_umem_destroy(rxq->umem); > + > + return ret; > +} > + > +static void > +queue_reset(struct pmd_internals *internals, uint16_t queue_idx) > +{ > + struct pkt_rx_queue *rxq = &internals->rx_queues[queue_idx]; > + struct pkt_tx_queue *txq = rxq->pair; > + int xsk_fd = xsk_socket__fd(rxq->xsk); > + > + if (xsk_fd) { > + close(xsk_fd); > + if (internals->umem) { 1.8.1 > + xdp_umem_destroy(internals->umem); > + internals->umem = NULL; > + } > + } > + memset(rxq, 0, sizeof(*rxq)); > + memset(txq, 0, sizeof(*txq)); > + rxq->pair = txq; > + txq->pair = rxq; > + rxq->queue_idx = queue_idx; > + txq->queue_idx = queue_idx; > +} > + > +static int > +eth_rx_queue_setup(struct rte_eth_dev *dev, > + uint16_t rx_queue_id, > + uint16_t nb_rx_desc, > + unsigned int socket_id __rte_unused, > + const struct rte_eth_rxconf *rx_conf __rte_unused, > + struct rte_mempool *mb_pool) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + unsigned int buf_size, data_size; uint32_t > + struct pkt_rx_queue *rxq; > + int ret = 0; No need to set 'ret'. Alternatively, restructure so you always return 'ret'. > + > + rxq = &internals->rx_queues[rx_queue_id]; > + queue_reset(internals, rx_queue_id); > + > + /* Now get the space available for data in the mbuf */ > + buf_size = rte_pktmbuf_data_room_size(mb_pool) - > + RTE_PKTMBUF_HEADROOM; > + data_size = ETH_AF_XDP_FRAME_SIZE - ETH_AF_XDP_DATA_HEADROOM; > + > + if (data_size > buf_size) { > + RTE_LOG(ERR, AF_XDP, > + "%s: %d bytes will not fit in mbuf (%d bytes)\n", > + dev->device->name, data_size, buf_size); > + ret = -ENOMEM; > + goto err; > + } > + > + rxq->mb_pool = mb_pool; > + > + if (xsk_configure(internals, rxq, nb_rx_desc)) { > + RTE_LOG(ERR, AF_XDP, > + "Failed to configure xdp socket\n"); > + ret = -EINVAL; > + goto err; > + } > + > + internals->umem = rxq->umem; > + > + dev->data->rx_queues[rx_queue_id] = rxq; > + return 0; > + > +err: > + queue_reset(internals, rx_queue_id); > + return ret; > +} > + > +static int > +eth_tx_queue_setup(struct rte_eth_dev *dev, > + uint16_t tx_queue_id, > + uint16_t nb_tx_desc __rte_unused, > + unsigned int socket_id __rte_unused, > + const struct rte_eth_txconf *tx_conf __rte_unused) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + struct pkt_tx_queue *txq; > + > + txq = &internals->tx_queues[tx_queue_id]; > + > + dev->data->tx_queues[tx_queue_id] = txq; > + return 0; > +} > + > +static int > +eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + struct ifreq ifr = { .ifr_mtu = mtu }; > + int ret; > + int s; > + > + s = socket(PF_INET, SOCK_DGRAM, 0); > + if (s < 0) > + return -EINVAL; > + > + strlcpy(ifr.ifr_name, internals->if_name, IFNAMSIZ); > + ret = ioctl(s, SIOCSIFMTU, &ifr); > + close(s); > + > + if (ret < 0) > + return -EINVAL; > + > + return 0; > +} > + > +static void > +eth_dev_change_flags(char *if_name, uint32_t flags, uint32_t mask) > +{ > + struct ifreq ifr; > + int s; > + > + s = socket(PF_INET, SOCK_DGRAM, 0); > + if (s < 0) > + return; > + > + strlcpy(ifr.ifr_name, if_name, IFNAMSIZ); > + if (ioctl(s, SIOCGIFFLAGS, &ifr) < 0) > + goto out; > + ifr.ifr_flags &= mask; > + ifr.ifr_flags |= flags; > + if (ioctl(s, SIOCSIFFLAGS, &ifr) < 0) > + goto out; > +out: > + close(s); > +} > + > +static void > +eth_dev_promiscuous_enable(struct rte_eth_dev *dev) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + > + eth_dev_change_flags(internals->if_name, IFF_PROMISC, ~0); > +} > + > +static void > +eth_dev_promiscuous_disable(struct rte_eth_dev *dev) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + > + eth_dev_change_flags(internals->if_name, 0, ~IFF_PROMISC); > +} > + > +static const struct eth_dev_ops ops = { > + .dev_start = eth_dev_start, > + .dev_stop = eth_dev_stop, > + .dev_close = eth_dev_close, > + .dev_configure = eth_dev_configure, > + .dev_infos_get = eth_dev_info, > + .mtu_set = eth_dev_mtu_set, > + .promiscuous_enable = eth_dev_promiscuous_enable, > + .promiscuous_disable = eth_dev_promiscuous_disable, > + .rx_queue_setup = eth_rx_queue_setup, > + .tx_queue_setup = eth_tx_queue_setup, > + .rx_queue_release = eth_queue_release, > + .tx_queue_release = eth_queue_release, > + .link_update = eth_link_update, > + .stats_get = eth_stats_get, > + .stats_reset = eth_stats_reset, > +}; > + > +/** parse integer from integer argument */ > +static int > +parse_integer_arg(const char *key __rte_unused, > + const char *value, void *extra_args) > +{ > + int *i = (int *)extra_args; > + > + *i = atoi(value); Use strtol(). > + if (*i < 0) { > + RTE_LOG(ERR, AF_XDP, "Argument has to be positive.\n"); > + return -EINVAL; > + } > + > + return 0; > +} > + > +/** parse name argument */ > +static int > +parse_name_arg(const char *key __rte_unused, > + const char *value, void *extra_args) > +{ > + char *name = extra_args; > + > + if (strlen(value) > IFNAMSIZ) { The buffer is IFNAMSIZ bytes (which it should be), so it can't hold a string with strlen() == IFNAMSIZ. > + RTE_LOG(ERR, AF_XDP, "Invalid name %s, should be less than " > + "%u bytes.\n", value, IFNAMSIZ); > + return -EINVAL; > + } > + > + strlcpy(name, value, IFNAMSIZ); > + > + return 0; > +} > + > +static int > +parse_parameters(struct rte_kvargs *kvlist, > + char *if_name, > + int *queue_idx) > +{ > + int ret = 0; Should not be initialized. > + > + ret = rte_kvargs_process(kvlist, ETH_AF_XDP_IFACE_ARG, > + &parse_name_arg, if_name); > + if (ret < 0) > + goto free_kvlist; > + > + ret = rte_kvargs_process(kvlist, ETH_AF_XDP_QUEUE_IDX_ARG, > + &parse_integer_arg, queue_idx); > + if (ret < 0) > + goto free_kvlist; I fail to see the point of this goto, but maybe there's more code to follow in future patches. > + > +free_kvlist: > + rte_kvargs_free(kvlist); > + return ret; > +} > + > +static int > +get_iface_info(const char *if_name, > + struct ether_addr *eth_addr, > + int *if_index) > +{ > + struct ifreq ifr; > + int sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP); > + > + if (sock < 0) > + return -1; > + > + strlcpy(ifr.ifr_name, if_name, IFNAMSIZ); > + if (ioctl(sock, SIOCGIFINDEX, &ifr)) > + goto error; > + > + if (ioctl(sock, SIOCGIFHWADDR, &ifr)) > + goto error; > + > + rte_memcpy(eth_addr, ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN); > + > + close(sock); > + *if_index = if_nametoindex(if_name); > + return 0; > + > +error: > + close(sock); > + return -1; > +} > + > +static int > +init_internals(struct rte_vdev_device *dev, > + const char *if_name, > + int queue_idx, > + struct rte_eth_dev **eth_dev) > +{ > + const char *name = rte_vdev_device_name(dev); > + const unsigned int numa_node = dev->device.numa_node; > + struct pmd_internals *internals = NULL; > + int ret; > + int i; > + > + internals = rte_zmalloc_socket(name, sizeof(*internals), 0, numa_node); > + if (!internals) 1.8.1 > + return -ENOMEM; > + > + internals->queue_idx = queue_idx; > + strlcpy(internals->if_name, if_name, IFNAMSIZ); > + > + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { > + internals->tx_queues[i].pair = &internals->rx_queues[i]; > + internals->rx_queues[i].pair = &internals->tx_queues[i]; > + } > + > + ret = get_iface_info(if_name, &internals->eth_addr, > + &internals->if_index); > + if (ret) > + goto err; > + > + *eth_dev = rte_eth_vdev_allocate(dev, 0); > + if (!*eth_dev) 1.8.1 > + goto err; > + > + (*eth_dev)->data->dev_private = internals; > + (*eth_dev)->data->dev_link = pmd_link; > + (*eth_dev)->data->mac_addrs = &internals->eth_addr; > + (*eth_dev)->dev_ops = &ops; > + (*eth_dev)->rx_pkt_burst = eth_af_xdp_rx; > + (*eth_dev)->tx_pkt_burst = eth_af_xdp_tx; > + > + return 0; > + > +err: > + rte_free(internals); > + return -1; > +} > + > +static int > +rte_pmd_af_xdp_probe(struct rte_vdev_device *dev) > +{ > + struct rte_kvargs *kvlist; > + char if_name[IFNAMSIZ]; > + int xsk_queue_idx = ETH_AF_XDP_DFLT_QUEUE_IDX; > + struct rte_eth_dev *eth_dev = NULL; > + const char *name; > + int ret; > + > + RTE_LOG(INFO, AF_XDP, "Initializing pmd_af_xdp for %s\n", > + rte_vdev_device_name(dev)); > + > + name = rte_vdev_device_name(dev); > + if (rte_eal_process_type() == RTE_PROC_SECONDARY && > + strlen(rte_vdev_device_args(dev)) == 0) { > + eth_dev = rte_eth_dev_attach_secondary(name); > + if (!eth_dev) { > + RTE_LOG(ERR, AF_XDP, "Failed to probe %s\n", name); > + return -EINVAL; > + } > + eth_dev->dev_ops = &ops; > + rte_eth_dev_probing_finish(eth_dev); > + return 0; > + } > + > + kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments); > + if (!kvlist) { > + RTE_LOG(ERR, AF_XDP, "Invalid kvargs key\n"); > + return -EINVAL; > + } > + > + if (dev->device.numa_node == SOCKET_ID_ANY) > + dev->device.numa_node = rte_socket_id(); > + > + if (parse_parameters(kvlist, if_name, &xsk_queue_idx) < 0) { > + RTE_LOG(ERR, AF_XDP, "Invalid kvargs value\n"); > + return -EINVAL; > + } > + > + ret = init_internals(dev, if_name, xsk_queue_idx, ð_dev); > + if (ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to init internals\n"); > + return ret; > + } > + > + rte_eth_dev_probing_finish(eth_dev); > + > + return 0; > +} > + > +static int > +rte_pmd_af_xdp_remove(struct rte_vdev_device *dev) > +{ > + struct rte_eth_dev *eth_dev = NULL; > + struct pmd_internals *internals; > + > + RTE_LOG(INFO, AF_XDP, "Removing AF_XDP ethdev on numa socket %u\n", > + rte_socket_id()); > + > + if (!dev) > + return -1; > + > + /* find the ethdev entry */ > + eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev)); > + if (!eth_dev) > + return -1; > + > + internals = eth_dev->data->dev_private; > + > + rte_ring_free(internals->umem->buf_ring); > + rte_free(internals->umem->buffer); > + rte_free(internals->umem); > + > + rte_eth_dev_release_port(eth_dev); > + > + > + return 0; > +} > + > +static struct rte_vdev_driver pmd_af_xdp_drv = { > + .probe = rte_pmd_af_xdp_probe, > + .remove = rte_pmd_af_xdp_remove, > +}; > + > +RTE_PMD_REGISTER_VDEV(eth_af_xdp, pmd_af_xdp_drv); > +RTE_PMD_REGISTER_PARAM_STRING(eth_af_xdp, > + "iface= " > + "queue= "); > diff --git a/drivers/net/af_xdp/rte_pmd_af_xdp_version.map b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map > new file mode 100644 > index 000000000..c6db030fe > --- /dev/null > +++ b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map > @@ -0,0 +1,3 @@ > +DPDK_19.05 { > + local: *; > +}; > diff --git a/drivers/net/meson.build b/drivers/net/meson.build > index 3ecc78cee..1105e72d8 100644 > --- a/drivers/net/meson.build > +++ b/drivers/net/meson.build > @@ -2,6 +2,7 @@ > # Copyright(c) 2017 Intel Corporation > > drivers = ['af_packet', > + 'af_xdp', > 'ark', > 'atlantic', > 'avp', > diff --git a/mk/rte.app.mk b/mk/rte.app.mk > index 262132fc6..be0af73cc 100644 > --- a/mk/rte.app.mk > +++ b/mk/rte.app.mk > @@ -143,6 +143,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += -lrte_mempool_dpaa2 > endif > > _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET) += -lrte_pmd_af_packet > +_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += -lrte_pmd_af_xdp -lelf -lbpf > _LDLIBS-$(CONFIG_RTE_LIBRTE_ARK_PMD) += -lrte_pmd_ark > _LDLIBS-$(CONFIG_RTE_LIBRTE_ATLANTIC_PMD) += -lrte_pmd_atlantic > _LDLIBS-$(CONFIG_RTE_LIBRTE_AVP_PMD) += -lrte_pmd_avp > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by dpdk.space (Postfix) with ESMTP id 9349FA00E6 for ; Tue, 19 Mar 2019 10:07:50 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 0F86725A1; Tue, 19 Mar 2019 10:07:50 +0100 (CET) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by dpdk.org (Postfix) with ESMTP id E531011A4 for ; Tue, 19 Mar 2019 10:07:48 +0100 (CET) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 79D3240008 for ; Tue, 19 Mar 2019 10:07:48 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 612C940012; Tue, 19 Mar 2019 10:07:48 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-0.6 required=5.0 tests=ALL_TRUSTED,AWL,URIBL_SBL, URIBL_SBL_A autolearn=disabled version=3.4.1 X-Spam-Score: -0.6 Received: from [192.168.1.59] (host-90-232-144-184.mobileonline.telia.com [90.232.144.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 24AE140008; Tue, 19 Mar 2019 10:07:43 +0100 (CET) To: Xiaolong Ye , dev@dpdk.org Cc: Qi Zhang , Karlsson Magnus , Topel Bjorn References: <20190301080947.91086-1-xiaolong.ye@intel.com> <20190319071256.26302-1-xiaolong.ye@intel.com> <20190319071256.26302-2-xiaolong.ye@intel.com> From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= Message-ID: Date: Tue, 19 Mar 2019 10:07:43 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 MIME-Version: 1.0 In-Reply-To: <20190319071256.26302-2-xiaolong.ye@intel.com> Content-Type: text/plain; charset="UTF-8"; format="flowed" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP Subject: Re: [dpdk-dev] [PATCH v2 1/6] net/af_xdp: introduce AF XDP PMD driver X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Message-ID: <20190319090743.lnJtUDk54Q8AZcvm5F5J3IIVfvFytf6K3Om_j_nUdto@z> On 2019-03-19 08:12, Xiaolong Ye wrote: > Add a new PMD driver for AF_XDP which is a proposed faster version of > AF_PACKET interface in Linux. More info about AF_XDP, please refer to [1] > [2]. > > This is the vanilla version PMD which just uses a raw buffer registered as > the umem. > > [1] https://fosdem.org/2018/schedule/event/af_xdp/ > [2] https://lwn.net/Articles/745934/ > > Signed-off-by: Xiaolong Ye > --- > MAINTAINERS | 6 + > config/common_base | 5 + > config/common_linux | 1 + > doc/guides/nics/af_xdp.rst | 45 + > doc/guides/nics/features/af_xdp.ini | 11 + > doc/guides/nics/index.rst | 1 + > doc/guides/rel_notes/release_19_05.rst | 7 + > drivers/net/Makefile | 1 + > drivers/net/af_xdp/Makefile | 33 + > drivers/net/af_xdp/meson.build | 21 + > drivers/net/af_xdp/rte_eth_af_xdp.c | 930 ++++++++++++++++++ > drivers/net/af_xdp/rte_pmd_af_xdp_version.map | 3 + > drivers/net/meson.build | 1 + > mk/rte.app.mk | 1 + > 14 files changed, 1066 insertions(+) > create mode 100644 doc/guides/nics/af_xdp.rst > create mode 100644 doc/guides/nics/features/af_xdp.ini > create mode 100644 drivers/net/af_xdp/Makefile > create mode 100644 drivers/net/af_xdp/meson.build > create mode 100644 drivers/net/af_xdp/rte_eth_af_xdp.c > create mode 100644 drivers/net/af_xdp/rte_pmd_af_xdp_version.map > > diff --git a/MAINTAINERS b/MAINTAINERS > index 452b8eb82..1cc54b439 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -468,6 +468,12 @@ M: John W. Linville > F: drivers/net/af_packet/ > F: doc/guides/nics/features/afpacket.ini > > +Linux AF_XDP > +M: Xiaolong Ye > +M: Qi Zhang > +F: drivers/net/af_xdp/ > +F: doc/guides/nics/features/af_xdp.rst > + > Amazon ENA > M: Marcin Wojtas > M: Michal Krawczyk > diff --git a/config/common_base b/config/common_base > index 0b09a9348..4044de205 100644 > --- a/config/common_base > +++ b/config/common_base > @@ -416,6 +416,11 @@ CONFIG_RTE_LIBRTE_VMXNET3_DEBUG_TX_FREE=n > # > CONFIG_RTE_LIBRTE_PMD_AF_PACKET=n > > +# > +# Compile software PMD backed by AF_XDP sockets (Linux only) > +# > +CONFIG_RTE_LIBRTE_PMD_AF_XDP=n > + > # > # Compile link bonding PMD library > # > diff --git a/config/common_linux b/config/common_linux > index 75334273d..0b1249da0 100644 > --- a/config/common_linux > +++ b/config/common_linux > @@ -19,6 +19,7 @@ CONFIG_RTE_LIBRTE_VHOST_POSTCOPY=n > CONFIG_RTE_LIBRTE_PMD_VHOST=y > CONFIG_RTE_LIBRTE_IFC_PMD=y > CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y > +CONFIG_RTE_LIBRTE_PMD_AF_XDP=y > CONFIG_RTE_LIBRTE_PMD_SOFTNIC=y > CONFIG_RTE_LIBRTE_PMD_TAP=y > CONFIG_RTE_LIBRTE_AVP_PMD=y > diff --git a/doc/guides/nics/af_xdp.rst b/doc/guides/nics/af_xdp.rst > new file mode 100644 > index 000000000..dd5654dd1 > --- /dev/null > +++ b/doc/guides/nics/af_xdp.rst > @@ -0,0 +1,45 @@ > +.. SPDX-License-Identifier: BSD-3-Clause > + Copyright(c) 2018 Intel Corporation. > + > +AF_XDP Poll Mode Driver > +========================== > + > +AF_XDP is an address family that is optimized for high performance > +packet processing. AF_XDP sockets enable the possibility for XDP program to > +redirect packets to a memory buffer in userspace. > + > +For the full details behind AF_XDP socket, you can refer to > +`AF_XDP documentation in the Kernel > +`_. > + > +This Linux-specific PMD driver creates the AF_XDP socket and binds it to a > +specific netdev queue, it allows a DPDK application to send and receive raw > +packets through the socket which would bypass the kernel network stack. > +Current implementation only supports single queue, multi-queues feature will > +be added later. > + > +Options > +------- > + > +The following options can be provided to set up an af_xdp port in DPDK. > + > +* ``iface`` - name of the Kernel interface to attach to (required); > +* ``queue`` - netdev queue id (optional, default 0); > + > +Prerequisites > +------------- > + > +This is a Linux-specific PMD, thus the following prerequisites apply: > + > +* A Linux Kernel (version > 4.18) with XDP sockets configuration enabled; > +* libbpf (within kernel version > 5.1) with latest af_xdp support installed > +* A Kernel bound interface to attach to. > + > +Set up an af_xdp interface > +----------------------------- > + > +The following example will set up an af_xdp interface in DPDK: > + > +.. code-block:: console > + > + --vdev eth_af_xdp,iface=ens786f1,queue=0 > diff --git a/doc/guides/nics/features/af_xdp.ini b/doc/guides/nics/features/af_xdp.ini > new file mode 100644 > index 000000000..7b8fcce00 > --- /dev/null > +++ b/doc/guides/nics/features/af_xdp.ini > @@ -0,0 +1,11 @@ > +; > +; Supported features of the 'af_xdp' network poll mode driver. > +; > +; Refer to default.ini for the full list of available PMD features. > +; > +[Features] > +Link status = Y > +MTU update = Y > +Promiscuous mode = Y > +Stats per queue = Y > +x86-64 = Y > \ No newline at end of file > diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst > index 5c80e3baa..a4b80a3d0 100644 > --- a/doc/guides/nics/index.rst > +++ b/doc/guides/nics/index.rst > @@ -12,6 +12,7 @@ Network Interface Controller Drivers > features > build_and_test > af_packet > + af_xdp > ark > atlantic > avp > diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst > index 61a2c7383..062facf89 100644 > --- a/doc/guides/rel_notes/release_19_05.rst > +++ b/doc/guides/rel_notes/release_19_05.rst > @@ -65,6 +65,13 @@ New Features > process. > * Added support for Rx packet types list in a secondary process. > > +* **Added the AF_XDP PMD.** > + > + Added a Linux-specific PMD driver for AF_XDP, it can create the AF_XDP socket > + and bind it to a specific netdev queue, it allows a DPDK application to send > + and receive raw packets through the socket which would bypass the kernel > + network stack to achieve high performance packet processing. > + > * **Updated Mellanox drivers.** > > New features and improvements were done in mlx4 and mlx5 PMDs: > diff --git a/drivers/net/Makefile b/drivers/net/Makefile > index 502869a87..5d401b8c5 100644 > --- a/drivers/net/Makefile > +++ b/drivers/net/Makefile > @@ -9,6 +9,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD),d) > endif > > DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET) += af_packet > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += af_xdp > DIRS-$(CONFIG_RTE_LIBRTE_ARK_PMD) += ark > DIRS-$(CONFIG_RTE_LIBRTE_ATLANTIC_PMD) += atlantic > DIRS-$(CONFIG_RTE_LIBRTE_AVP_PMD) += avp > diff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile > new file mode 100644 > index 000000000..6cf0ed7db > --- /dev/null > +++ b/drivers/net/af_xdp/Makefile > @@ -0,0 +1,33 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(c) 2018 Intel Corporation > + > +include $(RTE_SDK)/mk/rte.vars.mk > + > +# > +# library name > +# > +LIB = librte_pmd_af_xdp.a > + > +EXPORT_MAP := rte_pmd_af_xdp_version.map > + > +LIBABIVER := 1 > + > +CFLAGS += -O3 > + > +# require kernel version >= v5.1-rc1 > +LINUX_VERSION := $(shell uname -r) > +CFLAGS += -I/lib/modules/$(LINUX_VERSION)/build/tools/include > +CFLAGS += -I/lib/modules/$(LINUX_VERSION)/build/tools/lib/bpf > + > +CFLAGS += $(WERROR_FLAGS) > +LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring > +LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs > +LDLIBS += -lrte_bus_vdev > +LDLIBS += -lbpf > + > +# > +# all source are stored in SRCS-y > +# > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += rte_eth_af_xdp.c > + > +include $(RTE_SDK)/mk/rte.lib.mk > diff --git a/drivers/net/af_xdp/meson.build b/drivers/net/af_xdp/meson.build > new file mode 100644 > index 000000000..635e67483 > --- /dev/null > +++ b/drivers/net/af_xdp/meson.build > @@ -0,0 +1,21 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(c) 2018 Intel Corporation > + > +if host_machine.system() != 'linux' > + build = false > +endif > + > +bpf_dep = dependency('libbpf', required: false) > +if bpf_dep.found() > + build = true > +else > + bpf_dep = cc.find_library('libbpf', required: false) > + if bpf_dep.found() and cc.has_header('xsk.h', dependencies: bpf_dep) > + build = true > + pkgconfig_extra_libs += '-lbpf' > + else > + build = false > + endif > +endif > +sources = files('rte_eth_af_xdp.c') > +ext_deps += bpf_dep > diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c > new file mode 100644 > index 000000000..96dedc0c4 > --- /dev/null > +++ b/drivers/net/af_xdp/rte_eth_af_xdp.c > @@ -0,0 +1,930 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2019 Intel Corporation. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > +#include Is this include used? > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define RTE_LOGTYPE_AF_XDP RTE_LOGTYPE_USER1 > +#ifndef SOL_XDP > +#define SOL_XDP 283 > +#endif > + > +#ifndef AF_XDP > +#define AF_XDP 44 > +#endif > + > +#ifndef PF_XDP > +#define PF_XDP AF_XDP > +#endif > + > +#define ETH_AF_XDP_IFACE_ARG "iface" > +#define ETH_AF_XDP_QUEUE_IDX_ARG "queue" > + > +#define ETH_AF_XDP_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE > +#define ETH_AF_XDP_NUM_BUFFERS 4096 > +#define ETH_AF_XDP_DATA_HEADROOM 0 > +#define ETH_AF_XDP_DFLT_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS > +#define ETH_AF_XDP_DFLT_QUEUE_IDX 0 > + > +#define ETH_AF_XDP_RX_BATCH_SIZE 32 > +#define ETH_AF_XDP_TX_BATCH_SIZE 32 > + > +#define ETH_AF_XDP_MAX_QUEUE_PAIRS 16 > + > +struct xsk_umem_info { > + struct xsk_ring_prod fq; > + struct xsk_ring_cons cq; > + struct xsk_umem *umem; > + struct rte_ring *buf_ring; > + void *buffer; > +}; > + > +struct pkt_rx_queue { > + struct xsk_ring_cons rx; > + struct xsk_umem_info *umem; > + struct xsk_socket *xsk; > + struct rte_mempool *mb_pool; > + > + uint64_t rx_pkts; > + uint64_t rx_bytes; > + uint64_t rx_dropped; > + > + struct pkt_tx_queue *pair; > + uint16_t queue_idx; > +}; > + > +struct pkt_tx_queue { > + struct xsk_ring_prod tx; > + > + uint64_t tx_pkts; > + uint64_t err_pkts; > + uint64_t tx_bytes; > + > + struct pkt_rx_queue *pair; > + uint16_t queue_idx; > +}; > + > +struct pmd_internals { > + int if_index; > + char if_name[IFNAMSIZ]; > + uint16_t queue_idx; > + struct ether_addr eth_addr; > + struct xsk_umem_info *umem; > + struct rte_mempool *mb_pool_share; > + > + struct pkt_rx_queue rx_queues[ETH_AF_XDP_MAX_QUEUE_PAIRS]; > + struct pkt_tx_queue tx_queues[ETH_AF_XDP_MAX_QUEUE_PAIRS]; > +}; > + > +static const char * const valid_arguments[] = { > + ETH_AF_XDP_IFACE_ARG, > + ETH_AF_XDP_QUEUE_IDX_ARG, > + NULL > +}; > + > +static struct rte_eth_link pmd_link = { > + .link_speed = ETH_SPEED_NUM_10G, > + .link_duplex = ETH_LINK_FULL_DUPLEX, > + .link_status = ETH_LINK_DOWN, > + .link_autoneg = ETH_LINK_AUTONEG > +}; > + > +static inline int > +reserve_fill_queue(struct xsk_umem_info *umem, int reserve_size) > +{ > + struct xsk_ring_prod *fq = &umem->fq; > + uint32_t idx; > + void *addr = NULL; > + int i, ret = 0; No need to initialize 'ret'. Is there a point to set 'addr'? > + > + ret = xsk_ring_prod__reserve(fq, reserve_size, &idx); > + if (!ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to reserve enough fq descs.\n"); > + return ret; > + } > + > + for (i = 0; i < reserve_size; i++) { > + rte_ring_dequeue(umem->buf_ring, &addr); > + *xsk_ring_prod__fill_addr(fq, idx++) = (uint64_t)addr; Consider introducing a tmp variable to make this more readable. > + } > + > + xsk_ring_prod__submit(fq, reserve_size); > + > + return 0; > +} > + > +static uint16_t > +eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > +{ > + struct pkt_rx_queue *rxq = queue; > + struct xsk_ring_cons *rx = &rxq->rx; > + struct xsk_umem_info *umem = rxq->umem; > + struct xsk_ring_prod *fq = &umem->fq; > + uint32_t idx_rx; > + uint32_t free_thresh = fq->size >> 1; > + struct rte_mbuf *mbuf; > + unsigned long dropped = 0; > + unsigned long rx_bytes = 0; > + uint16_t count = 0; > + int rcvd, i; > + > + nb_pkts = nb_pkts < ETH_AF_XDP_RX_BATCH_SIZE ? > + nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE; > + > + rcvd = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx); > + if (!rcvd) Since peek returns the number of entries, not a boolean, do: rcvd == 0 > + return 0; > + > + if (xsk_prod_nb_free(fq, free_thresh) >= free_thresh) > + (void)reserve_fill_queue(umem, ETH_AF_XDP_RX_BATCH_SIZE); > + > + for (i = 0; i < rcvd; i++) { > + uint64_t addr = xsk_ring_cons__rx_desc(rx, idx_rx)->addr; > + uint32_t len = xsk_ring_cons__rx_desc(rx, idx_rx++)->len; Use a tmp variable, instead of two calls. > + char *pkt = xsk_umem__get_data(rxq->umem->buffer, addr); > + Don't mix declaration and code. Why is this a char pointer? As opppose to void. > + mbuf = rte_pktmbuf_alloc(rxq->mb_pool); > + if (mbuf) { 1.8.1 > + memcpy(rte_pktmbuf_mtod(mbuf, void*), pkt, len); rte_memcpy() > + rte_pktmbuf_pkt_len(mbuf) = > + rte_pktmbuf_data_len(mbuf) = len; Consider splitting this into two statements. > + rx_bytes += len; > + bufs[count++] = mbuf; > + } else { > + dropped++; > + } > + rte_ring_enqueue(umem->buf_ring, (void *)addr); > + } > + > + xsk_ring_cons__release(rx, rcvd); > + > + /* statistics */ > + rxq->rx_pkts += (rcvd - dropped); > + rxq->rx_bytes += rx_bytes; > + rxq->rx_dropped += dropped; > + > + return count; > +} > + > +static void pull_umem_cq(struct xsk_umem_info *umem, int size) > +{ > + struct xsk_ring_cons *cq = &umem->cq; > + int i, n; > + uint32_t idx_cq; > + uint64_t addr; > + > + n = xsk_ring_cons__peek(cq, size, &idx_cq); Use size_t for n. > + if (n > 0) { > + for (i = 0; i < n; i++) { Consider declaring 'addr' in this scope. > + addr = *xsk_ring_cons__comp_addr(cq, > + idx_cq++); > + rte_ring_enqueue(umem->buf_ring, (void *)addr); > + } > + > + xsk_ring_cons__release(cq, n); > + } > +} > + > +static void kick_tx(struct pkt_tx_queue *txq) > +{ > + struct xsk_umem_info *umem = txq->pair->umem; > + int ret; > + > + while (1) { for (;;) > + ret = sendto(xsk_socket__fd(txq->pair->xsk), NULL, 0, > + MSG_DONTWAIT, NULL, 0); > + > + /* everything is ok */ > + if (ret >= 0) Use likely()? > + break; > + > + /* some thing unexpected */ > + if (errno != EBUSY && errno != EAGAIN) > + break; > + > + /* pull from complete qeueu to leave more space */ > + if (errno == EAGAIN) > + pull_umem_cq(umem, ETH_AF_XDP_TX_BATCH_SIZE); > + } > + pull_umem_cq(umem, ETH_AF_XDP_TX_BATCH_SIZE); > +} > + > +static uint16_t > +eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > +{ > + struct pkt_tx_queue *txq = queue; > + struct xsk_umem_info *umem = txq->pair->umem; > + struct rte_mbuf *mbuf; > + void *addrs[ETH_AF_XDP_TX_BATCH_SIZE]; > + unsigned long tx_bytes = 0; > + int i, valid = 0; > + uint32_t idx_tx; > + > + nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ? > + nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE; Use RTE_MIN(). > + > + pull_umem_cq(umem, nb_pkts); > + > + nb_pkts = rte_ring_dequeue_bulk(umem->buf_ring, addrs, > + nb_pkts, NULL); > + if (!nb_pkts) nb_pkts == 0 > + return 0; > + > + if (xsk_ring_prod__reserve(&txq->tx, nb_pkts, &idx_tx) != nb_pkts) { > + kick_tx(txq); > + return 0; > + } > + > + for (i = 0; i < nb_pkts; i++) { > + struct xdp_desc *desc; > + char *pkt; Use void pointer? > + unsigned int buf_len = ETH_AF_XDP_FRAME_SIZE > + - ETH_AF_XDP_DATA_HEADROOM; Use uint32_t, as you seem to do elsewhere. > + desc = xsk_ring_prod__tx_desc(&txq->tx, idx_tx + i); > + mbuf = bufs[i]; > + if (mbuf->pkt_len <= buf_len) { > + desc->addr = (uint64_t)addrs[valid]; > + desc->len = mbuf->pkt_len; > + pkt = xsk_umem__get_data(umem->buffer, > + desc->addr); > + memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *), > + desc->len); rte_memcpy() > + valid++; > + tx_bytes += mbuf->pkt_len; > + } > + rte_pktmbuf_free(mbuf); > + } > + > + xsk_ring_prod__submit(&txq->tx, nb_pkts); > + > + kick_tx(txq); > + > + if (valid < nb_pkts) > + rte_ring_enqueue_bulk(umem->buf_ring, &addrs[valid], > + nb_pkts - valid, NULL); > + > + txq->err_pkts += nb_pkts - valid; > + txq->tx_pkts += valid; > + txq->tx_bytes += tx_bytes; > + > + return nb_pkts; > +} > + > +static int > +eth_dev_start(struct rte_eth_dev *dev) > +{ > + dev->data->dev_link.link_status = ETH_LINK_UP; > + > + return 0; > +} > + > +/* This function gets called when the current port gets stopped. */ > +static void > +eth_dev_stop(struct rte_eth_dev *dev) > +{ > + dev->data->dev_link.link_status = ETH_LINK_DOWN; > +} > + > +static int > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) > +{ > + /* rx/tx must be paired */ > + if (dev->data->nb_rx_queues != dev->data->nb_tx_queues) > + return -EINVAL; > + > + return 0; > +} > + > +static void > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + > + dev_info->if_index = internals->if_index; > + dev_info->max_mac_addrs = 1; > + dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN; > + dev_info->max_rx_queues = 1; > + dev_info->max_tx_queues = 1; > + dev_info->min_rx_bufsize = 0; > + > + dev_info->default_rxportconf.nb_queues = 1; > + dev_info->default_txportconf.nb_queues = 1; > + dev_info->default_rxportconf.ring_size = ETH_AF_XDP_DFLT_NUM_DESCS; > + dev_info->default_txportconf.ring_size = ETH_AF_XDP_DFLT_NUM_DESCS; > +} > + > +static int > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + struct xdp_statistics xdp_stats; > + struct pkt_rx_queue *rxq; > + socklen_t optlen; > + int i; > + > + optlen = sizeof(struct xdp_statistics); > + for (i = 0; i < dev->data->nb_rx_queues; i++) { > + rxq = &internals->rx_queues[i]; > + stats->q_ipackets[i] = internals->rx_queues[i].rx_pkts; > + stats->q_ibytes[i] = internals->rx_queues[i].rx_bytes; > + > + stats->q_opackets[i] = internals->tx_queues[i].tx_pkts; > + stats->q_obytes[i] = internals->tx_queues[i].tx_bytes; > + > + stats->ipackets += stats->q_ipackets[i]; > + stats->ibytes += stats->q_ibytes[i]; > + stats->imissed += internals->rx_queues[i].rx_dropped; > + getsockopt(xsk_socket__fd(rxq->xsk), SOL_XDP, XDP_STATISTICS, > + &xdp_stats, &optlen); > + stats->imissed += xdp_stats.rx_dropped; > + > + stats->opackets += stats->q_opackets[i]; > + stats->oerrors += stats->q_errors[i]; > + stats->oerrors += internals->tx_queues[i].err_pkts; > + stats->obytes += stats->q_obytes[i]; > + } > + > + return 0; > +} > + > +static void > +eth_stats_reset(struct rte_eth_dev *dev) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + int i; > + > + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { > + internals->rx_queues[i].rx_pkts = 0; > + internals->rx_queues[i].rx_bytes = 0; > + internals->rx_queues[i].rx_dropped = 0; > + > + internals->tx_queues[i].tx_pkts = 0; > + internals->tx_queues[i].err_pkts = 0; > + internals->tx_queues[i].tx_bytes = 0; > + } > +} > + > +static void remove_xdp_program(struct pmd_internals *internals) > +{ > + uint32_t curr_prog_id = 0; > + > + if (bpf_get_link_xdp_id(internals->if_index, &curr_prog_id, > + XDP_FLAGS_UPDATE_IF_NOEXIST)) { > + RTE_LOG(ERR, AF_XDP, "bpf_get_link_xdp_id failed\n"); > + return; > + } > + bpf_set_link_xdp_fd(internals->if_index, -1, > + XDP_FLAGS_UPDATE_IF_NOEXIST); > +} > + > +static void > +eth_dev_close(struct rte_eth_dev *dev) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + struct pkt_rx_queue *rxq; > + int i; > + > + RTE_LOG(INFO, AF_XDP, "Closing AF_XDP ethdev on numa socket %u\n", > + rte_socket_id()); > + > + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { > + rxq = &internals->rx_queues[i]; > + if (!rxq->umem) > + break; > + xsk_socket__delete(rxq->xsk); > + } > + > + (void)xsk_umem__delete(internals->umem->umem); > + remove_xdp_program(internals); > +} > + > +static void > +eth_queue_release(void *q __rte_unused) > +{ > +} > + > +static int > +eth_link_update(struct rte_eth_dev *dev __rte_unused, > + int wait_to_complete __rte_unused) > +{ > + return 0; > +} > + > +static void xdp_umem_destroy(struct xsk_umem_info *umem) > +{ > + free(umem->buffer); > + umem->buffer = NULL; > + > + rte_ring_free(umem->buf_ring); > + umem->buf_ring = NULL; > + > + free(umem); > + umem = NULL; > +} > + > +static struct xsk_umem_info *xdp_umem_configure(void) > +{ > + struct xsk_umem_info *umem; > + struct xsk_umem_config usr_config = { > + .fill_size = ETH_AF_XDP_DFLT_NUM_DESCS, > + .comp_size = ETH_AF_XDP_DFLT_NUM_DESCS, > + .frame_size = ETH_AF_XDP_FRAME_SIZE, > + .frame_headroom = ETH_AF_XDP_DATA_HEADROOM }; > + void *bufs = NULL; > + char ring_name[0x100]; > + int ret; > + uint64_t i; > + > + umem = calloc(1, sizeof(*umem)); > + if (!umem) { 1.8.1 > + RTE_LOG(ERR, AF_XDP, "Failed to allocate umem info"); > + return NULL; > + } > + > + snprintf(ring_name, 0x100, "af_xdp_ring"); Again the magical 0x100. > + umem->buf_ring = rte_ring_create(ring_name, > + ETH_AF_XDP_NUM_BUFFERS, > + SOCKET_ID_ANY, > + 0x0); > + if (!umem->buf_ring) { 1.8.1 > + RTE_LOG(ERR, AF_XDP, > + "Failed to create rte_ring\n"); > + goto err; > + } > + > + for (i = 0; i < ETH_AF_XDP_NUM_BUFFERS; i++) > + rte_ring_enqueue(umem->buf_ring, > + (void *)(i * ETH_AF_XDP_FRAME_SIZE + > + ETH_AF_XDP_DATA_HEADROOM)); > + > + if (posix_memalign(&bufs, getpagesize(), > + ETH_AF_XDP_NUM_BUFFERS * ETH_AF_XDP_FRAME_SIZE)) { > + RTE_LOG(ERR, AF_XDP, "Failed to allocate memory pool.\n"); > + goto err; > + } > + ret = xsk_umem__create(&umem->umem, bufs, > + ETH_AF_XDP_NUM_BUFFERS * ETH_AF_XDP_FRAME_SIZE, > + &umem->fq, &umem->cq, > + &usr_config); > + > + if (ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to create umem"); > + goto err; > + } > + umem->buffer = bufs; > + > + return umem; > + > +err: > + xdp_umem_destroy(umem); > + return NULL; > +} > + > +static int > +xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq, > + int ring_size) > +{ > + struct xsk_socket_config cfg; > + struct pkt_tx_queue *txq = rxq->pair; > + int ret = 0; > + int reserve_size; > + > + rxq->umem = xdp_umem_configure(); > + if (!rxq->umem) { 1.8.1 > + ret = -ENOMEM; > + goto err; > + } > + > + cfg.rx_size = ring_size; > + cfg.tx_size = ring_size; > + cfg.libbpf_flags = 0; > + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST; > + cfg.bind_flags = 0; > + ret = xsk_socket__create(&rxq->xsk, internals->if_name, > + internals->queue_idx, rxq->umem->umem, &rxq->rx, > + &txq->tx, &cfg); > + if (ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to create xsk socket.\n"); > + goto err; > + } > + > + reserve_size = ETH_AF_XDP_DFLT_NUM_DESCS / 2; > + ret = reserve_fill_queue(rxq->umem, reserve_size); > + if (ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to reserve fill queue.\n"); > + goto err; > + } > + > + return 0; > + > +err: > + xdp_umem_destroy(rxq->umem); > + > + return ret; > +} > + > +static void > +queue_reset(struct pmd_internals *internals, uint16_t queue_idx) > +{ > + struct pkt_rx_queue *rxq = &internals->rx_queues[queue_idx]; > + struct pkt_tx_queue *txq = rxq->pair; > + int xsk_fd = xsk_socket__fd(rxq->xsk); > + > + if (xsk_fd) { > + close(xsk_fd); > + if (internals->umem) { 1.8.1 > + xdp_umem_destroy(internals->umem); > + internals->umem = NULL; > + } > + } > + memset(rxq, 0, sizeof(*rxq)); > + memset(txq, 0, sizeof(*txq)); > + rxq->pair = txq; > + txq->pair = rxq; > + rxq->queue_idx = queue_idx; > + txq->queue_idx = queue_idx; > +} > + > +static int > +eth_rx_queue_setup(struct rte_eth_dev *dev, > + uint16_t rx_queue_id, > + uint16_t nb_rx_desc, > + unsigned int socket_id __rte_unused, > + const struct rte_eth_rxconf *rx_conf __rte_unused, > + struct rte_mempool *mb_pool) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + unsigned int buf_size, data_size; uint32_t > + struct pkt_rx_queue *rxq; > + int ret = 0; No need to set 'ret'. Alternatively, restructure so you always return 'ret'. > + > + rxq = &internals->rx_queues[rx_queue_id]; > + queue_reset(internals, rx_queue_id); > + > + /* Now get the space available for data in the mbuf */ > + buf_size = rte_pktmbuf_data_room_size(mb_pool) - > + RTE_PKTMBUF_HEADROOM; > + data_size = ETH_AF_XDP_FRAME_SIZE - ETH_AF_XDP_DATA_HEADROOM; > + > + if (data_size > buf_size) { > + RTE_LOG(ERR, AF_XDP, > + "%s: %d bytes will not fit in mbuf (%d bytes)\n", > + dev->device->name, data_size, buf_size); > + ret = -ENOMEM; > + goto err; > + } > + > + rxq->mb_pool = mb_pool; > + > + if (xsk_configure(internals, rxq, nb_rx_desc)) { > + RTE_LOG(ERR, AF_XDP, > + "Failed to configure xdp socket\n"); > + ret = -EINVAL; > + goto err; > + } > + > + internals->umem = rxq->umem; > + > + dev->data->rx_queues[rx_queue_id] = rxq; > + return 0; > + > +err: > + queue_reset(internals, rx_queue_id); > + return ret; > +} > + > +static int > +eth_tx_queue_setup(struct rte_eth_dev *dev, > + uint16_t tx_queue_id, > + uint16_t nb_tx_desc __rte_unused, > + unsigned int socket_id __rte_unused, > + const struct rte_eth_txconf *tx_conf __rte_unused) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + struct pkt_tx_queue *txq; > + > + txq = &internals->tx_queues[tx_queue_id]; > + > + dev->data->tx_queues[tx_queue_id] = txq; > + return 0; > +} > + > +static int > +eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + struct ifreq ifr = { .ifr_mtu = mtu }; > + int ret; > + int s; > + > + s = socket(PF_INET, SOCK_DGRAM, 0); > + if (s < 0) > + return -EINVAL; > + > + strlcpy(ifr.ifr_name, internals->if_name, IFNAMSIZ); > + ret = ioctl(s, SIOCSIFMTU, &ifr); > + close(s); > + > + if (ret < 0) > + return -EINVAL; > + > + return 0; > +} > + > +static void > +eth_dev_change_flags(char *if_name, uint32_t flags, uint32_t mask) > +{ > + struct ifreq ifr; > + int s; > + > + s = socket(PF_INET, SOCK_DGRAM, 0); > + if (s < 0) > + return; > + > + strlcpy(ifr.ifr_name, if_name, IFNAMSIZ); > + if (ioctl(s, SIOCGIFFLAGS, &ifr) < 0) > + goto out; > + ifr.ifr_flags &= mask; > + ifr.ifr_flags |= flags; > + if (ioctl(s, SIOCSIFFLAGS, &ifr) < 0) > + goto out; > +out: > + close(s); > +} > + > +static void > +eth_dev_promiscuous_enable(struct rte_eth_dev *dev) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + > + eth_dev_change_flags(internals->if_name, IFF_PROMISC, ~0); > +} > + > +static void > +eth_dev_promiscuous_disable(struct rte_eth_dev *dev) > +{ > + struct pmd_internals *internals = dev->data->dev_private; > + > + eth_dev_change_flags(internals->if_name, 0, ~IFF_PROMISC); > +} > + > +static const struct eth_dev_ops ops = { > + .dev_start = eth_dev_start, > + .dev_stop = eth_dev_stop, > + .dev_close = eth_dev_close, > + .dev_configure = eth_dev_configure, > + .dev_infos_get = eth_dev_info, > + .mtu_set = eth_dev_mtu_set, > + .promiscuous_enable = eth_dev_promiscuous_enable, > + .promiscuous_disable = eth_dev_promiscuous_disable, > + .rx_queue_setup = eth_rx_queue_setup, > + .tx_queue_setup = eth_tx_queue_setup, > + .rx_queue_release = eth_queue_release, > + .tx_queue_release = eth_queue_release, > + .link_update = eth_link_update, > + .stats_get = eth_stats_get, > + .stats_reset = eth_stats_reset, > +}; > + > +/** parse integer from integer argument */ > +static int > +parse_integer_arg(const char *key __rte_unused, > + const char *value, void *extra_args) > +{ > + int *i = (int *)extra_args; > + > + *i = atoi(value); Use strtol(). > + if (*i < 0) { > + RTE_LOG(ERR, AF_XDP, "Argument has to be positive.\n"); > + return -EINVAL; > + } > + > + return 0; > +} > + > +/** parse name argument */ > +static int > +parse_name_arg(const char *key __rte_unused, > + const char *value, void *extra_args) > +{ > + char *name = extra_args; > + > + if (strlen(value) > IFNAMSIZ) { The buffer is IFNAMSIZ bytes (which it should be), so it can't hold a string with strlen() == IFNAMSIZ. > + RTE_LOG(ERR, AF_XDP, "Invalid name %s, should be less than " > + "%u bytes.\n", value, IFNAMSIZ); > + return -EINVAL; > + } > + > + strlcpy(name, value, IFNAMSIZ); > + > + return 0; > +} > + > +static int > +parse_parameters(struct rte_kvargs *kvlist, > + char *if_name, > + int *queue_idx) > +{ > + int ret = 0; Should not be initialized. > + > + ret = rte_kvargs_process(kvlist, ETH_AF_XDP_IFACE_ARG, > + &parse_name_arg, if_name); > + if (ret < 0) > + goto free_kvlist; > + > + ret = rte_kvargs_process(kvlist, ETH_AF_XDP_QUEUE_IDX_ARG, > + &parse_integer_arg, queue_idx); > + if (ret < 0) > + goto free_kvlist; I fail to see the point of this goto, but maybe there's more code to follow in future patches. > + > +free_kvlist: > + rte_kvargs_free(kvlist); > + return ret; > +} > + > +static int > +get_iface_info(const char *if_name, > + struct ether_addr *eth_addr, > + int *if_index) > +{ > + struct ifreq ifr; > + int sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP); > + > + if (sock < 0) > + return -1; > + > + strlcpy(ifr.ifr_name, if_name, IFNAMSIZ); > + if (ioctl(sock, SIOCGIFINDEX, &ifr)) > + goto error; > + > + if (ioctl(sock, SIOCGIFHWADDR, &ifr)) > + goto error; > + > + rte_memcpy(eth_addr, ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN); > + > + close(sock); > + *if_index = if_nametoindex(if_name); > + return 0; > + > +error: > + close(sock); > + return -1; > +} > + > +static int > +init_internals(struct rte_vdev_device *dev, > + const char *if_name, > + int queue_idx, > + struct rte_eth_dev **eth_dev) > +{ > + const char *name = rte_vdev_device_name(dev); > + const unsigned int numa_node = dev->device.numa_node; > + struct pmd_internals *internals = NULL; > + int ret; > + int i; > + > + internals = rte_zmalloc_socket(name, sizeof(*internals), 0, numa_node); > + if (!internals) 1.8.1 > + return -ENOMEM; > + > + internals->queue_idx = queue_idx; > + strlcpy(internals->if_name, if_name, IFNAMSIZ); > + > + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { > + internals->tx_queues[i].pair = &internals->rx_queues[i]; > + internals->rx_queues[i].pair = &internals->tx_queues[i]; > + } > + > + ret = get_iface_info(if_name, &internals->eth_addr, > + &internals->if_index); > + if (ret) > + goto err; > + > + *eth_dev = rte_eth_vdev_allocate(dev, 0); > + if (!*eth_dev) 1.8.1 > + goto err; > + > + (*eth_dev)->data->dev_private = internals; > + (*eth_dev)->data->dev_link = pmd_link; > + (*eth_dev)->data->mac_addrs = &internals->eth_addr; > + (*eth_dev)->dev_ops = &ops; > + (*eth_dev)->rx_pkt_burst = eth_af_xdp_rx; > + (*eth_dev)->tx_pkt_burst = eth_af_xdp_tx; > + > + return 0; > + > +err: > + rte_free(internals); > + return -1; > +} > + > +static int > +rte_pmd_af_xdp_probe(struct rte_vdev_device *dev) > +{ > + struct rte_kvargs *kvlist; > + char if_name[IFNAMSIZ]; > + int xsk_queue_idx = ETH_AF_XDP_DFLT_QUEUE_IDX; > + struct rte_eth_dev *eth_dev = NULL; > + const char *name; > + int ret; > + > + RTE_LOG(INFO, AF_XDP, "Initializing pmd_af_xdp for %s\n", > + rte_vdev_device_name(dev)); > + > + name = rte_vdev_device_name(dev); > + if (rte_eal_process_type() == RTE_PROC_SECONDARY && > + strlen(rte_vdev_device_args(dev)) == 0) { > + eth_dev = rte_eth_dev_attach_secondary(name); > + if (!eth_dev) { > + RTE_LOG(ERR, AF_XDP, "Failed to probe %s\n", name); > + return -EINVAL; > + } > + eth_dev->dev_ops = &ops; > + rte_eth_dev_probing_finish(eth_dev); > + return 0; > + } > + > + kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments); > + if (!kvlist) { > + RTE_LOG(ERR, AF_XDP, "Invalid kvargs key\n"); > + return -EINVAL; > + } > + > + if (dev->device.numa_node == SOCKET_ID_ANY) > + dev->device.numa_node = rte_socket_id(); > + > + if (parse_parameters(kvlist, if_name, &xsk_queue_idx) < 0) { > + RTE_LOG(ERR, AF_XDP, "Invalid kvargs value\n"); > + return -EINVAL; > + } > + > + ret = init_internals(dev, if_name, xsk_queue_idx, ð_dev); > + if (ret) { > + RTE_LOG(ERR, AF_XDP, "Failed to init internals\n"); > + return ret; > + } > + > + rte_eth_dev_probing_finish(eth_dev); > + > + return 0; > +} > + > +static int > +rte_pmd_af_xdp_remove(struct rte_vdev_device *dev) > +{ > + struct rte_eth_dev *eth_dev = NULL; > + struct pmd_internals *internals; > + > + RTE_LOG(INFO, AF_XDP, "Removing AF_XDP ethdev on numa socket %u\n", > + rte_socket_id()); > + > + if (!dev) > + return -1; > + > + /* find the ethdev entry */ > + eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev)); > + if (!eth_dev) > + return -1; > + > + internals = eth_dev->data->dev_private; > + > + rte_ring_free(internals->umem->buf_ring); > + rte_free(internals->umem->buffer); > + rte_free(internals->umem); > + > + rte_eth_dev_release_port(eth_dev); > + > + > + return 0; > +} > + > +static struct rte_vdev_driver pmd_af_xdp_drv = { > + .probe = rte_pmd_af_xdp_probe, > + .remove = rte_pmd_af_xdp_remove, > +}; > + > +RTE_PMD_REGISTER_VDEV(eth_af_xdp, pmd_af_xdp_drv); > +RTE_PMD_REGISTER_PARAM_STRING(eth_af_xdp, > + "iface= " > + "queue= "); > diff --git a/drivers/net/af_xdp/rte_pmd_af_xdp_version.map b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map > new file mode 100644 > index 000000000..c6db030fe > --- /dev/null > +++ b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map > @@ -0,0 +1,3 @@ > +DPDK_19.05 { > + local: *; > +}; > diff --git a/drivers/net/meson.build b/drivers/net/meson.build > index 3ecc78cee..1105e72d8 100644 > --- a/drivers/net/meson.build > +++ b/drivers/net/meson.build > @@ -2,6 +2,7 @@ > # Copyright(c) 2017 Intel Corporation > > drivers = ['af_packet', > + 'af_xdp', > 'ark', > 'atlantic', > 'avp', > diff --git a/mk/rte.app.mk b/mk/rte.app.mk > index 262132fc6..be0af73cc 100644 > --- a/mk/rte.app.mk > +++ b/mk/rte.app.mk > @@ -143,6 +143,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += -lrte_mempool_dpaa2 > endif > > _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET) += -lrte_pmd_af_packet > +_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += -lrte_pmd_af_xdp -lelf -lbpf > _LDLIBS-$(CONFIG_RTE_LIBRTE_ARK_PMD) += -lrte_pmd_ark > _LDLIBS-$(CONFIG_RTE_LIBRTE_ATLANTIC_PMD) += -lrte_pmd_atlantic > _LDLIBS-$(CONFIG_RTE_LIBRTE_AVP_PMD) += -lrte_pmd_avp >