From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from na3sys009aog137.obsmtp.com (na3sys009aog137.obsmtp.com [74.125.149.18]) by dpdk.org (Postfix) with SMTP id 632B1684A for ; Tue, 28 Jan 2014 02:47:18 +0100 (CET) Received: from mail-pa0-f50.google.com ([209.85.220.50]) (using TLSv1) by na3sys009aob137.postini.com ([74.125.148.12]) with SMTP ID DSNKUucMdK/IJ7fPBSq1g97Ob6rfSyJWW5xd@postini.com; Mon, 27 Jan 2014 17:48:37 PST Received: by mail-pa0-f50.google.com with SMTP id kp14so6681300pab.37 for ; Mon, 27 Jan 2014 17:48:36 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-type:content-transfer-encoding; bh=eiNvckMOWU1V1/RMHeWyL+qqW60aC4gNfMENOXsLkIw=; b=QXU5I6/R6ZGsD7bWPyaTcalSDXuUw7TdC/NSZx0jEFJL/vmUYDa63BLSngCg7AXKsK Vu+okH5cOMmTsm9ysDdrShMbbx601zNs/qU0j7wft43ZjeCDcxK/w0y8UeZaN6r3MQYM +5OIjDsNfp+bQgHHtjgxW5sl1BKZfj25NVVnm3bHjL7JFwLUb/cEv0l7yjTSw3a7qYhr ff/8dFI6a7IUfUc54hCaMpncdYmFNA4tcIar4kW/BK2ObnUegk4CBY3i7u2q7COQ7nl4 7gzYYVDDhDPvCzNTNkXV9gQ4lINwlQqud0/ORpzSH1rVovKCEZCJzZY5QMq3E1+p+NYp SA3w== X-Gm-Message-State: ALoCoQlejz75/SwuMyNt2uSsgE2QsMtQbxSy8QthbuxpZ9bdDSzk31dedFsyNhBgt2hi0T5t2hFC0mPAaMMavW6CbGsXoipi9hvaApVZ9qiPU5MTUq62VB2fDFe6rEPDTenhVJltXWh0T+awwj8aOtQUUoNeyBVRBg== X-Received: by 10.68.191.73 with SMTP id gw9mr6165859pbc.158.1390873716096; Mon, 27 Jan 2014 17:48:36 -0800 (PST) X-Received: by 10.68.191.73 with SMTP id gw9mr6165821pbc.158.1390873715683; Mon, 27 Jan 2014 17:48:35 -0800 (PST) Received: from localhost ([75.98.92.113]) by mx.google.com with ESMTPSA id os1sm97809211pac.20.2014.01.27.17.48.34 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 27 Jan 2014 17:48:34 -0800 (PST) From: pshelar@nicira.com To: dev@openvswitch.org, dev@dpdk.org, dpdk-ovs@lists.01.org Date: Mon, 27 Jan 2014 17:48:35 -0800 Message-Id: <1390873715-26714-1-git-send-email-pshelar@nicira.com> X-Mailer: git-send-email 1.7.9.5 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Gerald Rogers Subject: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 01:47:20 -0000 From: Pravin B Shelar Following patch adds DPDK netdev-class to userspace datapath. Approach taken in this patch differs from IntelĀ® DPDK vSwitch where DPDK datapath switching is done in saparate process. This patch adds support for DPDK type port and uses OVS userspace datapath for switching. Therefore all DPDK processing and flow miss handling is done in single process. This also avoids code duplication by reusing OVS userspace datapath switching and therefore it supports all flow matching and actions that user-space datapath supports. Refer to INSTALL.DPDK doc for further info. With this patch I got similar performance for netperf TCP_STREAM tests compared to kernel datapath. This is based a patch from Gerald Rogers. Signed-off-by: Pravin B Shelar CC: "Gerald Rogers" --- This patch is tested on latest OVS master (commit 9d0581fdf22bec79). --- INSTALL | 1 + INSTALL.DPDK | 85 ++++ Makefile.am | 1 + acinclude.m4 | 40 ++ configure.ac | 1 + lib/automake.mk | 6 + lib/dpif-netdev.c | 393 +++++++++++----- lib/netdev-dpdk.c | 1152 +++++++++++++++++++++++++++++++++++++++++++++++ lib/netdev-dpdk.h | 7 + lib/netdev-dummy.c | 38 +- lib/netdev-linux.c | 33 +- lib/netdev-provider.h | 13 +- lib/netdev-vport.c | 1 + lib/netdev.c | 52 ++- lib/netdev.h | 15 +- lib/ofpbuf.c | 7 +- lib/ofpbuf.h | 13 +- lib/packets.c | 9 + lib/packets.h | 1 + vswitchd/ovs-vswitchd.c | 14 +- 20 files changed, 1702 insertions(+), 180 deletions(-) create mode 100644 INSTALL.DPDK create mode 100644 lib/netdev-dpdk.c create mode 100644 lib/netdev-dpdk.h diff --git a/INSTALL b/INSTALL index 001d3cb..74cd278 100644 --- a/INSTALL +++ b/INSTALL @@ -10,6 +10,7 @@ on a specific platform, please see one of these files: - INSTALL.RHEL - INSTALL.XenServer - INSTALL.NetBSD + - INSTALL.DPDK Build Requirements ------------------ diff --git a/INSTALL.DPDK b/INSTALL.DPDK new file mode 100644 index 0000000..1c95104 --- /dev/null +++ b/INSTALL.DPDK @@ -0,0 +1,85 @@ + Using Open vSwitch with DPDK + ============================ + +Open vSwitch can use Intel(R) DPDK lib to operate entirely in +userspace. This file explains how to install and use Open vSwitch in +such a mode. + +The DPDK support of Open vSwitch is considered experimental. +It has not been thoroughly tested. + +This version of Open vSwitch should be built manually with "configure" +and "make". + +Building and Installing: +------------------------ + +DPDK: +cd DPDK +make install T=x86_64-default-linuxapp-gcc +Refer to http://dpdk.org/ requirements of details. + +Linux kernel: +Refer to intel-dpdk-getting-started-guide.pdf for understanding +DPDK kernel requirement. + +OVS: +cd $(OVS_DIR)/openvswitch +./boot.sh +./configure --with-dpdk=$(DPDK_BUILD) +make + +Refer to INSTALL.userspace for general requirements of building +userspace OVS. + +Using the DPDK with ovs-vswitchd: +--------------------------------- + +Fist setup DPDK devices: + - insert igb_uio.ko + e.g. insmod DPDK/x86_64-default-linuxapp-gcc/kmod/igb_uio.ko + - mount hugefs + e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/ + - Bind network device to ibg_uio. + e.g. DPDK/tools/pci_unbind.py --bind=igb_uio eth1 + +Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. + +Start vswitchd: +DPDK configuration arguments can be passed to vswitchd via `--dpdk` +argument. + e.g. + ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach + +To use ovs-vswitchd with DPDK, create a bridge with datapath_type +"netdev" in the configuration database. For example: + + ovs-vsctl add-br br0 + ovs-vsctl set bridge br0 datapath_type=netdev + +Now you can add dpdk devices. OVS expect DPDK device name start with dpdk +and end with portid. vswitchd should print number of dpdk devices found. + + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk + +Once first DPDK port is added vswitchd, it creates Polling thread and +polls dpdk device in continues loop. Therefore CPU utilization +for that thread is always 100%. + +Restrictions: +------------- + + - This Support is for Physical NIC. I have tested with Intel NIC only. + - vswitchd userspace datapath does affine polling thread but it is + assumed that devices are on numa node 0. Therefore if device is + attached to non zero numa node switching performance would be + suboptimal. + - There are fixed number of polling thread and fixed number of per + device queues configured. + - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue. + - Currently DPDK port does not make use any offload functionality. + +Bug Reporting: +-------------- + +Please report problems to bugs@openvswitch.org. diff --git a/Makefile.am b/Makefile.am index 32775cc..4d53dd4 100644 --- a/Makefile.am +++ b/Makefile.am @@ -58,6 +58,7 @@ EXTRA_DIST = \ FAQ \ INSTALL \ INSTALL.Debian \ + INSTALL.DPDK \ INSTALL.Fedora \ INSTALL.KVM \ INSTALL.Libvirt \ diff --git a/acinclude.m4 b/acinclude.m4 index 8ff5828..01d39bf 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -157,6 +157,46 @@ AC_DEFUN([OVS_CHECK_LINUX], [ AM_CONDITIONAL(LINUX_ENABLED, test -n "$KBUILD") ]) +dnl OVS_CHECK_DPDK +dnl +dnl Configure DPDK source tree +AC_DEFUN([OVS_CHECK_DPDK], [ + AC_ARG_WITH([dpdk], + [AC_HELP_STRING([--with-dpdk=/path/to/dpdk], + [Specify the DPDP build directory])]) + + if test X"$with_dpdk" != X; then + RTE_SDK=$with_dpdk + + DPDK_INCLUDE=$RTE_SDK/include + DPDK_LIB_DIR=$RTE_SDK/lib + DPDK_LIBS="$DPDK_LIB_DIR/libethdev.a \ + $DPDK_LIB_DIR/librte_cmdline.a \ + $DPDK_LIB_DIR/librte_hash.a \ + $DPDK_LIB_DIR/librte_lpm.a \ + $DPDK_LIB_DIR/librte_mbuf.a \ + $DPDK_LIB_DIR/librte_mempool.a \ + $DPDK_LIB_DIR/librte_eal.a \ + $DPDK_LIB_DIR/librte_pmd_ring.a \ + $DPDK_LIB_DIR/librte_malloc.a \ + $DPDK_LIB_DIR/librte_pmd_ixgbe.a \ + $DPDK_LIB_DIR/librte_pmd_e1000.a \ + $DPDK_LIB_DIR/librte_pmd_virtio.a \ + $DPDK_LIB_DIR/librte_ring.a" + + LIBS="$DPDK_LIBS $LIBS" + CPPFLAGS="-I$DPDK_INCLUDE $CPPFLAGS" + SLICE_SIZE="4194304" + SLICE_SIZE_MAX="1073741824" + LDFLAGS="$LDFLAGS -Wl,-hugetlbfs-align,-zcommon-page-size=$SLICE_SIZE,-zmax-page-size=$SLICE_SIZE" + AC_DEFINE([DPDK_NETDEV], [1], [System uses the DPDK module.]) + else + RTE_SDK= + fi + + AM_CONDITIONAL(DPDK_NETDEV, test -n "$RTE_SDK") +]) + dnl OVS_GREP_IFELSE(FILE, REGEX, [IF-MATCH], [IF-NO-MATCH]) dnl dnl Greps FILE for REGEX. If it matches, runs IF-MATCH, otherwise IF-NO-MATCH. diff --git a/configure.ac b/configure.ac index 19c095e..30dbe39 100644 --- a/configure.ac +++ b/configure.ac @@ -119,6 +119,7 @@ OVS_ENABLE_SPARSE AC_ARG_VAR(KARCH, [Kernel Architecture String]) AC_SUBST(KARCH) OVS_CHECK_LINUX +OVS_CHECK_DPDK AC_CONFIG_FILES(Makefile) AC_CONFIG_FILES(datapath/Makefile) diff --git a/lib/automake.mk b/lib/automake.mk index 2ef806e..ffbecdb 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -289,6 +289,12 @@ lib_libopenvswitch_la_SOURCES += \ lib/route-table.h endif +if DPDK_NETDEV +lib_libopenvswitch_la_SOURCES += \ + lib/netdev-dpdk.c \ + lib/netdev-dpdk.h +endif + if HAVE_POSIX_AIO lib_libopenvswitch_la_SOURCES += lib/async-append-aio.c else diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index cb64bdc..f55732b 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -44,6 +44,7 @@ #include "meta-flow.h" #include "netdev.h" #include "netdev-vport.h" +#include "netdev-dpdk.h" #include "netlink.h" #include "odp-execute.h" #include "odp-util.h" @@ -64,14 +65,12 @@ VLOG_DEFINE_THIS_MODULE(dpif_netdev); /* By default, choose a priority in the middle. */ #define NETDEV_RULE_PRIORITY 0x8000 +/* TODO: number of thread should be configurable. */ +#define NR_THREADS 8 /* Configuration parameters. */ enum { MAX_FLOWS = 65536 }; /* Maximum number of flows in flow table. */ -/* Enough headroom to add a vlan tag, plus an extra 2 bytes to allow IP - * headers to be aligned on a 4-byte boundary. */ -enum { DP_NETDEV_HEADROOM = 2 + VLAN_HEADER_LEN }; - /* Queues. */ enum { N_QUEUES = 2 }; /* Number of queues for dpif_recv(). */ enum { MAX_QUEUE_LEN = 128 }; /* Maximum number of packets per queue. */ @@ -162,8 +161,9 @@ struct dp_netdev { /* Forwarding threads. */ struct latch exit_latch; - struct dp_forwarder *forwarders; - size_t n_forwarders; + struct pmd_thread *pmd_threads; + size_t n_pmd_threads; + int pmd_count; }; static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp, @@ -172,12 +172,14 @@ static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp, /* A port in a netdev-based datapath. */ struct dp_netdev_port { - struct hmap_node node; /* Node in dp_netdev's 'ports'. */ - odp_port_t port_no; + struct pkt_metadata md; + struct netdev_rx **rx; struct netdev *netdev; + odp_port_t port_no; struct netdev_saved_flags *sf; - struct netdev_rx *rx; char *type; /* Port type as requested by user. */ + struct ovs_refcount ref_cnt; + struct hmap_node node; /* Node in dp_netdev's 'ports'. */ }; /* A flow in dp_netdev's 'flow_table'. @@ -289,11 +291,12 @@ void dp_netdev_actions_unref(struct dp_netdev_actions *); /* A thread that receives packets from some ports, looks them up in the flow * table, and executes the actions it finds. */ -struct dp_forwarder { +struct pmd_thread { struct dp_netdev *dp; - pthread_t thread; + int qid; + atomic_uint change_seq; char *name; - uint32_t min_hash, max_hash; + pthread_t thread; }; /* Interface to netdev-based datapath. */ @@ -332,7 +335,7 @@ static void dp_netdev_execute_actions(struct dp_netdev *dp, static void dp_netdev_port_input(struct dp_netdev *dp, struct ofpbuf *packet, struct pkt_metadata *) OVS_REQ_RDLOCK(dp->port_rwlock); -static void dp_netdev_set_threads(struct dp_netdev *, int n); +static void dp_netdev_set_pmd_threads(struct dp_netdev *, int n); static struct dpif_netdev * dpif_netdev_cast(const struct dpif *dpif) @@ -478,7 +481,6 @@ create_dp_netdev(const char *name, const struct dpif_class *class, dp_netdev_free(dp); return error; } - dp_netdev_set_threads(dp, 2); *dpp = dp; return 0; @@ -536,8 +538,8 @@ dp_netdev_free(struct dp_netdev *dp) shash_find_and_delete(&dp_netdevs, dp->name); - dp_netdev_set_threads(dp, 0); - free(dp->forwarders); + dp_netdev_set_pmd_threads(dp, 0); + free(dp->pmd_threads); dp_netdev_flow_flush(dp); ovs_rwlock_wrlock(&dp->port_rwlock); @@ -621,18 +623,30 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats) return 0; } +static void +dp_netdev_reload_pmd_threads(struct dp_netdev *dp) +{ + int i; + + for (i = 0; i < dp->n_pmd_threads; i++) { + struct pmd_thread *f = &dp->pmd_threads[i]; + int id; + + atomic_add(&f->change_seq, 1, &id); + } +} + static int do_add_port(struct dp_netdev *dp, const char *devname, const char *type, odp_port_t port_no) - OVS_REQ_WRLOCK(dp->port_rwlock) { struct netdev_saved_flags *sf; struct dp_netdev_port *port; struct netdev *netdev; - struct netdev_rx *rx; enum netdev_flags flags; const char *open_type; int error; + int i; /* XXX reject devices already in some dp_netdev. */ @@ -651,28 +665,41 @@ do_add_port(struct dp_netdev *dp, const char *devname, const char *type, return EINVAL; } - error = netdev_rx_open(netdev, &rx); - if (error - && !(error == EOPNOTSUPP && dpif_netdev_class_is_dummy(dp->class))) { - VLOG_ERR("%s: cannot receive packets on this network device (%s)", - devname, ovs_strerror(errno)); - netdev_close(netdev); - return error; + port = xzalloc(sizeof *port); + port->port_no = port_no; + port->md = PKT_METADATA_INITIALIZER(port->port_no); + port->netdev = netdev; + port->rx = xmalloc(sizeof *port->rx * netdev_nr_rx(netdev)); + port->type = xstrdup(type); + for (i = 0; i < netdev_nr_rx(netdev); i++) { + error = netdev_rx_open(netdev, &port->rx[i], i); + if (error + && !(error == EOPNOTSUPP && dpif_netdev_class_is_dummy(dp->class))) { + VLOG_ERR("%s: cannot receive packets on this network device (%s)", + devname, ovs_strerror(errno)); + netdev_close(netdev); + return error; + } } error = netdev_turn_flags_on(netdev, NETDEV_PROMISC, &sf); if (error) { - netdev_rx_close(rx); + for (i = 0; i < netdev_nr_rx(netdev); i++) { + netdev_rx_close(port->rx[i]); + } netdev_close(netdev); + free(port->rx); + free(port); return error; } - - port = xmalloc(sizeof *port); - port->port_no = port_no; - port->netdev = netdev; port->sf = sf; - port->rx = rx; - port->type = xstrdup(type); + + if (netdev_is_pmd(netdev)) { + dp->pmd_count++; + dp_netdev_set_pmd_threads(dp, NR_THREADS); + dp_netdev_reload_pmd_threads(dp); + } + ovs_refcount_init(&port->ref_cnt); hmap_insert(&dp->ports, &port->node, hash_int(odp_to_u32(port_no), 0)); seq_change(dp->port_seq); @@ -772,6 +799,32 @@ get_port_by_name(struct dp_netdev *dp, return ENOENT; } +static void +port_ref(struct dp_netdev_port *port) +{ + if (port) { + ovs_refcount_ref(&port->ref_cnt); + } +} + +static void +port_unref(struct dp_netdev_port *port) +{ + if (port && ovs_refcount_unref(&port->ref_cnt) == 1) { + int i; + + netdev_restore_flags(port->sf); + for (i = 0; i < netdev_nr_rx(port->netdev); i++) { + netdev_rx_close(port->rx[i]); + } + free(port->rx); + netdev_close(port->netdev); + free(port->type); + ovs_refcount_destroy(&port->ref_cnt); + free(port); + } +} + static int do_del_port(struct dp_netdev *dp, odp_port_t port_no) OVS_REQ_WRLOCK(dp->port_rwlock) @@ -783,16 +836,13 @@ do_del_port(struct dp_netdev *dp, odp_port_t port_no) if (error) { return error; } - hmap_remove(&dp->ports, &port->node); seq_change(dp->port_seq); + if (netdev_is_pmd(port->netdev)) { + dp_netdev_reload_pmd_threads(dp); + } - netdev_close(port->netdev); - netdev_restore_flags(port->sf); - netdev_rx_close(port->rx); - free(port->type); - free(port); - + port_unref(port); return 0; } @@ -1543,123 +1593,215 @@ dp_netdev_actions_unref(struct dp_netdev_actions *actions) } } -static void * -dp_forwarder_main(void *f_) + +static void +dp_netdev_process_rx_port(struct dp_netdev *dp, + struct dp_netdev_port *port, + struct netdev_rx *queue) +{ + struct ofpbuf *packet; + struct pkt_metadata *md = &port->md; + int error, c; + + error = netdev_rx_recv(queue, &packet, &c); + if (!error) { + int i; + + for (i = 0; i < c; i++) { + dp_netdev_port_input(dp, packet, md); + packet++; + } + + } else if (error != EAGAIN && error != EOPNOTSUPP) { + static struct vlog_rate_limit rl + = VLOG_RATE_LIMIT_INIT(1, 5); + + VLOG_ERR_RL(&rl, "error receiving data from %s: %s", + netdev_get_name(port->netdev), + ovs_strerror(error)); + } +} + +static void +dpif_netdev_run(struct dpif *dpif) +{ + struct dp_netdev_port *port; + struct dp_netdev *dp = get_dp_netdev(dpif); + + ovs_rwlock_rdlock(&dp->port_rwlock); + + HMAP_FOR_EACH (port, node, &dp->ports) { + if (port->rx[0] && !netdev_is_pmd(port->netdev)) { + dp_netdev_process_rx_port(dp, port, port->rx[0]); + } + } + + ovs_rwlock_unlock(&dp->port_rwlock); +} + +static void +dpif_netdev_wait(struct dpif *dpif) +{ + struct dp_netdev_port *port; + struct dp_netdev *dp = get_dp_netdev(dpif); + + ovs_rwlock_rdlock(&dp->port_rwlock); + + HMAP_FOR_EACH (port, node, &dp->ports) { + if (port->rx[0] && !netdev_is_pmd(port->netdev)) { + netdev_rx_wait(port->rx[0]); + } + } + ovs_rwlock_unlock(&dp->port_rwlock); +} + +struct rx_poll { + struct dp_netdev_port *port; + struct netdev_rx *rx; +}; + +static int +pmd_load_queues(struct pmd_thread *f, + struct rx_poll **ppoll_list, int poll_cnt) { - struct dp_forwarder *f = f_; struct dp_netdev *dp = f->dp; - struct ofpbuf packet; + struct rx_poll *poll_list = *ppoll_list; + struct dp_netdev_port *port; + int qid = f->qid; + int index; + int i; - f->name = xasprintf("forwarder_%u", ovsthread_id_self()); - set_subprogram_name("%s", f->name); + /* Simple scheduler for netdev rx polling. */ + ovs_rwlock_rdlock(&dp->port_rwlock); + for (i = 0; i < poll_cnt; i++) { + port_unref(poll_list[i].port); + } - ofpbuf_init(&packet, 0); - while (!latch_is_set(&dp->exit_latch)) { - bool received_anything; - int i; + free(poll_list); + poll_cnt = 0; + index = 0; + + HMAP_FOR_EACH (port, node, &f->dp->ports) { + if (netdev_is_pmd(port->netdev)) { + for (i = 0; i < netdev_nr_rx(port->netdev); i++) { - ovs_rwlock_rdlock(&dp->port_rwlock); - for (i = 0; i < 50; i++) { - struct dp_netdev_port *port; - - received_anything = false; - HMAP_FOR_EACH (port, node, &f->dp->ports) { - if (port->rx - && port->node.hash >= f->min_hash - && port->node.hash <= f->max_hash) { - int buf_size; - int error; - int mtu; - - if (netdev_get_mtu(port->netdev, &mtu)) { - mtu = ETH_PAYLOAD_MAX; - } - buf_size = DP_NETDEV_HEADROOM + VLAN_ETH_HEADER_LEN + mtu; - - ofpbuf_clear(&packet); - ofpbuf_reserve_with_tailroom(&packet, DP_NETDEV_HEADROOM, - buf_size); - - error = netdev_rx_recv(port->rx, &packet); - if (!error) { - struct pkt_metadata md - = PKT_METADATA_INITIALIZER(port->port_no); - dp_netdev_port_input(dp, &packet, &md); - - received_anything = true; - } else if (error != EAGAIN && error != EOPNOTSUPP) { - static struct vlog_rate_limit rl - = VLOG_RATE_LIMIT_INIT(1, 5); - - VLOG_ERR_RL(&rl, "error receiving data from %s: %s", - netdev_get_name(port->netdev), - ovs_strerror(error)); - } + if ((index % dp->n_pmd_threads) == qid) { + port_ref(port); + poll_cnt++; } + index++; } + } + } - if (!received_anything) { - break; + poll_list = xzalloc(sizeof *poll_list * poll_cnt); + poll_cnt = 0; + index = 0; + + HMAP_FOR_EACH (port, node, &f->dp->ports) { + if (netdev_is_pmd(port->netdev)) { + for (i = 0; i < netdev_nr_rx(port->netdev); i++) { + + if ((index % dp->n_pmd_threads) == qid) { + poll_list[poll_cnt].port = port; + poll_list[poll_cnt].rx = port->rx[i]; + poll_cnt++; + VLOG_INFO("poll_cnt %d port = %d i = %d",poll_cnt,port->port_no, i); + } + index++; } } + } - if (received_anything) { - poll_immediate_wake(); - } else { - struct dp_netdev_port *port; + ovs_rwlock_unlock(&dp->port_rwlock); - HMAP_FOR_EACH (port, node, &f->dp->ports) - if (port->rx - && port->node.hash >= f->min_hash - && port->node.hash <= f->max_hash) { - netdev_rx_wait(port->rx); - } - seq_wait(dp->port_seq, seq_read(dp->port_seq)); - latch_wait(&dp->exit_latch); + *ppoll_list = poll_list; + return poll_cnt; +} + +static void * +pmd_thread_main(void *f_) +{ + struct pmd_thread *f = f_; + struct dp_netdev *dp = f->dp; + unsigned long lc = 0; + struct rx_poll *poll_list; + unsigned int port_seq; + int poll_cnt; + + f->name = xasprintf("pmd_%u", ovsthread_id_self()); + set_subprogram_name("%s", f->name); + netdev_setup_thread(f->qid); + poll_cnt = 0; + poll_list = NULL; + +reload: + poll_cnt = pmd_load_queues(f, &poll_list, poll_cnt); + atomic_read(&f->change_seq, &port_seq); + + while (1) { + unsigned int c_port_seq; + int i; + + for (i = 0; i < poll_cnt; i++) { + dp_netdev_process_rx_port(dp, poll_list[i].port, poll_list[i].rx); + } + + if (lc++ > (64 * 1024 * 1024)) { + /* TODO: need completely userspace based signaling method. + * to keep this thread entirely in userspace. + * For now using atomic counter. */ + lc = 0; + atomic_read(&f->change_seq, &c_port_seq); + if (c_port_seq != port_seq) { + break; + } } - ovs_rwlock_unlock(&dp->port_rwlock); + } - poll_block(); + if (!latch_is_set(&f->dp->exit_latch)){ + goto reload; } - ofpbuf_uninit(&packet); + free(poll_list); free(f->name); - return NULL; } static void -dp_netdev_set_threads(struct dp_netdev *dp, int n) +dp_netdev_set_pmd_threads(struct dp_netdev *dp, int n) { int i; - if (n == dp->n_forwarders) { + if (n == dp->n_pmd_threads) { return; } /* Stop existing threads. */ latch_set(&dp->exit_latch); - for (i = 0; i < dp->n_forwarders; i++) { - struct dp_forwarder *f = &dp->forwarders[i]; + dp_netdev_reload_pmd_threads(dp); + for (i = 0; i < dp->n_pmd_threads; i++) { + struct pmd_thread *f = &dp->pmd_threads[i]; xpthread_join(f->thread, NULL); } latch_poll(&dp->exit_latch); - free(dp->forwarders); + free(dp->pmd_threads); /* Start new threads. */ - dp->forwarders = xmalloc(n * sizeof *dp->forwarders); - dp->n_forwarders = n; + dp->pmd_threads = xmalloc(n * sizeof *dp->pmd_threads); + dp->n_pmd_threads = n; + for (i = 0; i < n; i++) { - struct dp_forwarder *f = &dp->forwarders[i]; + struct pmd_thread *f = &dp->pmd_threads[i]; f->dp = dp; - f->min_hash = UINT32_MAX / n * i; - f->max_hash = UINT32_MAX / n * (i + 1) - 1; - if (i == n - 1) { - f->max_hash = UINT32_MAX; - } - xpthread_create(&f->thread, NULL, dp_forwarder_main, f); + f->qid = i; + atomic_store(&f->change_seq, 1); + + /* Each thread will distribute all devices rx-queues among + * themselves. */ + xpthread_create(&f->thread, NULL, pmd_thread_main, f); } } @@ -1683,6 +1825,7 @@ dp_netdev_port_input(struct dp_netdev *dp, struct ofpbuf *packet, struct flow key; if (packet->size < ETH_HEADER_LEN) { + VLOG_ERR("%s small pkt %d\n",__func__,(int) packet->size); return; } flow_extract(packet, md->skb_priority, md->pkt_mark, &md->tunnel, @@ -1743,9 +1886,11 @@ dp_netdev_output_userspace(struct dp_netdev *dp, struct ofpbuf *packet, } /* Steal packet data. */ - ovs_assert(packet->source == OFPBUF_MALLOC); - upcall->packet = *packet; - ofpbuf_use(packet, NULL, 0); + ofpbuf_init(&upcall->packet,0); + ofpbuf_reserve_with_tailroom(&upcall->packet, + DP_NETDEV_HEADROOM, packet->size); + memcpy(upcall->packet.data, packet->data, packet->size); + upcall->packet.size = packet->size; seq_change(dp->queue_seq); @@ -1778,7 +1923,7 @@ dp_execute_cb(void *aux_, struct ofpbuf *packet, case OVS_ACTION_ATTR_OUTPUT: p = dp_netdev_lookup_port(aux->dp, u32_to_odp(nl_attr_get_u32(a))); if (p) { - netdev_send(p->netdev, packet); + netdev_send(p->netdev, packet, may_steal); } break; @@ -1828,8 +1973,8 @@ const struct dpif_class dpif_netdev_class = { dpif_netdev_open, dpif_netdev_close, dpif_netdev_destroy, - NULL, /* run */ - NULL, /* wait */ + dpif_netdev_run, + dpif_netdev_wait, dpif_netdev_get_stats, dpif_netdev_port_add, dpif_netdev_port_del, diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c new file mode 100644 index 0000000..06de08c --- /dev/null +++ b/lib/netdev-dpdk.c @@ -0,0 +1,1152 @@ +/* + * Copyright (c) 2014 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "list.h" +#include "netdev-provider.h" +#include "netdev-vport.h" +#include "netdev-dpdk.h" +#include "odp-util.h" +#include "ofp-print.h" +#include "ofpbuf.h" +#include "ovs-thread.h" +#include "packets.h" +#include "shash.h" +#include "sset.h" +#include "unaligned.h" +#include "timeval.h" +#include "unixctl.h" +#include "vlog.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +VLOG_DEFINE_THIS_MODULE(dpdk); + +#define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE +#define OVS_VPORT_DPDK "ovs_dpdk" + +/* + * need to reserve tons of extra space in the mbufs so we can align the + * DMA addresses to 4KB. + */ + +#define MTU_TO_MAX_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN) +#define MBUF_SIZE(mtu) (MTU_TO_MAX_LEN(mtu) + (512) + \ + sizeof(struct rte_mbuf) + RTE_PKTMBUF_HEADROOM) + +/* TODO: mempool size should be based on system resources. */ +#define NB_MBUF (4096 * 64) +#define MP_CACHE_SZ (256 * 2) +#define SOCKET0 0 + +/* TODO: number device queue, need to make this configurable at run time. */ +#define NR_QUEUE 8 + +/* TODO: Needs per NIC value for these constants. */ +#define RX_PTHRESH 16 /* Default values of RX prefetch threshold reg. */ +#define RX_HTHRESH 16 /* Default values of RX host threshold reg. */ +#define RX_WTHRESH 8 /* Default values of RX write-back threshold reg. */ + +#define TX_PTHRESH 36 /* Default values of TX prefetch threshold reg. */ +#define TX_HTHRESH 0 /* Default values of TX host threshold reg. */ +#define TX_WTHRESH 0 /* Default values of TX write-back threshold reg. */ + +static const struct rte_eth_conf port_conf = { + .rxmode = { + .mq_mode = ETH_MQ_RX_RSS, + .split_hdr_size = 0, + .header_split = 0, /* Header Split disabled */ + .hw_ip_checksum = 0, /* IP checksum offload enabled */ + .hw_vlan_filter = 0, /* VLAN filtering disabled */ + .jumbo_frame = 0, /* Jumbo Frame Support disabled */ + .hw_strip_crc = 0, /* CRC stripped by hardware */ + }, + .rx_adv_conf = { + .rss_conf = { + .rss_key = NULL, + .rss_hf = ETH_RSS_IPV4_TCP | ETH_RSS_IPV4 | ETH_RSS_IPV6, + }, + }, + .txmode = { + .mq_mode = ETH_MQ_TX_NONE, + }, +}; + +static const struct rte_eth_rxconf rx_conf = { + .rx_thresh = { + .pthresh = RX_PTHRESH, + .hthresh = RX_HTHRESH, + .wthresh = RX_WTHRESH, + }, +}; + +static const struct rte_eth_txconf tx_conf = { + .tx_thresh = { + .pthresh = TX_PTHRESH, + .hthresh = TX_HTHRESH, + .wthresh = TX_WTHRESH, + }, + .tx_free_thresh = 0, /* Use PMD default values */ + .tx_rs_thresh = 0, /* Use PMD default values */ +}; + +enum { MAX_RX_QUEUE_LEN = 64 }; + +static int rte_eal_init_ret = ENODEV; + +static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER; + +/* Contains all 'struct dpdk_dev's. */ +static struct list dpdk_list OVS_GUARDED_BY(dpdk_mutex) + = LIST_INITIALIZER(&dpdk_list); + +static struct list dpdk_mp_list; + +struct dpdk_mp { + struct rte_mempool *mp; + int mtu; + int socket_id; + int refcount; + struct list list_node OVS_GUARDED_BY(mp_list); +}; + +struct netdev_dpdk { + struct netdev up; + int port_id; + int max_packet_len; + rte_spinlock_t tx_lock; + + /* Protects all members below. */ + struct ovs_mutex mutex OVS_ACQ_AFTER(mutex); + + struct dpdk_mp *dpdk_mp; + int mtu OVS_GUARDED; + int socket_id; + int buf_size; + struct netdev_stats stats_offset OVS_GUARDED; + + uint8_t hwaddr[ETH_ADDR_LEN] OVS_GUARDED; + enum netdev_flags flags OVS_GUARDED; + + rte_spinlock_t lsi_lock; + struct rte_eth_link link; + int link_reset_cnt; + + /* In dpdk_list. */ + struct list list_node OVS_GUARDED_BY(mutex); +}; + +struct netdev_rx_dpdk { + struct netdev_rx up; + eth_rx_burst_t drv_rx; + void *rx_queues; + int port_id; + int queue_id; + int ofpbuf_cnt; + struct ofpbuf ofpbuf[MAX_RX_QUEUE_LEN]; +}; + +static int netdev_dpdk_construct(struct netdev *); +static bool +is_dpdk_class(const struct netdev_class *class) +{ + return class->construct == netdev_dpdk_construct; +} + +/* TODO: use dpdk malloc for entire OVS. infact huge page shld be used + * for all other sengments data, bss and text. */ + +static void *dpdk_rte_mzalloc(size_t sz) +{ + void *ptr; + + ptr = rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE); + if (ptr == NULL) { + out_of_memory(); + } + return ptr; +} + +static struct dpdk_mp * +dpdk_mp_get(int socket_id, int mtu) +{ + struct dpdk_mp *dmp = NULL; + char mp_name[RTE_MEMPOOL_NAMESIZE]; + + LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { + if (dmp->socket_id == socket_id && dmp->mtu == mtu) { + dmp->refcount++; + return dmp; + } + } + + dmp = dpdk_rte_mzalloc(sizeof *dmp); + dmp->socket_id = socket_id; + dmp->mtu = mtu; + dmp->refcount = 1; + + snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d", dmp->mtu); + dmp->mp = rte_mempool_create(mp_name, NB_MBUF, MBUF_SIZE(mtu), + MP_CACHE_SZ, + sizeof(struct rte_pktmbuf_pool_private), + rte_pktmbuf_pool_init, NULL, + rte_pktmbuf_init, NULL, + socket_id, 0); + + if (dmp->mp == NULL) { + return NULL; + } + + list_push_back(&dpdk_mp_list, &dmp->list_node); + return dmp; +} + +static void +dpdk_mp_put(struct dpdk_mp *dmp) +{ + + if (!dmp) { + return; + } + + dmp->refcount--; + ovs_assert(dmp->refcount >= 0); + +#if 0 + /* I could not find any API to destroy mp. */ + if (dmp->refcount == 0) { + list_delete(dmp->list_node); + /* destroy mp-pool. */ + } +#endif +} + +static void +lsi_event_callback(uint8_t port_id, enum rte_eth_event_type type, void *param) +{ + struct netdev_dpdk *dev = (struct netdev_dpdk *) param; + + VLOG_DBG("Event type: %s\n", type == RTE_ETH_EVENT_INTR_LSC ? "LSC interrupt" : "unknown event"); + + rte_spinlock_lock(&dev->lsi_lock); + rte_eth_link_get_nowait(port_id, &dev->link); + dev->link_reset_cnt++; + rte_spinlock_unlock(&dev->lsi_lock); + + if (dev->link.link_status) { + VLOG_DBG("Port %d Link Up - speed %u Mbps - %s\n", + port_id, (unsigned)dev->link.link_speed, + (dev->link.link_duplex == ETH_LINK_FULL_DUPLEX) ? + ("full-duplex") : ("half-duplex")); + } else { + VLOG_DBG("Port %d Link Down\n\n", port_id); + } +} + +static int +dpdk_eth_dev_init(struct netdev_dpdk *dev) +{ + struct rte_pktmbuf_pool_private *mbp_priv; + struct ether_addr eth_addr; + int diag; + int i; + + if (dev->port_id < 0 || dev->port_id >= rte_eth_dev_count()) { + return -ENODEV; + } + + diag = rte_eth_dev_configure(dev->port_id, NR_QUEUE, NR_QUEUE + 1, &port_conf); + if (diag) { + VLOG_ERR("eth dev config error %d\n",diag); + return diag; + } + + for (i = 0; i < (NR_QUEUE + 1); i++) { + diag = rte_eth_tx_queue_setup(dev->port_id, i, 64, 0, &tx_conf); + if (diag) { + VLOG_ERR("eth dev tx queue setup error %d\n",diag); + return diag; + } + } + + for (i = 0; i < NR_QUEUE; i++) { + /* DO NOT CHANGE NUMBER OF RX DESCRIPTORS */ + diag = rte_eth_rx_queue_setup(dev->port_id, i, 64, 0, &rx_conf, dev->dpdk_mp->mp); + if (diag) { + VLOG_ERR("eth dev rx queue setup error %d\n",diag); + return diag; + } + } + + rte_eth_dev_callback_register(dev->port_id, RTE_ETH_EVENT_INTR_LSC, + lsi_event_callback, dev); + + diag = rte_eth_dev_start(dev->port_id); + if (diag) { + VLOG_ERR("eth dev start error %d\n",diag); + return diag; + } + + rte_eth_promiscuous_enable(dev->port_id); + rte_eth_allmulticast_enable(dev->port_id); + + memset(ð_addr, 0x0, sizeof(eth_addr)); + rte_eth_macaddr_get(dev->port_id, ð_addr); + VLOG_INFO("Port %d: %02X:%02X:%02X:%02X:%02X:%02X\n",dev->port_id, + eth_addr.addr_bytes[0], + eth_addr.addr_bytes[1], + eth_addr.addr_bytes[2], + eth_addr.addr_bytes[3], + eth_addr.addr_bytes[4], + eth_addr.addr_bytes[5]); + + memcpy(dev->hwaddr, eth_addr.addr_bytes, ETH_ADDR_LEN); + rte_eth_link_get_nowait(dev->port_id, &dev->link); + + mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp); + dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM; + + dev->flags = NETDEV_UP | NETDEV_PROMISC; + return 0; /* return the number of args to delete */ +} + +static struct netdev_dpdk * +netdev_dpdk_cast(const struct netdev *netdev) +{ + return CONTAINER_OF(netdev, struct netdev_dpdk, up); +} + +static struct netdev * +netdev_dpdk_alloc(void) +{ + struct netdev_dpdk *netdev = dpdk_rte_mzalloc(sizeof *netdev); + return &netdev->up; +} + +static int +netdev_dpdk_construct(struct netdev *netdev_) +{ + struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_); + unsigned int port_no; + char *cport; + int err; + + if (rte_eal_init_ret) { + return rte_eal_init_ret; + } + + ovs_mutex_lock(&dpdk_mutex); + cport = netdev_->name + 4; /* Names always start with "dpdk" */ + + if (strncmp(netdev_->name, "dpdk", 4)) { + err = ENODEV; + goto unlock_dpdk; + } + + port_no = strtol(cport, 0, 0); /* string must be null terminated */ + + rte_spinlock_init(&netdev->lsi_lock); + rte_spinlock_init(&netdev->tx_lock); + ovs_mutex_init(&netdev->mutex); + + ovs_mutex_lock(&netdev->mutex); + netdev->flags = 0; + + netdev->mtu = ETHER_MTU; + netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu); + + /* TODO: need to discover device node at run time. */ + netdev->socket_id = SOCKET0; + netdev->port_id = port_no; + + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu); + if (!netdev->dpdk_mp) { + err = ENOMEM; + goto unlock_dev; + } + + err = dpdk_eth_dev_init(netdev); + if (err) { + goto unlock_dev; + } + netdev_->nr_rx = NR_QUEUE; + + list_push_back(&dpdk_list, &netdev->list_node); + +unlock_dev: + ovs_mutex_unlock(&netdev->mutex); +unlock_dpdk: + ovs_mutex_unlock(&dpdk_mutex); + return err; +} + +static void +netdev_dpdk_destruct(struct netdev *netdev_) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_); + + ovs_mutex_lock(&dev->mutex); + rte_eth_dev_stop(dev->port_id); + rte_eth_dev_callback_unregister(dev->port_id, RTE_ETH_EVENT_INTR_LSC, + lsi_event_callback, NULL); + + ovs_mutex_unlock(&dev->mutex); + + ovs_mutex_lock(&dpdk_mutex); + list_remove(&dev->list_node); + dpdk_mp_put(dev->dpdk_mp); + ovs_mutex_unlock(&dpdk_mutex); + + ovs_mutex_destroy(&dev->mutex); +} + +static void +netdev_dpdk_dealloc(struct netdev *netdev_) +{ + struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_); + + rte_free(netdev); +} + +static int +netdev_dpdk_get_config(const struct netdev *netdev_, struct smap *args) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_); + struct rte_eth_dev_info dev_info; + + ovs_mutex_lock(&dev->mutex); + rte_eth_dev_info_get(dev->port_id, &dev_info); + ovs_mutex_unlock(&dev->mutex); + + smap_add_format(args, "ifindex", "%d", dev->port_id); + smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id)); + smap_add_format(args, "driver_name", "%s", dev_info.driver_name); + smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize); + smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen); + smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues); + smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues); + smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs); + smap_add_format(args, "max_hash_mac_addrs", "%u", dev_info.max_hash_mac_addrs); + smap_add_format(args, "max_vfs", "%u", dev_info.max_vfs); + smap_add_format(args, "max_vmdq_pools", "%u", dev_info.max_vmdq_pools); + + smap_add_format(args, "pci-vendor_id", "0x%u", dev_info.pci_dev->id.vendor_id); + smap_add_format(args, "pci-device_id", "0x%x", dev_info.pci_dev->id.device_id); + + return 0; +} + +static struct netdev_rx * +netdev_dpdk_rx_alloc(int id) +{ + struct netdev_rx_dpdk *rx = dpdk_rte_mzalloc(sizeof *rx); + + rx->queue_id = id; + ovs_assert(id < NR_QUEUE); + + return &rx->up; +} + +static struct netdev_rx_dpdk * +netdev_rx_dpdk_cast(const struct netdev_rx *rx) +{ + return CONTAINER_OF(rx, struct netdev_rx_dpdk, up); +} + +static int +netdev_dpdk_rx_construct(struct netdev_rx *rx_) +{ + struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_); + struct netdev_dpdk *netdev = netdev_dpdk_cast(rx->up.netdev); + struct rte_eth_dev *eth_dev; + int i; + + ovs_mutex_lock(&netdev->mutex); + for (i = 0; i < MAX_RX_QUEUE_LEN; i++) { + ofpbuf_init(&rx->ofpbuf[i], 0); + rx->ofpbuf[i].allocated = netdev->buf_size; + rx->ofpbuf[i].source = OFPBUF_DPDK; + } + rx->ofpbuf_cnt = 0; + rx->port_id = netdev->port_id; + + eth_dev = &rte_eth_devices[rx->port_id]; + rx->drv_rx = eth_dev->rx_pkt_burst; + rx->rx_queues = eth_dev->data->rx_queues[rx->queue_id]; + ovs_mutex_unlock(&netdev->mutex); + + return 0; +} + +static void +netdev_dpdk_rx_destruct(struct netdev_rx *rx_ OVS_UNUSED) +{ +} + +static void +netdev_dpdk_rx_dealloc(struct netdev_rx *rx_) +{ + struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_); + + rte_free(rx); +} + +static void +build_ofpbuf(struct netdev_rx_dpdk *rx, struct ofpbuf *b, struct rte_mbuf *pkt) +{ + if (b->private_p) { + struct netdev *netdev = rx->up.netdev; + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + + rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **) &b->private_p, 1); + } + + b->private_p = pkt; + if (!pkt) { + return; + } + + b->data = pkt->pkt.data; + b->base = (char *)b->data - DP_NETDEV_HEADROOM - VLAN_ETH_HEADER_LEN; + packet_set_size(b, rte_pktmbuf_data_len(pkt)); +} + +static int +netdev_dpdk_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c) +{ + struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_); + struct rte_mbuf *burst_pkts[MAX_RX_QUEUE_LEN]; + int nb_rx; + int i; + + nb_rx = (*rx->drv_rx)(rx->rx_queues, burst_pkts, MAX_RX_QUEUE_LEN); + if (!nb_rx) { + for (i = 0; i < rx->ofpbuf_cnt; i++) { + build_ofpbuf(rx, &rx->ofpbuf[i], NULL); + } + rx->ofpbuf_cnt = 0; + return EAGAIN; + } + + i = 0; + do { + build_ofpbuf(rx, &rx->ofpbuf[i], burst_pkts[i]); + + i++; + } while (i < nb_rx); + + for (; i < rx->ofpbuf_cnt; i++) { + build_ofpbuf(rx, &rx->ofpbuf[i], NULL); + } + rx->ofpbuf_cnt = nb_rx; + *rpacket = rx->ofpbuf; + *c = nb_rx; + + return 0; +} + +static int +netdev_dpdk_rx_drain(struct netdev_rx *rx_) +{ + struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_); + int pending; + int i; + + pending = rx->ofpbuf_cnt; + if (pending) { + for (i = 0; i < pending; i++) { + build_ofpbuf(rx, &rx->ofpbuf[i], NULL); + } + rx->ofpbuf_cnt = 0; + return 0; + } + + return 0; +} + +/* Tx function. Transmit packets indefinitely */ +static int +dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + struct rte_mbuf *pkt; + uint32_t nb_tx = 0; + + pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp); + if (!pkt) { + return 0; + } + + /* We have to do a copy for now */ + memcpy(pkt->pkt.data, buf, size); + + rte_pktmbuf_data_len(pkt) = size; + rte_pktmbuf_pkt_len(pkt) = size; + + rte_spinlock_lock(&dev->tx_lock); + nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, &pkt, 1); + rte_spinlock_unlock(&dev->tx_lock); + + if (nb_tx != 1) { + /* free buffers if we couldn't transmit packets */ + rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1); + } + return nb_tx; +} + +static int +netdev_dpdk_send(struct netdev *netdev, + struct ofpbuf *ofpbuf, bool may_steal) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + + if (ofpbuf->size > dev->max_packet_len) { + VLOG_ERR("2big size %d max_packet_len %d", + (int)ofpbuf->size , dev->max_packet_len); + return E2BIG; + } + + rte_prefetch0(&ofpbuf->private_p); + if (!may_steal || + !ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) { + dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size); + } else { + struct rte_mbuf *pkt; + uint32_t nb_tx; + int qid; + + pkt = ofpbuf->private_p; + ofpbuf->private_p = NULL; + rte_pktmbuf_data_len(pkt) = ofpbuf->size; + rte_pktmbuf_pkt_len(pkt) = ofpbuf->size; + + /* TODO: TX batching. */ + qid = rte_lcore_id() % NR_QUEUE; + nb_tx = rte_eth_tx_burst(dev->port_id, qid, &pkt, 1); + if (nb_tx != 1) { + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + + rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1); + VLOG_ERR("TX error, zero packets sent"); + } + } + return 0; +} + +static int +netdev_dpdk_set_etheraddr(struct netdev *netdev, + const uint8_t mac[ETH_ADDR_LEN]) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + + ovs_mutex_lock(&dev->mutex); + if (!eth_addr_equals(dev->hwaddr, mac)) { + memcpy(dev->hwaddr, mac, ETH_ADDR_LEN); + } + ovs_mutex_unlock(&dev->mutex); + + return 0; +} + +static int +netdev_dpdk_get_etheraddr(const struct netdev *netdev, + uint8_t mac[ETH_ADDR_LEN]) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + + ovs_mutex_lock(&dev->mutex); + memcpy(mac, dev->hwaddr, ETH_ADDR_LEN); + ovs_mutex_unlock(&dev->mutex); + + return 0; +} + +static int +netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + + ovs_mutex_lock(&dev->mutex); + *mtup = dev->mtu; + ovs_mutex_unlock(&dev->mutex); + + return 0; +} + +static int +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + int old_mtu, err; + struct dpdk_mp *old_mp; + struct dpdk_mp *mp; + + ovs_mutex_lock(&dpdk_mutex); + ovs_mutex_lock(&dev->mutex); + if (dev->mtu == mtu) { + err = 0; + goto out; + } + + mp = dpdk_mp_get(dev->socket_id, dev->mtu); + if (!mp) { + err = ENOMEM; + goto out; + } + + rte_eth_dev_stop(dev->port_id); + + old_mtu = dev->mtu; + old_mp = dev->dpdk_mp; + dev->dpdk_mp = mp; + dev->mtu = mtu; + dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); + + err = dpdk_eth_dev_init(dev); + if (err) { + + dpdk_mp_put(mp); + dev->mtu = old_mtu; + dev->dpdk_mp = old_mp; + dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); + dpdk_eth_dev_init(dev); + goto out; + } + + dpdk_mp_put(old_mp); +out: + ovs_mutex_unlock(&dev->mutex); + ovs_mutex_unlock(&dpdk_mutex); + return err; +} + +static int +netdev_dpdk_get_stats(const struct netdev *netdev, struct netdev_stats *stats) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + struct rte_eth_stats rte_stats; + + ovs_mutex_lock(&dev->mutex); + rte_eth_stats_get(dev->port_id, &rte_stats); + ovs_mutex_unlock(&dev->mutex); + + *stats = dev->stats_offset; + + stats->rx_packets += rte_stats.ipackets; + stats->tx_packets += rte_stats.opackets; + stats->rx_bytes += rte_stats.ibytes; + stats->tx_bytes += rte_stats.obytes; + stats->rx_errors += rte_stats.ierrors; + stats->tx_errors += rte_stats.oerrors; + stats->multicast += rte_stats.imcasts; + + return 0; +} + +static int +netdev_dpdk_set_stats(struct netdev *netdev, const struct netdev_stats *stats) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + + ovs_mutex_lock(&dev->mutex); + dev->stats_offset = *stats; + ovs_mutex_unlock(&dev->mutex); + + return 0; +} + +static int +netdev_dpdk_get_features(const struct netdev *netdev_, + enum netdev_features *current, + enum netdev_features *advertised OVS_UNUSED, + enum netdev_features *supported OVS_UNUSED, + enum netdev_features *peer OVS_UNUSED) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_); + struct rte_eth_link link; + + rte_spinlock_lock(&dev->lsi_lock); + link = dev->link; + rte_spinlock_unlock(&dev->lsi_lock); + + if (link.link_duplex == ETH_LINK_AUTONEG_DUPLEX) { + if (link.link_speed == ETH_LINK_SPEED_AUTONEG) { + *current = NETDEV_F_AUTONEG; + } + } else if (link.link_duplex == ETH_LINK_HALF_DUPLEX) { + if (link.link_speed == ETH_LINK_SPEED_10) { + *current = NETDEV_F_10MB_HD; + } + if (link.link_speed == ETH_LINK_SPEED_100) { + *current = NETDEV_F_100MB_HD; + } + if (link.link_speed == ETH_LINK_SPEED_1000) { + *current = NETDEV_F_1GB_HD; + } + } else if (link.link_duplex == ETH_LINK_FULL_DUPLEX) { + if (link.link_speed == ETH_LINK_SPEED_10) { + *current = NETDEV_F_10MB_FD; + } + if (link.link_speed == ETH_LINK_SPEED_100) { + *current = NETDEV_F_100MB_FD; + } + if (link.link_speed == ETH_LINK_SPEED_1000) { + *current = NETDEV_F_1GB_FD; + } + if (link.link_speed == ETH_LINK_SPEED_10000) { + *current = NETDEV_F_10GB_FD; + } + } + + return 0; +} + +static int +netdev_dpdk_get_ifindex(const struct netdev *netdev) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + int ifindex; + + ovs_mutex_lock(&dev->mutex); + ifindex = dev->port_id; + ovs_mutex_unlock(&dev->mutex); + + return ifindex; +} + +static int +netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_); + + rte_spinlock_lock(&dev->lsi_lock); + *carrier = dev->link.link_status; + rte_spinlock_unlock(&dev->lsi_lock); + + return 0; +} + +static long long int +netdev_dpdk_get_carrier_resets(const struct netdev *netdev_) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_); + long long int carrier_resets; + + rte_spinlock_lock(&dev->lsi_lock); + carrier_resets = dev->link_reset_cnt; + rte_spinlock_unlock(&dev->lsi_lock); + + return carrier_resets; +} + +static int +netdev_dpdk_set_miimon(struct netdev *netdev_ OVS_UNUSED, + long long int interval OVS_UNUSED) +{ + return 0; +} + +static int +netdev_dpdk_update_flags__(struct netdev_dpdk *dev, + enum netdev_flags off, enum netdev_flags on, + enum netdev_flags *old_flagsp) + OVS_REQUIRES(dev->mutex) +{ + int err; + + if ((off | on) & ~(NETDEV_UP | NETDEV_PROMISC)) { + return EINVAL; + } + + *old_flagsp = dev->flags; + dev->flags |= on; + dev->flags &= ~off; + + if (dev->flags == *old_flagsp) { + return 0; + } + + rte_eth_dev_stop(dev->port_id); + + if (dev->flags & NETDEV_UP) { + err = rte_eth_dev_start(dev->port_id); + if (err) + return err; + } + + if (dev->flags & NETDEV_PROMISC) { + rte_eth_promiscuous_enable(dev->port_id); + rte_eth_allmulticast_enable(dev->port_id); + } + + return 0; +} + +static int +netdev_dpdk_update_flags(struct netdev *netdev_, + enum netdev_flags off, enum netdev_flags on, + enum netdev_flags *old_flagsp) +{ + struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_); + int error; + + ovs_mutex_lock(&netdev->mutex); + error = netdev_dpdk_update_flags__(netdev, off, on, old_flagsp); + ovs_mutex_unlock(&netdev->mutex); + + return error; +} + +static int +netdev_dpdk_get_status(const struct netdev *netdev_, struct smap *smap) +{ + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_); + struct rte_eth_dev_info dev_info; + + if (dev->port_id <= 0) + return ENODEV; + + ovs_mutex_lock(&dev->mutex); + rte_eth_dev_info_get(dev->port_id, &dev_info); + ovs_mutex_unlock(&dev->mutex); + + smap_add_format(smap, "driver_name", "%s", dev_info.driver_name); + return 0; +} + + +/* Helper functions. */ + +static void +netdev_dpdk_set_admin_state__(struct netdev_dpdk *dev, bool admin_state) + OVS_REQUIRES(dev->mutex) +{ + enum netdev_flags old_flags; + + if (admin_state) { + netdev_dpdk_update_flags__(dev, 0, NETDEV_UP, &old_flags); + } else { + netdev_dpdk_update_flags__(dev, NETDEV_UP, 0, &old_flags); + } +} + +static void +netdev_dpdk_set_admin_state(struct unixctl_conn *conn, int argc, + const char *argv[], void *aux OVS_UNUSED) +{ + bool up; + + if (!strcasecmp(argv[argc - 1], "up")) { + up = true; + } else if ( !strcasecmp(argv[argc - 1], "down")) { + up = false; + } else { + unixctl_command_reply_error(conn, "Invalid Admin State"); + return; + } + + if (argc > 2) { + struct netdev *netdev = netdev_from_name(argv[1]); + if (netdev && is_dpdk_class(netdev->netdev_class)) { + struct netdev_dpdk *dpdk_dev = netdev_dpdk_cast(netdev); + + ovs_mutex_lock(&dpdk_dev->mutex); + netdev_dpdk_set_admin_state__(dpdk_dev, up); + ovs_mutex_unlock(&dpdk_dev->mutex); + + netdev_close(netdev); + } else { + unixctl_command_reply_error(conn, "Unknown Dummy Interface"); + netdev_close(netdev); + return; + } + } else { + struct netdev_dpdk *netdev; + + ovs_mutex_lock(&dpdk_mutex); + LIST_FOR_EACH (netdev, list_node, &dpdk_list) { + ovs_mutex_lock(&netdev->mutex); + netdev_dpdk_set_admin_state__(netdev, up); + ovs_mutex_unlock(&netdev->mutex); + } + ovs_mutex_unlock(&dpdk_mutex); + } + unixctl_command_reply(conn, "OK"); +} + +static int +dpdk_class_init(void) +{ + int result; + + if (rte_eal_init_ret) { + return 0; + } + + result = rte_pmd_init_all(); + if (result) { + VLOG_ERR("Cannot init xnic PMD\n"); + return result; + } + + result = rte_eal_pci_probe(); + if (result) { + VLOG_ERR("Cannot probe PCI\n"); + return result; + } + + if (rte_eth_dev_count() < 1) { + VLOG_ERR("No Ethernet devices found. Try assigning ports to UIO.\n"); + } + + VLOG_INFO("Ethernet Device Count: %d\n", (int)rte_eth_dev_count()); + + list_init(&dpdk_list); + list_init(&dpdk_mp_list); + + unixctl_command_register("netdev-dpdk/set-admin-state", + "[netdev] up|down", 1, 2, + netdev_dpdk_set_admin_state, NULL); + + return 0; +} + +static void +dpdk_class_setup_thread(int tid) +{ + cpu_set_t cpuset; + int err; + + /* Setup thread for DPDK library. */ + RTE_PER_LCORE(_lcore_id) = tid % NR_QUEUE; + + CPU_ZERO(&cpuset); + CPU_SET(rte_lcore_id(), &cpuset); + err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset); + if (err) { + VLOG_ERR("thread affinity error %d\n",err); + } +} + +static struct netdev_class netdev_dpdk_class = { + "dpdk", + dpdk_class_init, /* init */ + NULL, /* netdev_dpdk_run */ + NULL, /* netdev_dpdk_wait */ + + netdev_dpdk_alloc, + netdev_dpdk_construct, + netdev_dpdk_destruct, + netdev_dpdk_dealloc, + dpdk_class_setup_thread, + netdev_dpdk_get_config, + NULL, /* netdev_dpdk_set_config */ + NULL, /* get_tunnel_config */ + + netdev_dpdk_send, /* send */ + NULL, /* send_wait */ + + netdev_dpdk_set_etheraddr, + netdev_dpdk_get_etheraddr, + netdev_dpdk_get_mtu, + netdev_dpdk_set_mtu, + netdev_dpdk_get_ifindex, + netdev_dpdk_get_carrier, + netdev_dpdk_get_carrier_resets, + netdev_dpdk_set_miimon, + netdev_dpdk_get_stats, + netdev_dpdk_set_stats, + netdev_dpdk_get_features, + NULL, /* set_advertisements */ + + NULL, /* set_policing */ + NULL, /* get_qos_types */ + NULL, /* get_qos_capabilities */ + NULL, /* get_qos */ + NULL, /* set_qos */ + NULL, /* get_queue */ + NULL, /* set_queue */ + NULL, /* delete_queue */ + NULL, /* get_queue_stats */ + NULL, /* queue_dump_start */ + NULL, /* queue_dump_next */ + NULL, /* queue_dump_done */ + NULL, /* dump_queue_stats */ + + NULL, /* get_in4 */ + NULL, /* set_in4 */ + NULL, /* get_in6 */ + NULL, /* add_router */ + NULL, /* get_next_hop */ + netdev_dpdk_get_status, + NULL, /* arp_lookup */ + + netdev_dpdk_update_flags, + + netdev_dpdk_rx_alloc, + netdev_dpdk_rx_construct, + netdev_dpdk_rx_destruct, + netdev_dpdk_rx_dealloc, + netdev_dpdk_rx_recv, + NULL, /* rx_wait */ + netdev_dpdk_rx_drain, +}; + +int +dpdk_init(int argc, char **argv) +{ + int result; + + if (strcmp(argv[1], "--dpdk")) + return 0; + argc--; + argv++; + /* Make sure things are initialized ... */ + if ((result=rte_eal_init(argc, argv)) < 0) + rte_panic("Cannot init EAL\n"); + rte_memzone_dump(); + rte_eal_init_ret = 0; + return result; +} + +void +netdev_dpdk_register(void) +{ + netdev_register_provider(&netdev_dpdk_class); +} diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h new file mode 100644 index 0000000..5cf5626 --- /dev/null +++ b/lib/netdev-dpdk.h @@ -0,0 +1,7 @@ +#ifndef __NETDEV_DPDK_H__ +#define __NETDEV_DPDK_H__ + +int dpdk_init(int argc, char **argv); +void netdev_dpdk_register(void); + +#endif diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c index 0f93363..2cb3c9b 100644 --- a/lib/netdev-dummy.c +++ b/lib/netdev-dummy.c @@ -103,6 +103,7 @@ struct netdev_dummy { FILE *tx_pcap, *rx_pcap OVS_GUARDED; struct list rxes OVS_GUARDED; /* List of child "netdev_rx_dummy"s. */ + struct ofpbuf buffer; }; /* Max 'recv_queue_len' in struct netdev_dummy. */ @@ -695,7 +696,7 @@ netdev_dummy_set_config(struct netdev *netdev_, const struct smap *args) } static struct netdev_rx * -netdev_dummy_rx_alloc(void) +netdev_dummy_rx_alloc(int id OVS_UNUSED) { struct netdev_rx_dummy *rx = xzalloc(sizeof *rx); return &rx->up; @@ -739,12 +740,12 @@ netdev_dummy_rx_dealloc(struct netdev_rx *rx_) } static int -netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer) +netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c) { struct netdev_rx_dummy *rx = netdev_rx_dummy_cast(rx_); struct netdev_dummy *netdev = netdev_dummy_cast(rx->up.netdev); + struct ofpbuf *buffer = &netdev->buffer; struct ofpbuf *packet; - int retval; ovs_mutex_lock(&netdev->mutex); if (!list_is_empty(&rx->recv_queue)) { @@ -758,22 +759,19 @@ netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer) if (!packet) { return EAGAIN; } + ovs_mutex_lock(&netdev->mutex); + netdev->stats.rx_packets++; + netdev->stats.rx_bytes += packet->size; + ovs_mutex_unlock(&netdev->mutex); - if (packet->size <= ofpbuf_tailroom(buffer)) { - memcpy(buffer->data, packet->data, packet->size); - buffer->size += packet->size; - retval = 0; - - ovs_mutex_lock(&netdev->mutex); - netdev->stats.rx_packets++; - netdev->stats.rx_bytes += packet->size; - ovs_mutex_unlock(&netdev->mutex); - } else { - retval = EMSGSIZE; - } - ofpbuf_delete(packet); + ofpbuf_clear(buffer); + ofpbuf_reserve_with_tailroom(buffer, DP_NETDEV_HEADROOM, packet->size); + memcpy(buffer->data, packet->data, packet->size); - return retval; + packet_set_size(packet, packet->size); + *rpacket = packet; + *c = 1; + return 0; } static void @@ -809,9 +807,12 @@ netdev_dummy_rx_drain(struct netdev_rx *rx_) } static int -netdev_dummy_send(struct netdev *netdev, const void *buffer, size_t size) +netdev_dummy_send(struct netdev *netdev, + struct ofpbuf *pkt, bool may_steal OVS_UNUSED) { struct netdev_dummy *dev = netdev_dummy_cast(netdev); + const void *buffer = pkt->data; + size_t size = pkt->size; if (size < ETH_HEADER_LEN) { return EMSGSIZE; @@ -987,6 +988,7 @@ static const struct netdev_class dummy_class = { netdev_dummy_construct, netdev_dummy_destruct, netdev_dummy_dealloc, + NULL, /* setup_thread */ netdev_dummy_get_config, netdev_dummy_set_config, NULL, /* get_tunnel_config */ diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index e756d88..73ba2c2 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -426,6 +426,7 @@ struct netdev_linux { struct netdev_rx_linux { struct netdev_rx up; + struct ofpbuf pkt; bool is_tap; int fd; }; @@ -462,6 +463,7 @@ static int af_packet_sock(void); static bool netdev_linux_miimon_enabled(void); static void netdev_linux_miimon_run(void); static void netdev_linux_miimon_wait(void); +static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup); static bool is_netdev_linux_class(const struct netdev_class *netdev_class) @@ -773,7 +775,7 @@ netdev_linux_dealloc(struct netdev *netdev_) } static struct netdev_rx * -netdev_linux_rx_alloc(void) +netdev_linux_rx_alloc(int id OVS_UNUSED) { struct netdev_rx_linux *rx = xzalloc(sizeof *rx); return &rx->up; @@ -985,10 +987,24 @@ netdev_linux_rx_recv_tap(int fd, struct ofpbuf *buffer) } static int -netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer) +netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c) { struct netdev_rx_linux *rx = netdev_rx_linux_cast(rx_); - int retval; + struct netdev *netdev = rx->up.netdev; + struct ofpbuf *buffer; + ssize_t retval; + int mtu; + int buf_size; + + if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) { + mtu = ETH_PAYLOAD_MAX; + } + buf_size = DP_NETDEV_HEADROOM + VLAN_ETH_HEADER_LEN + mtu; + + buffer = &rx->pkt; + ofpbuf_clear(buffer); + + ofpbuf_reserve_with_tailroom(buffer, DP_NETDEV_HEADROOM, buf_size); retval = (rx->is_tap ? netdev_linux_rx_recv_tap(rx->fd, buffer) @@ -996,8 +1012,11 @@ netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer) if (retval && retval != EAGAIN && retval != EMSGSIZE) { VLOG_WARN_RL(&rl, "error receiving Ethernet packet on %s: %s", ovs_strerror(errno), netdev_rx_get_name(rx_)); + } else { + packet_set_size(buffer, buffer->size); + *rpacket = buffer; + *c = 1; } - return retval; } @@ -1036,8 +1055,11 @@ netdev_linux_rx_drain(struct netdev_rx *rx_) * The kernel maintains a packet transmission queue, so the caller is not * expected to do additional queuing of packets. */ static int -netdev_linux_send(struct netdev *netdev_, const void *data, size_t size) +netdev_linux_send(struct netdev *netdev_, struct ofpbuf *pkt, bool may_steal OVS_UNUSED) { + const void *data = pkt->data; + size_t size = pkt->size; + for (;;) { ssize_t retval; @@ -2677,6 +2699,7 @@ netdev_linux_update_flags(struct netdev *netdev_, enum netdev_flags off, CONSTRUCT, \ netdev_linux_destruct, \ netdev_linux_dealloc, \ + NULL, /* setup_thread */ \ NULL, /* get_config */ \ NULL, /* set_config */ \ NULL, /* get_tunnel_config */ \ diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h index 673d3ab..0c9f347 100644 --- a/lib/netdev-provider.h +++ b/lib/netdev-provider.h @@ -33,11 +33,11 @@ extern "C" { * Network device implementations may read these members but should not modify * them. */ struct netdev { + int nr_rx; /* The following do not change during the lifetime of a struct netdev. */ char *name; /* Name of network device. */ const struct netdev_class *netdev_class; /* Functions to control this device. */ - /* The following are protected by 'netdev_mutex' (internal to netdev.c). */ int ref_cnt; /* Times this devices was opened. */ struct shash_node *node; /* Pointer to element in global map. */ @@ -203,6 +203,10 @@ struct netdev_class { void (*destruct)(struct netdev *); void (*dealloc)(struct netdev *); + /* + * Some platform need to setup thread state. */ + void (*setup_thread)(int thread_id); + /* Fetches the device 'netdev''s configuration, storing it in 'args'. * The caller owns 'args' and pre-initializes it to an empty smap. * @@ -241,7 +245,7 @@ struct netdev_class { * network device from being usefully used by the netdev-based "userspace * datapath". It will also prevent the OVS implementation of bonding from * working properly over 'netdev'.) */ - int (*send)(struct netdev *netdev, const void *buffer, size_t size); + int (*send)(struct netdev *, struct ofpbuf *buffer, bool may_steal); /* Registers with the poll loop to wake up from the next call to * poll_block() when the packet transmission queue for 'netdev' has @@ -629,7 +633,7 @@ struct netdev_class { /* Life-cycle functions for a netdev_rx. See the large comment above on * struct netdev_class. */ - struct netdev_rx *(*rx_alloc)(void); + struct netdev_rx *(*rx_alloc)(int id); int (*rx_construct)(struct netdev_rx *); void (*rx_destruct)(struct netdev_rx *); void (*rx_dealloc)(struct netdev_rx *); @@ -655,7 +659,7 @@ struct netdev_class { * * This function may be set to null if it would always return EOPNOTSUPP * anyhow. */ - int (*rx_recv)(struct netdev_rx *rx, struct ofpbuf *buffer); + int (*rx_recv)(struct netdev_rx *rx, struct ofpbuf **pkt, int *c); /* Registers with the poll loop to wake up from the next call to * poll_block() when a packet is ready to be received with netdev_rx_recv() @@ -672,6 +676,7 @@ int netdev_unregister_provider(const char *type); extern const struct netdev_class netdev_linux_class; extern const struct netdev_class netdev_internal_class; extern const struct netdev_class netdev_tap_class; +extern const struct netdev_class netdev_pdk_class; #if defined(__FreeBSD__) || defined(__NetBSD__) extern const struct netdev_class netdev_bsd_class; #endif diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c index 165c1c6..ad9d2a5 100644 --- a/lib/netdev-vport.c +++ b/lib/netdev-vport.c @@ -686,6 +686,7 @@ get_stats(const struct netdev *netdev, struct netdev_stats *stats) netdev_vport_construct, \ netdev_vport_destruct, \ netdev_vport_dealloc, \ + NULL, /* setup_thread */ \ GET_CONFIG, \ SET_CONFIG, \ GET_TUNNEL_CONFIG, \ diff --git a/lib/netdev.c b/lib/netdev.c index 8e62421..f688c5c 100644 --- a/lib/netdev.c +++ b/lib/netdev.c @@ -91,6 +91,11 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); static void restore_all_flags(void *aux OVS_UNUSED); void update_device_args(struct netdev *, const struct shash *args); +int netdev_nr_rx(const struct netdev *netdev) +{ + return netdev->nr_rx; +} + static void netdev_initialize(void) OVS_EXCLUDED(netdev_class_rwlock, netdev_mutex) @@ -107,6 +112,9 @@ netdev_initialize(void) netdev_register_provider(&netdev_tap_class); netdev_vport_tunnel_register(); #endif +#ifdef DPDK_NETDEV + netdev_dpdk_register(); +#endif #if defined(__FreeBSD__) || defined(__NetBSD__) netdev_register_provider(&netdev_tap_class); netdev_register_provider(&netdev_bsd_class); @@ -326,6 +334,7 @@ netdev_open(const char *name, const char *type, struct netdev **netdevp) memset(netdev, 0, sizeof *netdev); netdev->netdev_class = rc->class; netdev->name = xstrdup(name); + netdev->nr_rx = 1; netdev->node = shash_add(&netdev_shash, name, netdev); list_init(&netdev->saved_flags_list); @@ -481,6 +490,20 @@ netdev_close(struct netdev *netdev) } } +void +netdev_setup_thread(int id) +{ + struct netdev_registered_class *rc; + + ovs_rwlock_rdlock(&netdev_class_rwlock); + HMAP_FOR_EACH (rc, hmap_node, &netdev_classes) { + if (rc->class->setup_thread) { + rc->class->setup_thread(id); + } + } + ovs_rwlock_unlock(&netdev_class_rwlock); +} + /* Parses 'netdev_name_', which is of the form [type@]name into its component * pieces. 'name' and 'type' must be freed by the caller. */ void @@ -508,13 +531,13 @@ netdev_parse_name(const char *netdev_name_, char **name, char **type) * Some kinds of network devices might not support receiving packets. This * function returns EOPNOTSUPP in that case.*/ int -netdev_rx_open(struct netdev *netdev, struct netdev_rx **rxp) +netdev_rx_open(struct netdev *netdev, struct netdev_rx **rxp, int id) OVS_EXCLUDED(netdev_mutex) { int error; if (netdev->netdev_class->rx_alloc) { - struct netdev_rx *rx = netdev->netdev_class->rx_alloc(); + struct netdev_rx *rx = netdev->netdev_class->rx_alloc(id); if (rx) { rx->netdev = netdev; error = netdev->netdev_class->rx_construct(rx); @@ -575,23 +598,18 @@ netdev_rx_close(struct netdev_rx *rx) * This function may be set to null if it would always return EOPNOTSUPP * anyhow. */ int -netdev_rx_recv(struct netdev_rx *rx, struct ofpbuf *buffer) +netdev_rx_recv(struct netdev_rx *rx, struct ofpbuf **buffer, int *c) { int retval; - ovs_assert(buffer->size == 0); - ovs_assert(ofpbuf_tailroom(buffer) >= ETH_TOTAL_MIN); + retval = rx->netdev->netdev_class->rx_recv(rx, buffer, c); + return retval; +} - retval = rx->netdev->netdev_class->rx_recv(rx, buffer); - if (!retval) { - COVERAGE_INC(netdev_received); - if (buffer->size < ETH_TOTAL_MIN) { - ofpbuf_put_zeros(buffer, ETH_TOTAL_MIN - buffer->size); - } - return 0; - } else { - return retval; - } +bool +netdev_is_pmd(const struct netdev *netdev) +{ + return !strcmp(netdev->netdev_class->type, "dpdk"); } /* Arranges for poll_block() to wake up when a packet is ready to be received @@ -624,12 +642,12 @@ netdev_rx_drain(struct netdev_rx *rx) * Some network devices may not implement support for this function. In such * cases this function will always return EOPNOTSUPP. */ int -netdev_send(struct netdev *netdev, const struct ofpbuf *buffer) +netdev_send(struct netdev *netdev, struct ofpbuf *buffer, bool may_steal) { int error; error = (netdev->netdev_class->send - ? netdev->netdev_class->send(netdev, buffer->data, buffer->size) + ? netdev->netdev_class->send(netdev, buffer, may_steal) : EOPNOTSUPP); if (!error) { COVERAGE_INC(netdev_sent); diff --git a/lib/netdev.h b/lib/netdev.h index 410c35b..d5a7793 100644 --- a/lib/netdev.h +++ b/lib/netdev.h @@ -21,6 +21,7 @@ #include #include #include "openvswitch/types.h" +#include "packets.h" #ifdef __cplusplus extern "C" { @@ -138,6 +139,7 @@ bool netdev_is_reserved_name(const char *name); int netdev_open(const char *name, const char *type, struct netdev **); struct netdev *netdev_ref(const struct netdev *); void netdev_close(struct netdev *); +void netdev_setup_thread(int id); void netdev_parse_name(const char *netdev_name, char **name, char **type); @@ -156,17 +158,18 @@ int netdev_set_mtu(const struct netdev *, int mtu); int netdev_get_ifindex(const struct netdev *); /* Packet reception. */ -int netdev_rx_open(struct netdev *, struct netdev_rx **); +int netdev_rx_open(struct netdev *, struct netdev_rx **, int id); void netdev_rx_close(struct netdev_rx *); const char *netdev_rx_get_name(const struct netdev_rx *); -int netdev_rx_recv(struct netdev_rx *, struct ofpbuf *); +bool netdev_is_pmd(const struct netdev *netdev); +int netdev_rx_recv(struct netdev_rx *, struct ofpbuf **, int *); void netdev_rx_wait(struct netdev_rx *); int netdev_rx_drain(struct netdev_rx *); - +int netdev_nr_rx(const struct netdev *netdev); /* Packet transmission. */ -int netdev_send(struct netdev *, const struct ofpbuf *); +int netdev_send(struct netdev *, struct ofpbuf *, bool may_steal); void netdev_send_wait(struct netdev *); /* Hardware address. */ @@ -198,6 +201,10 @@ enum netdev_features { NETDEV_F_PAUSE_ASYM = 1 << 15, /* Asymmetric pause. */ }; +/* Enough headroom to add a vlan tag, plus an extra 2 bytes to allow IP + * headers to be aligned on a 4-byte boundary. */ +enum { DP_NETDEV_HEADROOM = 2 + VLAN_HEADER_LEN }; + int netdev_get_features(const struct netdev *, enum netdev_features *current, enum netdev_features *advertised, diff --git a/lib/ofpbuf.c b/lib/ofpbuf.c index 0eed428..249fbaa 100644 --- a/lib/ofpbuf.c +++ b/lib/ofpbuf.c @@ -265,6 +265,9 @@ ofpbuf_resize__(struct ofpbuf *b, size_t new_headroom, size_t new_tailroom) new_allocated = new_headroom + b->size + new_tailroom; switch (b->source) { + case OFPBUF_DPDK: + OVS_NOT_REACHED(); + case OFPBUF_MALLOC: if (new_headroom == ofpbuf_headroom(b)) { new_base = xrealloc(b->base, new_allocated); @@ -343,7 +346,7 @@ ofpbuf_prealloc_headroom(struct ofpbuf *b, size_t size) void ofpbuf_trim(struct ofpbuf *b) { - if (b->source == OFPBUF_MALLOC + if ((b->source == OFPBUF_MALLOC || b->source == OFPBUF_DPDK) && (ofpbuf_headroom(b) || ofpbuf_tailroom(b))) { ofpbuf_resize__(b, 0, 0); } @@ -562,6 +565,8 @@ void * ofpbuf_steal_data(struct ofpbuf *b) { void *p; + ovs_assert(b->source != OFPBUF_DPDK); + if (b->source == OFPBUF_MALLOC && b->data == b->base) { p = b->data; } else { diff --git a/lib/ofpbuf.h b/lib/ofpbuf.h index 7407d8b..1f7f276 100644 --- a/lib/ofpbuf.h +++ b/lib/ofpbuf.h @@ -20,6 +20,7 @@ #include #include #include "list.h" +#include "packets.h" #include "util.h" #ifdef __cplusplus @@ -29,18 +30,18 @@ extern "C" { enum ofpbuf_source { OFPBUF_MALLOC, /* Obtained via malloc(). */ OFPBUF_STACK, /* Un-movable stack space or static buffer. */ - OFPBUF_STUB /* Starts on stack, may expand into heap. */ + OFPBUF_STUB, /* Starts on stack, may expand into heap. */ + OFPBUF_DPDK, }; /* Buffer for holding arbitrary data. An ofpbuf is automatically reallocated * as necessary if it grows too large for the available memory. */ struct ofpbuf { void *base; /* First byte of allocated space. */ - size_t allocated; /* Number of bytes allocated. */ - enum ofpbuf_source source; /* Source of memory allocated as 'base'. */ - void *data; /* First byte actually in use. */ + void *private_p; /* Private pointer for use by owner. */ size_t size; /* Number of bytes in use. */ + size_t allocated; /* Number of bytes allocated. */ void *l2; /* Link-level header. */ void *l2_5; /* MPLS label stack */ @@ -49,10 +50,10 @@ struct ofpbuf { void *l7; /* Application data. */ struct list list_node; /* Private list element for use by owner. */ - void *private_p; /* Private pointer for use by owner. */ + enum ofpbuf_source source; /* Source of memory allocated as 'base'. */ }; -void ofpbuf_use(struct ofpbuf *, void *, size_t); +void ofpbuf_use_same(struct ofpbuf *b, void *base, size_t allocated); void ofpbuf_use_stack(struct ofpbuf *, void *, size_t); void ofpbuf_use_stub(struct ofpbuf *, void *, size_t); void ofpbuf_use_const(struct ofpbuf *, const void *, size_t); diff --git a/lib/packets.c b/lib/packets.c index 0d63841..525c084 100644 --- a/lib/packets.c +++ b/lib/packets.c @@ -990,3 +990,12 @@ packet_format_tcp_flags(struct ds *s, uint16_t tcp_flags) ds_put_cstr(s, "[800]"); } } + +void +packet_set_size(struct ofpbuf *b, int size) +{ + b->size = size; + if (b->size < ETH_TOTAL_MIN) { + ofpbuf_put_zeros(b, ETH_TOTAL_MIN - b->size); + } +} diff --git a/lib/packets.h b/lib/packets.h index 8e21fa8..dcf3c3d 100644 --- a/lib/packets.h +++ b/lib/packets.h @@ -656,4 +656,5 @@ uint16_t packet_get_tcp_flags(const struct ofpbuf *, const struct flow *); void packet_format_tcp_flags(struct ds *, uint16_t); const char *packet_tcp_flag_to_string(uint32_t flag); +void packet_set_size(struct ofpbuf *b, int size); #endif /* packets.h */ diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c index 990e58f..9bedd6c 100644 --- a/vswitchd/ovs-vswitchd.c +++ b/vswitchd/ovs-vswitchd.c @@ -49,6 +49,7 @@ #include "vconn.h" #include "vlog.h" #include "lib/vswitch-idl.h" +#include "lib/netdev-dpdk.h" VLOG_DEFINE_THIS_MODULE(vswitchd); @@ -71,6 +72,12 @@ main(int argc, char *argv[]) bool exiting; int retval; +#ifdef DPDK_NETDEV + retval = dpdk_init(argc,argv); + argc -= retval; + argv += retval; +#endif + proctitle_init(argc, argv); set_program_name(argv[0]); remote = parse_options(argc, argv, &unixctl_path); @@ -145,7 +152,8 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) OPT_BOOTSTRAP_CA_CERT, OPT_ENABLE_DUMMY, OPT_DISABLE_SYSTEM, - DAEMON_OPTION_ENUMS + DAEMON_OPTION_ENUMS, + OPT_DPDK, }; static const struct option long_options[] = { {"help", no_argument, NULL, 'h'}, @@ -159,6 +167,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) {"bootstrap-ca-cert", required_argument, NULL, OPT_BOOTSTRAP_CA_CERT}, {"enable-dummy", optional_argument, NULL, OPT_ENABLE_DUMMY}, {"disable-system", no_argument, NULL, OPT_DISABLE_SYSTEM}, + {"dpdk", required_argument, NULL, OPT_DPDK}, {NULL, 0, NULL, 0}, }; char *short_options = long_options_to_short_options(long_options); @@ -210,6 +219,9 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) case '?': exit(EXIT_FAILURE); + case OPT_DPDK: + break; + default: abort(); } -- 1.7.9.5