[dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
@ 2014-01-28  1:48 pshelar
       [not found] ` <20140128044950.GA4545@nicira.com>
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: pshelar @ 2014-01-28  1:48 UTC (permalink / raw)
  To: dev, dev, dpdk-ovs; +Cc: Gerald Rogers

From: Pravin B Shelar <pshelar@nicira.com>

Following patch adds DPDK netdev-class to userspace datapath.
Approach taken in this patch differs from Intel® DPDK vSwitch
where DPDK datapath switching is done in saparate process.  This
patch adds support for DPDK type port and uses OVS userspace
datapath for switching.  Therefore all DPDK processing and flow
miss handling is done in single process.  This also avoids code
duplication by reusing OVS userspace datapath switching and
therefore it supports all flow matching and actions that
user-space datapath supports.  Refer to INSTALL.DPDK doc for
further info.

With this patch I got similar performance for netperf TCP_STREAM
tests compared to kernel datapath.

This is based a patch from Gerald Rogers.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
CC: "Gerald Rogers" <gerald.rogers@intel.com>
---
This patch is tested on latest OVS master (commit 9d0581fdf22bec79).
---
 INSTALL                 |    1 +
 INSTALL.DPDK            |   85 ++++
 Makefile.am             |    1 +
 acinclude.m4            |   40 ++
 configure.ac            |    1 +
 lib/automake.mk         |    6 +
 lib/dpif-netdev.c       |  393 +++++++++++-----
 lib/netdev-dpdk.c       | 1152 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/netdev-dpdk.h       |    7 +
 lib/netdev-dummy.c      |   38 +-
 lib/netdev-linux.c      |   33 +-
 lib/netdev-provider.h   |   13 +-
 lib/netdev-vport.c      |    1 +
 lib/netdev.c            |   52 ++-
 lib/netdev.h            |   15 +-
 lib/ofpbuf.c            |    7 +-
 lib/ofpbuf.h            |   13 +-
 lib/packets.c           |    9 +
 lib/packets.h           |    1 +
 vswitchd/ovs-vswitchd.c |   14 +-
 20 files changed, 1702 insertions(+), 180 deletions(-)
 create mode 100644 INSTALL.DPDK
 create mode 100644 lib/netdev-dpdk.c
 create mode 100644 lib/netdev-dpdk.h

diff --git a/INSTALL b/INSTALL
index 001d3cb..74cd278 100644
--- a/INSTALL
+++ b/INSTALL
@@ -10,6 +10,7 @@ on a specific platform, please see one of these files:
     - INSTALL.RHEL
     - INSTALL.XenServer
     - INSTALL.NetBSD
+    - INSTALL.DPDK
 
 Build Requirements
 ------------------
diff --git a/INSTALL.DPDK b/INSTALL.DPDK
new file mode 100644
index 0000000..1c95104
--- /dev/null
+++ b/INSTALL.DPDK
@@ -0,0 +1,85 @@
+                   Using Open vSwitch with DPDK
+                   ============================
+
+Open vSwitch can use Intel(R) DPDK lib to operate entirely in
+userspace. This file explains how to install and use Open vSwitch in
+such a mode.
+
+The DPDK support of Open vSwitch is considered experimental.
+It has not been thoroughly tested.
+
+This version of Open vSwitch should be built manually with "configure"
+and "make".
+
+Building and Installing:
+------------------------
+
+DPDK:
+cd DPDK
+make install T=x86_64-default-linuxapp-gcc
+Refer to http://dpdk.org/ requirements of details.
+
+Linux kernel:
+Refer to intel-dpdk-getting-started-guide.pdf for understanding
+DPDK kernel requirement.
+
+OVS:
+cd $(OVS_DIR)/openvswitch
+./boot.sh
+./configure --with-dpdk=$(DPDK_BUILD)
+make
+
+Refer to INSTALL.userspace for general requirements of building
+userspace OVS.
+
+Using the DPDK with ovs-vswitchd:
+---------------------------------
+
+Fist setup DPDK devices:
+  - insert igb_uio.ko
+    e.g. insmod DPDK/x86_64-default-linuxapp-gcc/kmod/igb_uio.ko
+  - mount hugefs
+    e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/
+  - Bind network device to ibg_uio.
+    e.g. DPDK/tools/pci_unbind.py --bind=igb_uio eth1
+
+Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup.
+
+Start vswitchd:
+DPDK configuration arguments can be passed to vswitchd via `--dpdk`
+argument.
+   e.g.
+   ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK  --pidfile --detach
+
+To use ovs-vswitchd with DPDK, create a bridge with datapath_type
+"netdev" in the configuration database.  For example:
+
+    ovs-vsctl add-br br0
+    ovs-vsctl set bridge br0 datapath_type=netdev
+
+Now you can add dpdk devices. OVS expect DPDK device name start with dpdk
+and end with portid. vswitchd should print number of dpdk devices found.
+
+    ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
+
+Once first DPDK port is added vswitchd, it creates Polling thread and
+polls dpdk device in continues loop. Therefore CPU utilization
+for that thread is always 100%.
+
+Restrictions:
+-------------
+
+  - This Support is for Physical NIC. I have tested with Intel NIC only.
+  - vswitchd userspace datapath does affine polling thread but it is
+    assumed that devices are on numa node 0. Therefore if device is
+    attached to non zero numa node switching performance would be
+    suboptimal.
+  - There are fixed number of polling thread and fixed number of per
+    device queues configured.
+  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
+  - Currently DPDK port does not make use any offload functionality.
+
+Bug Reporting:
+--------------
+
+Please report problems to bugs@openvswitch.org.
diff --git a/Makefile.am b/Makefile.am
index 32775cc..4d53dd4 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -58,6 +58,7 @@ EXTRA_DIST = \
 	FAQ \
 	INSTALL \
 	INSTALL.Debian \
+	INSTALL.DPDK \
 	INSTALL.Fedora \
 	INSTALL.KVM \
 	INSTALL.Libvirt \
diff --git a/acinclude.m4 b/acinclude.m4
index 8ff5828..01d39bf 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -157,6 +157,46 @@ AC_DEFUN([OVS_CHECK_LINUX], [
   AM_CONDITIONAL(LINUX_ENABLED, test -n "$KBUILD")
 ])
 
+dnl OVS_CHECK_DPDK
+dnl
+dnl Configure DPDK source tree
+AC_DEFUN([OVS_CHECK_DPDK], [
+  AC_ARG_WITH([dpdk],
+              [AC_HELP_STRING([--with-dpdk=/path/to/dpdk],
+                              [Specify the DPDP build directory])])
+
+  if test X"$with_dpdk" != X; then
+    RTE_SDK=$with_dpdk
+
+    DPDK_INCLUDE=$RTE_SDK/include
+    DPDK_LIB_DIR=$RTE_SDK/lib
+    DPDK_LIBS="$DPDK_LIB_DIR/libethdev.a \
+              $DPDK_LIB_DIR/librte_cmdline.a \
+              $DPDK_LIB_DIR/librte_hash.a \
+              $DPDK_LIB_DIR/librte_lpm.a \
+              $DPDK_LIB_DIR/librte_mbuf.a \
+              $DPDK_LIB_DIR/librte_mempool.a \
+              $DPDK_LIB_DIR/librte_eal.a \
+              $DPDK_LIB_DIR/librte_pmd_ring.a \
+              $DPDK_LIB_DIR/librte_malloc.a \
+              $DPDK_LIB_DIR/librte_pmd_ixgbe.a \
+              $DPDK_LIB_DIR/librte_pmd_e1000.a \
+              $DPDK_LIB_DIR/librte_pmd_virtio.a \
+              $DPDK_LIB_DIR/librte_ring.a"
+
+    LIBS="$DPDK_LIBS $LIBS"
+    CPPFLAGS="-I$DPDK_INCLUDE $CPPFLAGS"
+    SLICE_SIZE="4194304"
+    SLICE_SIZE_MAX="1073741824"
+    LDFLAGS="$LDFLAGS -Wl,-hugetlbfs-align,-zcommon-page-size=$SLICE_SIZE,-zmax-page-size=$SLICE_SIZE"
+    AC_DEFINE([DPDK_NETDEV], [1], [System uses the DPDK module.])
+  else
+    RTE_SDK=
+  fi
+
+  AM_CONDITIONAL(DPDK_NETDEV, test -n "$RTE_SDK")
+])
+
 dnl OVS_GREP_IFELSE(FILE, REGEX, [IF-MATCH], [IF-NO-MATCH])
 dnl
 dnl Greps FILE for REGEX.  If it matches, runs IF-MATCH, otherwise IF-NO-MATCH.
diff --git a/configure.ac b/configure.ac
index 19c095e..30dbe39 100644
--- a/configure.ac
+++ b/configure.ac
@@ -119,6 +119,7 @@ OVS_ENABLE_SPARSE
 AC_ARG_VAR(KARCH, [Kernel Architecture String])
 AC_SUBST(KARCH)
 OVS_CHECK_LINUX
+OVS_CHECK_DPDK
 
 AC_CONFIG_FILES(Makefile)
 AC_CONFIG_FILES(datapath/Makefile)
diff --git a/lib/automake.mk b/lib/automake.mk
index 2ef806e..ffbecdb 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -289,6 +289,12 @@ lib_libopenvswitch_la_SOURCES += \
 	lib/route-table.h
 endif
 
+if DPDK_NETDEV
+lib_libopenvswitch_la_SOURCES += \
+	lib/netdev-dpdk.c \
+	lib/netdev-dpdk.h
+endif
+
 if HAVE_POSIX_AIO
 lib_libopenvswitch_la_SOURCES += lib/async-append-aio.c
 else
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index cb64bdc..f55732b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -44,6 +44,7 @@
 #include "meta-flow.h"
 #include "netdev.h"
 #include "netdev-vport.h"
+#include "netdev-dpdk.h"
 #include "netlink.h"
 #include "odp-execute.h"
 #include "odp-util.h"
@@ -64,14 +65,12 @@ VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 
 /* By default, choose a priority in the middle. */
 #define NETDEV_RULE_PRIORITY 0x8000
+/* TODO: number of thread should be configurable. */
+#define NR_THREADS 8
 
 /* Configuration parameters. */
 enum { MAX_FLOWS = 65536 };     /* Maximum number of flows in flow table. */
 
-/* Enough headroom to add a vlan tag, plus an extra 2 bytes to allow IP
- * headers to be aligned on a 4-byte boundary.  */
-enum { DP_NETDEV_HEADROOM = 2 + VLAN_HEADER_LEN };
-
 /* Queues. */
 enum { N_QUEUES = 2 };          /* Number of queues for dpif_recv(). */
 enum { MAX_QUEUE_LEN = 128 };   /* Maximum number of packets per queue. */
@@ -162,8 +161,9 @@ struct dp_netdev {
 
     /* Forwarding threads. */
     struct latch exit_latch;
-    struct dp_forwarder *forwarders;
-    size_t n_forwarders;
+    struct pmd_thread *pmd_threads;
+    size_t n_pmd_threads;
+    int pmd_count;
 };
 
 static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,
@@ -172,12 +172,14 @@ static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,
 
 /* A port in a netdev-based datapath. */
 struct dp_netdev_port {
-    struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
-    odp_port_t port_no;
+    struct pkt_metadata md;
+    struct netdev_rx **rx;
     struct netdev *netdev;
+    odp_port_t port_no;
     struct netdev_saved_flags *sf;
-    struct netdev_rx *rx;
     char *type;                 /* Port type as requested by user. */
+    struct ovs_refcount ref_cnt;
+    struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
 };
 
 /* A flow in dp_netdev's 'flow_table'.
@@ -289,11 +291,12 @@ void dp_netdev_actions_unref(struct dp_netdev_actions *);
 
 /* A thread that receives packets from some ports, looks them up in the flow
  * table, and executes the actions it finds. */
-struct dp_forwarder {
+struct pmd_thread {
     struct dp_netdev *dp;
-    pthread_t thread;
+    int qid;
+    atomic_uint change_seq;
     char *name;
-    uint32_t min_hash, max_hash;
+    pthread_t thread;
 };
 
 /* Interface to netdev-based datapath. */
@@ -332,7 +335,7 @@ static void dp_netdev_execute_actions(struct dp_netdev *dp,
 static void dp_netdev_port_input(struct dp_netdev *dp, struct ofpbuf *packet,
                                  struct pkt_metadata *)
     OVS_REQ_RDLOCK(dp->port_rwlock);
-static void dp_netdev_set_threads(struct dp_netdev *, int n);
+static void dp_netdev_set_pmd_threads(struct dp_netdev *, int n);
 
 static struct dpif_netdev *
 dpif_netdev_cast(const struct dpif *dpif)
@@ -478,7 +481,6 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
         dp_netdev_free(dp);
         return error;
     }
-    dp_netdev_set_threads(dp, 2);
 
     *dpp = dp;
     return 0;
@@ -536,8 +538,8 @@ dp_netdev_free(struct dp_netdev *dp)
 
     shash_find_and_delete(&dp_netdevs, dp->name);
 
-    dp_netdev_set_threads(dp, 0);
-    free(dp->forwarders);
+    dp_netdev_set_pmd_threads(dp, 0);
+    free(dp->pmd_threads);
 
     dp_netdev_flow_flush(dp);
     ovs_rwlock_wrlock(&dp->port_rwlock);
@@ -621,18 +623,30 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
     return 0;
 }
 
+static void
+dp_netdev_reload_pmd_threads(struct dp_netdev *dp)
+{
+    int i;
+
+    for (i = 0; i < dp->n_pmd_threads; i++) {
+        struct pmd_thread *f = &dp->pmd_threads[i];
+        int id;
+
+        atomic_add(&f->change_seq, 1, &id);
+   }
+}
+
 static int
 do_add_port(struct dp_netdev *dp, const char *devname, const char *type,
             odp_port_t port_no)
-    OVS_REQ_WRLOCK(dp->port_rwlock)
 {
     struct netdev_saved_flags *sf;
     struct dp_netdev_port *port;
     struct netdev *netdev;
-    struct netdev_rx *rx;
     enum netdev_flags flags;
     const char *open_type;
     int error;
+    int i;
 
     /* XXX reject devices already in some dp_netdev. */
 
@@ -651,28 +665,41 @@ do_add_port(struct dp_netdev *dp, const char *devname, const char *type,
         return EINVAL;
     }
 
-    error = netdev_rx_open(netdev, &rx);
-    if (error
-        && !(error == EOPNOTSUPP && dpif_netdev_class_is_dummy(dp->class))) {
-        VLOG_ERR("%s: cannot receive packets on this network device (%s)",
-                 devname, ovs_strerror(errno));
-        netdev_close(netdev);
-        return error;
+    port = xzalloc(sizeof *port);
+    port->port_no = port_no;
+    port->md = PKT_METADATA_INITIALIZER(port->port_no);
+    port->netdev = netdev;
+    port->rx = xmalloc(sizeof *port->rx * netdev_nr_rx(netdev));
+    port->type = xstrdup(type);
+    for (i = 0; i < netdev_nr_rx(netdev); i++) {
+        error = netdev_rx_open(netdev, &port->rx[i], i);
+        if (error
+            && !(error == EOPNOTSUPP && dpif_netdev_class_is_dummy(dp->class))) {
+            VLOG_ERR("%s: cannot receive packets on this network device (%s)",
+                     devname, ovs_strerror(errno));
+            netdev_close(netdev);
+            return error;
+        }
     }
 
     error = netdev_turn_flags_on(netdev, NETDEV_PROMISC, &sf);
     if (error) {
-        netdev_rx_close(rx);
+        for (i = 0; i < netdev_nr_rx(netdev); i++) {
+            netdev_rx_close(port->rx[i]);
+        }
         netdev_close(netdev);
+        free(port->rx);
+        free(port);
         return error;
     }
-
-    port = xmalloc(sizeof *port);
-    port->port_no = port_no;
-    port->netdev = netdev;
     port->sf = sf;
-    port->rx = rx;
-    port->type = xstrdup(type);
+
+    if (netdev_is_pmd(netdev)) {
+        dp->pmd_count++;
+        dp_netdev_set_pmd_threads(dp, NR_THREADS);
+        dp_netdev_reload_pmd_threads(dp);
+    }
+    ovs_refcount_init(&port->ref_cnt);
 
     hmap_insert(&dp->ports, &port->node, hash_int(odp_to_u32(port_no), 0));
     seq_change(dp->port_seq);
@@ -772,6 +799,32 @@ get_port_by_name(struct dp_netdev *dp,
     return ENOENT;
 }
 
+static void
+port_ref(struct dp_netdev_port *port)
+{
+    if (port) {
+        ovs_refcount_ref(&port->ref_cnt);
+    }
+}
+
+static void
+port_unref(struct dp_netdev_port *port)
+{
+    if (port && ovs_refcount_unref(&port->ref_cnt) == 1) {
+        int i;
+
+        netdev_restore_flags(port->sf);
+        for (i = 0; i < netdev_nr_rx(port->netdev); i++) {
+            netdev_rx_close(port->rx[i]);
+        }
+        free(port->rx);
+        netdev_close(port->netdev);
+        free(port->type);
+        ovs_refcount_destroy(&port->ref_cnt);
+        free(port);
+    }
+}
+
 static int
 do_del_port(struct dp_netdev *dp, odp_port_t port_no)
     OVS_REQ_WRLOCK(dp->port_rwlock)
@@ -783,16 +836,13 @@ do_del_port(struct dp_netdev *dp, odp_port_t port_no)
     if (error) {
         return error;
     }
-
     hmap_remove(&dp->ports, &port->node);
     seq_change(dp->port_seq);
+    if (netdev_is_pmd(port->netdev)) {
+        dp_netdev_reload_pmd_threads(dp);
+    }
 
-    netdev_close(port->netdev);
-    netdev_restore_flags(port->sf);
-    netdev_rx_close(port->rx);
-    free(port->type);
-    free(port);
-
+    port_unref(port);
     return 0;
 }
 
@@ -1543,123 +1593,215 @@ dp_netdev_actions_unref(struct dp_netdev_actions *actions)
     }
 }
 \f
-static void *
-dp_forwarder_main(void *f_)
+
+static void
+dp_netdev_process_rx_port(struct dp_netdev *dp,
+                          struct dp_netdev_port *port,
+                          struct netdev_rx *queue)
+{
+    struct ofpbuf *packet;
+    struct pkt_metadata *md = &port->md;
+    int error, c;
+
+    error = netdev_rx_recv(queue, &packet, &c);
+    if (!error) {
+        int i;
+
+        for (i = 0; i < c; i++) {
+            dp_netdev_port_input(dp, packet, md);
+            packet++;
+        }
+
+    } else if (error != EAGAIN && error != EOPNOTSUPP) {
+        static struct vlog_rate_limit rl
+            = VLOG_RATE_LIMIT_INIT(1, 5);
+
+        VLOG_ERR_RL(&rl, "error receiving data from %s: %s",
+                    netdev_get_name(port->netdev),
+                    ovs_strerror(error));
+    }
+}
+
+static void
+dpif_netdev_run(struct dpif *dpif)
+{
+    struct dp_netdev_port *port;
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+
+    ovs_rwlock_rdlock(&dp->port_rwlock);
+
+    HMAP_FOR_EACH (port, node, &dp->ports) {
+        if (port->rx[0] && !netdev_is_pmd(port->netdev)) {
+            dp_netdev_process_rx_port(dp, port, port->rx[0]);
+        }
+    }
+
+    ovs_rwlock_unlock(&dp->port_rwlock);
+}
+
+static void
+dpif_netdev_wait(struct dpif *dpif)
+{
+    struct dp_netdev_port *port;
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+
+    ovs_rwlock_rdlock(&dp->port_rwlock);
+
+    HMAP_FOR_EACH (port, node, &dp->ports) {
+        if (port->rx[0] && !netdev_is_pmd(port->netdev)) {
+            netdev_rx_wait(port->rx[0]);
+        }
+    }
+    ovs_rwlock_unlock(&dp->port_rwlock);
+}
+
+struct rx_poll {
+    struct dp_netdev_port *port;
+    struct netdev_rx *rx;
+};
+
+static int
+pmd_load_queues(struct pmd_thread *f,
+                struct rx_poll **ppoll_list, int poll_cnt)
 {
-    struct dp_forwarder *f = f_;
     struct dp_netdev *dp = f->dp;
-    struct ofpbuf packet;
+    struct rx_poll *poll_list = *ppoll_list;
+    struct dp_netdev_port *port;
+    int qid = f->qid;
+    int index;
+    int i;
 
-    f->name = xasprintf("forwarder_%u", ovsthread_id_self());
-    set_subprogram_name("%s", f->name);
+    /* Simple scheduler for netdev rx polling. */
+    ovs_rwlock_rdlock(&dp->port_rwlock);
+    for (i = 0; i < poll_cnt; i++) {
+         port_unref(poll_list[i].port);
+    }
 
-    ofpbuf_init(&packet, 0);
-    while (!latch_is_set(&dp->exit_latch)) {
-        bool received_anything;
-        int i;
+    free(poll_list);
+    poll_cnt = 0;
+    index = 0;
+
+    HMAP_FOR_EACH (port, node, &f->dp->ports) {
+        if (netdev_is_pmd(port->netdev)) {
+            for (i = 0; i < netdev_nr_rx(port->netdev); i++) {
 
-        ovs_rwlock_rdlock(&dp->port_rwlock);
-        for (i = 0; i < 50; i++) {
-            struct dp_netdev_port *port;
-
-            received_anything = false;
-            HMAP_FOR_EACH (port, node, &f->dp->ports) {
-                if (port->rx
-                    && port->node.hash >= f->min_hash
-                    && port->node.hash <= f->max_hash) {
-                    int buf_size;
-                    int error;
-                    int mtu;
-
-                    if (netdev_get_mtu(port->netdev, &mtu)) {
-                        mtu = ETH_PAYLOAD_MAX;
-                    }
-                    buf_size = DP_NETDEV_HEADROOM + VLAN_ETH_HEADER_LEN + mtu;
-
-                    ofpbuf_clear(&packet);
-                    ofpbuf_reserve_with_tailroom(&packet, DP_NETDEV_HEADROOM,
-                                                 buf_size);
-
-                    error = netdev_rx_recv(port->rx, &packet);
-                    if (!error) {
-                        struct pkt_metadata md
-                            = PKT_METADATA_INITIALIZER(port->port_no);
-                        dp_netdev_port_input(dp, &packet, &md);
-
-                        received_anything = true;
-                    } else if (error != EAGAIN && error != EOPNOTSUPP) {
-                        static struct vlog_rate_limit rl
-                            = VLOG_RATE_LIMIT_INIT(1, 5);
-
-                        VLOG_ERR_RL(&rl, "error receiving data from %s: %s",
-                                    netdev_get_name(port->netdev),
-                                    ovs_strerror(error));
-                    }
+                if ((index % dp->n_pmd_threads) == qid) {
+                    port_ref(port);
+                    poll_cnt++;
                 }
+                index++;
             }
+        }
+    }
 
-            if (!received_anything) {
-                break;
+    poll_list = xzalloc(sizeof *poll_list * poll_cnt);
+    poll_cnt = 0;
+    index = 0;
+
+    HMAP_FOR_EACH (port, node, &f->dp->ports) {
+        if (netdev_is_pmd(port->netdev)) {
+            for (i = 0; i < netdev_nr_rx(port->netdev); i++) {
+
+                if ((index % dp->n_pmd_threads) == qid) {
+                    poll_list[poll_cnt].port = port;
+                    poll_list[poll_cnt].rx = port->rx[i];
+                    poll_cnt++;
+                    VLOG_INFO("poll_cnt %d port = %d i = %d",poll_cnt,port->port_no, i);
+                }
+                index++;
             }
         }
+    }
 
-        if (received_anything) {
-            poll_immediate_wake();
-        } else {
-            struct dp_netdev_port *port;
+    ovs_rwlock_unlock(&dp->port_rwlock);
 
-            HMAP_FOR_EACH (port, node, &f->dp->ports)
-                if (port->rx
-                    && port->node.hash >= f->min_hash
-                    && port->node.hash <= f->max_hash) {
-                    netdev_rx_wait(port->rx);
-                }
-            seq_wait(dp->port_seq, seq_read(dp->port_seq));
-            latch_wait(&dp->exit_latch);
+    *ppoll_list = poll_list;
+    return poll_cnt;
+}
+
+static void *
+pmd_thread_main(void *f_)
+{
+    struct pmd_thread *f = f_;
+    struct dp_netdev *dp = f->dp;
+    unsigned long lc = 0;
+    struct rx_poll *poll_list;
+    unsigned int port_seq;
+    int poll_cnt;
+
+    f->name = xasprintf("pmd_%u", ovsthread_id_self());
+    set_subprogram_name("%s", f->name);
+    netdev_setup_thread(f->qid);
+    poll_cnt = 0;
+    poll_list = NULL;
+
+reload:
+    poll_cnt = pmd_load_queues(f, &poll_list, poll_cnt);
+    atomic_read(&f->change_seq, &port_seq);
+
+    while (1) {
+        unsigned int c_port_seq;
+        int i;
+
+        for (i = 0; i < poll_cnt; i++) {
+            dp_netdev_process_rx_port(dp,  poll_list[i].port, poll_list[i].rx);
+        }
+
+        if (lc++ > (64 * 1024 * 1024)) {
+            /* TODO: need completely userspace based signaling method.
+             * to keep this thread entirely in userspace.
+             * For now using atomic counter. */
+            lc = 0;
+            atomic_read(&f->change_seq, &c_port_seq);
+            if (c_port_seq != port_seq) {
+                break;
+            }
         }
-        ovs_rwlock_unlock(&dp->port_rwlock);
+    }
 
-        poll_block();
+    if (!latch_is_set(&f->dp->exit_latch)){
+        goto reload;
     }
-    ofpbuf_uninit(&packet);
 
+    free(poll_list);
     free(f->name);
-
     return NULL;
 }
 
 static void
-dp_netdev_set_threads(struct dp_netdev *dp, int n)
+dp_netdev_set_pmd_threads(struct dp_netdev *dp, int n)
 {
     int i;
 
-    if (n == dp->n_forwarders) {
+    if (n == dp->n_pmd_threads) {
         return;
     }
 
     /* Stop existing threads. */
     latch_set(&dp->exit_latch);
-    for (i = 0; i < dp->n_forwarders; i++) {
-        struct dp_forwarder *f = &dp->forwarders[i];
+    dp_netdev_reload_pmd_threads(dp);
+    for (i = 0; i < dp->n_pmd_threads; i++) {
+        struct pmd_thread *f = &dp->pmd_threads[i];
 
         xpthread_join(f->thread, NULL);
     }
     latch_poll(&dp->exit_latch);
-    free(dp->forwarders);
+    free(dp->pmd_threads);
 
     /* Start new threads. */
-    dp->forwarders = xmalloc(n * sizeof *dp->forwarders);
-    dp->n_forwarders = n;
+    dp->pmd_threads = xmalloc(n * sizeof *dp->pmd_threads);
+    dp->n_pmd_threads = n;
+
     for (i = 0; i < n; i++) {
-        struct dp_forwarder *f = &dp->forwarders[i];
+        struct pmd_thread *f = &dp->pmd_threads[i];
 
         f->dp = dp;
-        f->min_hash = UINT32_MAX / n * i;
-        f->max_hash = UINT32_MAX / n * (i + 1) - 1;
-        if (i == n - 1) {
-            f->max_hash = UINT32_MAX;
-        }
-        xpthread_create(&f->thread, NULL, dp_forwarder_main, f);
+        f->qid = i;
+        atomic_store(&f->change_seq, 1);
+
+        /* Each thread will distribute all devices rx-queues among
+         * themselves. */
+        xpthread_create(&f->thread, NULL, pmd_thread_main, f);
     }
 }
 \f
@@ -1683,6 +1825,7 @@ dp_netdev_port_input(struct dp_netdev *dp, struct ofpbuf *packet,
     struct flow key;
 
     if (packet->size < ETH_HEADER_LEN) {
+        VLOG_ERR("%s small pkt %d\n",__func__,(int) packet->size);
         return;
     }
     flow_extract(packet, md->skb_priority, md->pkt_mark, &md->tunnel,
@@ -1743,9 +1886,11 @@ dp_netdev_output_userspace(struct dp_netdev *dp, struct ofpbuf *packet,
         }
 
         /* Steal packet data. */
-        ovs_assert(packet->source == OFPBUF_MALLOC);
-        upcall->packet = *packet;
-        ofpbuf_use(packet, NULL, 0);
+        ofpbuf_init(&upcall->packet,0);
+        ofpbuf_reserve_with_tailroom(&upcall->packet,
+                                     DP_NETDEV_HEADROOM, packet->size);
+        memcpy(upcall->packet.data, packet->data, packet->size);
+        upcall->packet.size = packet->size;
 
         seq_change(dp->queue_seq);
 
@@ -1778,7 +1923,7 @@ dp_execute_cb(void *aux_, struct ofpbuf *packet,
     case OVS_ACTION_ATTR_OUTPUT:
         p = dp_netdev_lookup_port(aux->dp, u32_to_odp(nl_attr_get_u32(a)));
         if (p) {
-            netdev_send(p->netdev, packet);
+            netdev_send(p->netdev, packet, may_steal);
         }
         break;
 
@@ -1828,8 +1973,8 @@ const struct dpif_class dpif_netdev_class = {
     dpif_netdev_open,
     dpif_netdev_close,
     dpif_netdev_destroy,
-    NULL,                       /* run */
-    NULL,                       /* wait */
+    dpif_netdev_run,
+    dpif_netdev_wait,
     dpif_netdev_get_stats,
     dpif_netdev_port_add,
     dpif_netdev_port_del,
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
new file mode 100644
index 0000000..06de08c
--- /dev/null
+++ b/lib/netdev-dpdk.c
@@ -0,0 +1,1152 @@
+/*
+ * Copyright (c) 2014 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <pthread.h>
+#include <config.h>
+#include <errno.h>
+#include <sched.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdio.h>
+
+#include "list.h"
+#include "netdev-provider.h"
+#include "netdev-vport.h"
+#include "netdev-dpdk.h"
+#include "odp-util.h"
+#include "ofp-print.h"
+#include "ofpbuf.h"
+#include "ovs-thread.h"
+#include "packets.h"
+#include "shash.h"
+#include "sset.h"
+#include "unaligned.h"
+#include "timeval.h"
+#include "unixctl.h"
+#include "vlog.h"
+
+#include <rte_config.h>
+#include <rte_eal.h>
+#include <rte_debug.h>
+#include <rte_ethdev.h>
+#include <rte_errno.h>
+#include <rte_memzone.h>
+#include <rte_memcpy.h>
+#include <rte_cycles.h>
+#include <rte_spinlock.h>
+#include <rte_launch.h>
+#include <rte_malloc.h>
+
+VLOG_DEFINE_THIS_MODULE(dpdk);
+
+#define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
+#define OVS_VPORT_DPDK "ovs_dpdk"
+
+/*
+ * need to reserve tons of extra space in the mbufs so we can align the
+ * DMA addresses to 4KB.
+ */
+
+#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
+#define MBUF_SIZE(mtu)       (MTU_TO_MAX_LEN(mtu) + (512) + \
+                             sizeof(struct rte_mbuf) + RTE_PKTMBUF_HEADROOM)
+
+/* TODO: mempool size should be based on system resources. */
+#define NB_MBUF              (4096 * 64)
+#define MP_CACHE_SZ          (256 * 2)
+#define SOCKET0              0
+
+/* TODO: number device queue, need to make this configurable at run time. */
+#define NR_QUEUE             8
+
+/* TODO: Needs per NIC value for these constants. */
+#define RX_PTHRESH 16 /* Default values of RX prefetch threshold reg. */
+#define RX_HTHRESH 16 /* Default values of RX host threshold reg. */
+#define RX_WTHRESH 8 /* Default values of RX write-back threshold reg. */
+
+#define TX_PTHRESH 36 /* Default values of TX prefetch threshold reg. */
+#define TX_HTHRESH 0  /* Default values of TX host threshold reg. */
+#define TX_WTHRESH 0  /* Default values of TX write-back threshold reg. */
+
+static const struct rte_eth_conf port_conf = {
+        .rxmode = {
+                .mq_mode = ETH_MQ_RX_RSS,
+                .split_hdr_size = 0,
+                .header_split   = 0, /* Header Split disabled */
+                .hw_ip_checksum = 0, /* IP checksum offload enabled */
+                .hw_vlan_filter = 0, /* VLAN filtering disabled */
+                .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
+                .hw_strip_crc   = 0, /* CRC stripped by hardware */
+        },
+        .rx_adv_conf = {
+                .rss_conf = {
+                        .rss_key = NULL,
+                        .rss_hf = ETH_RSS_IPV4_TCP | ETH_RSS_IPV4 | ETH_RSS_IPV6,
+                },
+        },
+        .txmode = {
+                .mq_mode = ETH_MQ_TX_NONE,
+        },
+};
+
+static const struct rte_eth_rxconf rx_conf = {
+        .rx_thresh = {
+                .pthresh = RX_PTHRESH,
+                .hthresh = RX_HTHRESH,
+                .wthresh = RX_WTHRESH,
+        },
+};
+
+static const struct rte_eth_txconf tx_conf = {
+        .tx_thresh = {
+                .pthresh = TX_PTHRESH,
+                .hthresh = TX_HTHRESH,
+                .wthresh = TX_WTHRESH,
+        },
+        .tx_free_thresh = 0, /* Use PMD default values */
+        .tx_rs_thresh = 0, /* Use PMD default values */
+};
+
+enum { MAX_RX_QUEUE_LEN = 64 };
+
+static int rte_eal_init_ret = ENODEV;
+
+static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
+
+/* Contains all 'struct dpdk_dev's. */
+static struct list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
+    = LIST_INITIALIZER(&dpdk_list);
+
+static struct list dpdk_mp_list;
+
+struct dpdk_mp {
+    struct rte_mempool *mp;
+    int mtu;
+    int socket_id;
+    int refcount;
+    struct list list_node OVS_GUARDED_BY(mp_list);
+};
+
+struct netdev_dpdk {
+    struct netdev up;
+    int port_id;
+    int max_packet_len;
+    rte_spinlock_t tx_lock;
+
+    /* Protects all members below. */
+    struct ovs_mutex mutex OVS_ACQ_AFTER(mutex);
+
+    struct dpdk_mp *dpdk_mp;
+    int mtu OVS_GUARDED;
+    int socket_id;
+    int buf_size;
+    struct netdev_stats stats_offset OVS_GUARDED;
+
+    uint8_t hwaddr[ETH_ADDR_LEN] OVS_GUARDED;
+    enum netdev_flags flags OVS_GUARDED;
+
+    rte_spinlock_t lsi_lock;
+    struct rte_eth_link link;
+    int link_reset_cnt;
+
+    /* In dpdk_list. */
+    struct list list_node OVS_GUARDED_BY(mutex);
+};
+
+struct netdev_rx_dpdk {
+    struct netdev_rx up;
+    eth_rx_burst_t drv_rx;
+    void *rx_queues;
+    int port_id;
+    int queue_id;
+    int ofpbuf_cnt;
+    struct ofpbuf ofpbuf[MAX_RX_QUEUE_LEN];
+};
+
+static int netdev_dpdk_construct(struct netdev *);
+static bool
+is_dpdk_class(const struct netdev_class *class)
+{
+    return class->construct == netdev_dpdk_construct;
+}
+
+/* TODO: use dpdk malloc for entire OVS. infact huge page shld be used
+ * for all other sengments data, bss and text. */
+
+static void *dpdk_rte_mzalloc(size_t sz)
+{
+    void *ptr;
+
+    ptr = rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE);
+    if (ptr == NULL) {
+        out_of_memory();
+    }
+    return ptr;
+}
+
+static struct dpdk_mp *
+dpdk_mp_get(int socket_id, int mtu)
+{
+    struct dpdk_mp *dmp = NULL;
+    char mp_name[RTE_MEMPOOL_NAMESIZE];
+
+    LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
+        if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
+            dmp->refcount++;
+            return dmp;
+        }
+    }
+
+    dmp = dpdk_rte_mzalloc(sizeof *dmp);
+    dmp->socket_id = socket_id;
+    dmp->mtu = mtu;
+    dmp->refcount = 1;
+
+    snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d", dmp->mtu);
+    dmp->mp = rte_mempool_create(mp_name, NB_MBUF, MBUF_SIZE(mtu),
+                                 MP_CACHE_SZ,
+                                 sizeof(struct rte_pktmbuf_pool_private),
+                                 rte_pktmbuf_pool_init, NULL,
+                                 rte_pktmbuf_init, NULL,
+                                 socket_id, 0);
+
+    if (dmp->mp == NULL) {
+        return NULL;
+    }
+
+    list_push_back(&dpdk_mp_list, &dmp->list_node);
+    return dmp;
+}
+
+static void
+dpdk_mp_put(struct dpdk_mp *dmp)
+{
+
+    if (!dmp) {
+        return;
+    }
+
+    dmp->refcount--;
+    ovs_assert(dmp->refcount >= 0);
+
+#if 0
+    /* I could not find any API to destroy mp. */
+    if (dmp->refcount == 0) {
+        list_delete(dmp->list_node);
+        /* destroy mp-pool. */
+    }
+#endif
+}
+
+static void
+lsi_event_callback(uint8_t port_id, enum rte_eth_event_type type, void *param)
+{
+    struct netdev_dpdk *dev = (struct netdev_dpdk *) param;
+
+    VLOG_DBG("Event type: %s\n", type == RTE_ETH_EVENT_INTR_LSC ? "LSC interrupt" : "unknown event");
+
+    rte_spinlock_lock(&dev->lsi_lock);
+    rte_eth_link_get_nowait(port_id, &dev->link);
+    dev->link_reset_cnt++;
+    rte_spinlock_unlock(&dev->lsi_lock);
+
+    if (dev->link.link_status) {
+        VLOG_DBG("Port %d Link Up - speed %u Mbps - %s\n",
+                 port_id, (unsigned)dev->link.link_speed,
+                          (dev->link.link_duplex == ETH_LINK_FULL_DUPLEX) ?
+                          ("full-duplex") : ("half-duplex"));
+    } else {
+        VLOG_DBG("Port %d Link Down\n\n", port_id);
+    }
+}
+
+static int
+dpdk_eth_dev_init(struct netdev_dpdk *dev)
+{
+    struct rte_pktmbuf_pool_private *mbp_priv;
+    struct ether_addr eth_addr;
+    int diag;
+    int i;
+
+    if (dev->port_id < 0 || dev->port_id >= rte_eth_dev_count()) {
+        return -ENODEV;
+    }
+
+    diag = rte_eth_dev_configure(dev->port_id, NR_QUEUE, NR_QUEUE + 1, &port_conf);
+    if (diag) {
+        VLOG_ERR("eth dev config error %d\n",diag);
+        return diag;
+    }
+
+    for (i = 0; i < (NR_QUEUE + 1); i++) {
+        diag = rte_eth_tx_queue_setup(dev->port_id, i, 64, 0, &tx_conf);
+        if (diag) {
+            VLOG_ERR("eth dev tx queue setup error %d\n",diag);
+            return diag;
+        }
+    }
+
+    for (i = 0; i < NR_QUEUE; i++) {
+        /* DO NOT CHANGE NUMBER OF RX DESCRIPTORS */
+        diag = rte_eth_rx_queue_setup(dev->port_id, i, 64, 0, &rx_conf, dev->dpdk_mp->mp);
+        if (diag) {
+            VLOG_ERR("eth dev rx queue setup error %d\n",diag);
+            return diag;
+        }
+    }
+
+    rte_eth_dev_callback_register(dev->port_id, RTE_ETH_EVENT_INTR_LSC,
+                                  lsi_event_callback, dev);
+
+    diag = rte_eth_dev_start(dev->port_id);
+    if (diag) {
+        VLOG_ERR("eth dev start error %d\n",diag);
+        return diag;
+    }
+
+    rte_eth_promiscuous_enable(dev->port_id);
+    rte_eth_allmulticast_enable(dev->port_id);
+
+    memset(&eth_addr, 0x0, sizeof(eth_addr));
+    rte_eth_macaddr_get(dev->port_id, &eth_addr);
+    VLOG_INFO("Port %d: %02X:%02X:%02X:%02X:%02X:%02X\n",dev->port_id,
+              eth_addr.addr_bytes[0],
+              eth_addr.addr_bytes[1],
+              eth_addr.addr_bytes[2],
+              eth_addr.addr_bytes[3],
+              eth_addr.addr_bytes[4],
+              eth_addr.addr_bytes[5]);
+
+    memcpy(dev->hwaddr, eth_addr.addr_bytes, ETH_ADDR_LEN);
+    rte_eth_link_get_nowait(dev->port_id, &dev->link);
+
+    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
+    dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
+
+    dev->flags = NETDEV_UP | NETDEV_PROMISC;
+    return 0; /* return the number of args to delete */
+}
+
+static struct netdev_dpdk *
+netdev_dpdk_cast(const struct netdev *netdev)
+{
+    return CONTAINER_OF(netdev, struct netdev_dpdk, up);
+}
+
+static struct netdev *
+netdev_dpdk_alloc(void)
+{
+    struct netdev_dpdk *netdev = dpdk_rte_mzalloc(sizeof *netdev);
+    return &netdev->up;
+}
+
+static int
+netdev_dpdk_construct(struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    unsigned int port_no;
+    char *cport;
+    int err;
+
+    if (rte_eal_init_ret) {
+        return rte_eal_init_ret;
+    }
+
+    ovs_mutex_lock(&dpdk_mutex);
+    cport = netdev_->name + 4; /* Names always start with "dpdk" */
+
+    if (strncmp(netdev_->name, "dpdk", 4)) {
+        err = ENODEV;
+        goto unlock_dpdk;
+    }
+
+    port_no = strtol(cport, 0, 0); /* string must be null terminated */
+
+    rte_spinlock_init(&netdev->lsi_lock);
+    rte_spinlock_init(&netdev->tx_lock);
+    ovs_mutex_init(&netdev->mutex);
+
+    ovs_mutex_lock(&netdev->mutex);
+    netdev->flags = 0;
+
+    netdev->mtu = ETHER_MTU;
+    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
+
+    /* TODO: need to discover device node at run time. */
+    netdev->socket_id = SOCKET0;
+    netdev->port_id = port_no;
+
+    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
+    if (!netdev->dpdk_mp) {
+        err = ENOMEM;
+        goto unlock_dev;
+    }
+
+    err = dpdk_eth_dev_init(netdev);
+    if (err) {
+        goto unlock_dev;
+    }
+    netdev_->nr_rx = NR_QUEUE;
+
+    list_push_back(&dpdk_list, &netdev->list_node);
+
+unlock_dev:
+    ovs_mutex_unlock(&netdev->mutex);
+unlock_dpdk:
+    ovs_mutex_unlock(&dpdk_mutex);
+    return err;
+}
+
+static void
+netdev_dpdk_destruct(struct netdev *netdev_)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+
+    ovs_mutex_lock(&dev->mutex);
+    rte_eth_dev_stop(dev->port_id);
+    rte_eth_dev_callback_unregister(dev->port_id, RTE_ETH_EVENT_INTR_LSC,
+                                    lsi_event_callback, NULL);
+
+    ovs_mutex_unlock(&dev->mutex);
+
+    ovs_mutex_lock(&dpdk_mutex);
+    list_remove(&dev->list_node);
+    dpdk_mp_put(dev->dpdk_mp);
+    ovs_mutex_unlock(&dpdk_mutex);
+
+    ovs_mutex_destroy(&dev->mutex);
+}
+
+static void
+netdev_dpdk_dealloc(struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+
+    rte_free(netdev);
+}
+
+static int
+netdev_dpdk_get_config(const struct netdev *netdev_, struct smap *args)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+    struct rte_eth_dev_info dev_info;
+
+    ovs_mutex_lock(&dev->mutex);
+    rte_eth_dev_info_get(dev->port_id, &dev_info);
+    ovs_mutex_unlock(&dev->mutex);
+
+    smap_add_format(args, "ifindex", "%d", dev->port_id);
+    smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));
+    smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
+    smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
+    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
+    smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
+    smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
+    smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
+    smap_add_format(args, "max_hash_mac_addrs", "%u", dev_info.max_hash_mac_addrs);
+    smap_add_format(args, "max_vfs", "%u", dev_info.max_vfs);
+    smap_add_format(args, "max_vmdq_pools", "%u", dev_info.max_vmdq_pools);
+
+    smap_add_format(args, "pci-vendor_id", "0x%u", dev_info.pci_dev->id.vendor_id);
+    smap_add_format(args, "pci-device_id", "0x%x", dev_info.pci_dev->id.device_id);
+
+    return 0;
+}
+
+static struct netdev_rx *
+netdev_dpdk_rx_alloc(int id)
+{
+    struct netdev_rx_dpdk *rx = dpdk_rte_mzalloc(sizeof *rx);
+
+    rx->queue_id = id;
+    ovs_assert(id < NR_QUEUE);
+
+    return &rx->up;
+}
+
+static struct netdev_rx_dpdk *
+netdev_rx_dpdk_cast(const struct netdev_rx *rx)
+{
+    return CONTAINER_OF(rx, struct netdev_rx_dpdk, up);
+}
+
+static int
+netdev_dpdk_rx_construct(struct netdev_rx *rx_)
+{
+    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(rx->up.netdev);
+    struct rte_eth_dev *eth_dev;
+    int i;
+
+    ovs_mutex_lock(&netdev->mutex);
+    for (i = 0; i < MAX_RX_QUEUE_LEN; i++) {
+        ofpbuf_init(&rx->ofpbuf[i], 0);
+        rx->ofpbuf[i].allocated = netdev->buf_size;
+        rx->ofpbuf[i].source = OFPBUF_DPDK;
+    }
+    rx->ofpbuf_cnt = 0;
+    rx->port_id = netdev->port_id;
+
+    eth_dev = &rte_eth_devices[rx->port_id];
+    rx->drv_rx = eth_dev->rx_pkt_burst;
+    rx->rx_queues = eth_dev->data->rx_queues[rx->queue_id];
+    ovs_mutex_unlock(&netdev->mutex);
+
+    return 0;
+}
+
+static void
+netdev_dpdk_rx_destruct(struct netdev_rx *rx_ OVS_UNUSED)
+{
+}
+
+static void
+netdev_dpdk_rx_dealloc(struct netdev_rx *rx_)
+{
+    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
+
+    rte_free(rx);
+}
+
+static void
+build_ofpbuf(struct netdev_rx_dpdk *rx, struct ofpbuf *b, struct rte_mbuf *pkt)
+{
+    if (b->private_p) {
+        struct netdev *netdev = rx->up.netdev;
+        struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+        rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **) &b->private_p, 1);
+    }
+
+    b->private_p = pkt;
+    if (!pkt) {
+        return;
+    }
+
+    b->data = pkt->pkt.data;
+    b->base = (char *)b->data - DP_NETDEV_HEADROOM - VLAN_ETH_HEADER_LEN;
+    packet_set_size(b, rte_pktmbuf_data_len(pkt));
+}
+
+static int
+netdev_dpdk_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c)
+{
+    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
+    struct rte_mbuf *burst_pkts[MAX_RX_QUEUE_LEN];
+    int nb_rx;
+    int i;
+
+    nb_rx = (*rx->drv_rx)(rx->rx_queues, burst_pkts, MAX_RX_QUEUE_LEN);
+    if (!nb_rx) {
+        for (i = 0; i < rx->ofpbuf_cnt; i++) {
+             build_ofpbuf(rx, &rx->ofpbuf[i], NULL);
+        }
+        rx->ofpbuf_cnt = 0;
+        return EAGAIN;
+    }
+
+    i = 0;
+    do {
+        build_ofpbuf(rx, &rx->ofpbuf[i], burst_pkts[i]);
+
+        i++;
+    } while (i < nb_rx);
+
+    for (; i < rx->ofpbuf_cnt; i++) {
+         build_ofpbuf(rx, &rx->ofpbuf[i], NULL);
+    }
+    rx->ofpbuf_cnt = nb_rx;
+    *rpacket = rx->ofpbuf;
+    *c = nb_rx;
+
+    return 0;
+}
+
+static int
+netdev_dpdk_rx_drain(struct netdev_rx *rx_)
+{
+    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
+    int pending;
+    int i;
+
+    pending = rx->ofpbuf_cnt;
+    if (pending) {
+        for (i = 0; i < pending; i++) {
+             build_ofpbuf(rx, &rx->ofpbuf[i], NULL);
+        }
+        rx->ofpbuf_cnt = 0;
+        return 0;
+    }
+
+    return 0;
+}
+
+/* Tx function. Transmit packets indefinitely */
+static int
+dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    struct rte_mbuf *pkt;
+    uint32_t nb_tx = 0;
+
+    pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
+    if (!pkt) {
+        return 0;
+    }
+
+    /* We have to do a copy for now */
+    memcpy(pkt->pkt.data, buf, size);
+
+    rte_pktmbuf_data_len(pkt) = size;
+    rte_pktmbuf_pkt_len(pkt) = size;
+
+    rte_spinlock_lock(&dev->tx_lock);
+    nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, &pkt, 1);
+    rte_spinlock_unlock(&dev->tx_lock);
+
+    if (nb_tx != 1) {
+        /* free buffers if we couldn't transmit packets */
+        rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1);
+    }
+    return nb_tx;
+}
+
+static int
+netdev_dpdk_send(struct netdev *netdev,
+                 struct ofpbuf *ofpbuf, bool may_steal)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    if (ofpbuf->size > dev->max_packet_len) {
+        VLOG_ERR("2big size %d max_packet_len %d",
+                  (int)ofpbuf->size , dev->max_packet_len);
+        return E2BIG;
+    }
+
+    rte_prefetch0(&ofpbuf->private_p);
+    if (!may_steal ||
+        !ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) {
+        dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size);
+    } else {
+        struct rte_mbuf *pkt;
+        uint32_t nb_tx;
+        int qid;
+
+        pkt = ofpbuf->private_p;
+        ofpbuf->private_p = NULL;
+        rte_pktmbuf_data_len(pkt) = ofpbuf->size;
+        rte_pktmbuf_pkt_len(pkt) = ofpbuf->size;
+
+        /* TODO: TX batching. */
+        qid = rte_lcore_id() % NR_QUEUE;
+        nb_tx = rte_eth_tx_burst(dev->port_id, qid, &pkt, 1);
+        if (nb_tx != 1) {
+            struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+            rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1);
+            VLOG_ERR("TX error, zero packets sent");
+       }
+    }
+    return 0;
+}
+
+static int
+netdev_dpdk_set_etheraddr(struct netdev *netdev,
+                          const uint8_t mac[ETH_ADDR_LEN])
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    if (!eth_addr_equals(dev->hwaddr, mac)) {
+        memcpy(dev->hwaddr, mac, ETH_ADDR_LEN);
+    }
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_get_etheraddr(const struct netdev *netdev,
+                          uint8_t mac[ETH_ADDR_LEN])
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    memcpy(mac, dev->hwaddr, ETH_ADDR_LEN);
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    *mtup = dev->mtu;
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int old_mtu, err;
+    struct dpdk_mp *old_mp;
+    struct dpdk_mp *mp;
+
+    ovs_mutex_lock(&dpdk_mutex);
+    ovs_mutex_lock(&dev->mutex);
+    if (dev->mtu == mtu) {
+        err = 0;
+        goto out;
+    }
+
+    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
+    if (!mp) {
+        err = ENOMEM;
+        goto out;
+    }
+
+    rte_eth_dev_stop(dev->port_id);
+
+    old_mtu = dev->mtu;
+    old_mp = dev->dpdk_mp;
+    dev->dpdk_mp = mp;
+    dev->mtu = mtu;
+    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
+
+    err = dpdk_eth_dev_init(dev);
+    if (err) {
+
+        dpdk_mp_put(mp);
+        dev->mtu = old_mtu;
+        dev->dpdk_mp = old_mp;
+        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
+        dpdk_eth_dev_init(dev);
+        goto out;
+    }
+
+    dpdk_mp_put(old_mp);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    ovs_mutex_unlock(&dpdk_mutex);
+    return err;
+}
+
+static int
+netdev_dpdk_get_stats(const struct netdev *netdev, struct netdev_stats *stats)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    struct rte_eth_stats rte_stats;
+
+    ovs_mutex_lock(&dev->mutex);
+    rte_eth_stats_get(dev->port_id, &rte_stats);
+    ovs_mutex_unlock(&dev->mutex);
+
+    *stats = dev->stats_offset;
+
+    stats->rx_packets += rte_stats.ipackets;
+    stats->tx_packets += rte_stats.opackets;
+    stats->rx_bytes += rte_stats.ibytes;
+    stats->tx_bytes += rte_stats.obytes;
+    stats->rx_errors += rte_stats.ierrors;
+    stats->tx_errors += rte_stats.oerrors;
+    stats->multicast += rte_stats.imcasts;
+
+    return 0;
+}
+
+static int
+netdev_dpdk_set_stats(struct netdev *netdev, const struct netdev_stats *stats)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    dev->stats_offset = *stats;
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_get_features(const struct netdev *netdev_,
+                         enum netdev_features *current,
+                         enum netdev_features *advertised OVS_UNUSED,
+                         enum netdev_features *supported OVS_UNUSED,
+                         enum netdev_features *peer OVS_UNUSED)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+    struct rte_eth_link link;
+
+    rte_spinlock_lock(&dev->lsi_lock);
+    link = dev->link;
+    rte_spinlock_unlock(&dev->lsi_lock);
+
+    if (link.link_duplex == ETH_LINK_AUTONEG_DUPLEX) {
+        if (link.link_speed == ETH_LINK_SPEED_AUTONEG) {
+            *current = NETDEV_F_AUTONEG;
+        }
+    } else if (link.link_duplex == ETH_LINK_HALF_DUPLEX) {
+        if (link.link_speed == ETH_LINK_SPEED_10) {
+            *current = NETDEV_F_10MB_HD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_100) {
+            *current = NETDEV_F_100MB_HD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_1000) {
+            *current = NETDEV_F_1GB_HD;
+        }
+    } else if (link.link_duplex == ETH_LINK_FULL_DUPLEX) {
+        if (link.link_speed == ETH_LINK_SPEED_10) {
+            *current = NETDEV_F_10MB_FD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_100) {
+            *current = NETDEV_F_100MB_FD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_1000) {
+            *current = NETDEV_F_1GB_FD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_10000) {
+            *current = NETDEV_F_10GB_FD;
+        }
+    }
+
+    return 0;
+}
+
+static int
+netdev_dpdk_get_ifindex(const struct netdev *netdev)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int ifindex;
+
+    ovs_mutex_lock(&dev->mutex);
+    ifindex = dev->port_id;
+    ovs_mutex_unlock(&dev->mutex);
+
+    return ifindex;
+}
+
+static int
+netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+
+    rte_spinlock_lock(&dev->lsi_lock);
+    *carrier = dev->link.link_status;
+    rte_spinlock_unlock(&dev->lsi_lock);
+
+    return 0;
+}
+
+static long long int
+netdev_dpdk_get_carrier_resets(const struct netdev *netdev_)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+    long long int carrier_resets;
+
+    rte_spinlock_lock(&dev->lsi_lock);
+    carrier_resets = dev->link_reset_cnt;
+    rte_spinlock_unlock(&dev->lsi_lock);
+
+    return carrier_resets;
+}
+
+static int
+netdev_dpdk_set_miimon(struct netdev *netdev_ OVS_UNUSED,
+                       long long int interval OVS_UNUSED)
+{
+    return 0;
+}
+
+static int
+netdev_dpdk_update_flags__(struct netdev_dpdk *dev,
+                           enum netdev_flags off, enum netdev_flags on,
+                           enum netdev_flags *old_flagsp)
+    OVS_REQUIRES(dev->mutex)
+{
+    int err;
+
+    if ((off | on) & ~(NETDEV_UP | NETDEV_PROMISC)) {
+        return EINVAL;
+    }
+
+    *old_flagsp = dev->flags;
+    dev->flags |= on;
+    dev->flags &= ~off;
+
+    if (dev->flags == *old_flagsp) {
+        return 0;
+    }
+
+    rte_eth_dev_stop(dev->port_id);
+
+    if (dev->flags & NETDEV_UP) {
+        err = rte_eth_dev_start(dev->port_id);
+        if (err)
+            return err;
+    }
+
+    if (dev->flags & NETDEV_PROMISC) {
+        rte_eth_promiscuous_enable(dev->port_id);
+        rte_eth_allmulticast_enable(dev->port_id);
+    }
+
+    return 0;
+}
+
+static int
+netdev_dpdk_update_flags(struct netdev *netdev_,
+                         enum netdev_flags off, enum netdev_flags on,
+                         enum netdev_flags *old_flagsp)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    int error;
+
+    ovs_mutex_lock(&netdev->mutex);
+    error = netdev_dpdk_update_flags__(netdev, off, on, old_flagsp);
+    ovs_mutex_unlock(&netdev->mutex);
+
+    return error;
+}
+
+static int
+netdev_dpdk_get_status(const struct netdev *netdev_, struct smap *smap)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+    struct rte_eth_dev_info dev_info;
+
+    if (dev->port_id <= 0)
+        return ENODEV;
+
+    ovs_mutex_lock(&dev->mutex);
+    rte_eth_dev_info_get(dev->port_id, &dev_info);
+    ovs_mutex_unlock(&dev->mutex);
+
+    smap_add_format(smap, "driver_name", "%s", dev_info.driver_name);
+    return 0;
+}
+
+\f
+/* Helper functions. */
+
+static void
+netdev_dpdk_set_admin_state__(struct netdev_dpdk *dev, bool admin_state)
+    OVS_REQUIRES(dev->mutex)
+{
+    enum netdev_flags old_flags;
+
+    if (admin_state) {
+        netdev_dpdk_update_flags__(dev, 0, NETDEV_UP, &old_flags);
+    } else {
+        netdev_dpdk_update_flags__(dev, NETDEV_UP, 0, &old_flags);
+    }
+}
+
+static void
+netdev_dpdk_set_admin_state(struct unixctl_conn *conn, int argc,
+                            const char *argv[], void *aux OVS_UNUSED)
+{
+    bool up;
+
+    if (!strcasecmp(argv[argc - 1], "up")) {
+        up = true;
+    } else if ( !strcasecmp(argv[argc - 1], "down")) {
+        up = false;
+    } else {
+        unixctl_command_reply_error(conn, "Invalid Admin State");
+        return;
+    }
+
+    if (argc > 2) {
+        struct netdev *netdev = netdev_from_name(argv[1]);
+        if (netdev && is_dpdk_class(netdev->netdev_class)) {
+            struct netdev_dpdk *dpdk_dev = netdev_dpdk_cast(netdev);
+
+            ovs_mutex_lock(&dpdk_dev->mutex);
+            netdev_dpdk_set_admin_state__(dpdk_dev, up);
+            ovs_mutex_unlock(&dpdk_dev->mutex);
+
+            netdev_close(netdev);
+        } else {
+            unixctl_command_reply_error(conn, "Unknown Dummy Interface");
+            netdev_close(netdev);
+            return;
+        }
+    } else {
+        struct netdev_dpdk *netdev;
+
+        ovs_mutex_lock(&dpdk_mutex);
+        LIST_FOR_EACH (netdev, list_node, &dpdk_list) {
+            ovs_mutex_lock(&netdev->mutex);
+            netdev_dpdk_set_admin_state__(netdev, up);
+            ovs_mutex_unlock(&netdev->mutex);
+        }
+        ovs_mutex_unlock(&dpdk_mutex);
+    }
+    unixctl_command_reply(conn, "OK");
+}
+
+static int
+dpdk_class_init(void)
+{
+    int result;
+
+    if (rte_eal_init_ret) {
+        return 0;
+    }
+
+    result = rte_pmd_init_all();
+    if (result) {
+        VLOG_ERR("Cannot init xnic PMD\n");
+        return result;
+    }
+
+    result = rte_eal_pci_probe();
+    if (result) {
+        VLOG_ERR("Cannot probe PCI\n");
+        return result;
+    }
+
+    if (rte_eth_dev_count() < 1) {
+        VLOG_ERR("No Ethernet devices found. Try assigning ports to UIO.\n");
+    }
+
+    VLOG_INFO("Ethernet Device Count: %d\n", (int)rte_eth_dev_count());
+
+    list_init(&dpdk_list);
+    list_init(&dpdk_mp_list);
+
+    unixctl_command_register("netdev-dpdk/set-admin-state",
+                             "[netdev] up|down", 1, 2,
+                             netdev_dpdk_set_admin_state, NULL);
+
+    return 0;
+}
+
+static void
+dpdk_class_setup_thread(int tid)
+{
+    cpu_set_t cpuset;
+    int err;
+
+    /* Setup thread for DPDK library. */
+    RTE_PER_LCORE(_lcore_id) = tid % NR_QUEUE;
+
+    CPU_ZERO(&cpuset);
+    CPU_SET(rte_lcore_id(), &cpuset);
+    err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
+    if (err) {
+        VLOG_ERR("thread affinity error %d\n",err);
+    }
+}
+
+static struct netdev_class netdev_dpdk_class = {
+    "dpdk",
+    dpdk_class_init,            /* init */
+    NULL,                       /* netdev_dpdk_run */
+    NULL,                       /* netdev_dpdk_wait */
+
+    netdev_dpdk_alloc,
+    netdev_dpdk_construct,
+    netdev_dpdk_destruct,
+    netdev_dpdk_dealloc,
+    dpdk_class_setup_thread,
+    netdev_dpdk_get_config,
+    NULL,                       /* netdev_dpdk_set_config */
+    NULL,                       /* get_tunnel_config */
+
+    netdev_dpdk_send,           /* send */
+    NULL,                       /* send_wait */
+
+    netdev_dpdk_set_etheraddr,
+    netdev_dpdk_get_etheraddr,
+    netdev_dpdk_get_mtu,
+    netdev_dpdk_set_mtu,
+    netdev_dpdk_get_ifindex,
+    netdev_dpdk_get_carrier,
+    netdev_dpdk_get_carrier_resets,
+    netdev_dpdk_set_miimon,
+    netdev_dpdk_get_stats,
+    netdev_dpdk_set_stats,
+    netdev_dpdk_get_features,
+    NULL,                       /* set_advertisements */
+
+    NULL,                       /* set_policing */
+    NULL,                       /* get_qos_types */
+    NULL,                       /* get_qos_capabilities */
+    NULL,                       /* get_qos */
+    NULL,                       /* set_qos */
+    NULL,                       /* get_queue */
+    NULL,                       /* set_queue */
+    NULL,                       /* delete_queue */
+    NULL,                       /* get_queue_stats */
+    NULL,                       /* queue_dump_start */
+    NULL,                       /* queue_dump_next */
+    NULL,                       /* queue_dump_done */
+    NULL,                       /* dump_queue_stats */
+
+    NULL,                       /* get_in4 */
+    NULL,                       /* set_in4 */
+    NULL,                       /* get_in6 */
+    NULL,                       /* add_router */
+    NULL,                       /* get_next_hop */
+    netdev_dpdk_get_status,
+    NULL,                       /* arp_lookup */
+
+    netdev_dpdk_update_flags,
+
+    netdev_dpdk_rx_alloc,
+    netdev_dpdk_rx_construct,
+    netdev_dpdk_rx_destruct,
+    netdev_dpdk_rx_dealloc,
+    netdev_dpdk_rx_recv,
+    NULL,                       /* rx_wait */
+    netdev_dpdk_rx_drain,
+};
+
+int
+dpdk_init(int argc, char **argv)
+{
+    int result;
+
+    if (strcmp(argv[1], "--dpdk"))
+        return 0;
+    argc--;
+    argv++;
+    /* Make sure things are initialized ... */
+    if ((result=rte_eal_init(argc, argv)) < 0)
+        rte_panic("Cannot init EAL\n");
+    rte_memzone_dump();
+    rte_eal_init_ret = 0;
+    return result;
+}
+
+void
+netdev_dpdk_register(void)
+{
+    netdev_register_provider(&netdev_dpdk_class);
+}
diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
new file mode 100644
index 0000000..5cf5626
--- /dev/null
+++ b/lib/netdev-dpdk.h
@@ -0,0 +1,7 @@
+#ifndef __NETDEV_DPDK_H__
+#define __NETDEV_DPDK_H__
+
+int dpdk_init(int argc, char **argv);
+void netdev_dpdk_register(void);
+
+#endif
diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
index 0f93363..2cb3c9b 100644
--- a/lib/netdev-dummy.c
+++ b/lib/netdev-dummy.c
@@ -103,6 +103,7 @@ struct netdev_dummy {
     FILE *tx_pcap, *rx_pcap OVS_GUARDED;
 
     struct list rxes OVS_GUARDED; /* List of child "netdev_rx_dummy"s. */
+    struct ofpbuf buffer;
 };
 
 /* Max 'recv_queue_len' in struct netdev_dummy. */
@@ -695,7 +696,7 @@ netdev_dummy_set_config(struct netdev *netdev_, const struct smap *args)
 }
 
 static struct netdev_rx *
-netdev_dummy_rx_alloc(void)
+netdev_dummy_rx_alloc(int id OVS_UNUSED)
 {
     struct netdev_rx_dummy *rx = xzalloc(sizeof *rx);
     return &rx->up;
@@ -739,12 +740,12 @@ netdev_dummy_rx_dealloc(struct netdev_rx *rx_)
 }
 
 static int
-netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer)
+netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c)
 {
     struct netdev_rx_dummy *rx = netdev_rx_dummy_cast(rx_);
     struct netdev_dummy *netdev = netdev_dummy_cast(rx->up.netdev);
+    struct ofpbuf *buffer = &netdev->buffer;
     struct ofpbuf *packet;
-    int retval;
 
     ovs_mutex_lock(&netdev->mutex);
     if (!list_is_empty(&rx->recv_queue)) {
@@ -758,22 +759,19 @@ netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer)
     if (!packet) {
         return EAGAIN;
     }
+    ovs_mutex_lock(&netdev->mutex);
+    netdev->stats.rx_packets++;
+    netdev->stats.rx_bytes += packet->size;
+    ovs_mutex_unlock(&netdev->mutex);
 
-    if (packet->size <= ofpbuf_tailroom(buffer)) {
-        memcpy(buffer->data, packet->data, packet->size);
-        buffer->size += packet->size;
-        retval = 0;
-
-        ovs_mutex_lock(&netdev->mutex);
-        netdev->stats.rx_packets++;
-        netdev->stats.rx_bytes += packet->size;
-        ovs_mutex_unlock(&netdev->mutex);
-    } else {
-        retval = EMSGSIZE;
-    }
-    ofpbuf_delete(packet);
+    ofpbuf_clear(buffer);
+    ofpbuf_reserve_with_tailroom(buffer, DP_NETDEV_HEADROOM, packet->size);
+    memcpy(buffer->data, packet->data, packet->size);
 
-    return retval;
+    packet_set_size(packet, packet->size);
+    *rpacket = packet;
+    *c = 1;
+    return 0;
 }
 
 static void
@@ -809,9 +807,12 @@ netdev_dummy_rx_drain(struct netdev_rx *rx_)
 }
 
 static int
-netdev_dummy_send(struct netdev *netdev, const void *buffer, size_t size)
+netdev_dummy_send(struct netdev *netdev,
+                  struct ofpbuf *pkt, bool may_steal OVS_UNUSED)
 {
     struct netdev_dummy *dev = netdev_dummy_cast(netdev);
+    const void *buffer = pkt->data;
+    size_t size = pkt->size;
 
     if (size < ETH_HEADER_LEN) {
         return EMSGSIZE;
@@ -987,6 +988,7 @@ static const struct netdev_class dummy_class = {
     netdev_dummy_construct,
     netdev_dummy_destruct,
     netdev_dummy_dealloc,
+    NULL,                       /* setup_thread */
     netdev_dummy_get_config,
     netdev_dummy_set_config,
     NULL,                       /* get_tunnel_config */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index e756d88..73ba2c2 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -426,6 +426,7 @@ struct netdev_linux {
 
 struct netdev_rx_linux {
     struct netdev_rx up;
+    struct ofpbuf pkt;
     bool is_tap;
     int fd;
 };
@@ -462,6 +463,7 @@ static int af_packet_sock(void);
 static bool netdev_linux_miimon_enabled(void);
 static void netdev_linux_miimon_run(void);
 static void netdev_linux_miimon_wait(void);
+static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup);
 
 static bool
 is_netdev_linux_class(const struct netdev_class *netdev_class)
@@ -773,7 +775,7 @@ netdev_linux_dealloc(struct netdev *netdev_)
 }
 
 static struct netdev_rx *
-netdev_linux_rx_alloc(void)
+netdev_linux_rx_alloc(int id OVS_UNUSED)
 {
     struct netdev_rx_linux *rx = xzalloc(sizeof *rx);
     return &rx->up;
@@ -985,10 +987,24 @@ netdev_linux_rx_recv_tap(int fd, struct ofpbuf *buffer)
 }
 
 static int
-netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer)
+netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c)
 {
     struct netdev_rx_linux *rx = netdev_rx_linux_cast(rx_);
-    int retval;
+    struct netdev *netdev = rx->up.netdev;
+    struct ofpbuf *buffer;
+    ssize_t retval;
+    int mtu;
+    int buf_size;
+
+    if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
+        mtu = ETH_PAYLOAD_MAX;
+    }
+    buf_size = DP_NETDEV_HEADROOM + VLAN_ETH_HEADER_LEN + mtu;
+
+    buffer = &rx->pkt;
+    ofpbuf_clear(buffer);
+
+    ofpbuf_reserve_with_tailroom(buffer, DP_NETDEV_HEADROOM, buf_size);
 
     retval = (rx->is_tap
               ? netdev_linux_rx_recv_tap(rx->fd, buffer)
@@ -996,8 +1012,11 @@ netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer)
     if (retval && retval != EAGAIN && retval != EMSGSIZE) {
         VLOG_WARN_RL(&rl, "error receiving Ethernet packet on %s: %s",
                      ovs_strerror(errno), netdev_rx_get_name(rx_));
+    } else {
+        packet_set_size(buffer, buffer->size);
+        *rpacket = buffer;
+        *c = 1;
     }
-
     return retval;
 }
 
@@ -1036,8 +1055,11 @@ netdev_linux_rx_drain(struct netdev_rx *rx_)
  * The kernel maintains a packet transmission queue, so the caller is not
  * expected to do additional queuing of packets. */
 static int
-netdev_linux_send(struct netdev *netdev_, const void *data, size_t size)
+netdev_linux_send(struct netdev *netdev_, struct ofpbuf *pkt, bool may_steal OVS_UNUSED)
 {
+    const void *data = pkt->data;
+    size_t size = pkt->size;
+
     for (;;) {
         ssize_t retval;
 
@@ -2677,6 +2699,7 @@ netdev_linux_update_flags(struct netdev *netdev_, enum netdev_flags off,
     CONSTRUCT,                                                  \
     netdev_linux_destruct,                                      \
     netdev_linux_dealloc,                                       \
+    NULL,                       /* setup_thread */              \
     NULL,                       /* get_config */                \
     NULL,                       /* set_config */                \
     NULL,                       /* get_tunnel_config */         \
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index 673d3ab..0c9f347 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -33,11 +33,11 @@ extern "C" {
  * Network device implementations may read these members but should not modify
  * them. */
 struct netdev {
+    int nr_rx;
     /* The following do not change during the lifetime of a struct netdev. */
     char *name;                         /* Name of network device. */
     const struct netdev_class *netdev_class; /* Functions to control
                                                 this device. */
-
     /* The following are protected by 'netdev_mutex' (internal to netdev.c). */
     int ref_cnt;                        /* Times this devices was opened. */
     struct shash_node *node;            /* Pointer to element in global map. */
@@ -203,6 +203,10 @@ struct netdev_class {
     void (*destruct)(struct netdev *);
     void (*dealloc)(struct netdev *);
 
+    /*
+     * Some platform need to setup thread state. */
+    void (*setup_thread)(int thread_id);
+
     /* Fetches the device 'netdev''s configuration, storing it in 'args'.
      * The caller owns 'args' and pre-initializes it to an empty smap.
      *
@@ -241,7 +245,7 @@ struct netdev_class {
      * network device from being usefully used by the netdev-based "userspace
      * datapath".  It will also prevent the OVS implementation of bonding from
      * working properly over 'netdev'.) */
-    int (*send)(struct netdev *netdev, const void *buffer, size_t size);
+    int (*send)(struct netdev *, struct ofpbuf *buffer, bool may_steal);
 
     /* Registers with the poll loop to wake up from the next call to
      * poll_block() when the packet transmission queue for 'netdev' has
@@ -629,7 +633,7 @@ struct netdev_class {
 
     /* Life-cycle functions for a netdev_rx.  See the large comment above on
      * struct netdev_class. */
-    struct netdev_rx *(*rx_alloc)(void);
+    struct netdev_rx *(*rx_alloc)(int id);
     int (*rx_construct)(struct netdev_rx *);
     void (*rx_destruct)(struct netdev_rx *);
     void (*rx_dealloc)(struct netdev_rx *);
@@ -655,7 +659,7 @@ struct netdev_class {
      *
      * This function may be set to null if it would always return EOPNOTSUPP
      * anyhow. */
-    int (*rx_recv)(struct netdev_rx *rx, struct ofpbuf *buffer);
+    int (*rx_recv)(struct netdev_rx *rx, struct ofpbuf **pkt, int *c);
 
     /* Registers with the poll loop to wake up from the next call to
      * poll_block() when a packet is ready to be received with netdev_rx_recv()
@@ -672,6 +676,7 @@ int netdev_unregister_provider(const char *type);
 extern const struct netdev_class netdev_linux_class;
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
+extern const struct netdev_class netdev_pdk_class;
 #if defined(__FreeBSD__) || defined(__NetBSD__)
 extern const struct netdev_class netdev_bsd_class;
 #endif
diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
index 165c1c6..ad9d2a5 100644
--- a/lib/netdev-vport.c
+++ b/lib/netdev-vport.c
@@ -686,6 +686,7 @@ get_stats(const struct netdev *netdev, struct netdev_stats *stats)
     netdev_vport_construct,                                 \
     netdev_vport_destruct,                                  \
     netdev_vport_dealloc,                                   \
+    NULL,                       /* setup_thread */          \
     GET_CONFIG,                                             \
     SET_CONFIG,                                             \
     GET_TUNNEL_CONFIG,                                      \
diff --git a/lib/netdev.c b/lib/netdev.c
index 8e62421..f688c5c 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -91,6 +91,11 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 static void restore_all_flags(void *aux OVS_UNUSED);
 void update_device_args(struct netdev *, const struct shash *args);
 
+int netdev_nr_rx(const struct netdev *netdev)
+{
+    return netdev->nr_rx;
+}
+
 static void
 netdev_initialize(void)
     OVS_EXCLUDED(netdev_class_rwlock, netdev_mutex)
@@ -107,6 +112,9 @@ netdev_initialize(void)
         netdev_register_provider(&netdev_tap_class);
         netdev_vport_tunnel_register();
 #endif
+#ifdef DPDK_NETDEV
+        netdev_dpdk_register();
+#endif
 #if defined(__FreeBSD__) || defined(__NetBSD__)
         netdev_register_provider(&netdev_tap_class);
         netdev_register_provider(&netdev_bsd_class);
@@ -326,6 +334,7 @@ netdev_open(const char *name, const char *type, struct netdev **netdevp)
                 memset(netdev, 0, sizeof *netdev);
                 netdev->netdev_class = rc->class;
                 netdev->name = xstrdup(name);
+                netdev->nr_rx = 1;
                 netdev->node = shash_add(&netdev_shash, name, netdev);
                 list_init(&netdev->saved_flags_list);
 
@@ -481,6 +490,20 @@ netdev_close(struct netdev *netdev)
     }
 }
 
+void
+netdev_setup_thread(int id)
+{
+    struct netdev_registered_class *rc;
+
+    ovs_rwlock_rdlock(&netdev_class_rwlock);
+    HMAP_FOR_EACH (rc, hmap_node, &netdev_classes) {
+        if (rc->class->setup_thread) {
+            rc->class->setup_thread(id);
+        }
+    }
+    ovs_rwlock_unlock(&netdev_class_rwlock);
+}
+
 /* Parses 'netdev_name_', which is of the form [type@]name into its component
  * pieces.  'name' and 'type' must be freed by the caller. */
 void
@@ -508,13 +531,13 @@ netdev_parse_name(const char *netdev_name_, char **name, char **type)
  * Some kinds of network devices might not support receiving packets.  This
  * function returns EOPNOTSUPP in that case.*/
 int
-netdev_rx_open(struct netdev *netdev, struct netdev_rx **rxp)
+netdev_rx_open(struct netdev *netdev, struct netdev_rx **rxp, int id)
     OVS_EXCLUDED(netdev_mutex)
 {
     int error;
 
     if (netdev->netdev_class->rx_alloc) {
-        struct netdev_rx *rx = netdev->netdev_class->rx_alloc();
+        struct netdev_rx *rx = netdev->netdev_class->rx_alloc(id);
         if (rx) {
             rx->netdev = netdev;
             error = netdev->netdev_class->rx_construct(rx);
@@ -575,23 +598,18 @@ netdev_rx_close(struct netdev_rx *rx)
  * This function may be set to null if it would always return EOPNOTSUPP
  * anyhow. */
 int
-netdev_rx_recv(struct netdev_rx *rx, struct ofpbuf *buffer)
+netdev_rx_recv(struct netdev_rx *rx, struct ofpbuf **buffer, int *c)
 {
     int retval;
 
-    ovs_assert(buffer->size == 0);
-    ovs_assert(ofpbuf_tailroom(buffer) >= ETH_TOTAL_MIN);
+    retval = rx->netdev->netdev_class->rx_recv(rx, buffer, c);
+    return retval;
+}
 
-    retval = rx->netdev->netdev_class->rx_recv(rx, buffer);
-    if (!retval) {
-        COVERAGE_INC(netdev_received);
-        if (buffer->size < ETH_TOTAL_MIN) {
-            ofpbuf_put_zeros(buffer, ETH_TOTAL_MIN - buffer->size);
-        }
-        return 0;
-    } else {
-        return retval;
-    }
+bool
+netdev_is_pmd(const struct netdev *netdev)
+{
+    return !strcmp(netdev->netdev_class->type, "dpdk");
 }
 
 /* Arranges for poll_block() to wake up when a packet is ready to be received
@@ -624,12 +642,12 @@ netdev_rx_drain(struct netdev_rx *rx)
  * Some network devices may not implement support for this function.  In such
  * cases this function will always return EOPNOTSUPP. */
 int
-netdev_send(struct netdev *netdev, const struct ofpbuf *buffer)
+netdev_send(struct netdev *netdev, struct ofpbuf *buffer, bool may_steal)
 {
     int error;
 
     error = (netdev->netdev_class->send
-             ? netdev->netdev_class->send(netdev, buffer->data, buffer->size)
+             ? netdev->netdev_class->send(netdev, buffer, may_steal)
              : EOPNOTSUPP);
     if (!error) {
         COVERAGE_INC(netdev_sent);
diff --git a/lib/netdev.h b/lib/netdev.h
index 410c35b..d5a7793 100644
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -21,6 +21,7 @@
 #include <stddef.h>
 #include <stdint.h>
 #include "openvswitch/types.h"
+#include "packets.h"
 
 #ifdef  __cplusplus
 extern "C" {
@@ -138,6 +139,7 @@ bool netdev_is_reserved_name(const char *name);
 int netdev_open(const char *name, const char *type, struct netdev **);
 struct netdev *netdev_ref(const struct netdev *);
 void netdev_close(struct netdev *);
+void netdev_setup_thread(int id);
 
 void netdev_parse_name(const char *netdev_name, char **name, char **type);
 
@@ -156,17 +158,18 @@ int netdev_set_mtu(const struct netdev *, int mtu);
 int netdev_get_ifindex(const struct netdev *);
 
 /* Packet reception. */
-int netdev_rx_open(struct netdev *, struct netdev_rx **);
+int netdev_rx_open(struct netdev *, struct netdev_rx **, int id);
 void netdev_rx_close(struct netdev_rx *);
 
 const char *netdev_rx_get_name(const struct netdev_rx *);
 
-int netdev_rx_recv(struct netdev_rx *, struct ofpbuf *);
+bool netdev_is_pmd(const struct netdev *netdev);
+int netdev_rx_recv(struct netdev_rx *, struct ofpbuf **, int *);
 void netdev_rx_wait(struct netdev_rx *);
 int netdev_rx_drain(struct netdev_rx *);
-
+int netdev_nr_rx(const struct netdev *netdev);
 /* Packet transmission. */
-int netdev_send(struct netdev *, const struct ofpbuf *);
+int netdev_send(struct netdev *, struct ofpbuf *, bool may_steal);
 void netdev_send_wait(struct netdev *);
 
 /* Hardware address. */
@@ -198,6 +201,10 @@ enum netdev_features {
     NETDEV_F_PAUSE_ASYM = 1 << 15, /* Asymmetric pause. */
 };
 
+/* Enough headroom to add a vlan tag, plus an extra 2 bytes to allow IP
+ * headers to be aligned on a 4-byte boundary.  */
+enum { DP_NETDEV_HEADROOM = 2 + VLAN_HEADER_LEN };
+
 int netdev_get_features(const struct netdev *,
                         enum netdev_features *current,
                         enum netdev_features *advertised,
diff --git a/lib/ofpbuf.c b/lib/ofpbuf.c
index 0eed428..249fbaa 100644
--- a/lib/ofpbuf.c
+++ b/lib/ofpbuf.c
@@ -265,6 +265,9 @@ ofpbuf_resize__(struct ofpbuf *b, size_t new_headroom, size_t new_tailroom)
     new_allocated = new_headroom + b->size + new_tailroom;
 
     switch (b->source) {
+    case OFPBUF_DPDK:
+        OVS_NOT_REACHED();
+
     case OFPBUF_MALLOC:
         if (new_headroom == ofpbuf_headroom(b)) {
             new_base = xrealloc(b->base, new_allocated);
@@ -343,7 +346,7 @@ ofpbuf_prealloc_headroom(struct ofpbuf *b, size_t size)
 void
 ofpbuf_trim(struct ofpbuf *b)
 {
-    if (b->source == OFPBUF_MALLOC
+    if ((b->source == OFPBUF_MALLOC || b->source == OFPBUF_DPDK)
         && (ofpbuf_headroom(b) || ofpbuf_tailroom(b))) {
         ofpbuf_resize__(b, 0, 0);
     }
@@ -562,6 +565,8 @@ void *
 ofpbuf_steal_data(struct ofpbuf *b)
 {
     void *p;
+    ovs_assert(b->source != OFPBUF_DPDK);
+
     if (b->source == OFPBUF_MALLOC && b->data == b->base) {
         p = b->data;
     } else {
diff --git a/lib/ofpbuf.h b/lib/ofpbuf.h
index 7407d8b..1f7f276 100644
--- a/lib/ofpbuf.h
+++ b/lib/ofpbuf.h
@@ -20,6 +20,7 @@
 #include <stddef.h>
 #include <stdint.h>
 #include "list.h"
+#include "packets.h"
 #include "util.h"
 
 #ifdef  __cplusplus
@@ -29,18 +30,18 @@ extern "C" {
 enum ofpbuf_source {
     OFPBUF_MALLOC,              /* Obtained via malloc(). */
     OFPBUF_STACK,               /* Un-movable stack space or static buffer. */
-    OFPBUF_STUB                 /* Starts on stack, may expand into heap. */
+    OFPBUF_STUB,                /* Starts on stack, may expand into heap. */
+    OFPBUF_DPDK,
 };
 
 /* Buffer for holding arbitrary data.  An ofpbuf is automatically reallocated
  * as necessary if it grows too large for the available memory. */
 struct ofpbuf {
     void *base;                 /* First byte of allocated space. */
-    size_t allocated;           /* Number of bytes allocated. */
-    enum ofpbuf_source source;  /* Source of memory allocated as 'base'. */
-
     void *data;                 /* First byte actually in use. */
+    void *private_p;            /* Private pointer for use by owner. */
     size_t size;                /* Number of bytes in use. */
+    size_t allocated;           /* Number of bytes allocated. */
 
     void *l2;                   /* Link-level header. */
     void *l2_5;                 /* MPLS label stack */
@@ -49,10 +50,10 @@ struct ofpbuf {
     void *l7;                   /* Application data. */
 
     struct list list_node;      /* Private list element for use by owner. */
-    void *private_p;            /* Private pointer for use by owner. */
+    enum ofpbuf_source source;  /* Source of memory allocated as 'base'. */
 };
 
-void ofpbuf_use(struct ofpbuf *, void *, size_t);
+void ofpbuf_use_same(struct ofpbuf *b, void *base, size_t allocated);
 void ofpbuf_use_stack(struct ofpbuf *, void *, size_t);
 void ofpbuf_use_stub(struct ofpbuf *, void *, size_t);
 void ofpbuf_use_const(struct ofpbuf *, const void *, size_t);
diff --git a/lib/packets.c b/lib/packets.c
index 0d63841..525c084 100644
--- a/lib/packets.c
+++ b/lib/packets.c
@@ -990,3 +990,12 @@ packet_format_tcp_flags(struct ds *s, uint16_t tcp_flags)
         ds_put_cstr(s, "[800]");
     }
 }
+
+void
+packet_set_size(struct ofpbuf *b, int size)
+{
+    b->size = size;
+    if (b->size < ETH_TOTAL_MIN) {
+        ofpbuf_put_zeros(b, ETH_TOTAL_MIN - b->size);
+    }
+}
diff --git a/lib/packets.h b/lib/packets.h
index 8e21fa8..dcf3c3d 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -656,4 +656,5 @@ uint16_t packet_get_tcp_flags(const struct ofpbuf *, const struct flow *);
 void packet_format_tcp_flags(struct ds *, uint16_t);
 const char *packet_tcp_flag_to_string(uint32_t flag);
 
+void packet_set_size(struct ofpbuf *b, int size);
 #endif /* packets.h */
diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
index 990e58f..9bedd6c 100644
--- a/vswitchd/ovs-vswitchd.c
+++ b/vswitchd/ovs-vswitchd.c
@@ -49,6 +49,7 @@
 #include "vconn.h"
 #include "vlog.h"
 #include "lib/vswitch-idl.h"
+#include "lib/netdev-dpdk.h"
 
 VLOG_DEFINE_THIS_MODULE(vswitchd);
 
@@ -71,6 +72,12 @@ main(int argc, char *argv[])
     bool exiting;
     int retval;
 
+#ifdef DPDK_NETDEV
+    retval = dpdk_init(argc,argv);
+    argc -= retval;
+    argv += retval;
+#endif
+
     proctitle_init(argc, argv);
     set_program_name(argv[0]);
     remote = parse_options(argc, argv, &unixctl_path);
@@ -145,7 +152,8 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         OPT_BOOTSTRAP_CA_CERT,
         OPT_ENABLE_DUMMY,
         OPT_DISABLE_SYSTEM,
-        DAEMON_OPTION_ENUMS
+        DAEMON_OPTION_ENUMS,
+        OPT_DPDK,
     };
     static const struct option long_options[] = {
         {"help",        no_argument, NULL, 'h'},
@@ -159,6 +167,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         {"bootstrap-ca-cert", required_argument, NULL, OPT_BOOTSTRAP_CA_CERT},
         {"enable-dummy", optional_argument, NULL, OPT_ENABLE_DUMMY},
         {"disable-system", no_argument, NULL, OPT_DISABLE_SYSTEM},
+        {"dpdk", required_argument, NULL, OPT_DPDK},
         {NULL, 0, NULL, 0},
     };
     char *short_options = long_options_to_short_options(long_options);
@@ -210,6 +219,9 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         case '?':
             exit(EXIT_FAILURE);
 
+        case OPT_DPDK:
+            break;
+
         default:
             abort();
         }
-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
       [not found] ` <20140128044950.GA4545@nicira.com>
@ 2014-01-28  5:28   ` Pravin Shelar
  2014-01-28 14:47     ` [dpdk-dev] " Vincent JARDIN
  0 siblings, 1 reply; 23+ messages in thread
From: Pravin Shelar @ 2014-01-28  5:28 UTC (permalink / raw)
  To: Ben Pfaff; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

On Mon, Jan 27, 2014 at 8:49 PM, Ben Pfaff <blp@nicira.com> wrote:
> On Mon, Jan 27, 2014 at 05:48:35PM -0800, pshelar@nicira.com wrote:
>> From: Pravin B Shelar <pshelar@nicira.com>
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel?? DPDK vSwitch
>> where DPDK datapath switching is done in saparate process.  This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching.  Therefore all DPDK processing and flow
>> miss handling is done in single process.  This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports.  Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>>
>> This is based a patch from Gerald Rogers.
>>
>> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
>> CC: "Gerald Rogers" <gerald.rogers@intel.com>
>
> I haven't looked at the patch yet (it does sound awesome) but if it's
> based on Gerald's code then I'd expect to get his Signed-off-by too.

Right.

Gerald's Patch did not had any signed-off. So If he send me signed-off
now, I will update the commit msg.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-28  5:28   ` [dpdk-dev] [ovs-dev] " Pravin Shelar
@ 2014-01-28 14:47     ` Vincent JARDIN
  2014-01-28 17:56       ` Pravin Shelar
       [not found]       ` <52E7D2A8.400@redhat.com>
  0 siblings, 2 replies; 23+ messages in thread
From: Vincent JARDIN @ 2014-01-28 14:47 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: dev, Ben Pfaff, dev, dpdk-ovs

Hi Pravin,

Yes, it is a good integration with http://dpdk.org

Few feature questions:
   - what's about the vNIC supports (toward the guests)?
   - what's about IPsec support (VxLAN over IPsec for instance)?
I do not understand how your patch will solve those 2 cases.

>>> This is based a patch from Gerald Rogers.

Please which patch? I cannot find it into the archives.

>>>
>>> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
>>> CC: "Gerald Rogers" <gerald.rogers@intel.com>
>>
>> I haven't looked at the patch yet (it does sound awesome) but if it's
>> based on Gerald's code then I'd expect to get his Signed-off-by too.
>
> Right.
>
> Gerald's Patch did not had any signed-off. So If he send me signed-off
> now, I will update the commit msg.

Best regards,
   Vincent

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-28 14:47     ` [dpdk-dev] " Vincent JARDIN
@ 2014-01-28 17:56       ` Pravin Shelar
  2014-01-29  0:15         ` Vincent JARDIN
       [not found]       ` <52E7D2A8.400@redhat.com>
  1 sibling, 1 reply; 23+ messages in thread
From: Pravin Shelar @ 2014-01-28 17:56 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev, Ben Pfaff, Jesse Gross, dev, dpdk-ovs

Hi Vincent,

On Tue, Jan 28, 2014 at 6:47 AM, Vincent JARDIN
<vincent.jardin@6wind.com> wrote:
> Hi Pravin,
>
> Yes, it is a good integration with http://dpdk.org
>
> Few feature questions:
>   - what's about the vNIC supports (toward the guests)?
>   - what's about IPsec support (VxLAN over IPsec for instance)?
> I do not understand how your patch will solve those 2 cases.
>
At this point I wanted to get basic DPDK support in OVS, once that is
done we can add support for vNIC.
IPsec and vxlan or any L3 tunneling requires IP stack in userspace and
needs more design work.

>
>>>> This is based a patch from Gerald Rogers.
>
>
> Please which patch? I cannot find it into the archives.
>
It was directly sent to Jesse at Nicira. If you want I can send it out for ref.

Thanks,
Pravin.

>
>>>>
>>>> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
>>>> CC: "Gerald Rogers" <gerald.rogers@intel.com>
>>>
>>>
>>> I haven't looked at the patch yet (it does sound awesome) but if it's
>>> based on Gerald's code then I'd expect to get his Signed-off-by too.
>>
>>
>> Right.
>>
>> Gerald's Patch did not had any signed-off. So If he send me signed-off
>> now, I will update the commit msg.
>
>
> Best regards,
>   Vincent

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
       [not found] ` <52E7D13B.9020404@redhat.com>
@ 2014-01-28 18:17   ` Pravin Shelar
  2014-01-29  8:15     ` Thomas Graf
  0 siblings, 1 reply; 23+ messages in thread
From: Pravin Shelar @ 2014-01-28 18:17 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

On Tue, Jan 28, 2014 at 7:48 AM, Thomas Graf <tgraf@redhat.com> wrote:
> On 01/28/2014 02:48 AM, pshelar@nicira.com wrote:
>>
>> From: Pravin B Shelar <pshelar@nicira.com>
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel® DPDK vSwitch
>> where DPDK datapath switching is done in saparate process.  This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching.  Therefore all DPDK processing and flow
>> miss handling is done in single process.  This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports.  Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>
>
> I'm happy to see this happen!
>
>
>
>> +static const struct rte_eth_conf port_conf = {
>> +        .rxmode = {
>> +                .mq_mode = ETH_MQ_RX_RSS,
>> +                .split_hdr_size = 0,
>> +                .header_split   = 0, /* Header Split disabled */
>> +                .hw_ip_checksum = 0, /* IP checksum offload enabled */
>> +                .hw_vlan_filter = 0, /* VLAN filtering disabled */
>> +                .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>> +                .hw_strip_crc   = 0, /* CRC stripped by hardware */
>> +        },
>> +        .rx_adv_conf = {
>> +                .rss_conf = {
>> +                        .rss_key = NULL,
>> +                        .rss_hf = ETH_RSS_IPV4_TCP | ETH_RSS_IPV4 |
>> ETH_RSS_IPV6,
>> +                },
>> +        },
>> +        .txmode = {
>> +                .mq_mode = ETH_MQ_TX_NONE,
>> +        },
>> +};
>
>
> I realize this is an RFC patch but I will ask anyway:
>
> What are the plans on managing runtime dependencies of a DPDK enabled OVS
> and DPDK itself? Will a OVS built against DPDK 1.5.2 work with
> drivers written for 1.5.3?
>
> Based on the above use of struct rte_eth_conf it would seem that once
> released, rte_eth_conf cannot be extended anymore without breaking
> ABI compatibility. The same applies to many of the other user
> structures. I see various structures changes between minor releases,
> for example dpdk.org ed2c69c3ef7 between 1.5.1 and 1.5.2.
>

Right, version mismatch will not work. API provided by DPDK are not
stable, So OVS has to be built for different releases for now.

I do not see how we can fix it from OVS side. DPDK needs to
standardize API, Actually OVS also needs more API, like DPDK
initialization, mempool destroy, etc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
       [not found]       ` <52E7D2A8.400@redhat.com>
@ 2014-01-28 18:20         ` Pravin Shelar
  0 siblings, 0 replies; 23+ messages in thread
From: Pravin Shelar @ 2014-01-28 18:20 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, dpdk-ovs

On Tue, Jan 28, 2014 at 7:54 AM, Thomas Graf <tgraf@redhat.com> wrote:
> On 01/28/2014 03:47 PM, Vincent JARDIN wrote:
>>
>> Hi Pravin,
>>
>> Yes, it is a good integration with http://dpdk.org
>>
>> Few feature questions:
>>    - what's about the vNIC supports (toward the guests)?
>>    - what's about IPsec support (VxLAN over IPsec for instance)?
>
>
> I would like to extend this question to all previously implemented
> kernel data path features such as QoS. What is the plan?

I see there is QoS support in DPDK that we can use it in OVS. I will
look at it once this patch is reviewed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-28 17:56       ` Pravin Shelar
@ 2014-01-29  0:15         ` Vincent JARDIN
  2014-01-29 19:32           ` Pravin Shelar
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent JARDIN @ 2014-01-29  0:15 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: dev, Ben Pfaff, Jesse Gross, dev, dpdk-ovs

Hi Pravin,

 >> Few feature questions:
>>    - what's about the vNIC supports (toward the guests)?
>>    - what's about IPsec support (VxLAN over IPsec for instance)?
>> I do not understand how your patch will solve those 2 cases.
>>
> At this point I wanted to get basic DPDK support in OVS, once that is
> done we can add support for vNIC.

For vNIC, did you notice:
   http://dpdk.org/browse/memnic/
?

> IPsec and vxlan or any L3 tunneling requires IP stack in userspace and
> needs more design work.

OK, understood.

>>>>> This is based a patch from Gerald Rogers.
>>
>> Please which patch? I cannot find it into the archives.
>>
> It was directly sent to Jesse at Nicira. If you want I can send it out for ref.

Yes, please.

Thank you,
   Vincent

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-28 18:17   ` Pravin Shelar
@ 2014-01-29  8:15     ` Thomas Graf
  2014-01-29 10:26       ` Vincent JARDIN
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Graf @ 2014-01-29  8:15 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

On 01/28/2014 07:17 PM, Pravin Shelar wrote:
> Right, version mismatch will not work. API provided by DPDK are not
> stable, So OVS has to be built for different releases for now.
>
> I do not see how we can fix it from OVS side. DPDK needs to
> standardize API, Actually OVS also needs more API, like DPDK
> initialization, mempool destroy, etc.

Agreed. It's not fixable from the OVS side. I also don't want to
object to including this. I'm just raising awareness of the issue
as this will become essential for dstribution.

The obvious and usual best practise would be for DPDK to guarantee
ABI stability between minor releases.

Since dpdk-dev is copied as well, any comments?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-28  1:48 [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports pshelar
       [not found] ` <20140128044950.GA4545@nicira.com>
       [not found] ` <52E7D13B.9020404@redhat.com>
@ 2014-01-29  8:56 ` Prashant Upadhyaya
  2014-01-29 21:29   ` Pravin Shelar
  2014-01-29 10:01 ` [dpdk-dev] [ovs-dev] " Thomas Graf
  3 siblings, 1 reply; 23+ messages in thread
From: Prashant Upadhyaya @ 2014-01-29  8:56 UTC (permalink / raw)
  To: pshelar, dev, dev, dpdk-ovs; +Cc: Gerald Rogers

Hi Pravin,

I think your stuff is on the brink of a creating a mini revolution :)

Some questions inline below --
+    ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
What do you mean by portid here, do you mean the physical interface id like eth0 which I have bound to igb_uio now ?
If I have multiple interfaces I have assigned igb_uio to, eg. eth0, eth1, eth2 etc., what is the id mapping for those ?

If I have VM's running, then typically how to interface those VM's to this OVS in user space now, do I use the same classical 'tap' interface and add it to the OVS above.
What is the actual path the data takes from the VM now all the way to the switch, wouldn't it be hypervisor to kernel to OVS switch in user space to other VM/Network ?
I think if we can solve the VM to OVS port connectivity remaining in userspace only, then we have a great thing at our hand. Kindly comment on this.

Regards
-Prashant



-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of pshelar@nicira.com
Sent: Tuesday, January 28, 2014 7:19 AM
To: dev@openvswitch.org; dev@dpdk.org; dpdk-ovs@lists.01.org
Cc: Gerald Rogers
Subject: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

From: Pravin B Shelar <pshelar@nicira.com>

Following patch adds DPDK netdev-class to userspace datapath.
Approach taken in this patch differs from Intel® DPDK vSwitch
where DPDK datapath switching is done in saparate process.  This
patch adds support for DPDK type port and uses OVS userspace
datapath for switching.  Therefore all DPDK processing and flow
miss handling is done in single process.  This also avoids code
duplication by reusing OVS userspace datapath switching and
therefore it supports all flow matching and actions that
user-space datapath supports.  Refer to INSTALL.DPDK doc for
further info.

With this patch I got similar performance for netperf TCP_STREAM
tests compared to kernel datapath.

This is based a patch from Gerald Rogers.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
CC: "Gerald Rogers" <gerald.rogers@intel.com>
---
This patch is tested on latest OVS master (commit 9d0581fdf22bec79).
---
 INSTALL                 |    1 +
 INSTALL.DPDK            |   85 ++++
 Makefile.am             |    1 +
 acinclude.m4            |   40 ++
 configure.ac            |    1 +
 lib/automake.mk         |    6 +
 lib/dpif-netdev.c       |  393 +++++++++++-----
 lib/netdev-dpdk.c       | 1152 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/netdev-dpdk.h       |    7 +
 lib/netdev-dummy.c      |   38 +-
 lib/netdev-linux.c      |   33 +-
 lib/netdev-provider.h   |   13 +-
 lib/netdev-vport.c      |    1 +
 lib/netdev.c            |   52 ++-
 lib/netdev.h            |   15 +-
 lib/ofpbuf.c            |    7 +-
 lib/ofpbuf.h            |   13 +-
 lib/packets.c           |    9 +
 lib/packets.h           |    1 +
 vswitchd/ovs-vswitchd.c |   14 +-
 20 files changed, 1702 insertions(+), 180 deletions(-)
 create mode 100644 INSTALL.DPDK
 create mode 100644 lib/netdev-dpdk.c
 create mode 100644 lib/netdev-dpdk.h

diff --git a/INSTALL b/INSTALL
index 001d3cb..74cd278 100644
--- a/INSTALL
+++ b/INSTALL
@@ -10,6 +10,7 @@ on a specific platform, please see one of these files:
     - INSTALL.RHEL
     - INSTALL.XenServer
     - INSTALL.NetBSD
+    - INSTALL.DPDK

 Build Requirements
 ------------------
diff --git a/INSTALL.DPDK b/INSTALL.DPDK
new file mode 100644
index 0000000..1c95104
--- /dev/null
+++ b/INSTALL.DPDK
@@ -0,0 +1,85 @@
+                   Using Open vSwitch with DPDK
+                   ============================
+
+Open vSwitch can use Intel(R) DPDK lib to operate entirely in
+userspace. This file explains how to install and use Open vSwitch in
+such a mode.
+
+The DPDK support of Open vSwitch is considered experimental.
+It has not been thoroughly tested.
+
+This version of Open vSwitch should be built manually with "configure"
+and "make".
+
+Building and Installing:
+------------------------
+
+DPDK:
+cd DPDK
+make install T=x86_64-default-linuxapp-gcc
+Refer to http://dpdk.org/ requirements of details.
+
+Linux kernel:
+Refer to intel-dpdk-getting-started-guide.pdf for understanding
+DPDK kernel requirement.
+
+OVS:
+cd $(OVS_DIR)/openvswitch
+./boot.sh
+./configure --with-dpdk=$(DPDK_BUILD)
+make
+
+Refer to INSTALL.userspace for general requirements of building
+userspace OVS.
+
+Using the DPDK with ovs-vswitchd:
+---------------------------------
+
+Fist setup DPDK devices:
+  - insert igb_uio.ko
+    e.g. insmod DPDK/x86_64-default-linuxapp-gcc/kmod/igb_uio.ko
+  - mount hugefs
+    e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/
+  - Bind network device to ibg_uio.
+    e.g. DPDK/tools/pci_unbind.py --bind=igb_uio eth1
+
+Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup.
+
+Start vswitchd:
+DPDK configuration arguments can be passed to vswitchd via `--dpdk`
+argument.
+   e.g.
+   ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK  --pidfile --detach
+
+To use ovs-vswitchd with DPDK, create a bridge with datapath_type
+"netdev" in the configuration database.  For example:
+
+    ovs-vsctl add-br br0
+    ovs-vsctl set bridge br0 datapath_type=netdev
+
+Now you can add dpdk devices. OVS expect DPDK device name start with dpdk
+and end with portid. vswitchd should print number of dpdk devices found.
+
+    ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
+
+Once first DPDK port is added vswitchd, it creates Polling thread and
+polls dpdk device in continues loop. Therefore CPU utilization
+for that thread is always 100%.
+
+Restrictions:
+-------------
+
+  - This Support is for Physical NIC. I have tested with Intel NIC only.
+  - vswitchd userspace datapath does affine polling thread but it is
+    assumed that devices are on numa node 0. Therefore if device is
+    attached to non zero numa node switching performance would be
+    suboptimal.
+  - There are fixed number of polling thread and fixed number of per
+    device queues configured.
+  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
+  - Currently DPDK port does not make use any offload functionality.
+
+Bug Reporting:
+--------------
+
+Please report problems to bugs@openvswitch.org.
diff --git a/Makefile.am b/Makefile.am
index 32775cc..4d53dd4 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -58,6 +58,7 @@ EXTRA_DIST = \
        FAQ \
        INSTALL \
        INSTALL.Debian \
+       INSTALL.DPDK \
        INSTALL.Fedora \
        INSTALL.KVM \
        INSTALL.Libvirt \
diff --git a/acinclude.m4 b/acinclude.m4
index 8ff5828..01d39bf 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -157,6 +157,46 @@ AC_DEFUN([OVS_CHECK_LINUX], [
   AM_CONDITIONAL(LINUX_ENABLED, test -n "$KBUILD")
 ])

+dnl OVS_CHECK_DPDK
+dnl
+dnl Configure DPDK source tree
+AC_DEFUN([OVS_CHECK_DPDK], [
+  AC_ARG_WITH([dpdk],
+              [AC_HELP_STRING([--with-dpdk=/path/to/dpdk],
+                              [Specify the DPDP build directory])])
+
+  if test X"$with_dpdk" != X; then
+    RTE_SDK=$with_dpdk
+
+    DPDK_INCLUDE=$RTE_SDK/include
+    DPDK_LIB_DIR=$RTE_SDK/lib
+    DPDK_LIBS="$DPDK_LIB_DIR/libethdev.a \
+              $DPDK_LIB_DIR/librte_cmdline.a \
+              $DPDK_LIB_DIR/librte_hash.a \
+              $DPDK_LIB_DIR/librte_lpm.a \
+              $DPDK_LIB_DIR/librte_mbuf.a \
+              $DPDK_LIB_DIR/librte_mempool.a \
+              $DPDK_LIB_DIR/librte_eal.a \
+              $DPDK_LIB_DIR/librte_pmd_ring.a \
+              $DPDK_LIB_DIR/librte_malloc.a \
+              $DPDK_LIB_DIR/librte_pmd_ixgbe.a \
+              $DPDK_LIB_DIR/librte_pmd_e1000.a \
+              $DPDK_LIB_DIR/librte_pmd_virtio.a \
+              $DPDK_LIB_DIR/librte_ring.a"
+
+    LIBS="$DPDK_LIBS $LIBS"
+    CPPFLAGS="-I$DPDK_INCLUDE $CPPFLAGS"
+    SLICE_SIZE="4194304"
+    SLICE_SIZE_MAX="1073741824"
+    LDFLAGS="$LDFLAGS -Wl,-hugetlbfs-align,-zcommon-page-size=$SLICE_SIZE,-zmax-page-size=$SLICE_SIZE"
+    AC_DEFINE([DPDK_NETDEV], [1], [System uses the DPDK module.])
+  else
+    RTE_SDK=
+  fi
+
+  AM_CONDITIONAL(DPDK_NETDEV, test -n "$RTE_SDK")
+])
+
 dnl OVS_GREP_IFELSE(FILE, REGEX, [IF-MATCH], [IF-NO-MATCH])
 dnl
 dnl Greps FILE for REGEX.  If it matches, runs IF-MATCH, otherwise IF-NO-MATCH.
diff --git a/configure.ac b/configure.ac
index 19c095e..30dbe39 100644
--- a/configure.ac
+++ b/configure.ac
@@ -119,6 +119,7 @@ OVS_ENABLE_SPARSE
 AC_ARG_VAR(KARCH, [Kernel Architecture String])
 AC_SUBST(KARCH)
 OVS_CHECK_LINUX
+OVS_CHECK_DPDK

 AC_CONFIG_FILES(Makefile)
 AC_CONFIG_FILES(datapath/Makefile)
diff --git a/lib/automake.mk b/lib/automake.mk
index 2ef806e..ffbecdb 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -289,6 +289,12 @@ lib_libopenvswitch_la_SOURCES += \
        lib/route-table.h
 endif

+if DPDK_NETDEV
+lib_libopenvswitch_la_SOURCES += \
+       lib/netdev-dpdk.c \
+       lib/netdev-dpdk.h
+endif
+
 if HAVE_POSIX_AIO
 lib_libopenvswitch_la_SOURCES += lib/async-append-aio.c
 else
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index cb64bdc..f55732b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -44,6 +44,7 @@
 #include "meta-flow.h"
 #include "netdev.h"
 #include "netdev-vport.h"
+#include "netdev-dpdk.h"
 #include "netlink.h"
 #include "odp-execute.h"
 #include "odp-util.h"
@@ -64,14 +65,12 @@ VLOG_DEFINE_THIS_MODULE(dpif_netdev);

 /* By default, choose a priority in the middle. */
 #define NETDEV_RULE_PRIORITY 0x8000
+/* TODO: number of thread should be configurable. */
+#define NR_THREADS 8

 /* Configuration parameters. */
 enum { MAX_FLOWS = 65536 };     /* Maximum number of flows in flow table. */

-/* Enough headroom to add a vlan tag, plus an extra 2 bytes to allow IP
- * headers to be aligned on a 4-byte boundary.  */
-enum { DP_NETDEV_HEADROOM = 2 + VLAN_HEADER_LEN };
-
 /* Queues. */
 enum { N_QUEUES = 2 };          /* Number of queues for dpif_recv(). */
 enum { MAX_QUEUE_LEN = 128 };   /* Maximum number of packets per queue. */
@@ -162,8 +161,9 @@ struct dp_netdev {

     /* Forwarding threads. */
     struct latch exit_latch;
-    struct dp_forwarder *forwarders;
-    size_t n_forwarders;
+    struct pmd_thread *pmd_threads;
+    size_t n_pmd_threads;
+    int pmd_count;
 };

 static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,
@@ -172,12 +172,14 @@ static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,

 /* A port in a netdev-based datapath. */
 struct dp_netdev_port {
-    struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
-    odp_port_t port_no;
+    struct pkt_metadata md;
+    struct netdev_rx **rx;
     struct netdev *netdev;
+    odp_port_t port_no;
     struct netdev_saved_flags *sf;
-    struct netdev_rx *rx;
     char *type;                 /* Port type as requested by user. */
+    struct ovs_refcount ref_cnt;
+    struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
 };

 /* A flow in dp_netdev's 'flow_table'.
@@ -289,11 +291,12 @@ void dp_netdev_actions_unref(struct dp_netdev_actions *);

 /* A thread that receives packets from some ports, looks them up in the flow
  * table, and executes the actions it finds. */
-struct dp_forwarder {
+struct pmd_thread {
     struct dp_netdev *dp;
-    pthread_t thread;
+    int qid;
+    atomic_uint change_seq;
     char *name;
-    uint32_t min_hash, max_hash;
+    pthread_t thread;
 };

 /* Interface to netdev-based datapath. */
@@ -332,7 +335,7 @@ static void dp_netdev_execute_actions(struct dp_netdev *dp,
 static void dp_netdev_port_input(struct dp_netdev *dp, struct ofpbuf *packet,
                                  struct pkt_metadata *)
     OVS_REQ_RDLOCK(dp->port_rwlock);
-static void dp_netdev_set_threads(struct dp_netdev *, int n);
+static void dp_netdev_set_pmd_threads(struct dp_netdev *, int n);

 static struct dpif_netdev *
 dpif_netdev_cast(const struct dpif *dpif)
@@ -478,7 +481,6 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
         dp_netdev_free(dp);
         return error;
     }
-    dp_netdev_set_threads(dp, 2);

     *dpp = dp;
     return 0;
@@ -536,8 +538,8 @@ dp_netdev_free(struct dp_netdev *dp)

     shash_find_and_delete(&dp_netdevs, dp->name);

-    dp_netdev_set_threads(dp, 0);
-    free(dp->forwarders);
+    dp_netdev_set_pmd_threads(dp, 0);
+    free(dp->pmd_threads);

     dp_netdev_flow_flush(dp);
     ovs_rwlock_wrlock(&dp->port_rwlock);
@@ -621,18 +623,30 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
     return 0;
 }

+static void
+dp_netdev_reload_pmd_threads(struct dp_netdev *dp)
+{
+    int i;
+
+    for (i = 0; i < dp->n_pmd_threads; i++) {
+        struct pmd_thread *f = &dp->pmd_threads[i];
+        int id;
+
+        atomic_add(&f->change_seq, 1, &id);
+   }
+}
+
 static int
 do_add_port(struct dp_netdev *dp, const char *devname, const char *type,
             odp_port_t port_no)
-    OVS_REQ_WRLOCK(dp->port_rwlock)
 {
     struct netdev_saved_flags *sf;
     struct dp_netdev_port *port;
     struct netdev *netdev;
-    struct netdev_rx *rx;
     enum netdev_flags flags;
     const char *open_type;
     int error;
+    int i;

     /* XXX reject devices already in some dp_netdev. */

@@ -651,28 +665,41 @@ do_add_port(struct dp_netdev *dp, const char *devname, const char *type,
         return EINVAL;
     }

-    error = netdev_rx_open(netdev, &rx);
-    if (error
-        && !(error == EOPNOTSUPP && dpif_netdev_class_is_dummy(dp->class))) {
-        VLOG_ERR("%s: cannot receive packets on this network device (%s)",
-                 devname, ovs_strerror(errno));
-        netdev_close(netdev);
-        return error;
+    port = xzalloc(sizeof *port);
+    port->port_no = port_no;
+    port->md = PKT_METADATA_INITIALIZER(port->port_no);
+    port->netdev = netdev;
+    port->rx = xmalloc(sizeof *port->rx * netdev_nr_rx(netdev));
+    port->type = xstrdup(type);
+    for (i = 0; i < netdev_nr_rx(netdev); i++) {
+        error = netdev_rx_open(netdev, &port->rx[i], i);
+        if (error
+            && !(error == EOPNOTSUPP && dpif_netdev_class_is_dummy(dp->class))) {
+            VLOG_ERR("%s: cannot receive packets on this network device (%s)",
+                     devname, ovs_strerror(errno));
+            netdev_close(netdev);
+            return error;
+        }
     }

     error = netdev_turn_flags_on(netdev, NETDEV_PROMISC, &sf);
     if (error) {
-        netdev_rx_close(rx);
+        for (i = 0; i < netdev_nr_rx(netdev); i++) {
+            netdev_rx_close(port->rx[i]);
+        }
         netdev_close(netdev);
+        free(port->rx);
+        free(port);
         return error;
     }
-
-    port = xmalloc(sizeof *port);
-    port->port_no = port_no;
-    port->netdev = netdev;
     port->sf = sf;
-    port->rx = rx;
-    port->type = xstrdup(type);
+
+    if (netdev_is_pmd(netdev)) {
+        dp->pmd_count++;
+        dp_netdev_set_pmd_threads(dp, NR_THREADS);
+        dp_netdev_reload_pmd_threads(dp);
+    }
+    ovs_refcount_init(&port->ref_cnt);

     hmap_insert(&dp->ports, &port->node, hash_int(odp_to_u32(port_no), 0));
     seq_change(dp->port_seq);
@@ -772,6 +799,32 @@ get_port_by_name(struct dp_netdev *dp,
     return ENOENT;
 }

+static void
+port_ref(struct dp_netdev_port *port)
+{
+    if (port) {
+        ovs_refcount_ref(&port->ref_cnt);
+    }
+}
+
+static void
+port_unref(struct dp_netdev_port *port)
+{
+    if (port && ovs_refcount_unref(&port->ref_cnt) == 1) {
+        int i;
+
+        netdev_restore_flags(port->sf);
+        for (i = 0; i < netdev_nr_rx(port->netdev); i++) {
+            netdev_rx_close(port->rx[i]);
+        }
+        free(port->rx);
+        netdev_close(port->netdev);
+        free(port->type);
+        ovs_refcount_destroy(&port->ref_cnt);
+        free(port);
+    }
+}
+
 static int
 do_del_port(struct dp_netdev *dp, odp_port_t port_no)
     OVS_REQ_WRLOCK(dp->port_rwlock)
@@ -783,16 +836,13 @@ do_del_port(struct dp_netdev *dp, odp_port_t port_no)
     if (error) {
         return error;
     }
-
     hmap_remove(&dp->ports, &port->node);
     seq_change(dp->port_seq);
+    if (netdev_is_pmd(port->netdev)) {
+        dp_netdev_reload_pmd_threads(dp);
+    }

-    netdev_close(port->netdev);
-    netdev_restore_flags(port->sf);
-    netdev_rx_close(port->rx);
-    free(port->type);
-    free(port);
-
+    port_unref(port);
     return 0;
 }

@@ -1543,123 +1593,215 @@ dp_netdev_actions_unref(struct dp_netdev_actions *actions)
     }
 }


-static void *
-dp_forwarder_main(void *f_)
+
+static void
+dp_netdev_process_rx_port(struct dp_netdev *dp,
+                          struct dp_netdev_port *port,
+                          struct netdev_rx *queue)
+{
+    struct ofpbuf *packet;
+    struct pkt_metadata *md = &port->md;
+    int error, c;
+
+    error = netdev_rx_recv(queue, &packet, &c);
+    if (!error) {
+        int i;
+
+        for (i = 0; i < c; i++) {
+            dp_netdev_port_input(dp, packet, md);
+            packet++;
+        }
+
+    } else if (error != EAGAIN && error != EOPNOTSUPP) {
+        static struct vlog_rate_limit rl
+            = VLOG_RATE_LIMIT_INIT(1, 5);
+
+        VLOG_ERR_RL(&rl, "error receiving data from %s: %s",
+                    netdev_get_name(port->netdev),
+                    ovs_strerror(error));
+    }
+}
+
+static void
+dpif_netdev_run(struct dpif *dpif)
+{
+    struct dp_netdev_port *port;
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+
+    ovs_rwlock_rdlock(&dp->port_rwlock);
+
+    HMAP_FOR_EACH (port, node, &dp->ports) {
+        if (port->rx[0] && !netdev_is_pmd(port->netdev)) {
+            dp_netdev_process_rx_port(dp, port, port->rx[0]);
+        }
+    }
+
+    ovs_rwlock_unlock(&dp->port_rwlock);
+}
+
+static void
+dpif_netdev_wait(struct dpif *dpif)
+{
+    struct dp_netdev_port *port;
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+
+    ovs_rwlock_rdlock(&dp->port_rwlock);
+
+    HMAP_FOR_EACH (port, node, &dp->ports) {
+        if (port->rx[0] && !netdev_is_pmd(port->netdev)) {
+            netdev_rx_wait(port->rx[0]);
+        }
+    }
+    ovs_rwlock_unlock(&dp->port_rwlock);
+}
+
+struct rx_poll {
+    struct dp_netdev_port *port;
+    struct netdev_rx *rx;
+};
+
+static int
+pmd_load_queues(struct pmd_thread *f,
+                struct rx_poll **ppoll_list, int poll_cnt)
 {
-    struct dp_forwarder *f = f_;
     struct dp_netdev *dp = f->dp;
-    struct ofpbuf packet;
+    struct rx_poll *poll_list = *ppoll_list;
+    struct dp_netdev_port *port;
+    int qid = f->qid;
+    int index;
+    int i;

-    f->name = xasprintf("forwarder_%u", ovsthread_id_self());
-    set_subprogram_name("%s", f->name);
+    /* Simple scheduler for netdev rx polling. */
+    ovs_rwlock_rdlock(&dp->port_rwlock);
+    for (i = 0; i < poll_cnt; i++) {
+         port_unref(poll_list[i].port);
+    }

-    ofpbuf_init(&packet, 0);
-    while (!latch_is_set(&dp->exit_latch)) {
-        bool received_anything;
-        int i;
+    free(poll_list);
+    poll_cnt = 0;
+    index = 0;
+
+    HMAP_FOR_EACH (port, node, &f->dp->ports) {
+        if (netdev_is_pmd(port->netdev)) {
+            for (i = 0; i < netdev_nr_rx(port->netdev); i++) {

-        ovs_rwlock_rdlock(&dp->port_rwlock);
-        for (i = 0; i < 50; i++) {
-            struct dp_netdev_port *port;
-
-            received_anything = false;
-            HMAP_FOR_EACH (port, node, &f->dp->ports) {
-                if (port->rx
-                    && port->node.hash >= f->min_hash
-                    && port->node.hash <= f->max_hash) {
-                    int buf_size;
-                    int error;
-                    int mtu;
-
-                    if (netdev_get_mtu(port->netdev, &mtu)) {
-                        mtu = ETH_PAYLOAD_MAX;
-                    }
-                    buf_size = DP_NETDEV_HEADROOM + VLAN_ETH_HEADER_LEN + mtu;
-
-                    ofpbuf_clear(&packet);
-                    ofpbuf_reserve_with_tailroom(&packet, DP_NETDEV_HEADROOM,
-                                                 buf_size);
-
-                    error = netdev_rx_recv(port->rx, &packet);
-                    if (!error) {
-                        struct pkt_metadata md
-                            = PKT_METADATA_INITIALIZER(port->port_no);
-                        dp_netdev_port_input(dp, &packet, &md);
-
-                        received_anything = true;
-                    } else if (error != EAGAIN && error != EOPNOTSUPP) {
-                        static struct vlog_rate_limit rl
-                            = VLOG_RATE_LIMIT_INIT(1, 5);
-
-                        VLOG_ERR_RL(&rl, "error receiving data from %s: %s",
-                                    netdev_get_name(port->netdev),
-                                    ovs_strerror(error));
-                    }
+                if ((index % dp->n_pmd_threads) == qid) {
+                    port_ref(port);
+                    poll_cnt++;
                 }
+                index++;
             }
+        }
+    }

-            if (!received_anything) {
-                break;
+    poll_list = xzalloc(sizeof *poll_list * poll_cnt);
+    poll_cnt = 0;
+    index = 0;
+
+    HMAP_FOR_EACH (port, node, &f->dp->ports) {
+        if (netdev_is_pmd(port->netdev)) {
+            for (i = 0; i < netdev_nr_rx(port->netdev); i++) {
+
+                if ((index % dp->n_pmd_threads) == qid) {
+                    poll_list[poll_cnt].port = port;
+                    poll_list[poll_cnt].rx = port->rx[i];
+                    poll_cnt++;
+                    VLOG_INFO("poll_cnt %d port = %d i = %d",poll_cnt,port->port_no, i);
+                }
+                index++;
             }
         }
+    }

-        if (received_anything) {
-            poll_immediate_wake();
-        } else {
-            struct dp_netdev_port *port;
+    ovs_rwlock_unlock(&dp->port_rwlock);

-            HMAP_FOR_EACH (port, node, &f->dp->ports)
-                if (port->rx
-                    && port->node.hash >= f->min_hash
-                    && port->node.hash <= f->max_hash) {
-                    netdev_rx_wait(port->rx);
-                }
-            seq_wait(dp->port_seq, seq_read(dp->port_seq));
-            latch_wait(&dp->exit_latch);
+    *ppoll_list = poll_list;
+    return poll_cnt;
+}
+
+static void *
+pmd_thread_main(void *f_)
+{
+    struct pmd_thread *f = f_;
+    struct dp_netdev *dp = f->dp;
+    unsigned long lc = 0;
+    struct rx_poll *poll_list;
+    unsigned int port_seq;
+    int poll_cnt;
+
+    f->name = xasprintf("pmd_%u", ovsthread_id_self());
+    set_subprogram_name("%s", f->name);
+    netdev_setup_thread(f->qid);
+    poll_cnt = 0;
+    poll_list = NULL;
+
+reload:
+    poll_cnt = pmd_load_queues(f, &poll_list, poll_cnt);
+    atomic_read(&f->change_seq, &port_seq);
+
+    while (1) {
+        unsigned int c_port_seq;
+        int i;
+
+        for (i = 0; i < poll_cnt; i++) {
+            dp_netdev_process_rx_port(dp,  poll_list[i].port, poll_list[i].rx);
+        }
+
+        if (lc++ > (64 * 1024 * 1024)) {
+            /* TODO: need completely userspace based signaling method.
+             * to keep this thread entirely in userspace.
+             * For now using atomic counter. */
+            lc = 0;
+            atomic_read(&f->change_seq, &c_port_seq);
+            if (c_port_seq != port_seq) {
+                break;
+            }
         }
-        ovs_rwlock_unlock(&dp->port_rwlock);
+    }

-        poll_block();
+    if (!latch_is_set(&f->dp->exit_latch)){
+        goto reload;
     }
-    ofpbuf_uninit(&packet);

+    free(poll_list);
     free(f->name);
-
     return NULL;
 }

 static void
-dp_netdev_set_threads(struct dp_netdev *dp, int n)
+dp_netdev_set_pmd_threads(struct dp_netdev *dp, int n)
 {
     int i;

-    if (n == dp->n_forwarders) {
+    if (n == dp->n_pmd_threads) {
         return;
     }

     /* Stop existing threads. */
     latch_set(&dp->exit_latch);
-    for (i = 0; i < dp->n_forwarders; i++) {
-        struct dp_forwarder *f = &dp->forwarders[i];
+    dp_netdev_reload_pmd_threads(dp);
+    for (i = 0; i < dp->n_pmd_threads; i++) {
+        struct pmd_thread *f = &dp->pmd_threads[i];

         xpthread_join(f->thread, NULL);
     }
     latch_poll(&dp->exit_latch);
-    free(dp->forwarders);
+    free(dp->pmd_threads);

     /* Start new threads. */
-    dp->forwarders = xmalloc(n * sizeof *dp->forwarders);
-    dp->n_forwarders = n;
+    dp->pmd_threads = xmalloc(n * sizeof *dp->pmd_threads);
+    dp->n_pmd_threads = n;
+
     for (i = 0; i < n; i++) {
-        struct dp_forwarder *f = &dp->forwarders[i];
+        struct pmd_thread *f = &dp->pmd_threads[i];

         f->dp = dp;
-        f->min_hash = UINT32_MAX / n * i;
-        f->max_hash = UINT32_MAX / n * (i + 1) - 1;
-        if (i == n - 1) {
-            f->max_hash = UINT32_MAX;
-        }
-        xpthread_create(&f->thread, NULL, dp_forwarder_main, f);
+        f->qid = i;
+        atomic_store(&f->change_seq, 1);
+
+        /* Each thread will distribute all devices rx-queues among
+         * themselves. */
+        xpthread_create(&f->thread, NULL, pmd_thread_main, f);
     }
 }


@@ -1683,6 +1825,7 @@ dp_netdev_port_input(struct dp_netdev *dp, struct ofpbuf *packet,
     struct flow key;

     if (packet->size < ETH_HEADER_LEN) {
+        VLOG_ERR("%s small pkt %d\n",__func__,(int) packet->size);
         return;
     }
     flow_extract(packet, md->skb_priority, md->pkt_mark, &md->tunnel,
@@ -1743,9 +1886,11 @@ dp_netdev_output_userspace(struct dp_netdev *dp, struct ofpbuf *packet,
         }

         /* Steal packet data. */
-        ovs_assert(packet->source == OFPBUF_MALLOC);
-        upcall->packet = *packet;
-        ofpbuf_use(packet, NULL, 0);
+        ofpbuf_init(&upcall->packet,0);
+        ofpbuf_reserve_with_tailroom(&upcall->packet,
+                                     DP_NETDEV_HEADROOM, packet->size);
+        memcpy(upcall->packet.data, packet->data, packet->size);
+        upcall->packet.size = packet->size;

         seq_change(dp->queue_seq);

@@ -1778,7 +1923,7 @@ dp_execute_cb(void *aux_, struct ofpbuf *packet,
     case OVS_ACTION_ATTR_OUTPUT:
         p = dp_netdev_lookup_port(aux->dp, u32_to_odp(nl_attr_get_u32(a)));
         if (p) {
-            netdev_send(p->netdev, packet);
+            netdev_send(p->netdev, packet, may_steal);
         }
         break;

@@ -1828,8 +1973,8 @@ const struct dpif_class dpif_netdev_class = {
     dpif_netdev_open,
     dpif_netdev_close,
     dpif_netdev_destroy,
-    NULL,                       /* run */
-    NULL,                       /* wait */
+    dpif_netdev_run,
+    dpif_netdev_wait,
     dpif_netdev_get_stats,
     dpif_netdev_port_add,
     dpif_netdev_port_del,
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
new file mode 100644
index 0000000..06de08c
--- /dev/null
+++ b/lib/netdev-dpdk.c
@@ -0,0 +1,1152 @@
+/*
+ * Copyright (c) 2014 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <pthread.h>
+#include <config.h>
+#include <errno.h>
+#include <sched.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdio.h>
+
+#include "list.h"
+#include "netdev-provider.h"
+#include "netdev-vport.h"
+#include "netdev-dpdk.h"
+#include "odp-util.h"
+#include "ofp-print.h"
+#include "ofpbuf.h"
+#include "ovs-thread.h"
+#include "packets.h"
+#include "shash.h"
+#include "sset.h"
+#include "unaligned.h"
+#include "timeval.h"
+#include "unixctl.h"
+#include "vlog.h"
+
+#include <rte_config.h>
+#include <rte_eal.h>
+#include <rte_debug.h>
+#include <rte_ethdev.h>
+#include <rte_errno.h>
+#include <rte_memzone.h>
+#include <rte_memcpy.h>
+#include <rte_cycles.h>
+#include <rte_spinlock.h>
+#include <rte_launch.h>
+#include <rte_malloc.h>
+
+VLOG_DEFINE_THIS_MODULE(dpdk);
+
+#define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
+#define OVS_VPORT_DPDK "ovs_dpdk"
+
+/*
+ * need to reserve tons of extra space in the mbufs so we can align the
+ * DMA addresses to 4KB.
+ */
+
+#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
+#define MBUF_SIZE(mtu)       (MTU_TO_MAX_LEN(mtu) + (512) + \
+                             sizeof(struct rte_mbuf) + RTE_PKTMBUF_HEADROOM)
+
+/* TODO: mempool size should be based on system resources. */
+#define NB_MBUF              (4096 * 64)
+#define MP_CACHE_SZ          (256 * 2)
+#define SOCKET0              0
+
+/* TODO: number device queue, need to make this configurable at run time. */
+#define NR_QUEUE             8
+
+/* TODO: Needs per NIC value for these constants. */
+#define RX_PTHRESH 16 /* Default values of RX prefetch threshold reg. */
+#define RX_HTHRESH 16 /* Default values of RX host threshold reg. */
+#define RX_WTHRESH 8 /* Default values of RX write-back threshold reg. */
+
+#define TX_PTHRESH 36 /* Default values of TX prefetch threshold reg. */
+#define TX_HTHRESH 0  /* Default values of TX host threshold reg. */
+#define TX_WTHRESH 0  /* Default values of TX write-back threshold reg. */
+
+static const struct rte_eth_conf port_conf = {
+        .rxmode = {
+                .mq_mode = ETH_MQ_RX_RSS,
+                .split_hdr_size = 0,
+                .header_split   = 0, /* Header Split disabled */
+                .hw_ip_checksum = 0, /* IP checksum offload enabled */
+                .hw_vlan_filter = 0, /* VLAN filtering disabled */
+                .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
+                .hw_strip_crc   = 0, /* CRC stripped by hardware */
+        },
+        .rx_adv_conf = {
+                .rss_conf = {
+                        .rss_key = NULL,
+                        .rss_hf = ETH_RSS_IPV4_TCP | ETH_RSS_IPV4 | ETH_RSS_IPV6,
+                },
+        },
+        .txmode = {
+                .mq_mode = ETH_MQ_TX_NONE,
+        },
+};
+
+static const struct rte_eth_rxconf rx_conf = {
+        .rx_thresh = {
+                .pthresh = RX_PTHRESH,
+                .hthresh = RX_HTHRESH,
+                .wthresh = RX_WTHRESH,
+        },
+};
+
+static const struct rte_eth_txconf tx_conf = {
+        .tx_thresh = {
+                .pthresh = TX_PTHRESH,
+                .hthresh = TX_HTHRESH,
+                .wthresh = TX_WTHRESH,
+        },
+        .tx_free_thresh = 0, /* Use PMD default values */
+        .tx_rs_thresh = 0, /* Use PMD default values */
+};
+
+enum { MAX_RX_QUEUE_LEN = 64 };
+
+static int rte_eal_init_ret = ENODEV;
+
+static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
+
+/* Contains all 'struct dpdk_dev's. */
+static struct list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
+    = LIST_INITIALIZER(&dpdk_list);
+
+static struct list dpdk_mp_list;
+
+struct dpdk_mp {
+    struct rte_mempool *mp;
+    int mtu;
+    int socket_id;
+    int refcount;
+    struct list list_node OVS_GUARDED_BY(mp_list);
+};
+
+struct netdev_dpdk {
+    struct netdev up;
+    int port_id;
+    int max_packet_len;
+    rte_spinlock_t tx_lock;
+
+    /* Protects all members below. */
+    struct ovs_mutex mutex OVS_ACQ_AFTER(mutex);
+
+    struct dpdk_mp *dpdk_mp;
+    int mtu OVS_GUARDED;
+    int socket_id;
+    int buf_size;
+    struct netdev_stats stats_offset OVS_GUARDED;
+
+    uint8_t hwaddr[ETH_ADDR_LEN] OVS_GUARDED;
+    enum netdev_flags flags OVS_GUARDED;
+
+    rte_spinlock_t lsi_lock;
+    struct rte_eth_link link;
+    int link_reset_cnt;
+
+    /* In dpdk_list. */
+    struct list list_node OVS_GUARDED_BY(mutex);
+};
+
+struct netdev_rx_dpdk {
+    struct netdev_rx up;
+    eth_rx_burst_t drv_rx;
+    void *rx_queues;
+    int port_id;
+    int queue_id;
+    int ofpbuf_cnt;
+    struct ofpbuf ofpbuf[MAX_RX_QUEUE_LEN];
+};
+
+static int netdev_dpdk_construct(struct netdev *);
+static bool
+is_dpdk_class(const struct netdev_class *class)
+{
+    return class->construct == netdev_dpdk_construct;
+}
+
+/* TODO: use dpdk malloc for entire OVS. infact huge page shld be used
+ * for all other sengments data, bss and text. */
+
+static void *dpdk_rte_mzalloc(size_t sz)
+{
+    void *ptr;
+
+    ptr = rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE);
+    if (ptr == NULL) {
+        out_of_memory();
+    }
+    return ptr;
+}
+
+static struct dpdk_mp *
+dpdk_mp_get(int socket_id, int mtu)
+{
+    struct dpdk_mp *dmp = NULL;
+    char mp_name[RTE_MEMPOOL_NAMESIZE];
+
+    LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
+        if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
+            dmp->refcount++;
+            return dmp;
+        }
+    }
+
+    dmp = dpdk_rte_mzalloc(sizeof *dmp);
+    dmp->socket_id = socket_id;
+    dmp->mtu = mtu;
+    dmp->refcount = 1;
+
+    snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d", dmp->mtu);
+    dmp->mp = rte_mempool_create(mp_name, NB_MBUF, MBUF_SIZE(mtu),
+                                 MP_CACHE_SZ,
+                                 sizeof(struct rte_pktmbuf_pool_private),
+                                 rte_pktmbuf_pool_init, NULL,
+                                 rte_pktmbuf_init, NULL,
+                                 socket_id, 0);
+
+    if (dmp->mp == NULL) {
+        return NULL;
+    }
+
+    list_push_back(&dpdk_mp_list, &dmp->list_node);
+    return dmp;
+}
+
+static void
+dpdk_mp_put(struct dpdk_mp *dmp)
+{
+
+    if (!dmp) {
+        return;
+    }
+
+    dmp->refcount--;
+    ovs_assert(dmp->refcount >= 0);
+
+#if 0
+    /* I could not find any API to destroy mp. */
+    if (dmp->refcount == 0) {
+        list_delete(dmp->list_node);
+        /* destroy mp-pool. */
+    }
+#endif
+}
+
+static void
+lsi_event_callback(uint8_t port_id, enum rte_eth_event_type type, void *param)
+{
+    struct netdev_dpdk *dev = (struct netdev_dpdk *) param;
+
+    VLOG_DBG("Event type: %s\n", type == RTE_ETH_EVENT_INTR_LSC ? "LSC interrupt" : "unknown event");
+
+    rte_spinlock_lock(&dev->lsi_lock);
+    rte_eth_link_get_nowait(port_id, &dev->link);
+    dev->link_reset_cnt++;
+    rte_spinlock_unlock(&dev->lsi_lock);
+
+    if (dev->link.link_status) {
+        VLOG_DBG("Port %d Link Up - speed %u Mbps - %s\n",
+                 port_id, (unsigned)dev->link.link_speed,
+                          (dev->link.link_duplex == ETH_LINK_FULL_DUPLEX) ?
+                          ("full-duplex") : ("half-duplex"));
+    } else {
+        VLOG_DBG("Port %d Link Down\n\n", port_id);
+    }
+}
+
+static int
+dpdk_eth_dev_init(struct netdev_dpdk *dev)
+{
+    struct rte_pktmbuf_pool_private *mbp_priv;
+    struct ether_addr eth_addr;
+    int diag;
+    int i;
+
+    if (dev->port_id < 0 || dev->port_id >= rte_eth_dev_count()) {
+        return -ENODEV;
+    }
+
+    diag = rte_eth_dev_configure(dev->port_id, NR_QUEUE, NR_QUEUE + 1, &port_conf);
+    if (diag) {
+        VLOG_ERR("eth dev config error %d\n",diag);
+        return diag;
+    }
+
+    for (i = 0; i < (NR_QUEUE + 1); i++) {
+        diag = rte_eth_tx_queue_setup(dev->port_id, i, 64, 0, &tx_conf);
+        if (diag) {
+            VLOG_ERR("eth dev tx queue setup error %d\n",diag);
+            return diag;
+        }
+    }
+
+    for (i = 0; i < NR_QUEUE; i++) {
+        /* DO NOT CHANGE NUMBER OF RX DESCRIPTORS */
+        diag = rte_eth_rx_queue_setup(dev->port_id, i, 64, 0, &rx_conf, dev->dpdk_mp->mp);
+        if (diag) {
+            VLOG_ERR("eth dev rx queue setup error %d\n",diag);
+            return diag;
+        }
+    }
+
+    rte_eth_dev_callback_register(dev->port_id, RTE_ETH_EVENT_INTR_LSC,
+                                  lsi_event_callback, dev);
+
+    diag = rte_eth_dev_start(dev->port_id);
+    if (diag) {
+        VLOG_ERR("eth dev start error %d\n",diag);
+        return diag;
+    }
+
+    rte_eth_promiscuous_enable(dev->port_id);
+    rte_eth_allmulticast_enable(dev->port_id);
+
+    memset(&eth_addr, 0x0, sizeof(eth_addr));
+    rte_eth_macaddr_get(dev->port_id, &eth_addr);
+    VLOG_INFO("Port %d: %02X:%02X:%02X:%02X:%02X:%02X\n",dev->port_id,
+              eth_addr.addr_bytes[0],
+              eth_addr.addr_bytes[1],
+              eth_addr.addr_bytes[2],
+              eth_addr.addr_bytes[3],
+              eth_addr.addr_bytes[4],
+              eth_addr.addr_bytes[5]);
+
+    memcpy(dev->hwaddr, eth_addr.addr_bytes, ETH_ADDR_LEN);
+    rte_eth_link_get_nowait(dev->port_id, &dev->link);
+
+    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
+    dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
+
+    dev->flags = NETDEV_UP | NETDEV_PROMISC;
+    return 0; /* return the number of args to delete */
+}
+
+static struct netdev_dpdk *
+netdev_dpdk_cast(const struct netdev *netdev)
+{
+    return CONTAINER_OF(netdev, struct netdev_dpdk, up);
+}
+
+static struct netdev *
+netdev_dpdk_alloc(void)
+{
+    struct netdev_dpdk *netdev = dpdk_rte_mzalloc(sizeof *netdev);
+    return &netdev->up;
+}
+
+static int
+netdev_dpdk_construct(struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    unsigned int port_no;
+    char *cport;
+    int err;
+
+    if (rte_eal_init_ret) {
+        return rte_eal_init_ret;
+    }
+
+    ovs_mutex_lock(&dpdk_mutex);
+    cport = netdev_->name + 4; /* Names always start with "dpdk" */
+
+    if (strncmp(netdev_->name, "dpdk", 4)) {
+        err = ENODEV;
+        goto unlock_dpdk;
+    }
+
+    port_no = strtol(cport, 0, 0); /* string must be null terminated */
+
+    rte_spinlock_init(&netdev->lsi_lock);
+    rte_spinlock_init(&netdev->tx_lock);
+    ovs_mutex_init(&netdev->mutex);
+
+    ovs_mutex_lock(&netdev->mutex);
+    netdev->flags = 0;
+
+    netdev->mtu = ETHER_MTU;
+    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
+
+    /* TODO: need to discover device node at run time. */
+    netdev->socket_id = SOCKET0;
+    netdev->port_id = port_no;
+
+    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
+    if (!netdev->dpdk_mp) {
+        err = ENOMEM;
+        goto unlock_dev;
+    }
+
+    err = dpdk_eth_dev_init(netdev);
+    if (err) {
+        goto unlock_dev;
+    }
+    netdev_->nr_rx = NR_QUEUE;
+
+    list_push_back(&dpdk_list, &netdev->list_node);
+
+unlock_dev:
+    ovs_mutex_unlock(&netdev->mutex);
+unlock_dpdk:
+    ovs_mutex_unlock(&dpdk_mutex);
+    return err;
+}
+
+static void
+netdev_dpdk_destruct(struct netdev *netdev_)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+
+    ovs_mutex_lock(&dev->mutex);
+    rte_eth_dev_stop(dev->port_id);
+    rte_eth_dev_callback_unregister(dev->port_id, RTE_ETH_EVENT_INTR_LSC,
+                                    lsi_event_callback, NULL);
+
+    ovs_mutex_unlock(&dev->mutex);
+
+    ovs_mutex_lock(&dpdk_mutex);
+    list_remove(&dev->list_node);
+    dpdk_mp_put(dev->dpdk_mp);
+    ovs_mutex_unlock(&dpdk_mutex);
+
+    ovs_mutex_destroy(&dev->mutex);
+}
+
+static void
+netdev_dpdk_dealloc(struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+
+    rte_free(netdev);
+}
+
+static int
+netdev_dpdk_get_config(const struct netdev *netdev_, struct smap *args)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+    struct rte_eth_dev_info dev_info;
+
+    ovs_mutex_lock(&dev->mutex);
+    rte_eth_dev_info_get(dev->port_id, &dev_info);
+    ovs_mutex_unlock(&dev->mutex);
+
+    smap_add_format(args, "ifindex", "%d", dev->port_id);
+    smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));
+    smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
+    smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
+    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
+    smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
+    smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
+    smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
+    smap_add_format(args, "max_hash_mac_addrs", "%u", dev_info.max_hash_mac_addrs);
+    smap_add_format(args, "max_vfs", "%u", dev_info.max_vfs);
+    smap_add_format(args, "max_vmdq_pools", "%u", dev_info.max_vmdq_pools);
+
+    smap_add_format(args, "pci-vendor_id", "0x%u", dev_info.pci_dev->id.vendor_id);
+    smap_add_format(args, "pci-device_id", "0x%x", dev_info.pci_dev->id.device_id);
+
+    return 0;
+}
+
+static struct netdev_rx *
+netdev_dpdk_rx_alloc(int id)
+{
+    struct netdev_rx_dpdk *rx = dpdk_rte_mzalloc(sizeof *rx);
+
+    rx->queue_id = id;
+    ovs_assert(id < NR_QUEUE);
+
+    return &rx->up;
+}
+
+static struct netdev_rx_dpdk *
+netdev_rx_dpdk_cast(const struct netdev_rx *rx)
+{
+    return CONTAINER_OF(rx, struct netdev_rx_dpdk, up);
+}
+
+static int
+netdev_dpdk_rx_construct(struct netdev_rx *rx_)
+{
+    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(rx->up.netdev);
+    struct rte_eth_dev *eth_dev;
+    int i;
+
+    ovs_mutex_lock(&netdev->mutex);
+    for (i = 0; i < MAX_RX_QUEUE_LEN; i++) {
+        ofpbuf_init(&rx->ofpbuf[i], 0);
+        rx->ofpbuf[i].allocated = netdev->buf_size;
+        rx->ofpbuf[i].source = OFPBUF_DPDK;
+    }
+    rx->ofpbuf_cnt = 0;
+    rx->port_id = netdev->port_id;
+
+    eth_dev = &rte_eth_devices[rx->port_id];
+    rx->drv_rx = eth_dev->rx_pkt_burst;
+    rx->rx_queues = eth_dev->data->rx_queues[rx->queue_id];
+    ovs_mutex_unlock(&netdev->mutex);
+
+    return 0;
+}
+
+static void
+netdev_dpdk_rx_destruct(struct netdev_rx *rx_ OVS_UNUSED)
+{
+}
+
+static void
+netdev_dpdk_rx_dealloc(struct netdev_rx *rx_)
+{
+    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
+
+    rte_free(rx);
+}
+
+static void
+build_ofpbuf(struct netdev_rx_dpdk *rx, struct ofpbuf *b, struct rte_mbuf *pkt)
+{
+    if (b->private_p) {
+        struct netdev *netdev = rx->up.netdev;
+        struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+        rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **) &b->private_p, 1);
+    }
+
+    b->private_p = pkt;
+    if (!pkt) {
+        return;
+    }
+
+    b->data = pkt->pkt.data;
+    b->base = (char *)b->data - DP_NETDEV_HEADROOM - VLAN_ETH_HEADER_LEN;
+    packet_set_size(b, rte_pktmbuf_data_len(pkt));
+}
+
+static int
+netdev_dpdk_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c)
+{
+    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
+    struct rte_mbuf *burst_pkts[MAX_RX_QUEUE_LEN];
+    int nb_rx;
+    int i;
+
+    nb_rx = (*rx->drv_rx)(rx->rx_queues, burst_pkts, MAX_RX_QUEUE_LEN);
+    if (!nb_rx) {
+        for (i = 0; i < rx->ofpbuf_cnt; i++) {
+             build_ofpbuf(rx, &rx->ofpbuf[i], NULL);
+        }
+        rx->ofpbuf_cnt = 0;
+        return EAGAIN;
+    }
+
+    i = 0;
+    do {
+        build_ofpbuf(rx, &rx->ofpbuf[i], burst_pkts[i]);
+
+        i++;
+    } while (i < nb_rx);
+
+    for (; i < rx->ofpbuf_cnt; i++) {
+         build_ofpbuf(rx, &rx->ofpbuf[i], NULL);
+    }
+    rx->ofpbuf_cnt = nb_rx;
+    *rpacket = rx->ofpbuf;
+    *c = nb_rx;
+
+    return 0;
+}
+
+static int
+netdev_dpdk_rx_drain(struct netdev_rx *rx_)
+{
+    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
+    int pending;
+    int i;
+
+    pending = rx->ofpbuf_cnt;
+    if (pending) {
+        for (i = 0; i < pending; i++) {
+             build_ofpbuf(rx, &rx->ofpbuf[i], NULL);
+        }
+        rx->ofpbuf_cnt = 0;
+        return 0;
+    }
+
+    return 0;
+}
+
+/* Tx function. Transmit packets indefinitely */
+static int
+dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    struct rte_mbuf *pkt;
+    uint32_t nb_tx = 0;
+
+    pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
+    if (!pkt) {
+        return 0;
+    }
+
+    /* We have to do a copy for now */
+    memcpy(pkt->pkt.data, buf, size);
+
+    rte_pktmbuf_data_len(pkt) = size;
+    rte_pktmbuf_pkt_len(pkt) = size;
+
+    rte_spinlock_lock(&dev->tx_lock);
+    nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, &pkt, 1);
+    rte_spinlock_unlock(&dev->tx_lock);
+
+    if (nb_tx != 1) {
+        /* free buffers if we couldn't transmit packets */
+        rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1);
+    }
+    return nb_tx;
+}
+
+static int
+netdev_dpdk_send(struct netdev *netdev,
+                 struct ofpbuf *ofpbuf, bool may_steal)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    if (ofpbuf->size > dev->max_packet_len) {
+        VLOG_ERR("2big size %d max_packet_len %d",
+                  (int)ofpbuf->size , dev->max_packet_len);
+        return E2BIG;
+    }
+
+    rte_prefetch0(&ofpbuf->private_p);
+    if (!may_steal ||
+        !ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) {
+        dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size);
+    } else {
+        struct rte_mbuf *pkt;
+        uint32_t nb_tx;
+        int qid;
+
+        pkt = ofpbuf->private_p;
+        ofpbuf->private_p = NULL;
+        rte_pktmbuf_data_len(pkt) = ofpbuf->size;
+        rte_pktmbuf_pkt_len(pkt) = ofpbuf->size;
+
+        /* TODO: TX batching. */
+        qid = rte_lcore_id() % NR_QUEUE;
+        nb_tx = rte_eth_tx_burst(dev->port_id, qid, &pkt, 1);
+        if (nb_tx != 1) {
+            struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+            rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1);
+            VLOG_ERR("TX error, zero packets sent");
+       }
+    }
+    return 0;
+}
+
+static int
+netdev_dpdk_set_etheraddr(struct netdev *netdev,
+                          const uint8_t mac[ETH_ADDR_LEN])
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    if (!eth_addr_equals(dev->hwaddr, mac)) {
+        memcpy(dev->hwaddr, mac, ETH_ADDR_LEN);
+    }
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_get_etheraddr(const struct netdev *netdev,
+                          uint8_t mac[ETH_ADDR_LEN])
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    memcpy(mac, dev->hwaddr, ETH_ADDR_LEN);
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    *mtup = dev->mtu;
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int old_mtu, err;
+    struct dpdk_mp *old_mp;
+    struct dpdk_mp *mp;
+
+    ovs_mutex_lock(&dpdk_mutex);
+    ovs_mutex_lock(&dev->mutex);
+    if (dev->mtu == mtu) {
+        err = 0;
+        goto out;
+    }
+
+    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
+    if (!mp) {
+        err = ENOMEM;
+        goto out;
+    }
+
+    rte_eth_dev_stop(dev->port_id);
+
+    old_mtu = dev->mtu;
+    old_mp = dev->dpdk_mp;
+    dev->dpdk_mp = mp;
+    dev->mtu = mtu;
+    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
+
+    err = dpdk_eth_dev_init(dev);
+    if (err) {
+
+        dpdk_mp_put(mp);
+        dev->mtu = old_mtu;
+        dev->dpdk_mp = old_mp;
+        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
+        dpdk_eth_dev_init(dev);
+        goto out;
+    }
+
+    dpdk_mp_put(old_mp);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    ovs_mutex_unlock(&dpdk_mutex);
+    return err;
+}
+
+static int
+netdev_dpdk_get_stats(const struct netdev *netdev, struct netdev_stats *stats)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    struct rte_eth_stats rte_stats;
+
+    ovs_mutex_lock(&dev->mutex);
+    rte_eth_stats_get(dev->port_id, &rte_stats);
+    ovs_mutex_unlock(&dev->mutex);
+
+    *stats = dev->stats_offset;
+
+    stats->rx_packets += rte_stats.ipackets;
+    stats->tx_packets += rte_stats.opackets;
+    stats->rx_bytes += rte_stats.ibytes;
+    stats->tx_bytes += rte_stats.obytes;
+    stats->rx_errors += rte_stats.ierrors;
+    stats->tx_errors += rte_stats.oerrors;
+    stats->multicast += rte_stats.imcasts;
+
+    return 0;
+}
+
+static int
+netdev_dpdk_set_stats(struct netdev *netdev, const struct netdev_stats *stats)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    dev->stats_offset = *stats;
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_get_features(const struct netdev *netdev_,
+                         enum netdev_features *current,
+                         enum netdev_features *advertised OVS_UNUSED,
+                         enum netdev_features *supported OVS_UNUSED,
+                         enum netdev_features *peer OVS_UNUSED)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+    struct rte_eth_link link;
+
+    rte_spinlock_lock(&dev->lsi_lock);
+    link = dev->link;
+    rte_spinlock_unlock(&dev->lsi_lock);
+
+    if (link.link_duplex == ETH_LINK_AUTONEG_DUPLEX) {
+        if (link.link_speed == ETH_LINK_SPEED_AUTONEG) {
+            *current = NETDEV_F_AUTONEG;
+        }
+    } else if (link.link_duplex == ETH_LINK_HALF_DUPLEX) {
+        if (link.link_speed == ETH_LINK_SPEED_10) {
+            *current = NETDEV_F_10MB_HD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_100) {
+            *current = NETDEV_F_100MB_HD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_1000) {
+            *current = NETDEV_F_1GB_HD;
+        }
+    } else if (link.link_duplex == ETH_LINK_FULL_DUPLEX) {
+        if (link.link_speed == ETH_LINK_SPEED_10) {
+            *current = NETDEV_F_10MB_FD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_100) {
+            *current = NETDEV_F_100MB_FD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_1000) {
+            *current = NETDEV_F_1GB_FD;
+        }
+        if (link.link_speed == ETH_LINK_SPEED_10000) {
+            *current = NETDEV_F_10GB_FD;
+        }
+    }
+
+    return 0;
+}
+
+static int
+netdev_dpdk_get_ifindex(const struct netdev *netdev)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int ifindex;
+
+    ovs_mutex_lock(&dev->mutex);
+    ifindex = dev->port_id;
+    ovs_mutex_unlock(&dev->mutex);
+
+    return ifindex;
+}
+
+static int
+netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+
+    rte_spinlock_lock(&dev->lsi_lock);
+    *carrier = dev->link.link_status;
+    rte_spinlock_unlock(&dev->lsi_lock);
+
+    return 0;
+}
+
+static long long int
+netdev_dpdk_get_carrier_resets(const struct netdev *netdev_)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+    long long int carrier_resets;
+
+    rte_spinlock_lock(&dev->lsi_lock);
+    carrier_resets = dev->link_reset_cnt;
+    rte_spinlock_unlock(&dev->lsi_lock);
+
+    return carrier_resets;
+}
+
+static int
+netdev_dpdk_set_miimon(struct netdev *netdev_ OVS_UNUSED,
+                       long long int interval OVS_UNUSED)
+{
+    return 0;
+}
+
+static int
+netdev_dpdk_update_flags__(struct netdev_dpdk *dev,
+                           enum netdev_flags off, enum netdev_flags on,
+                           enum netdev_flags *old_flagsp)
+    OVS_REQUIRES(dev->mutex)
+{
+    int err;
+
+    if ((off | on) & ~(NETDEV_UP | NETDEV_PROMISC)) {
+        return EINVAL;
+    }
+
+    *old_flagsp = dev->flags;
+    dev->flags |= on;
+    dev->flags &= ~off;
+
+    if (dev->flags == *old_flagsp) {
+        return 0;
+    }
+
+    rte_eth_dev_stop(dev->port_id);
+
+    if (dev->flags & NETDEV_UP) {
+        err = rte_eth_dev_start(dev->port_id);
+        if (err)
+            return err;
+    }
+
+    if (dev->flags & NETDEV_PROMISC) {
+        rte_eth_promiscuous_enable(dev->port_id);
+        rte_eth_allmulticast_enable(dev->port_id);
+    }
+
+    return 0;
+}
+
+static int
+netdev_dpdk_update_flags(struct netdev *netdev_,
+                         enum netdev_flags off, enum netdev_flags on,
+                         enum netdev_flags *old_flagsp)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    int error;
+
+    ovs_mutex_lock(&netdev->mutex);
+    error = netdev_dpdk_update_flags__(netdev, off, on, old_flagsp);
+    ovs_mutex_unlock(&netdev->mutex);
+
+    return error;
+}
+
+static int
+netdev_dpdk_get_status(const struct netdev *netdev_, struct smap *smap)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev_);
+    struct rte_eth_dev_info dev_info;
+
+    if (dev->port_id <= 0)
+        return ENODEV;
+
+    ovs_mutex_lock(&dev->mutex);
+    rte_eth_dev_info_get(dev->port_id, &dev_info);
+    ovs_mutex_unlock(&dev->mutex);
+
+    smap_add_format(smap, "driver_name", "%s", dev_info.driver_name);
+    return 0;
+}
+
+

+/* Helper functions. */
+
+static void
+netdev_dpdk_set_admin_state__(struct netdev_dpdk *dev, bool admin_state)
+    OVS_REQUIRES(dev->mutex)
+{
+    enum netdev_flags old_flags;
+
+    if (admin_state) {
+        netdev_dpdk_update_flags__(dev, 0, NETDEV_UP, &old_flags);
+    } else {
+        netdev_dpdk_update_flags__(dev, NETDEV_UP, 0, &old_flags);
+    }
+}
+
+static void
+netdev_dpdk_set_admin_state(struct unixctl_conn *conn, int argc,
+                            const char *argv[], void *aux OVS_UNUSED)
+{
+    bool up;
+
+    if (!strcasecmp(argv[argc - 1], "up")) {
+        up = true;
+    } else if ( !strcasecmp(argv[argc - 1], "down")) {
+        up = false;
+    } else {
+        unixctl_command_reply_error(conn, "Invalid Admin State");
+        return;
+    }
+
+    if (argc > 2) {
+        struct netdev *netdev = netdev_from_name(argv[1]);
+        if (netdev && is_dpdk_class(netdev->netdev_class)) {
+            struct netdev_dpdk *dpdk_dev = netdev_dpdk_cast(netdev);
+
+            ovs_mutex_lock(&dpdk_dev->mutex);
+            netdev_dpdk_set_admin_state__(dpdk_dev, up);
+            ovs_mutex_unlock(&dpdk_dev->mutex);
+
+            netdev_close(netdev);
+        } else {
+            unixctl_command_reply_error(conn, "Unknown Dummy Interface");
+            netdev_close(netdev);
+            return;
+        }
+    } else {
+        struct netdev_dpdk *netdev;
+
+        ovs_mutex_lock(&dpdk_mutex);
+        LIST_FOR_EACH (netdev, list_node, &dpdk_list) {
+            ovs_mutex_lock(&netdev->mutex);
+            netdev_dpdk_set_admin_state__(netdev, up);
+            ovs_mutex_unlock(&netdev->mutex);
+        }
+        ovs_mutex_unlock(&dpdk_mutex);
+    }
+    unixctl_command_reply(conn, "OK");
+}
+
+static int
+dpdk_class_init(void)
+{
+    int result;
+
+    if (rte_eal_init_ret) {
+        return 0;
+    }
+
+    result = rte_pmd_init_all();
+    if (result) {
+        VLOG_ERR("Cannot init xnic PMD\n");
+        return result;
+    }
+
+    result = rte_eal_pci_probe();
+    if (result) {
+        VLOG_ERR("Cannot probe PCI\n");
+        return result;
+    }
+
+    if (rte_eth_dev_count() < 1) {
+        VLOG_ERR("No Ethernet devices found. Try assigning ports to UIO.\n");
+    }
+
+    VLOG_INFO("Ethernet Device Count: %d\n", (int)rte_eth_dev_count());
+
+    list_init(&dpdk_list);
+    list_init(&dpdk_mp_list);
+
+    unixctl_command_register("netdev-dpdk/set-admin-state",
+                             "[netdev] up|down", 1, 2,
+                             netdev_dpdk_set_admin_state, NULL);
+
+    return 0;
+}
+
+static void
+dpdk_class_setup_thread(int tid)
+{
+    cpu_set_t cpuset;
+    int err;
+
+    /* Setup thread for DPDK library. */
+    RTE_PER_LCORE(_lcore_id) = tid % NR_QUEUE;
+
+    CPU_ZERO(&cpuset);
+    CPU_SET(rte_lcore_id(), &cpuset);
+    err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
+    if (err) {
+        VLOG_ERR("thread affinity error %d\n",err);
+    }
+}
+
+static struct netdev_class netdev_dpdk_class = {
+    "dpdk",
+    dpdk_class_init,            /* init */
+    NULL,                       /* netdev_dpdk_run */
+    NULL,                       /* netdev_dpdk_wait */
+
+    netdev_dpdk_alloc,
+    netdev_dpdk_construct,
+    netdev_dpdk_destruct,
+    netdev_dpdk_dealloc,
+    dpdk_class_setup_thread,
+    netdev_dpdk_get_config,
+    NULL,                       /* netdev_dpdk_set_config */
+    NULL,                       /* get_tunnel_config */
+
+    netdev_dpdk_send,           /* send */
+    NULL,                       /* send_wait */
+
+    netdev_dpdk_set_etheraddr,
+    netdev_dpdk_get_etheraddr,
+    netdev_dpdk_get_mtu,
+    netdev_dpdk_set_mtu,
+    netdev_dpdk_get_ifindex,
+    netdev_dpdk_get_carrier,
+    netdev_dpdk_get_carrier_resets,
+    netdev_dpdk_set_miimon,
+    netdev_dpdk_get_stats,
+    netdev_dpdk_set_stats,
+    netdev_dpdk_get_features,
+    NULL,                       /* set_advertisements */
+
+    NULL,                       /* set_policing */
+    NULL,                       /* get_qos_types */
+    NULL,                       /* get_qos_capabilities */
+    NULL,                       /* get_qos */
+    NULL,                       /* set_qos */
+    NULL,                       /* get_queue */
+    NULL,                       /* set_queue */
+    NULL,                       /* delete_queue */
+    NULL,                       /* get_queue_stats */
+    NULL,                       /* queue_dump_start */
+    NULL,                       /* queue_dump_next */
+    NULL,                       /* queue_dump_done */
+    NULL,                       /* dump_queue_stats */
+
+    NULL,                       /* get_in4 */
+    NULL,                       /* set_in4 */
+    NULL,                       /* get_in6 */
+    NULL,                       /* add_router */
+    NULL,                       /* get_next_hop */
+    netdev_dpdk_get_status,
+    NULL,                       /* arp_lookup */
+
+    netdev_dpdk_update_flags,
+
+    netdev_dpdk_rx_alloc,
+    netdev_dpdk_rx_construct,
+    netdev_dpdk_rx_destruct,
+    netdev_dpdk_rx_dealloc,
+    netdev_dpdk_rx_recv,
+    NULL,                       /* rx_wait */
+    netdev_dpdk_rx_drain,
+};
+
+int
+dpdk_init(int argc, char **argv)
+{
+    int result;
+
+    if (strcmp(argv[1], "--dpdk"))
+        return 0;
+    argc--;
+    argv++;
+    /* Make sure things are initialized ... */
+    if ((result=rte_eal_init(argc, argv)) < 0)
+        rte_panic("Cannot init EAL\n");
+    rte_memzone_dump();
+    rte_eal_init_ret = 0;
+    return result;
+}
+
+void
+netdev_dpdk_register(void)
+{
+    netdev_register_provider(&netdev_dpdk_class);
+}
diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
new file mode 100644
index 0000000..5cf5626
--- /dev/null
+++ b/lib/netdev-dpdk.h
@@ -0,0 +1,7 @@
+#ifndef __NETDEV_DPDK_H__
+#define __NETDEV_DPDK_H__
+
+int dpdk_init(int argc, char **argv);
+void netdev_dpdk_register(void);
+
+#endif
diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
index 0f93363..2cb3c9b 100644
--- a/lib/netdev-dummy.c
+++ b/lib/netdev-dummy.c
@@ -103,6 +103,7 @@ struct netdev_dummy {
     FILE *tx_pcap, *rx_pcap OVS_GUARDED;

     struct list rxes OVS_GUARDED; /* List of child "netdev_rx_dummy"s. */
+    struct ofpbuf buffer;
 };

 /* Max 'recv_queue_len' in struct netdev_dummy. */
@@ -695,7 +696,7 @@ netdev_dummy_set_config(struct netdev *netdev_, const struct smap *args)
 }

 static struct netdev_rx *
-netdev_dummy_rx_alloc(void)
+netdev_dummy_rx_alloc(int id OVS_UNUSED)
 {
     struct netdev_rx_dummy *rx = xzalloc(sizeof *rx);
     return &rx->up;
@@ -739,12 +740,12 @@ netdev_dummy_rx_dealloc(struct netdev_rx *rx_)
 }

 static int
-netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer)
+netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c)
 {
     struct netdev_rx_dummy *rx = netdev_rx_dummy_cast(rx_);
     struct netdev_dummy *netdev = netdev_dummy_cast(rx->up.netdev);
+    struct ofpbuf *buffer = &netdev->buffer;
     struct ofpbuf *packet;
-    int retval;

     ovs_mutex_lock(&netdev->mutex);
     if (!list_is_empty(&rx->recv_queue)) {
@@ -758,22 +759,19 @@ netdev_dummy_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer)
     if (!packet) {
         return EAGAIN;
     }
+    ovs_mutex_lock(&netdev->mutex);
+    netdev->stats.rx_packets++;
+    netdev->stats.rx_bytes += packet->size;
+    ovs_mutex_unlock(&netdev->mutex);

-    if (packet->size <= ofpbuf_tailroom(buffer)) {
-        memcpy(buffer->data, packet->data, packet->size);
-        buffer->size += packet->size;
-        retval = 0;
-
-        ovs_mutex_lock(&netdev->mutex);
-        netdev->stats.rx_packets++;
-        netdev->stats.rx_bytes += packet->size;
-        ovs_mutex_unlock(&netdev->mutex);
-    } else {
-        retval = EMSGSIZE;
-    }
-    ofpbuf_delete(packet);
+    ofpbuf_clear(buffer);
+    ofpbuf_reserve_with_tailroom(buffer, DP_NETDEV_HEADROOM, packet->size);
+    memcpy(buffer->data, packet->data, packet->size);

-    return retval;
+    packet_set_size(packet, packet->size);
+    *rpacket = packet;
+    *c = 1;
+    return 0;
 }

 static void
@@ -809,9 +807,12 @@ netdev_dummy_rx_drain(struct netdev_rx *rx_)
 }

 static int
-netdev_dummy_send(struct netdev *netdev, const void *buffer, size_t size)
+netdev_dummy_send(struct netdev *netdev,
+                  struct ofpbuf *pkt, bool may_steal OVS_UNUSED)
 {
     struct netdev_dummy *dev = netdev_dummy_cast(netdev);
+    const void *buffer = pkt->data;
+    size_t size = pkt->size;

     if (size < ETH_HEADER_LEN) {
         return EMSGSIZE;
@@ -987,6 +988,7 @@ static const struct netdev_class dummy_class = {
     netdev_dummy_construct,
     netdev_dummy_destruct,
     netdev_dummy_dealloc,
+    NULL,                       /* setup_thread */
     netdev_dummy_get_config,
     netdev_dummy_set_config,
     NULL,                       /* get_tunnel_config */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index e756d88..73ba2c2 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -426,6 +426,7 @@ struct netdev_linux {

 struct netdev_rx_linux {
     struct netdev_rx up;
+    struct ofpbuf pkt;
     bool is_tap;
     int fd;
 };
@@ -462,6 +463,7 @@ static int af_packet_sock(void);
 static bool netdev_linux_miimon_enabled(void);
 static void netdev_linux_miimon_run(void);
 static void netdev_linux_miimon_wait(void);
+static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup);

 static bool
 is_netdev_linux_class(const struct netdev_class *netdev_class)
@@ -773,7 +775,7 @@ netdev_linux_dealloc(struct netdev *netdev_)
 }

 static struct netdev_rx *
-netdev_linux_rx_alloc(void)
+netdev_linux_rx_alloc(int id OVS_UNUSED)
 {
     struct netdev_rx_linux *rx = xzalloc(sizeof *rx);
     return &rx->up;
@@ -985,10 +987,24 @@ netdev_linux_rx_recv_tap(int fd, struct ofpbuf *buffer)
 }

 static int
-netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer)
+netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf **rpacket, int *c)
 {
     struct netdev_rx_linux *rx = netdev_rx_linux_cast(rx_);
-    int retval;
+    struct netdev *netdev = rx->up.netdev;
+    struct ofpbuf *buffer;
+    ssize_t retval;
+    int mtu;
+    int buf_size;
+
+    if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
+        mtu = ETH_PAYLOAD_MAX;
+    }
+    buf_size = DP_NETDEV_HEADROOM + VLAN_ETH_HEADER_LEN + mtu;
+
+    buffer = &rx->pkt;
+    ofpbuf_clear(buffer);
+
+    ofpbuf_reserve_with_tailroom(buffer, DP_NETDEV_HEADROOM, buf_size);

     retval = (rx->is_tap
               ? netdev_linux_rx_recv_tap(rx->fd, buffer)
@@ -996,8 +1012,11 @@ netdev_linux_rx_recv(struct netdev_rx *rx_, struct ofpbuf *buffer)
     if (retval && retval != EAGAIN && retval != EMSGSIZE) {
         VLOG_WARN_RL(&rl, "error receiving Ethernet packet on %s: %s",
                      ovs_strerror(errno), netdev_rx_get_name(rx_));
+    } else {
+        packet_set_size(buffer, buffer->size);
+        *rpacket = buffer;
+        *c = 1;
     }
-
     return retval;
 }

@@ -1036,8 +1055,11 @@ netdev_linux_rx_drain(struct netdev_rx *rx_)
  * The kernel maintains a packet transmission queue, so the caller is not
  * expected to do additional queuing of packets. */
 static int
-netdev_linux_send(struct netdev *netdev_, const void *data, size_t size)
+netdev_linux_send(struct netdev *netdev_, struct ofpbuf *pkt, bool may_steal OVS_UNUSED)
 {
+    const void *data = pkt->data;
+    size_t size = pkt->size;
+
     for (;;) {
         ssize_t retval;

@@ -2677,6 +2699,7 @@ netdev_linux_update_flags(struct netdev *netdev_, enum netdev_flags off,
     CONSTRUCT,                                                  \
     netdev_linux_destruct,                                      \
     netdev_linux_dealloc,                                       \
+    NULL,                       /* setup_thread */              \
     NULL,                       /* get_config */                \
     NULL,                       /* set_config */                \
     NULL,                       /* get_tunnel_config */         \
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index 673d3ab..0c9f347 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -33,11 +33,11 @@ extern "C" {
  * Network device implementations may read these members but should not modify
  * them. */
 struct netdev {
+    int nr_rx;
     /* The following do not change during the lifetime of a struct netdev. */
     char *name;                         /* Name of network device. */
     const struct netdev_class *netdev_class; /* Functions to control
                                                 this device. */
-
     /* The following are protected by 'netdev_mutex' (internal to netdev.c). */
     int ref_cnt;                        /* Times this devices was opened. */
     struct shash_node *node;            /* Pointer to element in global map. */
@@ -203,6 +203,10 @@ struct netdev_class {
     void (*destruct)(struct netdev *);
     void (*dealloc)(struct netdev *);

+    /*
+     * Some platform need to setup thread state. */
+    void (*setup_thread)(int thread_id);
+
     /* Fetches the device 'netdev''s configuration, storing it in 'args'.
      * The caller owns 'args' and pre-initializes it to an empty smap.
      *
@@ -241,7 +245,7 @@ struct netdev_class {
      * network device from being usefully used by the netdev-based "userspace
      * datapath".  It will also prevent the OVS implementation of bonding from
      * working properly over 'netdev'.) */
-    int (*send)(struct netdev *netdev, const void *buffer, size_t size);
+    int (*send)(struct netdev *, struct ofpbuf *buffer, bool may_steal);

     /* Registers with the poll loop to wake up from the next call to
      * poll_block() when the packet transmission queue for 'netdev' has
@@ -629,7 +633,7 @@ struct netdev_class {

     /* Life-cycle functions for a netdev_rx.  See the large comment above on
      * struct netdev_class. */
-    struct netdev_rx *(*rx_alloc)(void);
+    struct netdev_rx *(*rx_alloc)(int id);
     int (*rx_construct)(struct netdev_rx *);
     void (*rx_destruct)(struct netdev_rx *);
     void (*rx_dealloc)(struct netdev_rx *);
@@ -655,7 +659,7 @@ struct netdev_class {
      *
      * This function may be set to null if it would always return EOPNOTSUPP
      * anyhow. */
-    int (*rx_recv)(struct netdev_rx *rx, struct ofpbuf *buffer);
+    int (*rx_recv)(struct netdev_rx *rx, struct ofpbuf **pkt, int *c);

     /* Registers with the poll loop to wake up from the next call to
      * poll_block() when a packet is ready to be received with netdev_rx_recv()
@@ -672,6 +676,7 @@ int netdev_unregister_provider(const char *type);
 extern const struct netdev_class netdev_linux_class;
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
+extern const struct netdev_class netdev_pdk_class;
 #if defined(__FreeBSD__) || defined(__NetBSD__)
 extern const struct netdev_class netdev_bsd_class;
 #endif
diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
index 165c1c6..ad9d2a5 100644
--- a/lib/netdev-vport.c
+++ b/lib/netdev-vport.c
@@ -686,6 +686,7 @@ get_stats(const struct netdev *netdev, struct netdev_stats *stats)
     netdev_vport_construct,                                 \
     netdev_vport_destruct,                                  \
     netdev_vport_dealloc,                                   \
+    NULL,                       /* setup_thread */          \
     GET_CONFIG,                                             \
     SET_CONFIG,                                             \
     GET_TUNNEL_CONFIG,                                      \
diff --git a/lib/netdev.c b/lib/netdev.c
index 8e62421..f688c5c 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -91,6 +91,11 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 static void restore_all_flags(void *aux OVS_UNUSED);
 void update_device_args(struct netdev *, const struct shash *args);

+int netdev_nr_rx(const struct netdev *netdev)
+{
+    return netdev->nr_rx;
+}
+
 static void
 netdev_initialize(void)
     OVS_EXCLUDED(netdev_class_rwlock, netdev_mutex)
@@ -107,6 +112,9 @@ netdev_initialize(void)
         netdev_register_provider(&netdev_tap_class);
         netdev_vport_tunnel_register();
 #endif
+#ifdef DPDK_NETDEV
+        netdev_dpdk_register();
+#endif
 #if defined(__FreeBSD__) || defined(__NetBSD__)
         netdev_register_provider(&netdev_tap_class);
         netdev_register_provider(&netdev_bsd_class);
@@ -326,6 +334,7 @@ netdev_open(const char *name, const char *type, struct netdev **netdevp)
                 memset(netdev, 0, sizeof *netdev);
                 netdev->netdev_class = rc->class;
                 netdev->name = xstrdup(name);
+                netdev->nr_rx = 1;
                 netdev->node = shash_add(&netdev_shash, name, netdev);
                 list_init(&netdev->saved_flags_list);

@@ -481,6 +490,20 @@ netdev_close(struct netdev *netdev)
     }
 }

+void
+netdev_setup_thread(int id)
+{
+    struct netdev_registered_class *rc;
+
+    ovs_rwlock_rdlock(&netdev_class_rwlock);
+    HMAP_FOR_EACH (rc, hmap_node, &netdev_classes) {
+        if (rc->class->setup_thread) {
+            rc->class->setup_thread(id);
+        }
+    }
+    ovs_rwlock_unlock(&netdev_class_rwlock);
+}
+
 /* Parses 'netdev_name_', which is of the form [type@]name into its component
  * pieces.  'name' and 'type' must be freed by the caller. */
 void
@@ -508,13 +531,13 @@ netdev_parse_name(const char *netdev_name_, char **name, char **type)
  * Some kinds of network devices might not support receiving packets.  This
  * function returns EOPNOTSUPP in that case.*/
 int
-netdev_rx_open(struct netdev *netdev, struct netdev_rx **rxp)
+netdev_rx_open(struct netdev *netdev, struct netdev_rx **rxp, int id)
     OVS_EXCLUDED(netdev_mutex)
 {
     int error;

     if (netdev->netdev_class->rx_alloc) {
-        struct netdev_rx *rx = netdev->netdev_class->rx_alloc();
+        struct netdev_rx *rx = netdev->netdev_class->rx_alloc(id);
         if (rx) {
             rx->netdev = netdev;
             error = netdev->netdev_class->rx_construct(rx);
@@ -575,23 +598,18 @@ netdev_rx_close(struct netdev_rx *rx)
  * This function may be set to null if it would always return EOPNOTSUPP
  * anyhow. */
 int
-netdev_rx_recv(struct netdev_rx *rx, struct ofpbuf *buffer)
+netdev_rx_recv(struct netdev_rx *rx, struct ofpbuf **buffer, int *c)
 {
     int retval;

-    ovs_assert(buffer->size == 0);
-    ovs_assert(ofpbuf_tailroom(buffer) >= ETH_TOTAL_MIN);
+    retval = rx->netdev->netdev_class->rx_recv(rx, buffer, c);
+    return retval;
+}

-    retval = rx->netdev->netdev_class->rx_recv(rx, buffer);
-    if (!retval) {
-        COVERAGE_INC(netdev_received);
-        if (buffer->size < ETH_TOTAL_MIN) {
-            ofpbuf_put_zeros(buffer, ETH_TOTAL_MIN - buffer->size);
-        }
-        return 0;
-    } else {
-        return retval;
-    }
+bool
+netdev_is_pmd(const struct netdev *netdev)
+{
+    return !strcmp(netdev->netdev_class->type, "dpdk");
 }

 /* Arranges for poll_block() to wake up when a packet is ready to be received
@@ -624,12 +642,12 @@ netdev_rx_drain(struct netdev_rx *rx)
  * Some network devices may not implement support for this function.  In such
  * cases this function will always return EOPNOTSUPP. */
 int
-netdev_send(struct netdev *netdev, const struct ofpbuf *buffer)
+netdev_send(struct netdev *netdev, struct ofpbuf *buffer, bool may_steal)
 {
     int error;

     error = (netdev->netdev_class->send
-             ? netdev->netdev_class->send(netdev, buffer->data, buffer->size)
+             ? netdev->netdev_class->send(netdev, buffer, may_steal)
              : EOPNOTSUPP);
     if (!error) {
         COVERAGE_INC(netdev_sent);
diff --git a/lib/netdev.h b/lib/netdev.h
index 410c35b..d5a7793 100644
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -21,6 +21,7 @@
 #include <stddef.h>
 #include <stdint.h>
 #include "openvswitch/types.h"
+#include "packets.h"

 #ifdef  __cplusplus
 extern "C" {
@@ -138,6 +139,7 @@ bool netdev_is_reserved_name(const char *name);
 int netdev_open(const char *name, const char *type, struct netdev **);
 struct netdev *netdev_ref(const struct netdev *);
 void netdev_close(struct netdev *);
+void netdev_setup_thread(int id);

 void netdev_parse_name(const char *netdev_name, char **name, char **type);

@@ -156,17 +158,18 @@ int netdev_set_mtu(const struct netdev *, int mtu);
 int netdev_get_ifindex(const struct netdev *);

 /* Packet reception. */
-int netdev_rx_open(struct netdev *, struct netdev_rx **);
+int netdev_rx_open(struct netdev *, struct netdev_rx **, int id);
 void netdev_rx_close(struct netdev_rx *);

 const char *netdev_rx_get_name(const struct netdev_rx *);

-int netdev_rx_recv(struct netdev_rx *, struct ofpbuf *);
+bool netdev_is_pmd(const struct netdev *netdev);
+int netdev_rx_recv(struct netdev_rx *, struct ofpbuf **, int *);
 void netdev_rx_wait(struct netdev_rx *);
 int netdev_rx_drain(struct netdev_rx *);
-
+int netdev_nr_rx(const struct netdev *netdev);
 /* Packet transmission. */
-int netdev_send(struct netdev *, const struct ofpbuf *);
+int netdev_send(struct netdev *, struct ofpbuf *, bool may_steal);
 void netdev_send_wait(struct netdev *);

 /* Hardware address. */
@@ -198,6 +201,10 @@ enum netdev_features {
     NETDEV_F_PAUSE_ASYM = 1 << 15, /* Asymmetric pause. */
 };

+/* Enough headroom to add a vlan tag, plus an extra 2 bytes to allow IP
+ * headers to be aligned on a 4-byte boundary.  */
+enum { DP_NETDEV_HEADROOM = 2 + VLAN_HEADER_LEN };
+
 int netdev_get_features(const struct netdev *,
                         enum netdev_features *current,
                         enum netdev_features *advertised,
diff --git a/lib/ofpbuf.c b/lib/ofpbuf.c
index 0eed428..249fbaa 100644
--- a/lib/ofpbuf.c
+++ b/lib/ofpbuf.c
@@ -265,6 +265,9 @@ ofpbuf_resize__(struct ofpbuf *b, size_t new_headroom, size_t new_tailroom)
     new_allocated = new_headroom + b->size + new_tailroom;

     switch (b->source) {
+    case OFPBUF_DPDK:
+        OVS_NOT_REACHED();
+
     case OFPBUF_MALLOC:
         if (new_headroom == ofpbuf_headroom(b)) {
             new_base = xrealloc(b->base, new_allocated);
@@ -343,7 +346,7 @@ ofpbuf_prealloc_headroom(struct ofpbuf *b, size_t size)
 void
 ofpbuf_trim(struct ofpbuf *b)
 {
-    if (b->source == OFPBUF_MALLOC
+    if ((b->source == OFPBUF_MALLOC || b->source == OFPBUF_DPDK)
         && (ofpbuf_headroom(b) || ofpbuf_tailroom(b))) {
         ofpbuf_resize__(b, 0, 0);
     }
@@ -562,6 +565,8 @@ void *
 ofpbuf_steal_data(struct ofpbuf *b)
 {
     void *p;
+    ovs_assert(b->source != OFPBUF_DPDK);
+
     if (b->source == OFPBUF_MALLOC && b->data == b->base) {
         p = b->data;
     } else {
diff --git a/lib/ofpbuf.h b/lib/ofpbuf.h
index 7407d8b..1f7f276 100644
--- a/lib/ofpbuf.h
+++ b/lib/ofpbuf.h
@@ -20,6 +20,7 @@
 #include <stddef.h>
 #include <stdint.h>
 #include "list.h"
+#include "packets.h"
 #include "util.h"

 #ifdef  __cplusplus
@@ -29,18 +30,18 @@ extern "C" {
 enum ofpbuf_source {
     OFPBUF_MALLOC,              /* Obtained via malloc(). */
     OFPBUF_STACK,               /* Un-movable stack space or static buffer. */
-    OFPBUF_STUB                 /* Starts on stack, may expand into heap. */
+    OFPBUF_STUB,                /* Starts on stack, may expand into heap. */
+    OFPBUF_DPDK,
 };

 /* Buffer for holding arbitrary data.  An ofpbuf is automatically reallocated
  * as necessary if it grows too large for the available memory. */
 struct ofpbuf {
     void *base;                 /* First byte of allocated space. */
-    size_t allocated;           /* Number of bytes allocated. */
-    enum ofpbuf_source source;  /* Source of memory allocated as 'base'. */
-
     void *data;                 /* First byte actually in use. */
+    void *private_p;            /* Private pointer for use by owner. */
     size_t size;                /* Number of bytes in use. */
+    size_t allocated;           /* Number of bytes allocated. */

     void *l2;                   /* Link-level header. */
     void *l2_5;                 /* MPLS label stack */
@@ -49,10 +50,10 @@ struct ofpbuf {
     void *l7;                   /* Application data. */

     struct list list_node;      /* Private list element for use by owner. */
-    void *private_p;            /* Private pointer for use by owner. */
+    enum ofpbuf_source source;  /* Source of memory allocated as 'base'. */
 };

-void ofpbuf_use(struct ofpbuf *, void *, size_t);
+void ofpbuf_use_same(struct ofpbuf *b, void *base, size_t allocated);
 void ofpbuf_use_stack(struct ofpbuf *, void *, size_t);
 void ofpbuf_use_stub(struct ofpbuf *, void *, size_t);
 void ofpbuf_use_const(struct ofpbuf *, const void *, size_t);
diff --git a/lib/packets.c b/lib/packets.c
index 0d63841..525c084 100644
--- a/lib/packets.c
+++ b/lib/packets.c
@@ -990,3 +990,12 @@ packet_format_tcp_flags(struct ds *s, uint16_t tcp_flags)
         ds_put_cstr(s, "[800]");
     }
 }
+
+void
+packet_set_size(struct ofpbuf *b, int size)
+{
+    b->size = size;
+    if (b->size < ETH_TOTAL_MIN) {
+        ofpbuf_put_zeros(b, ETH_TOTAL_MIN - b->size);
+    }
+}
diff --git a/lib/packets.h b/lib/packets.h
index 8e21fa8..dcf3c3d 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -656,4 +656,5 @@ uint16_t packet_get_tcp_flags(const struct ofpbuf *, const struct flow *);
 void packet_format_tcp_flags(struct ds *, uint16_t);
 const char *packet_tcp_flag_to_string(uint32_t flag);

+void packet_set_size(struct ofpbuf *b, int size);
 #endif /* packets.h */
diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
index 990e58f..9bedd6c 100644
--- a/vswitchd/ovs-vswitchd.c
+++ b/vswitchd/ovs-vswitchd.c
@@ -49,6 +49,7 @@
 #include "vconn.h"
 #include "vlog.h"
 #include "lib/vswitch-idl.h"
+#include "lib/netdev-dpdk.h"

 VLOG_DEFINE_THIS_MODULE(vswitchd);

@@ -71,6 +72,12 @@ main(int argc, char *argv[])
     bool exiting;
     int retval;

+#ifdef DPDK_NETDEV
+    retval = dpdk_init(argc,argv);
+    argc -= retval;
+    argv += retval;
+#endif
+
     proctitle_init(argc, argv);
     set_program_name(argv[0]);
     remote = parse_options(argc, argv, &unixctl_path);
@@ -145,7 +152,8 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         OPT_BOOTSTRAP_CA_CERT,
         OPT_ENABLE_DUMMY,
         OPT_DISABLE_SYSTEM,
-        DAEMON_OPTION_ENUMS
+        DAEMON_OPTION_ENUMS,
+        OPT_DPDK,
     };
     static const struct option long_options[] = {
         {"help",        no_argument, NULL, 'h'},
@@ -159,6 +167,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         {"bootstrap-ca-cert", required_argument, NULL, OPT_BOOTSTRAP_CA_CERT},
         {"enable-dummy", optional_argument, NULL, OPT_ENABLE_DUMMY},
         {"disable-system", no_argument, NULL, OPT_DISABLE_SYSTEM},
+        {"dpdk", required_argument, NULL, OPT_DPDK},
         {NULL, 0, NULL, 0},
     };
     char *short_options = long_options_to_short_options(long_options);
@@ -210,6 +219,9 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         case '?':
             exit(EXIT_FAILURE);

+        case OPT_DPDK:
+            break;
+
         default:
             abort();
         }
--
1.7.9.5





===============================================================================
Please refer to http://www.aricent.com/legal/email_disclaimer.html
for important disclosures regarding this electronic communication.
===============================================================================

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-28  1:48 [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports pshelar
                   ` (2 preceding siblings ...)
  2014-01-29  8:56 ` [dpdk-dev] " Prashant Upadhyaya
@ 2014-01-29 10:01 ` Thomas Graf
  2014-01-29 21:49   ` Pravin Shelar
  3 siblings, 1 reply; 23+ messages in thread
From: Thomas Graf @ 2014-01-29 10:01 UTC (permalink / raw)
  To: pshelar, dev, dev, dpdk-ovs; +Cc: Gerald Rogers

On 01/28/2014 02:48 AM, pshelar@nicira.com wrote:
> From: Pravin B Shelar <pshelar@nicira.com>
>
> Following patch adds DPDK netdev-class to userspace datapath.
> Approach taken in this patch differs from Intel® DPDK vSwitch
> where DPDK datapath switching is done in saparate process.  This
> patch adds support for DPDK type port and uses OVS userspace
> datapath for switching.  Therefore all DPDK processing and flow
> miss handling is done in single process.  This also avoids code
> duplication by reusing OVS userspace datapath switching and
> therefore it supports all flow matching and actions that
> user-space datapath supports.  Refer to INSTALL.DPDK doc for
> further info.
>
> With this patch I got similar performance for netperf TCP_STREAM
> tests compared to kernel datapath.
>
> This is based a patch from Gerald Rogers.
>
> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
> CC: "Gerald Rogers" <gerald.rogers@intel.com>

Pravin,

Some initial comments below. I will provide more after deeper
digging.

Do you have any ideas on how to implement the TX batching yet?

> +
> +static int
> +netdev_dpdk_rx_drain(struct netdev_rx *rx_)
> +{
> +    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
> +    int pending;
> +    int i;
> +
> +    pending = rx->ofpbuf_cnt;
> +    if (pending) {

This conditional seems unneeded.

> +        for (i = 0; i < pending; i++) {
> +             build_ofpbuf(rx, &rx->ofpbuf[i], NULL);
> +        }
> +        rx->ofpbuf_cnt = 0;
> +        return 0;
> +    }
> +
> +    return 0;
> +}
> +
> +/* Tx function. Transmit packets indefinitely */
> +static int
> +dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    struct rte_mbuf *pkt;
> +    uint32_t nb_tx = 0;
> +
> +    pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> +    if (!pkt) {
> +        return 0;

Silent drop? ;-) Shouldn't these drops be accounted for somehow?

> +    }
> +
> +    /* We have to do a copy for now */
> +    memcpy(pkt->pkt.data, buf, size);
> +
> +    rte_pktmbuf_data_len(pkt) = size;
> +    rte_pktmbuf_pkt_len(pkt) = size;
> +
> +    rte_spinlock_lock(&dev->tx_lock);

What is the purpose of tx_lock here? Multiple threads writing to
the same Q? The lock is not acquired for the zerocopy path below.

> +    nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, &pkt, 1);
> +    rte_spinlock_unlock(&dev->tx_lock);
> +
> +    if (nb_tx != 1) {
> +        /* free buffers if we couldn't transmit packets */
> +        rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1);
> +    }
> +    return nb_tx;
> +}
> +
> +static int
> +netdev_dpdk_send(struct netdev *netdev,
> +                 struct ofpbuf *ofpbuf, bool may_steal)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +    if (ofpbuf->size > dev->max_packet_len) {
> +        VLOG_ERR("2big size %d max_packet_len %d",
> +                  (int)ofpbuf->size , dev->max_packet_len);

Should probably use VLOG_RATE_LIMIT_INIT

> +        return E2BIG;
> +    }
> +
> +    rte_prefetch0(&ofpbuf->private_p);
> +    if (!may_steal ||
> +        !ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) {
> +        dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size);
> +    } else {
> +        struct rte_mbuf *pkt;
> +        uint32_t nb_tx;
> +        int qid;
> +
> +        pkt = ofpbuf->private_p;
> +        ofpbuf->private_p = NULL;
> +        rte_pktmbuf_data_len(pkt) = ofpbuf->size;
> +        rte_pktmbuf_pkt_len(pkt) = ofpbuf->size;
> +
> +        /* TODO: TX batching. */
> +        qid = rte_lcore_id() % NR_QUEUE;
> +        nb_tx = rte_eth_tx_burst(dev->port_id, qid, &pkt, 1);
> +        if (nb_tx != 1) {
> +            struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +            rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1);
> +            VLOG_ERR("TX error, zero packets sent");

Same here

> +       }
> +    }
> +    return 0;
> +}

> +static int
> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int old_mtu, err;
> +    struct dpdk_mp *old_mp;
> +    struct dpdk_mp *mp;
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    ovs_mutex_lock(&dev->mutex);
> +    if (dev->mtu == mtu) {
> +        err = 0;
> +        goto out;
> +    }
> +
> +    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
> +    if (!mp) {
> +        err = ENOMEM;
> +        goto out;
> +    }
> +
> +    rte_eth_dev_stop(dev->port_id);
> +
> +    old_mtu = dev->mtu;
> +    old_mp = dev->dpdk_mp;
> +    dev->dpdk_mp = mp;
> +    dev->mtu = mtu;
> +    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> +
> +    err = dpdk_eth_dev_init(dev);
> +    if (err) {
> +
> +        dpdk_mp_put(mp);
> +        dev->mtu = old_mtu;
> +        dev->dpdk_mp = old_mp;
> +        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> +        dpdk_eth_dev_init(dev);

Would be nice if we don't need these constructs and DPDK would
provide an all or nothing init method.

> +        goto out;
> +    }
> +
> +    dpdk_mp_put(old_mp);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    ovs_mutex_unlock(&dpdk_mutex);
> +    return err;
> +}
> +

> +static int
> +netdev_dpdk_update_flags__(struct netdev_dpdk *dev,
> +                           enum netdev_flags off, enum netdev_flags on,
> +                           enum netdev_flags *old_flagsp)
> +    OVS_REQUIRES(dev->mutex)
> +{
> +    int err;
> +
> +    if ((off | on) & ~(NETDEV_UP | NETDEV_PROMISC)) {
> +        return EINVAL;
> +    }
> +
> +    *old_flagsp = dev->flags;
> +    dev->flags |= on;
> +    dev->flags &= ~off;
> +
> +    if (dev->flags == *old_flagsp) {
> +        return 0;
> +    }
> +
> +    rte_eth_dev_stop(dev->port_id);
> +
> +    if (dev->flags & NETDEV_UP) {
> +        err = rte_eth_dev_start(dev->port_id);
> +        if (err)
> +            return err;
> +    }

I'm not a DPDK expert but is it required to restart the device
to change promisc settings or could we conditionally start and
stop based on the previous flags state?

> +
> +    if (dev->flags & NETDEV_PROMISC) {
> +        rte_eth_promiscuous_enable(dev->port_id);
> +        rte_eth_allmulticast_enable(dev->port_id);
> +    }
> +
> +    return 0;
> +}
>
> +
> +static void
> +netdev_dpdk_set_admin_state(struct unixctl_conn *conn, int argc,
> +                            const char *argv[], void *aux OVS_UNUSED)
> +{
> +    bool up;
> +
> +    if (!strcasecmp(argv[argc - 1], "up")) {
> +        up = true;
> +    } else if ( !strcasecmp(argv[argc - 1], "down")) {
> +        up = false;
> +    } else {
> +        unixctl_command_reply_error(conn, "Invalid Admin State");
> +        return;
> +    }
> +
> +    if (argc > 2) {
> +        struct netdev *netdev = netdev_from_name(argv[1]);

For future refinement: Usability would be increased if either a
strict one interface argument is enforced or multiple interface
names could be passed in, e.g. set-admin-state dpdk0 dpdk1 up
or set-admin-state dpdk0 up dpdk1 up

As of now, dpdk1 is silently ignored which is not nice.

> +        if (netdev && is_dpdk_class(netdev->netdev_class)) {
> +            struct netdev_dpdk *dpdk_dev = netdev_dpdk_cast(netdev);
> +
> +            ovs_mutex_lock(&dpdk_dev->mutex);
> +            netdev_dpdk_set_admin_state__(dpdk_dev, up);
> +            ovs_mutex_unlock(&dpdk_dev->mutex);
> +
> +            netdev_close(netdev);
> +        } else {
> +            unixctl_command_reply_error(conn, "Unknown Dummy Interface");

I think this should read "Not a DPDK Interface" or something similar.


> +            netdev_close(netdev);
> +            return;
> +        }
> +    } else {
> +        struct netdev_dpdk *netdev;
> +
> +        ovs_mutex_lock(&dpdk_mutex);
> +        LIST_FOR_EACH (netdev, list_node, &dpdk_list) {
> +            ovs_mutex_lock(&netdev->mutex);
> +            netdev_dpdk_set_admin_state__(netdev, up);
> +            ovs_mutex_unlock(&netdev->mutex);
> +        }
> +        ovs_mutex_unlock(&dpdk_mutex);
> +    }
> +    unixctl_command_reply(conn, "OK");
> +}
> +
> +
> -    retval = rx->netdev->netdev_class->rx_recv(rx, buffer);
> -    if (!retval) {
> -        COVERAGE_INC(netdev_received);

Are you removing the netdev_receive counter on purpose here?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29  8:15     ` Thomas Graf
@ 2014-01-29 10:26       ` Vincent JARDIN
  2014-01-29 11:14         ` Thomas Graf
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent JARDIN @ 2014-01-29 10:26 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

Hi Thomas,

On 29/01/2014 09:15, Thomas Graf wrote:

 > The obvious and usual best practise would be for DPDK to guarantee
 > ABI stability between minor releases.
 >
 > Since dpdk-dev is copied as well, any comments?

DPDK's ABIs are not Kernel's ABIs, they are not POSIX, there is no 
standard. Currently, there is no such plan to have a stable ABI since we 
need to keep freedom to chase CPU cycles over having a stable ABI. For 
instance, some applications on top of the DPDK process the packets in 
less than 150 CPU cycles (have a look at testpmd:
   http://dpdk.org/browse/dpdk/tree/app/test-pmd )

I agree that some areas could be improved since they are not into the 
critical datapath of packets, but still other areas remain very CPU 
constraints. For instance:
http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
is bad:
    struct eth_dev_ops
is churned, no comment, and a #ifdef that changes the structure 
according to compilation!

Should an application use the librte libraries of the DPDK:
   - you can use RTE_VERSION and RTE_VERSION_NUM :
http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
   - you can write your own wrapper (with CPU overhead) in order to have 
a stable ABI, that wrapper should be tight to the versions of the librte 
=> the overhead is part of your application instead of the DPDK,
   - *otherwise recompile your software, it is opensource, what's the 
issue?*

We are opened to any suggestion to have stable ABI, but it should never 
remove the options to have fast/efficient/compilation/CPU execution 
processing.

Best regards,
   Vincent

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 10:26       ` Vincent JARDIN
@ 2014-01-29 11:14         ` Thomas Graf
  2014-01-29 16:34           ` Vincent JARDIN
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Graf @ 2014-01-29 11:14 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

Vincent,

On 01/29/2014 11:26 AM, Vincent JARDIN wrote:
> DPDK's ABIs are not Kernel's ABIs, they are not POSIX, there is no
> standard. Currently, there is no such plan to have a stable ABI since we
> need to keep freedom to chase CPU cycles over having a stable ABI. For
> instance, some applications on top of the DPDK process the packets in
> less than 150 CPU cycles (have a look at testpmd:
>    http://dpdk.org/browse/dpdk/tree/app/test-pmd )

I understand the requirement to not introduce overhead with wrappers
or shim layers. No problem with that. I believe this is mainly a policy
and release process issue.

Without a concept of stable interfaces, it will be difficult to
package and distribute RTE libraries, PMD, and DPDK applications. Right
now, the obvious path would include packaging the PMD bits together
with each DPDK application depending on the version of DPDK the binary
was compiled against. This is clearly not ideal.

> I agree that some areas could be improved since they are not into the
> critical datapath of packets, but still other areas remain very CPU
> constraints. For instance:
> http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
>
> is bad:
>     struct eth_dev_ops
> is churned, no comment, and a #ifdef that changes the structure
> according to compilation!

This is a very good example as it outlines the difference between
control structures and the fast path. We have this same exact trade off
in the kernel a lot where we have highly optimized internal APIs
towards modules and drivers but want to provide binary compatibility to
a certain extend.

As for the specific example you mention, it is relatively trivial to
make eth_dev_ops backwards compatible by appending appropriate padding
to the struct before a new major release and ensure that new members
are added by replacing the padding accordingly. Obviously no ifdefs
would be allowed anymore.

> Should an application use the librte libraries of the DPDK:
>    - you can use RTE_VERSION and RTE_VERSION_NUM :
> http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9

Right. This would be more or less identical to requiring a specific
DPDK version in OVS_CHEC_DPDK. It's not ideal to require application to
clutter their code with #ifdefs all over for every new minor release
though.

>    - you can write your own wrapper (with CPU overhead) in order to have
> a stable ABI, that wrapper should be tight to the versions of the librte
> => the overhead is part of your application instead of the DPDK,
>    - *otherwise recompile your software, it is opensource, what's the
> issue?*
>
> We are opened to any suggestion to have stable ABI, but it should never
> remove the options to have fast/efficient/compilation/CPU execution
> processing.

Absolutely agreed. We also don't want to add tons of abstraction and
overcomplicate everything. Still, I strongly believe that the definition
of stable interfaces towards applications and especially PMD is
essential.

I'm not proposing to standardize all the APIs towards applications on
the level of POSIX. DPDK is in early stages and disruptive changes will
come along. What I would propose on an abstract level is:

1. Extend but not break API between minor releases. Postpone API
    breakages to the next major release. High cadence of major
    releases initially, lower cadence as DPDK matures.

2. Define ABI stability towards PMD for minor releases to allow
    isolated packaging of PMD by padding control structures and keeping
    functions ABI stable.

I realize that this might be less trivial than it seems without
sacrificing performance but I consider it effort well spent.

Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 11:14         ` Thomas Graf
@ 2014-01-29 16:34           ` Vincent JARDIN
  2014-01-29 17:14             ` Thomas Graf
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent JARDIN @ 2014-01-29 16:34 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

Thomas,

First and easy answer: it is open source, so anyone can recompile. So, 
what's the issue?

> Without a concept of stable interfaces, it will be difficult to
> package and distribute RTE libraries, PMD, and DPDK applications. Right
> now, the obvious path would include packaging the PMD bits together
> with each DPDK application depending on the version of DPDK the binary
> was compiled against. This is clearly not ideal.

>
>> I agree that some areas could be improved since they are not into the
>> critical datapath of packets, but still other areas remain very CPU
>> constraints. For instance:
>> http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
>>
>> is bad:
>>     struct eth_dev_ops
>> is churned, no comment, and a #ifdef that changes the structure
>> according to compilation!
>
> This is a very good example as it outlines the difference between
> control structures and the fast path. We have this same exact trade off
> in the kernel a lot where we have highly optimized internal APIs
> towards modules and drivers but want to provide binary compatibility to
> a certain extend.

As long as we agree on this limited scope, we'll think about it and 
provide a proposal on dev@dpdk.org mailing list.

> As for the specific example you mention, it is relatively trivial to
> make eth_dev_ops backwards compatible by appending appropriate padding
> to the struct before a new major release and ensure that new members
> are added by replacing the padding accordingly. Obviously no ifdefs
> would be allowed anymore.

Of course, it is basic C!

>> Should an application use the librte libraries of the DPDK:
>>    - you can use RTE_VERSION and RTE_VERSION_NUM :
>> http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
>
> Right. This would be more or less identical to requiring a specific
> DPDK version in OVS_CHEC_DPDK. It's not ideal to require application to
> clutter their code with #ifdefs all over for every new minor release
> though.
>
>>    - you can write your own wrapper (with CPU overhead) in order to have
>> a stable ABI, that wrapper should be tight to the versions of the librte
>> => the overhead is part of your application instead of the DPDK,
>>    - *otherwise recompile your software, it is opensource, what's the
>> issue?*
>>
>> We are opened to any suggestion to have stable ABI, but it should never
>> remove the options to have fast/efficient/compilation/CPU execution
>> processing.
>
> Absolutely agreed. We also don't want to add tons of abstraction and
> overcomplicate everything. Still, I strongly believe that the definition
> of stable interfaces towards applications and especially PMD is
> essential.
>
> I'm not proposing to standardize all the APIs towards applications on
> the level of POSIX. DPDK is in early stages and disruptive changes will
> come along. What I would propose on an abstract level is:
>
> 1. Extend but not break API between minor releases. Postpone API
>     breakages to the next major release. High cadence of major
>     releases initially, lower cadence as DPDK matures.
>
> 2. Define ABI stability towards PMD for minor releases to allow
>     isolated packaging of PMD by padding control structures and keeping
>     functions ABI stable.

I get lost: do you mean ABI + API toward the PMDs or towards the 
applications using the librte ?

Best regards,
   Vincent

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 16:34           ` Vincent JARDIN
@ 2014-01-29 17:14             ` Thomas Graf
  2014-01-29 18:42               ` Stephen Hemminger
  2014-01-29 20:47               ` François-Frédéric Ozog
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Graf @ 2014-01-29 17:14 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> Thomas,
>
> First and easy answer: it is open source, so anyone can recompile. So,
> what's the issue?

I'm talking from a pure distribution perspective here: Requiring to
recompile all DPDK based applications to distribute a bugfix or to
add support for a new PMD is not ideal.

So ideally OVS would have the possibility to link against the shared
library long term.

> I get lost: do you mean ABI + API toward the PMDs or towards the
> applications using the librte ?

Towards the PMDs is more straight forward at first so it seems logical
to focus on that first.

A stable API and ABI for librte seems required as well long term as
DPDK does offer shared libraries but I realize that this is a stretch
goal in the initial phase.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 17:14             ` Thomas Graf
@ 2014-01-29 18:42               ` Stephen Hemminger
  2014-01-29 20:47               ` François-Frédéric Ozog
  1 sibling, 0 replies; 23+ messages in thread
From: Stephen Hemminger @ 2014-01-29 18:42 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

On Wed, 29 Jan 2014 18:14:01 +0100
Thomas Graf <tgraf@redhat.com> wrote:

> On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> > Thomas,
> >
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
> 
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to
> add support for a new PMD is not ideal.
> 
> So ideally OVS would have the possibility to link against the shared
> library long term.
> 
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
> 
> Towards the PMDs is more straight forward at first so it seems logical
> to focus on that first.
> 
> A stable API and ABI for librte seems required as well long term as
> DPDK does offer shared libraries but I realize that this is a stretch
> goal in the initial phase.

I would hate to see the API/ABI nailed down. We have lots of bug fixes
and new drivers that are ready to contribute, but most of them have some
change to existing ABI.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29  0:15         ` Vincent JARDIN
@ 2014-01-29 19:32           ` Pravin Shelar
  0 siblings, 0 replies; 23+ messages in thread
From: Pravin Shelar @ 2014-01-29 19:32 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev, Ben Pfaff, Jesse Gross, dev, dpdk-ovs

On Tue, Jan 28, 2014 at 4:15 PM, Vincent JARDIN
<vincent.jardin@6wind.com> wrote:
> Hi Pravin,
>
>
>>> Few feature questions:
>>>
>>>    - what's about the vNIC supports (toward the guests)?
>>>    - what's about IPsec support (VxLAN over IPsec for instance)?
>>> I do not understand how your patch will solve those 2 cases.
>>>
>> At this point I wanted to get basic DPDK support in OVS, once that is
>> done we can add support for vNIC.
>
>
> For vNIC, did you notice:
>   http://dpdk.org/browse/memnic/
>
> ?

AFAIU it is introducing backend driver for vNIC which shld work with this patch.

>
>> IPsec and vxlan or any L3 tunneling requires IP stack in userspace and
>> needs more design work.
>
>
> OK, understood.
>
>
>>>>>> This is based a patch from Gerald Rogers.
>>>
>>>
>>> Please which patch? I cannot find it into the archives.
>>>
>> It was directly sent to Jesse at Nicira. If you want I can send it out for
>> ref.
>
>
> Yes, please.

Attached is the patch based upon HEAD-309d9da.

Thanks,
Pravin.

>
> Thank you,
>   Vincent
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 17:14             ` Thomas Graf
  2014-01-29 18:42               ` Stephen Hemminger
@ 2014-01-29 20:47               ` François-Frédéric Ozog
  2014-01-29 23:15                 ` Thomas Graf
  2014-03-13  7:37                 ` David Nyström
  1 sibling, 2 replies; 23+ messages in thread
From: François-Frédéric Ozog @ 2014-01-29 20:47 UTC (permalink / raw)
  To: 'Thomas Graf', 'Vincent JARDIN'
  Cc: dev, dev, 'Gerald Rogers', dpdk-ovs

> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
> 
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to add
> support for a new PMD is not ideal.

> 
> So ideally OVS would have the possibility to link against the shared
> library long term.

I agree that distribution of DPDK apps is not covered properly at present.
Identifying the proper scheme requires a specific analysis based on the
constraints of the Telecom/Cloud/Networking markets.

In the telecom world, if you fix the underlying framework of an app, you
will still have to validate the solution, ie app/framework. In addition, the
idea of shared libraries introduces the implied requirement to validate apps
against diverse versions of DPDK shared libraries. This translates into
development and support costs.

I also expect many DPDK applications to tackle core networking features,
with sub micro second packet handling delays  and even lower than 200ns
(NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
mentioning that optimization stops are shared libraries boundaries (gcc
whole program optimization can be very effective...). Microsoft DLL linkage
are an order of magnitude faster. If Linux was to provide that, I would
probably revise my judgment. (I haven't checked Linux dynamic linking
implementation for some time so my understanding of Linux dynamic linking
may be outdated).

> 
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
> 
> Towards the PMDs is more straight forward at first so it seems logical to
> focus on that first.

I don't think it is so straight forward. Many recent cards such as Chelsio
and Myricom have a very different "packet memory layout" that does not fit
so easily into actual DPDK architecture.

1) "traditional" architecture: the driver reserves X buffers and provide the
card with descriptors of those buffers. Each packet is DMA'ed into exactly
one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
one buffer

2) "alternative" new architecture: the driver reserves a memory zone, say
4MB, without any structure, and provide a a single zone description and a
ring buffer to the card. (there no individual buffer descriptors any more).
The card fills the memory zone with packets, one next to the other and
specifies where the packets are by updating the supplied ring. Out of the
many issues fitting this scheme into DPDK, you cannot free a single mbuf:
you have to maintain a ref count to the memory zone so that, when all mbufs
have been "released", the memory zone can be freed.
That's quite a stretch from actual paradigm.

Apart from this aspect, managing RSS is two tied to Intel's flow director
concepts and cannot accommodate directly smarter or dumber RSS mechanisms.

That said, I fully agree PMD API should be revisited.

Cordially,

François-Frédéric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29  8:56 ` [dpdk-dev] " Prashant Upadhyaya
@ 2014-01-29 21:29   ` Pravin Shelar
  2014-01-30 10:15     ` Prashant Upadhyaya
  0 siblings, 1 reply; 23+ messages in thread
From: Pravin Shelar @ 2014-01-29 21:29 UTC (permalink / raw)
  To: Prashant Upadhyaya; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

On Wed, Jan 29, 2014 at 12:56 AM, Prashant Upadhyaya
<prashant.upadhyaya@aricent.com> wrote:
> Hi Pravin,
>
> I think your stuff is on the brink of a creating a mini revolution :)
>
> Some questions inline below --
> +    ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> What do you mean by portid here, do you mean the physical interface id like eth0 which I have bound to igb_uio now ?
> If I have multiple interfaces I have assigned igb_uio to, eg. eth0, eth1, eth2 etc., what is the id mapping for those ?
>
Port id is id assigned by DPDK. DPDK interface takes this port id as
argument. Currently you need to look at pci id to figure out the
device mapping to port id. I know it is clean and I am exploring
better interface so that we can specify device names to ovs-vsctl.

> If I have VM's running, then typically how to interface those VM's to this OVS in user space now, do I use the same classical 'tap' interface and add it to the OVS above.

tap device will work, but you would not get performance primarily due
to scheduling delay and memcopy.
DPDK has multiple drivers to create interface with KVM guests OS.
those should perform better. I have no tried it yet.

> What is the actual path the data takes from the VM now all the way to the switch, wouldn't it be hypervisor to kernel to OVS switch in user space to other VM/Network ?

Depends on method you use. e.g. Memnic bypass hypervisor and host
kernel entirely.

> I think if we can solve the VM to OVS port connectivity remaining in userspace only, then we have a great thing at our hand. Kindly comment on this.
>
right, performance looks pretty good. Still DPDK needs constant
polling which consumes more power. RFC ovs-dkdp patch has simple
polling which need tweaking for better power usage.

Thanks,
Pravin.



> Regards
> -Prashant
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 10:01 ` [dpdk-dev] [ovs-dev] " Thomas Graf
@ 2014-01-29 21:49   ` Pravin Shelar
  0 siblings, 0 replies; 23+ messages in thread
From: Pravin Shelar @ 2014-01-29 21:49 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

On Wed, Jan 29, 2014 at 2:01 AM, Thomas Graf <tgraf@redhat.com> wrote:
> On 01/28/2014 02:48 AM, pshelar@nicira.com wrote:
>>
>> From: Pravin B Shelar <pshelar@nicira.com>
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel® DPDK vSwitch
>> where DPDK datapath switching is done in saparate process.  This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching.  Therefore all DPDK processing and flow
>> miss handling is done in single process.  This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports.  Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>>
>> This is based a patch from Gerald Rogers.
>>
>> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
>> CC: "Gerald Rogers" <gerald.rogers@intel.com>
>
>
> Pravin,
>
> Some initial comments below. I will provide more after deeper
> digging.
>

> Do you have any ideas on how to implement the TX batching yet?
>
We can batch packets for some interval, then to do tx-burst. But I did
not see any performance improvements as we have other bottleneck in
userspace datapath. Ben's RCU patch should help there.

>
>> +
>> +static int
>> +netdev_dpdk_rx_drain(struct netdev_rx *rx_)
>> +{
>> +    struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
>> +    int pending;
>> +    int i;
>> +
>> +    pending = rx->ofpbuf_cnt;
>> +    if (pending) {
>
>
> This conditional seems unneeded.
>
Right.

>
>> +        for (i = 0; i < pending; i++) {
>> +             build_ofpbuf(rx, &rx->ofpbuf[i], NULL);
>> +        }
>> +        rx->ofpbuf_cnt = 0;
>> +        return 0;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* Tx function. Transmit packets indefinitely */
>> +static int
>> +dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +    struct rte_mbuf *pkt;
>> +    uint32_t nb_tx = 0;
>> +
>> +    pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>> +    if (!pkt) {
>> +        return 0;
>
>
> Silent drop? ;-) Shouldn't these drops be accounted for somehow?
>
ahh, I will keep it in netdev-dpdk.

>
>> +    }
>> +
>> +    /* We have to do a copy for now */
>> +    memcpy(pkt->pkt.data, buf, size);
>> +
>> +    rte_pktmbuf_data_len(pkt) = size;
>> +    rte_pktmbuf_pkt_len(pkt) = size;
>> +
>> +    rte_spinlock_lock(&dev->tx_lock);
>
>
> What is the purpose of tx_lock here? Multiple threads writing to
> the same Q? The lock is not acquired for the zerocopy path below.
>
There are PMD threads which have their own queue. So tx in these
threads is lockless. But vswitchd can send packet from other thread
all other thread send packets from single queue which is locked.

>
>> +    nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, &pkt, 1);
>> +    rte_spinlock_unlock(&dev->tx_lock);
>> +
>> +    if (nb_tx != 1) {
>> +        /* free buffers if we couldn't transmit packets */
>> +        rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1);
>> +    }
>> +    return nb_tx;
>> +}
>> +
>> +static int
>> +netdev_dpdk_send(struct netdev *netdev,
>> +                 struct ofpbuf *ofpbuf, bool may_steal)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +
>> +    if (ofpbuf->size > dev->max_packet_len) {
>> +        VLOG_ERR("2big size %d max_packet_len %d",
>> +                  (int)ofpbuf->size , dev->max_packet_len);
>
>
> Should probably use VLOG_RATE_LIMIT_INIT
>
ok.

>
>> +        return E2BIG;
>> +    }
>> +
>> +    rte_prefetch0(&ofpbuf->private_p);
>> +    if (!may_steal ||
>> +        !ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) {
>> +        dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size);
>> +    } else {
>> +        struct rte_mbuf *pkt;
>> +        uint32_t nb_tx;
>> +        int qid;
>> +
>> +        pkt = ofpbuf->private_p;
>> +        ofpbuf->private_p = NULL;
>> +        rte_pktmbuf_data_len(pkt) = ofpbuf->size;
>> +        rte_pktmbuf_pkt_len(pkt) = ofpbuf->size;
>> +
>> +        /* TODO: TX batching. */
>> +        qid = rte_lcore_id() % NR_QUEUE;
>> +        nb_tx = rte_eth_tx_burst(dev->port_id, qid, &pkt, 1);
>> +        if (nb_tx != 1) {
>> +            struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +
>> +            rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **)&pkt, 1);
>> +            VLOG_ERR("TX error, zero packets sent");
>
>
> Same here
>
ok

>
>> +       }
>> +    }
>> +    return 0;
>> +}
>
>
>> +static int
>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +    int old_mtu, err;
>> +    struct dpdk_mp *old_mp;
>> +    struct dpdk_mp *mp;
>> +
>> +    ovs_mutex_lock(&dpdk_mutex);
>> +    ovs_mutex_lock(&dev->mutex);
>> +    if (dev->mtu == mtu) {
>> +        err = 0;
>> +        goto out;
>> +    }
>> +
>> +    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>> +    if (!mp) {
>> +        err = ENOMEM;
>> +        goto out;
>> +    }
>> +
>> +    rte_eth_dev_stop(dev->port_id);
>> +
>> +    old_mtu = dev->mtu;
>> +    old_mp = dev->dpdk_mp;
>> +    dev->dpdk_mp = mp;
>> +    dev->mtu = mtu;
>> +    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>> +
>> +    err = dpdk_eth_dev_init(dev);
>> +    if (err) {
>> +
>> +        dpdk_mp_put(mp);
>> +        dev->mtu = old_mtu;
>> +        dev->dpdk_mp = old_mp;
>> +        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>> +        dpdk_eth_dev_init(dev);
>
>
> Would be nice if we don't need these constructs and DPDK would
> provide an all or nothing init method.
>
right, today we need to do bunch of calls to setup a device.

>
>> +        goto out;
>> +    }
>> +
>> +    dpdk_mp_put(old_mp);
>> +out:
>> +    ovs_mutex_unlock(&dev->mutex);
>> +    ovs_mutex_unlock(&dpdk_mutex);
>> +    return err;
>> +}
>> +
>
>
>> +static int
>>
>> +netdev_dpdk_update_flags__(struct netdev_dpdk *dev,
>> +                           enum netdev_flags off, enum netdev_flags on,
>> +                           enum netdev_flags *old_flagsp)
>> +    OVS_REQUIRES(dev->mutex)
>> +{
>> +    int err;
>> +
>> +    if ((off | on) & ~(NETDEV_UP | NETDEV_PROMISC)) {
>> +        return EINVAL;
>> +    }
>> +
>> +    *old_flagsp = dev->flags;
>> +    dev->flags |= on;
>> +    dev->flags &= ~off;
>> +
>> +    if (dev->flags == *old_flagsp) {
>> +        return 0;
>> +    }
>> +
>> +    rte_eth_dev_stop(dev->port_id);
>> +
>> +    if (dev->flags & NETDEV_UP) {
>> +        err = rte_eth_dev_start(dev->port_id);
>> +        if (err)
>> +            return err;
>> +    }
>
>
> I'm not a DPDK expert but is it required to restart the device
> to change promisc settings or could we conditionally start and
> stop based on the previous flags state?
>
promiscuous-enable does not require device reset, but I was lazy to
write another case for promiscuous flag change :)
I will update code.

>> +
>> +    if (dev->flags & NETDEV_PROMISC) {
>> +        rte_eth_promiscuous_enable(dev->port_id);
>> +        rte_eth_allmulticast_enable(dev->port_id);
>> +    }
>> +
>> +    return 0;
>> +}
>>
>> +
>> +static void
>> +netdev_dpdk_set_admin_state(struct unixctl_conn *conn, int argc,
>> +                            const char *argv[], void *aux OVS_UNUSED)
>> +{
>> +    bool up;
>> +
>> +    if (!strcasecmp(argv[argc - 1], "up")) {
>> +        up = true;
>> +    } else if ( !strcasecmp(argv[argc - 1], "down")) {
>> +        up = false;
>> +    } else {
>> +        unixctl_command_reply_error(conn, "Invalid Admin State");
>> +        return;
>> +    }
>> +
>> +    if (argc > 2) {
>> +        struct netdev *netdev = netdev_from_name(argv[1]);
>
>
> For future refinement: Usability would be increased if either a
> strict one interface argument is enforced or multiple interface
> names could be passed in, e.g. set-admin-state dpdk0 dpdk1 up
> or set-admin-state dpdk0 up dpdk1 up
>
> As of now, dpdk1 is silently ignored which is not nice.
>
ok.

>
>> +        if (netdev && is_dpdk_class(netdev->netdev_class)) {
>> +            struct netdev_dpdk *dpdk_dev = netdev_dpdk_cast(netdev);
>> +
>> +            ovs_mutex_lock(&dpdk_dev->mutex);
>> +            netdev_dpdk_set_admin_state__(dpdk_dev, up);
>> +            ovs_mutex_unlock(&dpdk_dev->mutex);
>> +
>> +            netdev_close(netdev);
>> +        } else {
>> +            unixctl_command_reply_error(conn, "Unknown Dummy Interface");
>
>
> I think this should read "Not a DPDK Interface" or something similar.
>
ok.

>
>> +            netdev_close(netdev);
>> +            return;
>> +        }
>> +    } else {
>> +        struct netdev_dpdk *netdev;
>> +
>> +        ovs_mutex_lock(&dpdk_mutex);
>> +        LIST_FOR_EACH (netdev, list_node, &dpdk_list) {
>> +            ovs_mutex_lock(&netdev->mutex);
>> +            netdev_dpdk_set_admin_state__(netdev, up);
>> +            ovs_mutex_unlock(&netdev->mutex);
>> +        }
>> +        ovs_mutex_unlock(&dpdk_mutex);
>> +    }
>> +    unixctl_command_reply(conn, "OK");
>> +}
>> +
>> +
>> -    retval = rx->netdev->netdev_class->rx_recv(rx, buffer);
>> -    if (!retval) {
>> -        COVERAGE_INC(netdev_received);
>
>
> Are you removing the netdev_receive counter on purpose here?

That is mistake. I will add it back.

Thanks a lot for review.

Pravin.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 20:47               ` François-Frédéric Ozog
@ 2014-01-29 23:15                 ` Thomas Graf
  2014-03-13  7:37                 ` David Nyström
  1 sibling, 0 replies; 23+ messages in thread
From: Thomas Graf @ 2014-01-29 23:15 UTC (permalink / raw)
  To: François-Frédéric Ozog, 'Vincent JARDIN'
  Cc: dev, dev, 'Gerald Rogers', dpdk-ovs

On 01/29/2014 09:47 PM, François-Frédéric Ozog wrote:
> In the telecom world, if you fix the underlying framework of an app, you
> will still have to validate the solution, ie app/framework. In addition, the
> idea of shared libraries introduces the implied requirement to validate apps
> against diverse versions of DPDK shared libraries. This translates into
> development and support costs.
>
> I also expect many DPDK applications to tackle core networking features,
> with sub micro second packet handling delays  and even lower than 200ns
> (NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
> mentioning that optimization stops are shared libraries boundaries (gcc
> whole program optimization can be very effective...). Microsoft DLL linkage
> are an order of magnitude faster. If Linux was to provide that, I would
> probably revise my judgment. (I haven't checked Linux dynamic linking
> implementation for some time so my understanding of Linux dynamic linking
> may be outdated).

All very valid points and I am not suggesting to stop offering the
static linking option in any way. Dynamic linking will by design result
in more cycles. My sole point is that for a core platform component
like OVS, the shared library benefits _might_ outweigh the performance
difference. In order for a shared library to be effective, some form of
ABI compatibility must be guaranteed though.

> I don't think it is so straight forward. Many recent cards such as Chelsio
> and Myricom have a very different "packet memory layout" that does not fit
> so easily into actual DPDK architecture.
>
> 1) "traditional" architecture: the driver reserves X buffers and provide the
> card with descriptors of those buffers. Each packet is DMA'ed into exactly
> one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
> one buffer
>
> 2) "alternative" new architecture: the driver reserves a memory zone, say
> 4MB, without any structure, and provide a a single zone description and a
> ring buffer to the card. (there no individual buffer descriptors any more).
> The card fills the memory zone with packets, one next to the other and
> specifies where the packets are by updating the supplied ring. Out of the
> many issues fitting this scheme into DPDK, you cannot free a single mbuf:
> you have to maintain a ref count to the memory zone so that, when all mbufs
> have been "released", the memory zone can be freed.
> That's quite a stretch from actual paradigm.
>
> Apart from this aspect, managing RSS is two tied to Intel's flow director
> concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
>
> That said, I fully agree PMD API should be revisited.

Fair enough. I don't see a reason why multiple interfaces could not
coexist in order to support multiple memory layouts. What I'm hearing
so far is that while there is no objection to bringing stability to the
APIs, it should not result in performance side effects and it is still
early to nail down the yet fluent APIs.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 21:29   ` Pravin Shelar
@ 2014-01-30 10:15     ` Prashant Upadhyaya
  2014-01-30 16:27       ` Rogers, Gerald
  0 siblings, 1 reply; 23+ messages in thread
From: Prashant Upadhyaya @ 2014-01-30 10:15 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: dev, dev, Gerald Rogers, dpdk-ovs

Hi Pravin,

Request you to please validate atleast one method to interface VM's with your innovative dpdk port on the OVS.
Preferably IVSHM.
Please do publish the steps for that too.

We really need the above for huge acceptance.

Regards
-Prashant


-----Original Message-----
From: Pravin Shelar [mailto:pshelar@nicira.com]
Sent: Thursday, January 30, 2014 3:00 AM
To: Prashant Upadhyaya
Cc: dev@openvswitch.org; dev@dpdk.org; dpdk-ovs@lists.01.org; Gerald Rogers
Subject: Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

On Wed, Jan 29, 2014 at 12:56 AM, Prashant Upadhyaya <prashant.upadhyaya@aricent.com> wrote:
> Hi Pravin,
>
> I think your stuff is on the brink of a creating a mini revolution :)
>
> Some questions inline below --
> +    ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> What do you mean by portid here, do you mean the physical interface id like eth0 which I have bound to igb_uio now ?
> If I have multiple interfaces I have assigned igb_uio to, eg. eth0, eth1, eth2 etc., what is the id mapping for those ?
>
Port id is id assigned by DPDK. DPDK interface takes this port id as argument. Currently you need to look at pci id to figure out the device mapping to port id. I know it is clean and I am exploring better interface so that we can specify device names to ovs-vsctl.

> If I have VM's running, then typically how to interface those VM's to this OVS in user space now, do I use the same classical 'tap' interface and add it to the OVS above.

tap device will work, but you would not get performance primarily due to scheduling delay and memcopy.
DPDK has multiple drivers to create interface with KVM guests OS.
those should perform better. I have no tried it yet.

> What is the actual path the data takes from the VM now all the way to the switch, wouldn't it be hypervisor to kernel to OVS switch in user space to other VM/Network ?

Depends on method you use. e.g. Memnic bypass hypervisor and host kernel entirely.

> I think if we can solve the VM to OVS port connectivity remaining in userspace only, then we have a great thing at our hand. Kindly comment on this.
>
right, performance looks pretty good. Still DPDK needs constant polling which consumes more power. RFC ovs-dkdp patch has simple polling which need tweaking for better power usage.

Thanks,
Pravin.



> Regards
> -Prashant
>
>




===============================================================================
Please refer to http://www.aricent.com/legal/email_disclaimer.html
for important disclosures regarding this electronic communication.
===============================================================================

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-30 10:15     ` Prashant Upadhyaya
@ 2014-01-30 16:27       ` Rogers, Gerald
  0 siblings, 0 replies; 23+ messages in thread
From: Rogers, Gerald @ 2014-01-30 16:27 UTC (permalink / raw)
  To: Prashant Upadhyaya, Pravin Shelar; +Cc: dev, dev, dpdk-ovs

Prashant,

IVShm is supported by the Intel DPDK client rings, and a patched QEMU/KVM
from the OVDK work on 01.org (https://01.org/packet-processing).  The main
part being the patched QEMU/KVM to map the Intel DPDK Huge Page Tables
(with Release of Intel DPDK 1.6 the requirement to map 1 GB huge pages has
been removed, but still supported) into the QEMU/KVM ivshm device (the
Client rings are standard in Intel DPDK releases).  My suggestion to add
this functionality is similar to the way different interfaces are
supported under Linux (tap, socket, etc.).  Basically add another Intel
DPDK Netdev type for client rings. Once the rings are instantiated (upon
Intel DPDK initialization), then it is simple enough to have the
OpenVSwitch control to initialize them into the polling method (just like
the physical ports are done today).  Structures like mac address, MTU,
etc. for the would need to be considered since the client rings are not
really thought of as ³Ethernet² ports within DPDK.  If you want to make
them virtual ³Ethernet² ports, then assigning the Ethernet parameters a
static value upon initialization would be acceptable.

Thoughts from the community are much welcomed on this whole topic.

As Pravin mentioned in one of his previous e-mails, the support for
IVShmem, Intel DPDK QOS, vxLan etc. wasn¹t added (and in some cases
doesn¹t exist as an Intel DPDK library, ie. vxLan) to simplify the patch.
It will be worked on for subsequent patches.

Sincerely,

Gerald

On 1/30/14, 3:15 AM, "Prashant Upadhyaya" <prashant.upadhyaya@aricent.com>
wrote:

>Hi Pravin,
>
>Request you to please validate atleast one method to interface VM's with
>your innovative dpdk port on the OVS.
>Preferably IVSHM.
>Please do publish the steps for that too.
>
>We really need the above for huge acceptance.
>
>Regards
>-Prashant
>
>
>-----Original Message-----
>From: Pravin Shelar [mailto:pshelar@nicira.com]
>Sent: Thursday, January 30, 2014 3:00 AM
>To: Prashant Upadhyaya
>Cc: dev@openvswitch.org; dev@dpdk.org; dpdk-ovs@lists.01.org; Gerald
>Rogers
>Subject: Re: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK
>based ports.
>
>On Wed, Jan 29, 2014 at 12:56 AM, Prashant Upadhyaya
><prashant.upadhyaya@aricent.com> wrote:
>> Hi Pravin,
>>
>> I think your stuff is on the brink of a creating a mini revolution :)
>>
>> Some questions inline below --
>> +    ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>> What do you mean by portid here, do you mean the physical interface id
>>like eth0 which I have bound to igb_uio now ?
>> If I have multiple interfaces I have assigned igb_uio to, eg. eth0,
>>eth1, eth2 etc., what is the id mapping for those ?
>>
>Port id is id assigned by DPDK. DPDK interface takes this port id as
>argument. Currently you need to look at pci id to figure out the device
>mapping to port id. I know it is clean and I am exploring better
>interface so that we can specify device names to ovs-vsctl.
>
>> If I have VM's running, then typically how to interface those VM's to
>>this OVS in user space now, do I use the same classical 'tap' interface
>>and add it to the OVS above.
>
>tap device will work, but you would not get performance primarily due to
>scheduling delay and memcopy.
>DPDK has multiple drivers to create interface with KVM guests OS.
>those should perform better. I have no tried it yet.
>
>> What is the actual path the data takes from the VM now all the way to
>>the switch, wouldn't it be hypervisor to kernel to OVS switch in user
>>space to other VM/Network ?
>
>Depends on method you use. e.g. Memnic bypass hypervisor and host kernel
>entirely.
>
>> I think if we can solve the VM to OVS port connectivity remaining in
>>userspace only, then we have a great thing at our hand. Kindly comment
>>on this.
>>
>right, performance looks pretty good. Still DPDK needs constant polling
>which consumes more power. RFC ovs-dkdp patch has simple polling which
>need tweaking for better power usage.
>
>Thanks,
>Pravin.
>
>
>
>> Regards
>> -Prashant
>>
>>
>
>
>
>
>==========================================================================
>=====
>Please refer to http://www.aricent.com/legal/email_disclaimer.html
>for important disclosures regarding this electronic communication.
>==========================================================================
>=====

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 20:47               ` François-Frédéric Ozog
  2014-01-29 23:15                 ` Thomas Graf
@ 2014-03-13  7:37                 ` David Nyström
  1 sibling, 0 replies; 23+ messages in thread
From: David Nyström @ 2014-03-13  7:37 UTC (permalink / raw)
  To: François-Frédéric Ozog, 'Thomas Graf',
	'Vincent JARDIN'
  Cc: dev, dev, dpdk-ovs

On 2014-01-29 21:47, François-Frédéric Ozog wrote:
>>> First and easy answer: it is open source, so anyone can recompile. So,
>>> what's the issue?
>>
>> I'm talking from a pure distribution perspective here: Requiring to
>> recompile all DPDK based applications to distribute a bugfix or to add
>> support for a new PMD is not ideal.
>
>>
>> So ideally OVS would have the possibility to link against the shared
>> library long term.
>
> I agree that distribution of DPDK apps is not covered properly at present.
> Identifying the proper scheme requires a specific analysis based on the
> constraints of the Telecom/Cloud/Networking markets.
>
> In the telecom world, if you fix the underlying framework of an app, you
> will still have to validate the solution, ie app/framework. In addition, the
> idea of shared libraries introduces the implied requirement to validate apps
> against diverse versions of DPDK shared libraries. This translates into
> development and support costs.
>
> I also expect many DPDK applications to tackle core networking features,
> with sub micro second packet handling delays  and even lower than 200ns
> (NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
> mentioning that optimization stops are shared libraries boundaries (gcc
> whole program optimization can be very effective...). Microsoft DLL linkage
> are an order of magnitude faster. If Linux was to provide that, I would
> probably revise my judgment. (I haven't checked Linux dynamic linking
> implementation for some time so my understanding of Linux dynamic linking
> may be outdated).
>
>
>>
>>> I get lost: do you mean ABI + API toward the PMDs or towards the
>>> applications using the librte ?
>>
>> Towards the PMDs is more straight forward at first so it seems logical to
>> focus on that first.
>
> I don't think it is so straight forward. Many recent cards such as Chelsio
> and Myricom have a very different "packet memory layout" that does not fit
> so easily into actual DPDK architecture.
>
> 1) "traditional" architecture: the driver reserves X buffers and provide the
> card with descriptors of those buffers. Each packet is DMA'ed into exactly
> one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
> one buffer
>
> 2) "alternative" new architecture: the driver reserves a memory zone, say
> 4MB, without any structure, and provide a a single zone description and a
> ring buffer to the card. (there no individual buffer descriptors any more).
> The card fills the memory zone with packets, one next to the other and
> specifies where the packets are by updating the supplied ring. Out of the
> many issues fitting this scheme into DPDK, you cannot free a single mbuf:
> you have to maintain a ref count to the memory zone so that, when all mbufs
> have been "released", the memory zone can be freed.
> That's quite a stretch from actual paradigm.
>
> Apart from this aspect, managing RSS is two tied to Intel's flow director
> concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
>
> That said, I fully agree PMD API should be revisited.

Hi,

Sorry for jumping in late.
Perhaps you are already aware of OpenDataPlane, which can use DPDK as 
its south bound NIC interface.

>
> Cordially,
>
> François-Frédéric
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-03-13  7:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-28  1:48 [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports pshelar
     [not found] ` <20140128044950.GA4545@nicira.com>
2014-01-28  5:28   ` [dpdk-dev] [ovs-dev] " Pravin Shelar
2014-01-28 14:47     ` [dpdk-dev] " Vincent JARDIN
2014-01-28 17:56       ` Pravin Shelar
2014-01-29  0:15         ` Vincent JARDIN
2014-01-29 19:32           ` Pravin Shelar
     [not found]       ` <52E7D2A8.400@redhat.com>
2014-01-28 18:20         ` [dpdk-dev] [ovs-dev] " Pravin Shelar
     [not found] ` <52E7D13B.9020404@redhat.com>
2014-01-28 18:17   ` Pravin Shelar
2014-01-29  8:15     ` Thomas Graf
2014-01-29 10:26       ` Vincent JARDIN
2014-01-29 11:14         ` Thomas Graf
2014-01-29 16:34           ` Vincent JARDIN
2014-01-29 17:14             ` Thomas Graf
2014-01-29 18:42               ` Stephen Hemminger
2014-01-29 20:47               ` François-Frédéric Ozog
2014-01-29 23:15                 ` Thomas Graf
2014-03-13  7:37                 ` David Nyström
2014-01-29  8:56 ` [dpdk-dev] " Prashant Upadhyaya
2014-01-29 21:29   ` Pravin Shelar
2014-01-30 10:15     ` Prashant Upadhyaya
2014-01-30 16:27       ` Rogers, Gerald
2014-01-29 10:01 ` [dpdk-dev] [ovs-dev] " Thomas Graf
2014-01-29 21:49   ` Pravin Shelar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).