* [dpdk-dev] [RFC PATCH 1/3] lib: introduce IF proxy library (API)
2020-01-14 14:25 [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library Andrzej Ostruszka
@ 2020-01-14 14:25 ` Andrzej Ostruszka
2020-01-14 14:25 ` [dpdk-dev] [RFC PATCH 2/3] if_proxy: add preliminary Linux implementation Andrzej Ostruszka
` (2 subsequent siblings)
3 siblings, 0 replies; 21+ messages in thread
From: Andrzej Ostruszka @ 2020-01-14 14:25 UTC (permalink / raw)
To: dev
Cc: Jerin Jacob Kollanukkaran, Nithin Kumar Dabilpuram,
Pavan Nikhilesh Bhagavatula, Kiran Kumar Kokkilagadda,
Krzysztof Kanas
This library allows to designate ports visible to the system (such as
Tun/Tap or KNI) as port representors serving as proxies for other DPDK
ports. When such a proxy is configured this library initially queries
network configuration from the system and later monitors its changes.
The information gathered is passed to the application via a set of user
registered callbacks. This way user can use normal network utilities
(like those from the iproute2 suite) to configure DPDK ports.
Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com>
---
lib/librte_if_proxy/rte_if_proxy.h | 364 +++++++++++++++++++++++++++++
1 file changed, 364 insertions(+)
create mode 100644 lib/librte_if_proxy/rte_if_proxy.h
diff --git a/lib/librte_if_proxy/rte_if_proxy.h b/lib/librte_if_proxy/rte_if_proxy.h
new file mode 100644
index 000000000..83895d8b7
--- /dev/null
+++ b/lib/librte_if_proxy/rte_if_proxy.h
@@ -0,0 +1,364 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell International Ltd.
+ */
+
+#ifndef _RTE_IF_PROXY_H_
+#define _RTE_IF_PROXY_H_
+
+/**
+ * @file
+ * RTE IF Proxy library
+ *
+ * The IF Proxy library allows for monitoring of system network configuration
+ * and configuration of DPDK ports by using usual system utilities (like the
+ * ones from iproute2 package).
+ *
+ * It is based on the notion of "proxy interface" which actually can be any DPDK
+ * port which is also visible to the system - that is it has non-zero 'if_index'
+ * field in 'rte_eth_dev_info' structure.
+ *
+ * If application doesn't have any such port (or doesn't want to use it for
+ * proxy) it can create one by calling:
+ *
+ * proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT);
+ *
+ * This function is just a wrapper that constructs valid 'devargs' string based
+ * on the proxy type chosen (currently Tap or KNI) and creates the interface by
+ * calling rte_ifpx_dev_create().
+ *
+ * Once one has DPDK port capable of being proxy one can bind target DPDK port
+ * to it by calling.
+ *
+ * rte_ifpx_port_bind(port_id, proxy_id);
+ *
+ * This binding is a logical one - there is no automatic packet forwarding
+ * between port and it's proxy since the library doesn't know the structure of
+ * application's packet processing. It remains application responsibility to
+ * forward the packets from/to proxy port (by calling the usual DPDK RX/TX burst
+ * API). However when the library notes some change to the proxy interface it
+ * will simply call appropriate callback with 'port_id' of the DPDK port that is
+ * bound to this proxy interface. The binding can be 1 to many - that is many
+ * ports can point to one proxy - in that case registered callbacks will be
+ * called for every bound port.
+ *
+ * The callbacks that are used for notifications are described by the
+ * 'rte_ifpx_callbacks' structure and they are registered by calling:
+ *
+ * rte_ifpx_callbacks_register(&cbs);
+ *
+ * Finally the application should call:
+ *
+ * rte_ifpx_listen();
+ *
+ * which will query system for present network configuration and start listening
+ * to its changes.
+ */
+
+#include <rte_eal.h>
+#include <rte_ethdev.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Enum naming the type of proxy to create.
+ *
+ * @see rte_ifpx_create()
+ */
+enum rte_ifpx_type {
+ RTE_IFPX_DEFAULT, /**< Use default proxy type for given arch. */
+ RTE_IFPX_TAP, /**< Use Tap based port for proxy. */
+ RTE_IFPX_KNI /**< Use KNI based port for proxy. */
+};
+
+/**
+ * Create DPDK port that can serve as an interface proxy.
+ *
+ * This function is just a wrapper around rte_ifpx_create_by_devarg() that
+ * constructs its 'devarg' argument based on type of proxy requested.
+ *
+ * @param type
+ * A type of proxy to create.
+ *
+ * @return
+ * DPDK port id on success, RTE_MAX_ETHPORTS otherwise.
+ *
+ * @see enum rte_ifpx_type
+ * @see rte_ifpx_create_by_devarg()
+ */
+__rte_experimental
+uint16_t rte_ifpx_create(enum rte_ifpx_type type);
+
+/**
+ * Create DPDK port that can serve as an interface proxy.
+ *
+ * @param devarg
+ * A string passed to rte_dev_probe() to create proxy port.
+ *
+ * @return
+ * DPDK port id on success, RTE_MAX_ETHPORTS otherwise.
+ */
+__rte_experimental
+uint16_t rte_ifpx_create_by_devarg(const char *devarg);
+
+/**
+ * Remove DPDK proxy port.
+ *
+ * In addition to removing the proxy port the bindings (if any) are cleared.
+ *
+ * @param proxy_id
+ * Port id of the proxy that should be removed.
+ *
+ * @return
+ * 0 on success, negative on error.
+ */
+__rte_experimental
+int rte_ifpx_destroy(uint16_t proxy_id);
+
+/**
+ * This structure groups the callbacks that might be called as a notification
+ * events for changing network configuration. Not every platform might
+ * implement all of them and you can query the availability with
+ * rte_ifpx_callbacks_available() function and testing each bit against bit mask
+ * values defined in enum rte_ifpx_cb_bit.
+ * @see enum rte_ifpx_cb_bit
+ * @see rte_ifpx_callbacks_available()
+ * @see rte_ifpx_callbacks_register()
+ */
+struct rte_ifpx_callbacks {
+ void (*mac_change)(uint16_t port_id, const struct rte_ether_addr *mac);
+ /**< Callback for notification about MAC change of the proxy interface.
+ * This callback (as all other port related callbacks) is called for
+ * each port (with its port_id as a first argument) bound to the proxy
+ * interface for which change has been observed.
+ * @see RTE_IFPX_MAC_CHANGE
+ */
+ void (*mtu_change)(uint16_t port_id, uint16_t mtu);
+ /**< Callback for notification about MTU change.
+ * @see RTE_IFPX_MTU_CHANGE
+ */
+ void (*link_change)(uint16_t port_id, int is_up);
+ /**< Callback for notification about link going up/down.
+ * @see RTE_IFPX_LINK_CHANGE
+ */
+ /* All IPv4 addresses are in host order */
+ void (*addr_add)(uint16_t port_id, uint32_t ip);
+ /**< Callback for notification about IPv4 address being added.
+ * @see RTE_IFPX_ADDR_ADD
+ */
+ void (*addr_del)(uint16_t port_id, uint32_t ip);
+ /**< Callback for notification about IPv4 address removal.
+ * @see RTE_IFPX_ADDR_DEL
+ */
+ void (*addr6_add)(uint16_t port_id, const uint8_t *ip);
+ /**< Callback for notification about IPv6 address being added.
+ * @see RTE_IFPX_ADDR6_ADD
+ */
+ void (*addr6_del)(uint16_t port_id, const uint8_t *ip);
+ /**< Callback for notification about IPv4 address removal.
+ * @see RTE_IFPX_ADDR6_DEL
+ */
+ void (*route_add)(uint32_t ip, uint8_t depth);
+ /**< Callback for notification about IPv4 route being added.
+ * Note that "route" callbacks might be also called when user adds
+ * address to the interface (that is in addition to address related
+ * callbacks).
+ * @see RTE_IFPX_ROUTE_ADD
+ */
+ void (*route_del)(uint32_t ip, uint8_t depth);
+ /**< Callback for notification about IPv4 route removal.
+ * @see RTE_IFPX_ROUTE_DEL
+ */
+ void (*route6_add)(const uint8_t *ip, uint8_t depth);
+ /**< Callback for notification about IPv6 route being added.
+ * @see RTE_IFPX_ROUTE6_ADD
+ */
+ void (*route6_del)(const uint8_t *ip, uint8_t depth);
+ /**< Callback for notification about IPv6 route removal.
+ * @see RTE_IFPX_ROUTE6_DEL
+ */
+ void (*cfg_finished)(void);
+ /**< Lib specific callback - called when initial network configuration
+ * query is finished.
+ */
+};
+
+/**
+ * The rte_ifpx_cb_bit enum defines bit mask values to test against value
+ * returned by rte_ifpx_callbacks_available() to learn about type of callbacks
+ * implemented for this platform.
+ */
+enum rte_ifpx_cb_bit {
+ RTE_IFPX_MAC_CHANGE = 1ULL << 0, /**< @see mac_change callback */
+ RTE_IFPX_MTU_CHANGE = 1ULL << 1, /**< @see mtu_change callback */
+ RTE_IFPX_LINK_CHANGE = 1ULL << 2, /**< @see link_change callback */
+ RTE_IFPX_ADDR_ADD = 1ULL << 3, /**< @see addr_add callback */
+ RTE_IFPX_ADDR_DEL = 1ULL << 4, /**< @see addr_del callback */
+ RTE_IFPX_ADDR6_ADD = 1ULL << 5, /**< @see addr6_add callback */
+ RTE_IFPX_ADDR6_DEL = 1ULL << 6, /**< @see addr6_del callback */
+ RTE_IFPX_ROUTE_ADD = 1ULL << 7, /**< @see route_add callback */
+ RTE_IFPX_ROUTE_DEL = 1ULL << 8, /**< @see route_del callback */
+ RTE_IFPX_ROUTE6_ADD = 1ULL << 9, /**< @see route6_add callback */
+ RTE_IFPX_ROUTE6_DEL = 1ULL << 10, /**< @see route6_del callback */
+};
+/**
+ * Get the bit mask of implemented callbacks for this platform.
+ *
+ * @return
+ * Bit mask of callbacks implemented.
+ * @see enum rte_ifpx_cb_bit
+ */
+__rte_experimental
+uint64_t rte_ifpx_callbacks_available(void);
+
+/**
+ * Typedef naming type of value returned during callback registration.
+ *
+ * @see rte_ifpx_callbacks_register()
+ */
+typedef const void *rte_ifpx_cbs_hndl;
+
+/**
+ * Register proxy callbacks.
+ *
+ * This function registers callbacks to be called upon appropriate network
+ * event notification.
+ *
+ * @param cbs
+ * Set of callbacks that will be called. The library does not take any
+ * ownership of the pointer passed - the callbacks are stored internally.
+ *
+ * @return
+ * Non-NULL pointer upon successful registration - that pointer can be used
+ * as a handle to unregister callbacks (and nothing more). On failure NULL
+ * is returned.
+ */
+__rte_experimental
+rte_ifpx_cbs_hndl rte_ifpx_callbacks_register(const
+ struct rte_ifpx_callbacks *cbs);
+
+/**
+ * Unregister proxy callbacks.
+ *
+ * This function unregisters callbacks previously registered with
+ * rte_ifpx_callbacks_register().
+ *
+ * @param cbs
+ * Handle/pointer returned on previous callback registration.
+ *
+ * @return
+ * 0 on success, negative otherwise.
+ */
+__rte_experimental
+int rte_ifpx_callbacks_unregister(rte_ifpx_cbs_hndl cbs);
+
+/**
+ * Bind the port to its proxy.
+ *
+ * After calling this function all network configuration of the proxy (and it's
+ * changes) will be passed to given port by calling registered callbacks with
+ * 'port_id' as an argument.
+ *
+ * Note: since both arguments are of the same type in order to not mix them and
+ * ease remembering the order the first one is kept the same for bind/unbind.
+ *
+ * @param port_id
+ * Id of the port to be bound.
+ * @param proxy_id
+ * Id of the proxy the port needs to be bound to.
+ * @return
+ * 0 on success, negative on error.
+ */
+__rte_experimental
+int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id);
+
+/**
+ * Unbind the port from its proxy.
+ *
+ * After calling this function registered callbacks will no longer be called for
+ * this port (but they might be called for other ports in one to many binding
+ * scenario).
+ *
+ * @param port_id
+ * Id of the port to unbind.
+ * @return
+ * 0 on success, negative on error.
+ */
+__rte_experimental
+int rte_ifpx_port_unbind(uint16_t port_id);
+
+/**
+ * Get the system network configuration and start listening to its changes.
+ *
+ * @return
+ * 0 on success, negative otherwise.
+ */
+__rte_experimental
+int rte_ifpx_listen(void);
+
+/**
+ * Remove all bindings/callbacks and stop listening to network configuration.
+ *
+ * @return
+ * 0 on success, negative otherwise.
+ */
+__rte_experimental
+int rte_ifpx_close(void);
+
+/**
+ * Get the id of the proxy the port is bound to.
+ *
+ * @param port_id
+ * Id of the port for which to get proxy.
+ * @return
+ * Port id of the proxy on success, RTE_ETH_MAXPORT on error.
+ */
+__rte_experimental
+uint16_t rte_ifpx_proxy_get(uint16_t port_id);
+
+/**
+ * Get the ids of the ports bound to the proxy.
+ *
+ * @param proxy_id
+ * Id of the proxy for which to get ports.
+ * @param ports
+ * Array where to store the port ids.
+ * @param num
+ * Size of the 'ports' array.
+ * @return
+ * The number of ports bound to given proxy. Note that this function return
+ * value does not depend on the ports/num argument - so you can call it first
+ * with NULL/0 to query for the size of the buffer to create or call it with
+ * the buffer you have and later check if it was large enough.
+ */
+__rte_experimental
+unsigned int rte_ifpx_port_get(uint16_t proxy_id,
+ uint16_t *ports, unsigned int num);
+
+/**
+ * The structure containing some properties of the proxy interface.
+ */
+struct rte_ifpx_info {
+ unsigned int if_index; /* entry valid iff if_index != 0 */
+ uint16_t mtu;
+ struct rte_ether_addr mac;
+ char if_name[RTE_ETH_NAME_MAX_LEN];
+};
+
+/**
+ * Get the properties of the proxy interface given port is bound to.
+ *
+ * @param port_id
+ * Id of the port for which to get proxy properties.
+ * @return
+ * Pointer to the proxy information structure.
+ */
+__rte_experimental
+const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_IF_PROXY_H_ */
--
2.17.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [dpdk-dev] [RFC PATCH 2/3] if_proxy: add preliminary Linux implementation
2020-01-14 14:25 [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library Andrzej Ostruszka
2020-01-14 14:25 ` [dpdk-dev] [RFC PATCH 1/3] lib: introduce IF proxy library (API) Andrzej Ostruszka
@ 2020-01-14 14:25 ` Andrzej Ostruszka
2020-01-14 14:25 ` [dpdk-dev] [RFC PATCH 3/3] if_proxy: add example, test and documentation Andrzej Ostruszka
2020-01-14 15:16 ` [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library Morten Brørup
3 siblings, 0 replies; 21+ messages in thread
From: Andrzej Ostruszka @ 2020-01-14 14:25 UTC (permalink / raw)
To: dev, Thomas Monjalon
Cc: Jerin Jacob Kollanukkaran, Nithin Kumar Dabilpuram,
Pavan Nikhilesh Bhagavatula, Kiran Kumar Kokkilagadda,
Krzysztof Kanas
This commit adds a preliminary Linux implementation of the IF Proxy
library. It should allow one to play around with the idea and check its
usefulness.
Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com>
---
config/common_base | 5 +
lib/Makefile | 2 +
.../common/include/rte_eal_interrupts.h | 2 +
lib/librte_eal/linux/eal/eal_interrupts.c | 14 +-
lib/librte_if_proxy/Makefile | 25 +
lib/librte_if_proxy/meson.build | 7 +
lib/librte_if_proxy/rte_if_proxy.c | 803 ++++++++++++++++++
lib/meson.build | 2 +-
8 files changed, 855 insertions(+), 5 deletions(-)
create mode 100644 lib/librte_if_proxy/Makefile
create mode 100644 lib/librte_if_proxy/meson.build
create mode 100644 lib/librte_if_proxy/rte_if_proxy.c
diff --git a/config/common_base b/config/common_base
index 7dec7ed45..f20296750 100644
--- a/config/common_base
+++ b/config/common_base
@@ -1056,6 +1056,11 @@ CONFIG_RTE_LIBRTE_BPF_ELF=n
#
CONFIG_RTE_LIBRTE_IPSEC=y
+#
+# Compile librte_if_proxy
+#
+CONFIG_RTE_LIBRTE_IF_PROXY=y
+
#
# Compile the test application
#
diff --git a/lib/Makefile b/lib/Makefile
index 46b91ae1a..0a60f3656 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -118,6 +118,8 @@ DIRS-$(CONFIG_RTE_LIBRTE_TELEMETRY) += librte_telemetry
DEPDIRS-librte_telemetry := librte_eal librte_metrics librte_ethdev
DIRS-$(CONFIG_RTE_LIBRTE_RCU) += librte_rcu
DEPDIRS-librte_rcu := librte_eal
+DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += librte_if_proxy
+DEPDIRS-librte_if_proxy := librte_eal
ifeq ($(CONFIG_RTE_EXEC_ENV_LINUX),y)
DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni
diff --git a/lib/librte_eal/common/include/rte_eal_interrupts.h b/lib/librte_eal/common/include/rte_eal_interrupts.h
index b370c0d26..f3d39a5ce 100644
--- a/lib/librte_eal/common/include/rte_eal_interrupts.h
+++ b/lib/librte_eal/common/include/rte_eal_interrupts.h
@@ -35,7 +35,9 @@ enum rte_intr_handle_type {
RTE_INTR_HANDLE_EXT, /**< external handler */
RTE_INTR_HANDLE_VDEV, /**< virtual device */
RTE_INTR_HANDLE_DEV_EVENT, /**< device event handle */
+ RTE_INTR_HANDLE_NETLINK, /**< netlink notification handle */
RTE_INTR_HANDLE_VFIO_REQ, /**< VFIO request handle */
+
RTE_INTR_HANDLE_MAX /**< count of elements */
};
diff --git a/lib/librte_eal/linux/eal/eal_interrupts.c b/lib/librte_eal/linux/eal/eal_interrupts.c
index 14ebb108c..ccdd94002 100644
--- a/lib/librte_eal/linux/eal/eal_interrupts.c
+++ b/lib/librte_eal/linux/eal/eal_interrupts.c
@@ -680,6 +680,9 @@ rte_intr_enable(const struct rte_intr_handle *intr_handle)
break;
/* not used at this moment */
case RTE_INTR_HANDLE_ALARM:
+#if RTE_LIBRTE_IF_PROXY
+ case RTE_INTR_HANDLE_NETLINK:
+#endif
return -1;
#ifdef VFIO_PRESENT
case RTE_INTR_HANDLE_VFIO_MSIX:
@@ -796,6 +799,9 @@ rte_intr_disable(const struct rte_intr_handle *intr_handle)
break;
/* not used at this moment */
case RTE_INTR_HANDLE_ALARM:
+#if RTE_LIBRTE_IF_PROXY
+ case RTE_INTR_HANDLE_NETLINK:
+#endif
return -1;
#ifdef VFIO_PRESENT
case RTE_INTR_HANDLE_VFIO_MSIX:
@@ -889,12 +895,12 @@ eal_intr_process_interrupts(struct epoll_event *events, int nfds)
break;
#endif
#endif
- case RTE_INTR_HANDLE_VDEV:
case RTE_INTR_HANDLE_EXT:
- bytes_read = 0;
- call = true;
- break;
+ case RTE_INTR_HANDLE_VDEV:
case RTE_INTR_HANDLE_DEV_EVENT:
+#if RTE_LIBRTE_IF_PROXY
+ case RTE_INTR_HANDLE_NETLINK:
+#endif
bytes_read = 0;
call = true;
break;
diff --git a/lib/librte_if_proxy/Makefile b/lib/librte_if_proxy/Makefile
new file mode 100644
index 000000000..9dd5f4791
--- /dev/null
+++ b/lib/librte_if_proxy/Makefile
@@ -0,0 +1,25 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2019 Marvell International Ltd.
+
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# library name
+LIB = librte_if_proxy.a
+
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
+LDLIBS += -lrte_eal
+
+EXPORT_MAP := rte_if_proxy_version.map
+
+LIBABIVER := 1
+
+# all source are stored in SRCS-y
+SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) := rte_if_proxy.c
+
+# install this header file
+SYMLINK-$(CONFIG_RTE_LIBRTE_IF_PROXY)-include := rte_if_proxy.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_if_proxy/meson.build b/lib/librte_if_proxy/meson.build
new file mode 100644
index 000000000..f9ed410b6
--- /dev/null
+++ b/lib/librte_if_proxy/meson.build
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2019 Marvell International Ltd.
+
+version = 1
+allow_experimental_apis = true
+sources = files('rte_if_proxy.c')
+headers = files('rte_if_proxy.h')
diff --git a/lib/librte_if_proxy/rte_if_proxy.c b/lib/librte_if_proxy/rte_if_proxy.c
new file mode 100644
index 000000000..770462702
--- /dev/null
+++ b/lib/librte_if_proxy/rte_if_proxy.c
@@ -0,0 +1,803 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell International Ltd.
+ */
+
+#include <rte_if_proxy.h>
+#include <rte_interrupts.h>
+#include <rte_spinlock.h>
+#include <rte_string_fns.h>
+
+#include <stdbool.h>
+#include <unistd.h>
+#include <errno.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <sys/socket.h>
+#include <sys/queue.h>
+
+static
+int ifpx_log_type;
+#define IFPX_LOG(level, fmt, args...) \
+ rte_log(RTE_LOG_ ## level, ifpx_log_type, "%s(): " fmt "\n", \
+ __func__, ##args)
+
+static
+struct rte_intr_handle ifpx_irq = {
+ .type = RTE_INTR_HANDLE_NETLINK,
+ .fd = -1,
+};
+
+static
+unsigned int ifpx_pid;
+
+/* Port to proxy mapping table */
+static uint16_t ifpx_p2p[RTE_MAX_ETHPORTS];
+
+/* Since this library is really slow/config path we guard data structures with
+ * a lock - and only one for all of them should be enough. But only callback
+ * and proxies lists are protected, I don't expect the need to protect port to
+ * proxy map table above.
+ */
+static
+rte_spinlock_t ifpx_lock = RTE_SPINLOCK_INITIALIZER;
+
+/* List of configured proxies */
+struct ifpx_proxies_node {
+ TAILQ_ENTRY(ifpx_proxies_node) elem;
+ uint16_t proxy_id;
+ struct rte_ifpx_info info;
+};
+static
+TAILQ_HEAD(ifpx_proxies_head, ifpx_proxies_node) ifpx_proxies =
+ TAILQ_HEAD_INITIALIZER(ifpx_proxies);
+
+/* List of registered callbacks */
+struct ifpx_cbs_node {
+ TAILQ_ENTRY(ifpx_cbs_node) elem;
+ struct rte_ifpx_callbacks cbs;
+};
+static
+TAILQ_HEAD(ifpx_cbs_head, ifpx_cbs_node) ifpx_callbacks =
+ TAILQ_HEAD_INITIALIZER(ifpx_callbacks);
+
+static
+int request_info(int type, int index);
+
+uint64_t rte_ifpx_callbacks_available(void)
+{
+ return RTE_IFPX_MAC_CHANGE | RTE_IFPX_MTU_CHANGE |
+ RTE_IFPX_LINK_CHANGE | RTE_IFPX_ADDR_ADD |
+ RTE_IFPX_ADDR_DEL | RTE_IFPX_ADDR6_ADD |
+ RTE_IFPX_ADDR6_DEL | RTE_IFPX_ROUTE_ADD |
+ RTE_IFPX_ROUTE_DEL | RTE_IFPX_ROUTE6_ADD |
+ RTE_IFPX_ROUTE6_DEL;
+}
+
+uint16_t rte_ifpx_create(enum rte_ifpx_type type)
+{
+ char devargs[16] = { '\0' };
+ int dev_cnt = 0, nlen;
+ uint16_t port_id;
+
+ switch (type) {
+ case RTE_IFPX_DEFAULT:
+ case RTE_IFPX_TAP:
+ nlen = strlcpy(devargs, "net_tap", sizeof(devargs));
+ break;
+ case RTE_IFPX_KNI:
+ nlen = strlcpy(devargs, "net_kni", sizeof(devargs));
+ break;
+ default:
+ IFPX_LOG(ERR, "Unknown proxy type: %d", type);
+ return RTE_MAX_ETHPORTS;
+ }
+
+ RTE_ETH_FOREACH_DEV(port_id) {
+ if (strcmp(rte_eth_devices[port_id].device->driver->name,
+ devargs) == 0)
+ ++dev_cnt;
+ }
+ snprintf(devargs+nlen, sizeof(devargs)-nlen, "%d", dev_cnt);
+
+ return rte_ifpx_create_by_devarg(devargs);
+}
+
+uint16_t rte_ifpx_create_by_devarg(const char *devarg)
+{
+ uint16_t port_id = RTE_MAX_ETHPORTS;
+ struct rte_dev_iterator iter;
+
+ if (rte_dev_probe(devarg) < 0) {
+ IFPX_LOG(ERR, "Failed to create proxy port %s\n", devarg);
+ return RTE_MAX_ETHPORTS;
+ }
+
+ RTE_ETH_FOREACH_MATCHING_DEV(port_id, devarg, &iter) {
+ break;
+ }
+ if (port_id != RTE_MAX_ETHPORTS)
+ rte_eth_iterator_cleanup(&iter);
+
+ return port_id;
+}
+
+int rte_ifpx_destroy(uint16_t proxy_id)
+{
+ struct ifpx_proxies_node *px;
+ unsigned int i;
+ int ec = 0;
+
+ rte_spinlock_lock(&ifpx_lock);
+ TAILQ_FOREACH(px, &ifpx_proxies, elem) {
+ if (px->proxy_id != proxy_id)
+ continue;
+ }
+ if (!px) {
+ ec = -EINVAL;
+ goto exit;
+ }
+ TAILQ_REMOVE(&ifpx_proxies, px, elem);
+ free(px);
+
+ /* Clear any bindings for this proxy. */
+ for (i = 0; i < RTE_DIM(ifpx_p2p); ++i) {
+ if (ifpx_p2p[i] == proxy_id)
+ ifpx_p2p[i] = RTE_MAX_ETHPORTS;
+ }
+
+ ec = rte_dev_remove(rte_eth_devices[proxy_id].device);
+exit:
+ rte_spinlock_unlock(&ifpx_lock);
+ return ec;
+}
+
+int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id)
+{
+ struct rte_eth_dev_info proxy_eth_info;
+ struct ifpx_proxies_node *px;
+ int ec;
+
+ if (port_id >= RTE_MAX_ETHPORTS || proxy_id >= RTE_MAX_ETHPORTS) {
+ IFPX_LOG(ERR, "Invalid port_id: %d", port_id);
+ return -EINVAL;
+ }
+
+ /* Do automatic rebinding but issue a warning since this is not
+ * considered to be a valid behaviour.
+ */
+ if (ifpx_p2p[port_id] != RTE_MAX_ETHPORTS) {
+ IFPX_LOG(WARNING, "Port already bound: %d -> %d", port_id,
+ ifpx_p2p[port_id]);
+ }
+
+ ec = rte_eth_dev_info_get(proxy_id, &proxy_eth_info);
+ if (ec < 0) {
+ IFPX_LOG(ERR, "Failed to read proxy dev info: %d", ec);
+ return ec;
+ }
+ if (proxy_eth_info.if_index == 0) {
+ IFPX_LOG(ERR, "Proxy with no IF index");
+ return -EINVAL;
+ }
+
+ /* Search for existing proxy - if not found add one to the list. */
+ rte_spinlock_lock(&ifpx_lock);
+ TAILQ_FOREACH(px, &ifpx_proxies, elem) {
+ if (px->proxy_id == proxy_id)
+ break;
+ }
+ if (!px) {
+ px = malloc(sizeof(*px));
+ if (!px) {
+ rte_spinlock_unlock(&ifpx_lock);
+ return -ENOMEM;
+ }
+ px->proxy_id = proxy_id;
+ px->info.if_index = proxy_eth_info.if_index;
+ rte_eth_dev_get_mtu(proxy_id, &px->info.mtu);
+ rte_eth_macaddr_get(proxy_id, &px->info.mac);
+ memset(px->info.if_name, 0, sizeof(px->info.if_name));
+ TAILQ_INSERT_TAIL(&ifpx_proxies, px, elem);
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+ ifpx_p2p[port_id] = proxy_id;
+
+ if (ifpx_irq.fd != -1)
+ request_info(RTM_GETLINK, px->info.if_index);
+
+ return 0;
+}
+
+int rte_ifpx_port_unbind(uint16_t port_id)
+{
+ if (port_id >= RTE_MAX_ETHPORTS ||
+ ifpx_p2p[port_id] == RTE_MAX_ETHPORTS)
+ return -EINVAL;
+
+ ifpx_p2p[port_id] = RTE_MAX_ETHPORTS;
+ /* Proxy without any port bound is OK - that is the state of the proxy
+ * that has just been created, and it can still report routing
+ * information. So we do not even check if this is the case.
+ */
+
+ return 0;
+}
+
+rte_ifpx_cbs_hndl rte_ifpx_callbacks_register(const
+ struct rte_ifpx_callbacks *cbs)
+{
+ rte_ifpx_cbs_hndl cb_hndl = NULL;
+ struct ifpx_cbs_node *node;
+
+ if (!cbs)
+ return NULL;
+
+ rte_spinlock_lock(&ifpx_lock);
+ TAILQ_FOREACH(node, &ifpx_callbacks, elem) {
+ if (&node->cbs == cbs) {
+ cb_hndl = cbs;
+ goto exit;
+ }
+ }
+
+ node = malloc(sizeof(*node));
+ if (!node)
+ goto exit;
+
+ node->cbs = *cbs;
+ TAILQ_INSERT_TAIL(&ifpx_callbacks, node, elem);
+ cb_hndl = &node->cbs;
+exit:
+ rte_spinlock_unlock(&ifpx_lock);
+
+ return cb_hndl;
+}
+
+int rte_ifpx_callbacks_unregister(rte_ifpx_cbs_hndl cbs)
+{
+ struct ifpx_cbs_node *node;
+ int ec = -EINVAL;
+
+ if (!cbs)
+ return ec;
+
+ rte_spinlock_lock(&ifpx_lock);
+ TAILQ_FOREACH(node, &ifpx_callbacks, elem) {
+ if (&node->cbs == cbs) {
+ TAILQ_REMOVE(&ifpx_callbacks, node, elem);
+ free(node);
+ ec = 0;
+ break;
+ }
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+
+ return ec;
+}
+
+uint16_t rte_ifpx_proxy_get(uint16_t port_id)
+{
+ if (port_id >= RTE_MAX_ETHPORTS)
+ return RTE_MAX_ETHPORTS;
+
+ return ifpx_p2p[port_id];
+}
+
+unsigned int rte_ifpx_port_get(uint16_t proxy_id,
+ uint16_t *ports, unsigned int num)
+{
+ unsigned int p, cnt = 0;
+
+ for (p = 0; p < RTE_DIM(ifpx_p2p); ++p) {
+ if (ifpx_p2p[p] == proxy_id) {
+ ++cnt;
+ if (ports && num > 0) {
+ *ports++ = ifpx_p2p[p];
+ --num;
+ }
+ }
+ }
+ return cnt;
+}
+
+const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id)
+{
+ struct ifpx_proxies_node *px;
+
+ if (port_id >= RTE_MAX_ETHPORTS ||
+ ifpx_p2p[port_id] == RTE_MAX_ETHPORTS)
+ return NULL;
+
+ rte_spinlock_lock(&ifpx_lock);
+ TAILQ_FOREACH(px, &ifpx_proxies, elem) {
+ if (px->proxy_id == ifpx_p2p[port_id])
+ break;
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+ RTE_ASSERT(px && "Internal IF Proxy library error");
+
+ return &px->info;
+}
+
+static
+void handle_link(const struct nlmsghdr *h)
+{
+ const struct ifinfomsg *ifi = NLMSG_DATA(h);
+ int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+ const struct rtattr *attrs[IFLA_MAX+1] = { NULL };
+ const struct rtattr *attr;
+ struct ifpx_proxies_node *px;
+ struct ifpx_cbs_node *cb;
+ uint16_t p;
+
+ IFPX_LOG(DEBUG, "\tLink action (%u): %u, 0x%x/0x%x (flags/changed)",
+ ifi->ifi_index, h->nlmsg_type, ifi->ifi_flags,
+ ifi->ifi_change);
+
+ rte_spinlock_lock(&ifpx_lock);
+ TAILQ_FOREACH(px, &ifpx_proxies, elem) {
+ if (px->info.if_index == (unsigned int)ifi->ifi_index)
+ break;
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+
+ /* Drop messages that are not associated with any proxy */
+ if (!px)
+ return;
+ /* When message is a reply to request for specific interface then keep
+ * it only when it contains info for this interface.
+ */
+ if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 &&
+ (h->nlmsg_seq >> 8) != (unsigned int)ifi->ifi_index)
+ return;
+
+ for (attr = IFLA_RTA(ifi); RTA_OK(attr, alen);
+ attr = RTA_NEXT(attr, alen)) {
+ if (attr->rta_type > IFLA_MAX)
+ continue;
+ attrs[attr->rta_type] = attr;
+ }
+
+ rte_spinlock_lock(&ifpx_lock);
+ if (ifi->ifi_change & IFF_UP) {
+ TAILQ_FOREACH(cb, &ifpx_callbacks, elem) {
+ if (!cb->cbs.link_change)
+ continue;
+ for (p = 0; p < RTE_DIM(ifpx_p2p); ++p) {
+ if (ifpx_p2p[p] != px->proxy_id)
+ continue;
+ cb->cbs.link_change(p,
+ ifi->ifi_flags & IFF_UP);
+ }
+ }
+ }
+ if (attrs[IFLA_MTU]) {
+ uint16_t mtu = *(const int *)RTA_DATA(attrs[IFLA_MTU]);
+ if (mtu != px->info.mtu) {
+ px->info.mtu = mtu;
+ TAILQ_FOREACH(cb, &ifpx_callbacks, elem) {
+ if (!cb->cbs.mtu_change)
+ continue;
+ for (p = 0; p < RTE_DIM(ifpx_p2p); ++p) {
+ if (ifpx_p2p[p] != px->proxy_id)
+ continue;
+ cb->cbs.mtu_change(p, mtu);
+ }
+ }
+ }
+ }
+ if (attrs[IFLA_ADDRESS]) {
+ const struct rte_ether_addr *mac =
+ RTA_DATA(attrs[IFLA_ADDRESS]);
+
+ RTE_ASSERT(RTA_PAYLOAD(attrs[IFLA_ADDRESS]) ==
+ RTE_ETHER_ADDR_LEN);
+ if (memcmp(mac, &px->info.mac, RTE_ETHER_ADDR_LEN) != 0) {
+ memcpy(px->info.mac.addr_bytes, mac, RTE_ETHER_ADDR_LEN);
+ TAILQ_FOREACH(cb, &ifpx_callbacks, elem) {
+ if (!cb->cbs.mac_change)
+ continue;
+ for (p = 0; p < RTE_DIM(ifpx_p2p); ++p) {
+ if (ifpx_p2p[p] != px->proxy_id)
+ continue;
+ cb->cbs.mac_change(p, mac);
+ }
+ }
+ }
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+ if (h->nlmsg_pid == ifpx_pid) {
+ RTE_ASSERT((h->nlmsg_seq & 0xFF) == RTM_GETLINK);
+ /* If this is reply for specific link request (not initial
+ * global dump) then follow up with address request, otherwise
+ * just store the interface name.
+ */
+ if (h->nlmsg_seq >> 8)
+ request_info(RTM_GETADDR, ifi->ifi_index);
+ else if (!px->info.if_name[0] && attrs[IFLA_IFNAME])
+ strlcpy(px->info.if_name, RTA_DATA(attrs[IFLA_IFNAME]),
+ sizeof(px->info.if_name));
+ }
+}
+
+static
+void handle_addr(const struct nlmsghdr *h, bool needs_del)
+{
+ const struct ifaddrmsg *ifa = NLMSG_DATA(h);
+ int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifa));
+ const struct rtattr *attrs[IFA_MAX+1] = { NULL };
+ const struct rtattr *attr;
+ struct ifpx_proxies_node *px;
+ struct ifpx_cbs_node *cb;
+ const uint8_t *ip;
+ uint16_t p;
+
+ rte_spinlock_lock(&ifpx_lock);
+ TAILQ_FOREACH(px, &ifpx_proxies, elem) {
+ if (px->info.if_index == ifa->ifa_index)
+ break;
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+
+ /* Drop messages that are not associated with any proxy */
+ if (!px)
+ return;
+ /* When message is a reply to request for specific interface then keep
+ * it only when it contains info for this interface.
+ */
+ if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 &&
+ (h->nlmsg_seq >> 8) != ifa->ifa_index)
+ return;
+
+ for (attr = IFA_RTA(ifa); RTA_OK(attr, alen);
+ attr = RTA_NEXT(attr, alen)) {
+ if (attr->rta_type > IFA_MAX)
+ continue;
+ attrs[attr->rta_type] = attr;
+ }
+
+ rte_spinlock_lock(&ifpx_lock);
+ if (attrs[IFA_ADDRESS]) {
+ TAILQ_FOREACH(cb, &ifpx_callbacks, elem) {
+ struct rte_ifpx_callbacks *cbs = &cb->cbs;
+
+ ip = RTA_DATA(attrs[IFA_ADDRESS]);
+ if (ifa->ifa_family == AF_INET) {
+ /* address is in network order */
+ uint32_t ipv4 =
+ RTE_IPV4(ip[0], ip[1], ip[2], ip[3]);
+
+ for (p = 0; p < RTE_DIM(ifpx_p2p); ++p) {
+ if (ifpx_p2p[p] != px->proxy_id)
+ continue;
+ if (needs_del && cbs->addr_del)
+ cb->cbs.addr_del(p, ipv4);
+ else if (!needs_del && cbs->addr_add)
+ cb->cbs.addr_add(p, ipv4);
+ }
+ } else if (ifa->ifa_family == AF_INET6) {
+ for (p = 0; p < RTE_DIM(ifpx_p2p); ++p) {
+ if (ifpx_p2p[p] != px->proxy_id)
+ continue;
+ if (needs_del && cbs->addr6_del)
+ cb->cbs.addr6_del(p, ip);
+ else if (!needs_del && cbs->addr6_add)
+ cb->cbs.addr6_add(p, ip);
+ }
+ }
+ }
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+}
+
+static
+void handle_route(const struct nlmsghdr *h, bool needs_del)
+{
+ const struct rtmsg *r = NLMSG_DATA(h);
+ int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*r));
+ const struct rtattr *attrs[RTA_MAX+1] = { NULL };
+ const struct rtattr *attr;
+ struct ifpx_cbs_node *node;
+ const uint8_t *ip;
+
+ for (attr = RTM_RTA(r); RTA_OK(attr, alen);
+ attr = RTA_NEXT(attr, alen)) {
+ if (attr->rta_type > RTA_MAX)
+ continue;
+ attrs[attr->rta_type] = attr;
+ }
+
+ rte_spinlock_lock(&ifpx_lock);
+ if (attrs[RTA_DST]) {
+ TAILQ_FOREACH(node, &ifpx_callbacks, elem) {
+ struct rte_ifpx_callbacks *cbs = &node->cbs;
+
+ ip = RTA_DATA(attrs[RTA_DST]);
+ if (r->rtm_family == AF_INET) {
+ /* address is in network order */
+ uint32_t ipv4 =
+ RTE_IPV4(ip[0], ip[1], ip[2], ip[3]);
+
+ if (needs_del && cbs->route_del)
+ cbs->route_del(ipv4, r->rtm_dst_len);
+ else if (!needs_del && cbs->route_add)
+ cbs->route_add(ipv4, r->rtm_dst_len);
+ } else if (r->rtm_family == AF_INET6) {
+ if (needs_del && cbs->route6_del)
+ cbs->route6_del(ip, r->rtm_dst_len);
+ else if (!needs_del && cbs->route6_add)
+ cbs->route6_add(ip, r->rtm_dst_len);
+ }
+ }
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+}
+
+static
+int request_info(int type, int index)
+{
+ static rte_spinlock_t send_lock = RTE_SPINLOCK_INITIALIZER;
+ struct info_get {
+ struct nlmsghdr h;
+ union {
+ struct ifinfomsg ifm;
+ struct ifaddrmsg ifa;
+ struct rtmsg rtm;
+ } __rte_aligned(NLMSG_ALIGNTO);
+ } info_req;
+ int ret;
+
+ IFPX_LOG(DEBUG, "\tRequesting msg %d for: %u", type, index);
+
+ memset(&info_req, 0, sizeof(info_req));
+ /* First byte of these messages is family, so just make sure that this
+ * memset is enough to get all families.
+ */
+ RTE_ASSERT(AF_UNSPEC == 0);
+
+ info_req.h.nlmsg_pid = ifpx_pid;
+ info_req.h.nlmsg_type = type;
+ info_req.h.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
+ info_req.h.nlmsg_len = offsetof(struct info_get, ifm);
+
+ switch (type) {
+ case RTM_GETLINK:
+ info_req.h.nlmsg_len += sizeof(info_req.ifm);
+ info_req.ifm.ifi_index = index;
+ break;
+ case RTM_GETADDR:
+ info_req.h.nlmsg_len += sizeof(info_req.ifa);
+ info_req.ifa.ifa_index = index;
+ break;
+ case RTM_GETROUTE:
+ info_req.h.nlmsg_len += sizeof(info_req.rtm);
+ break;
+ default:
+ return -EINVAL;
+ }
+ /* Store request type (and if it is global or link specific) in 'seq'.
+ * Later it is used during handling of reply to continue requesting of
+ * information dump from system - if needed.
+ */
+ info_req.h.nlmsg_seq = index << 8 | type;
+
+ rte_spinlock_lock(&send_lock);
+ ret = send(ifpx_irq.fd, &info_req, info_req.h.nlmsg_len, 0);
+ if (ret < 0) {
+ IFPX_LOG(ERR, "Failed to send netlink msg: %d", errno);
+ rte_errno = errno;
+ }
+ rte_spinlock_unlock(&send_lock);
+
+ return ret;
+}
+
+static
+void notify_cfg_finished(void)
+{
+ struct ifpx_cbs_node *node;
+
+ rte_spinlock_lock(&ifpx_lock);
+ TAILQ_FOREACH(node, &ifpx_callbacks, elem) {
+ if ( !node->cbs.cfg_finished)
+ continue;
+ node->cbs.cfg_finished();
+ }
+ rte_spinlock_unlock(&ifpx_lock);
+}
+
+static
+void if_proxy_intr_callback(void *arg __rte_unused)
+{
+ struct nlmsghdr *h;
+ struct sockaddr_nl addr;
+ socklen_t addr_len;
+ char buf[8192];
+ ssize_t len;
+
+restart:
+ len = recvfrom(ifpx_irq.fd, buf, sizeof(buf), 0,
+ (struct sockaddr *)&addr, &addr_len);
+ if (len < 0) {
+ if (errno == EINTR) {
+ IFPX_LOG(DEBUG, "recvmsg() interrupted");
+ goto restart;
+ }
+ IFPX_LOG(ERR, "Failed to read netlink msg: %ld (errno %d)",
+ len, errno);
+ return;
+ }
+ if (addr_len != sizeof(addr)) {
+ IFPX_LOG(ERR, "Invalid netlink addr size: %d", addr_len);
+ return;
+ }
+ IFPX_LOG(DEBUG, "Read %lu bytes (buf %lu) from %u/%u", len,
+ sizeof(buf), addr.nl_pid, addr.nl_groups);
+
+ for (h = (struct nlmsghdr *)buf; NLMSG_OK(h, len);
+ h = NLMSG_NEXT(h, len)) {
+ IFPX_LOG(DEBUG, "Recv msg: %u (%u/%u/%u seq/flags/pid)",
+ h->nlmsg_type, h->nlmsg_seq, h->nlmsg_flags,
+ h->nlmsg_pid);
+
+ switch (h->nlmsg_type) {
+ case RTM_NEWLINK:
+ case RTM_DELLINK:
+ handle_link(h);
+ break;
+ case RTM_NEWADDR:
+ case RTM_DELADDR:
+ handle_addr(h, h->nlmsg_type == RTM_DELADDR);
+ break;
+ case RTM_NEWROUTE:
+ case RTM_DELROUTE:
+ handle_route(h, h->nlmsg_type == RTM_DELROUTE);
+ break;
+ }
+
+ /* If this is a reply for global request then follow up with
+ * additional requests and notify about finish.
+ */
+ if (h->nlmsg_pid == ifpx_pid && (h->nlmsg_seq >> 8) == 0 &&
+ h->nlmsg_type == NLMSG_DONE) {
+ if ((h->nlmsg_seq & 0xFF) == RTM_GETLINK)
+ request_info(RTM_GETADDR, 0);
+ else if ((h->nlmsg_seq & 0xFF) == RTM_GETADDR)
+ request_info(RTM_GETROUTE, 0);
+ else {
+ RTE_ASSERT((h->nlmsg_seq & 0xFF) ==
+ RTE_GETROUTE);
+ notify_cfg_finished();
+ }
+ }
+ }
+ IFPX_LOG(DEBUG, "Finished msg loop: %ld bytes left", len);
+}
+
+int rte_ifpx_listen(void)
+{
+ struct sockaddr_nl addr = {
+ .nl_family = AF_NETLINK,
+ .nl_pid = 0,
+ };
+ socklen_t addr_len = sizeof(addr);
+ int ret;
+
+ if (ifpx_irq.fd != -1) {
+ rte_errno = EBUSY;
+ return -1;
+ }
+
+ addr.nl_groups = 1 << (RTNLGRP_LINK-1)
+ | 1 << (RTNLGRP_IPV4_IFADDR-1)
+ | 1 << (RTNLGRP_IPV6_IFADDR-1)
+ | 1 << (RTNLGRP_IPV4_ROUTE-1)
+ | 1 << (RTNLGRP_IPV6_ROUTE-1);
+
+ ifpx_irq.fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC,
+ NETLINK_ROUTE);
+ if (ifpx_irq.fd == -1) {
+ IFPX_LOG(ERR, "Failed to create netlink socket: %d", errno);
+ goto error;
+ }
+ /* Starting with kernel 4.19 you can request dump for a specific
+ * interface and kernel will filter out and send only relevant info.
+ * Otherwise NLM_F_DUMP will generate info for all interfaces and you
+ * need to filter them yourself.
+ */
+#ifdef NETLINK_DUMP_STRICT_CHK
+ ret = 1; /* use this var also as an input param */
+ ret = setsockopt(ifpx_irq.fd, SOL_SOCKET, NETLINK_DUMP_STRICT_CHK,
+ &ret, sizeof(ret));
+ if (ret < 0) {
+ IFPX_LOG(ERR, "Failed to set socket option: %d", errno);
+ goto error;
+ }
+#endif
+
+ ret = bind(ifpx_irq.fd, (struct sockaddr *)&addr, addr_len);
+ if (ret < 0) {
+ IFPX_LOG(ERR, "Failed to bind socket: %d", errno);
+ goto error;
+ }
+ ret = getsockname(ifpx_irq.fd, (struct sockaddr *)&addr, &addr_len);
+ if (ret < 0) {
+ IFPX_LOG(ERR, "Failed to get socket addr: %d", errno);
+ goto error;
+ } else {
+ ifpx_pid = addr.nl_pid;
+ IFPX_LOG(DEBUG, "Assigned port ID: %u", addr.nl_pid);
+ }
+
+ ret = rte_intr_callback_register(&ifpx_irq, if_proxy_intr_callback,
+ NULL);
+ if (ret < 0)
+ goto error;
+
+ request_info(RTM_GETLINK, 0);
+
+ return 0;
+
+error:
+ rte_errno = errno;
+ if (ifpx_irq.fd != -1) {
+ close(ifpx_irq.fd);
+ ifpx_irq.fd = -1;
+ }
+ return -1;
+}
+
+int rte_ifpx_close(void)
+{
+ int ec;
+ unsigned int p;
+ struct ifpx_cbs_node *cbs;
+ struct ifpx_proxies_node *px;
+
+ if (ifpx_irq.fd < 0)
+ return -EBADFD;
+
+restart:
+ ec = rte_intr_callback_unregister(&ifpx_irq,
+ if_proxy_intr_callback, NULL);
+ if (ec == -EAGAIN) /* unlikely but possible - at least I think so */
+ goto restart;
+
+ rte_spinlock_lock(&ifpx_lock);
+
+ close(ifpx_irq.fd);
+ ifpx_irq.fd = -1;
+ ifpx_pid = 0;
+
+ /* Clear callbacks. */
+ while (!TAILQ_EMPTY(&ifpx_callbacks)) {
+ cbs = TAILQ_FIRST(&ifpx_callbacks);
+ TAILQ_REMOVE(&ifpx_callbacks, cbs, elem);
+ free(cbs);
+ }
+
+ /* Clear proxies. */
+ while (!TAILQ_EMPTY(&ifpx_proxies)) {
+ px = TAILQ_FIRST(&ifpx_proxies);
+ TAILQ_REMOVE(&ifpx_proxies, px, elem);
+ free(px);
+ }
+
+ for (p = 0; p < RTE_DIM(ifpx_p2p); ++p)
+ ifpx_p2p[p] = RTE_MAX_ETHPORTS;
+
+ rte_spinlock_unlock(&ifpx_lock);
+
+ return 0;
+}
+
+RTE_INIT(if_proxy_init)
+{
+ unsigned int i;
+ for (i = 0; i < RTE_DIM(ifpx_p2p); ++i)
+ ifpx_p2p[i] = RTE_MAX_ETHPORTS;
+
+ ifpx_log_type = rte_log_register("lib.if_proxy");
+ if (ifpx_log_type >= 0)
+ rte_log_set_level(ifpx_log_type, RTE_LOG_WARNING);
+}
diff --git a/lib/meson.build b/lib/meson.build
index 0af3efab2..c913b33dd 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -19,7 +19,7 @@ libraries = [
'acl', 'bbdev', 'bitratestats', 'cfgfile',
'compressdev', 'cryptodev',
'distributor', 'efd', 'eventdev',
- 'gro', 'gso', 'ip_frag', 'jobstats',
+ 'gro', 'gso', 'if_proxy', 'ip_frag', 'jobstats',
'kni', 'latencystats', 'lpm', 'member',
'power', 'pdump', 'rawdev',
'rcu', 'rib', 'reorder', 'sched', 'security', 'stack', 'vhost',
--
2.17.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [dpdk-dev] [RFC PATCH 3/3] if_proxy: add example, test and documentation
2020-01-14 14:25 [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library Andrzej Ostruszka
2020-01-14 14:25 ` [dpdk-dev] [RFC PATCH 1/3] lib: introduce IF proxy library (API) Andrzej Ostruszka
2020-01-14 14:25 ` [dpdk-dev] [RFC PATCH 2/3] if_proxy: add preliminary Linux implementation Andrzej Ostruszka
@ 2020-01-14 14:25 ` Andrzej Ostruszka
2020-01-14 15:16 ` [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library Morten Brørup
3 siblings, 0 replies; 21+ messages in thread
From: Andrzej Ostruszka @ 2020-01-14 14:25 UTC (permalink / raw)
To: dev, John McNamara, Marko Kovacevic
Cc: Jerin Jacob Kollanukkaran, Nithin Kumar Dabilpuram,
Pavan Nikhilesh Bhagavatula, Kiran Kumar Kokkilagadda,
Krzysztof Kanas
This commit adds a test, documentation and a small example.
The example just creates one proxy port and binds all ports available to
it. Then you can play around with changing of network configuration of
this proxy port and you should observe notifications from the
appropriate callbacks. Below is an exemplary output (with some parts
elided and some comments added) - 'dtap0' is the name of the proxy
interface.
sudo ./if_proxy -w 00:03.0 -w 00:04.0
...
Press ^C to quit
route add -> 10.0.0.0/16
route add -> 192.168.123.0/24
...
route6 add -> ::1/128
route6 add -> fe80::/64
route6 add -> fe80::ee05:deaf:6827:b435/128
...
[[ output on: ip link set dtap0 mtu 1600 ]]
mtu change for port 0 -> 1600
mtu change for port 1 -> 1600
[[ output on: ip link set dtap0 up ]]
port 0 going up
port 1 going up
route6 add -> ff00::/8
route6 add -> fe80::/64
address6 add for port 0 -> fe80::2436:17ff:fefd:94ed
address6 add for port 1 -> fe80::2436:17ff:fefd:94ed
route6 add -> fe80::2436:17ff:fefd:94ed/128
Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com>
---
app/test/Makefile | 5 +
app/test/meson.build | 1 +
app/test/test_if_proxy.c | 431 +++++++++++++++++++++++++
doc/guides/prog_guide/if_proxy_lib.rst | 103 ++++++
doc/guides/prog_guide/index.rst | 1 +
examples/Makefile | 1 +
examples/if_proxy/Makefile | 58 ++++
examples/if_proxy/main.c | 203 ++++++++++++
examples/if_proxy/meson.build | 12 +
examples/meson.build | 2 +-
10 files changed, 816 insertions(+), 1 deletion(-)
create mode 100644 app/test/test_if_proxy.c
create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst
create mode 100644 examples/if_proxy/Makefile
create mode 100644 examples/if_proxy/main.c
create mode 100644 examples/if_proxy/meson.build
diff --git a/app/test/Makefile b/app/test/Makefile
index 57930c00b..f621978d7 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -230,6 +230,11 @@ SRCS-$(CONFIG_RTE_LIBRTE_BPF) += test_bpf.c
SRCS-$(CONFIG_RTE_LIBRTE_RCU) += test_rcu_qsbr.c test_rcu_qsbr_perf.c
+ifeq ($(CONFIG_RTE_LIBRTE_IF_PROXY),y)
+SRCS-y += test_if_proxy.c
+LDLIBS += -lrte_if_proxy
+endif
+
SRCS-$(CONFIG_RTE_LIBRTE_IPSEC) += test_ipsec.c
SRCS-$(CONFIG_RTE_LIBRTE_IPSEC) += test_ipsec_sad.c
ifeq ($(CONFIG_RTE_LIBRTE_IPSEC),y)
diff --git a/app/test/meson.build b/app/test/meson.build
index fb49d804b..2a3b5fef2 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -61,6 +61,7 @@ test_sources = files('commands.c',
'test_hash_perf.c',
'test_hash_readwrite_lf.c',
'test_interrupts.c',
+ 'test_if_proxy.c',
'test_ipsec.c',
'test_ipsec_sad.c',
'test_kni.c',
diff --git a/app/test/test_if_proxy.c b/app/test/test_if_proxy.c
new file mode 100644
index 000000000..0ecfb79b4
--- /dev/null
+++ b/app/test/test_if_proxy.c
@@ -0,0 +1,431 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#include <rte_ethdev.h>
+#include <rte_if_proxy.h>
+
+#include <string.h>
+#include <unistd.h>
+#include <signal.h>
+#include <net/if.h>
+#include <arpa/inet.h>
+#include <pthread.h>
+#include <time.h>
+
+static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
+static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
+
+enum net_op {
+ INITIALIZED = 1U << 0,
+ LOOP_ROUTE = 1U << 1,
+ LOOP6_ROUTE = 1U << 2,
+ LINK_CHANGED = 1U << 3,
+ MAC_CHANGED = 1U << 4,
+ MTU_CHANGED = 1U << 5,
+ ADDR_ADD = 1U << 6,
+ ADDR_DEL = 1U << 7,
+ ROUTE_ADD = 1U << 8,
+ ROUTE_DEL = 1U << 9,
+ ADDR6_ADD = 1U << 10,
+ ADDR6_DEL = 1U << 11,
+ ROUTE6_ADD = 1U << 12,
+ ROUTE6_DEL = 1U << 13,
+};
+
+static unsigned int state;
+
+static struct {
+ struct rte_ether_addr mac_addr;
+ uint16_t port_id, mtu;
+ struct in_addr ipv4, route4;
+ struct in6_addr ipv6, route6;
+ uint16_t depth4, depth6;
+ int is_up;
+} net_cfg;
+
+static
+int unlock_notify(unsigned int op)
+{
+ /* the mutex is expected to be locked on entry */
+ RTE_VERIFY(pthread_mutex_trylock(&mutex) == EBUSY);
+ state |= op;
+
+ pthread_mutex_unlock(&mutex);
+ return pthread_cond_signal(&cond);
+}
+
+static
+int wait_for(unsigned int op_mask, unsigned int sec)
+{
+ struct timespec time;
+ int ec = pthread_mutex_trylock(&mutex);
+
+ /* the mutex is expected to be locked on entry */
+ RTE_VERIFY(ec == EBUSY);
+
+ ec = 0;
+ clock_gettime(CLOCK_REALTIME, &time);
+ time.tv_sec += sec;
+
+ while ((state & op_mask) != op_mask && ec == 0)
+ ec = pthread_cond_timedwait(&cond, &mutex, &time);
+
+ return ec;
+}
+
+static
+int expect(unsigned int op_mask, const char *fmt, ...)
+#if __GNUC__
+ __attribute__((format(printf, 2, 3)));
+#endif
+
+static
+int expect(unsigned int op_mask, const char *fmt, ...)
+{
+ char cmd[128];
+ va_list args;
+ int ret;
+
+ state &= ~op_mask;
+ va_start(args, fmt);
+ vsnprintf(cmd, sizeof(cmd), fmt, args);
+ va_end(args);
+ ret = system(cmd);
+ if (ret == 0)
+ /* IPv6 address notifications seem to need that long delay. */
+ return wait_for(op_mask, 2);
+ return ret;
+}
+
+static
+void mac_change(uint16_t port_id, const struct rte_ether_addr *mac)
+{
+ pthread_mutex_lock(&mutex);
+ RTE_VERIFY(port_id == net_cfg.port_id);
+ if (memcmp(mac->addr_bytes, net_cfg.mac_addr.addr_bytes,
+ RTE_ETHER_ADDR_LEN) == 0) {
+ unlock_notify(MAC_CHANGED);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void mtu_change(uint16_t port_id, uint16_t mtu)
+{
+ pthread_mutex_lock(&mutex);
+ RTE_VERIFY(port_id == net_cfg.port_id);
+ if (net_cfg.mtu == mtu) {
+ unlock_notify(MTU_CHANGED);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void link_change(uint16_t port_id, int is_up)
+{
+ pthread_mutex_lock(&mutex);
+ RTE_VERIFY(port_id == net_cfg.port_id);
+ if (net_cfg.is_up == is_up) {
+ unlock_notify(LINK_CHANGED);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void addr_add(uint16_t port_id, uint32_t ip)
+{
+ pthread_mutex_lock(&mutex);
+ RTE_VERIFY(port_id == net_cfg.port_id);
+ if (net_cfg.ipv4.s_addr == ip) {
+ unlock_notify(ADDR_ADD);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void addr_del(uint16_t port_id, uint32_t ip)
+{
+ pthread_mutex_lock(&mutex);
+ RTE_VERIFY(port_id == net_cfg.port_id);
+ if (net_cfg.ipv4.s_addr == ip) {
+ unlock_notify(ADDR_DEL);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void addr6_add(uint16_t port_id, const uint8_t *ip)
+{
+ pthread_mutex_lock(&mutex);
+ RTE_VERIFY(port_id == net_cfg.port_id);
+ if (memcmp(ip, net_cfg.ipv6.s6_addr, 16) == 0) {
+ unlock_notify(ADDR6_ADD);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void addr6_del(uint16_t port_id __rte_unused, const uint8_t *ip)
+{
+ pthread_mutex_lock(&mutex);
+ RTE_VERIFY(port_id == net_cfg.port_id);
+ if (memcmp(ip, net_cfg.ipv6.s6_addr, 16) == 0) {
+ unlock_notify(ADDR6_DEL);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void route_add(uint32_t ip, uint8_t depth)
+{
+ pthread_mutex_lock(&mutex);
+ /* Since we are checking if during initialization we get some routing
+ * info we need to notify either when we are not initialized or when
+ * the exact route matches.
+ */
+ if (!(state & INITIALIZED) ||
+ (net_cfg.depth4 == depth && net_cfg.route4.s_addr == ip)) {
+ unlock_notify(ROUTE_ADD);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void route_del(uint32_t ip, uint8_t depth)
+{
+ pthread_mutex_lock(&mutex);
+ if (net_cfg.depth4 == depth && net_cfg.route4.s_addr == ip) {
+ unlock_notify(ROUTE_DEL);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void route6_add(const uint8_t *ip, uint8_t depth)
+{
+ pthread_mutex_lock(&mutex);
+ /* Since we are checking if during initialization we get some routing
+ * info we need to notify either when we are not initialized or when
+ * the exact route matches.
+ */
+ if (!(state & INITIALIZED) ||
+ (net_cfg.depth6 == depth &&
+ /* don't check for trailing zeros */
+ memcmp(ip, net_cfg.route6.s6_addr, depth/8) == 0)) {
+ unlock_notify(ROUTE6_ADD);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void route6_del(const uint8_t *ip, uint8_t depth)
+{
+ pthread_mutex_lock(&mutex);
+ if (net_cfg.depth6 == depth &&
+ /* don't check for trailing zeros */
+ memcmp(ip, net_cfg.route6.s6_addr, depth/8) == 0) {
+ unlock_notify(ROUTE6_DEL);
+ return;
+ }
+ pthread_mutex_unlock(&mutex);
+}
+
+static
+void cfg_finished(void)
+{
+ pthread_mutex_lock(&mutex);
+ unlock_notify(INITIALIZED);
+}
+
+static
+struct rte_ifpx_callbacks cbs = {
+ .mac_change = mac_change,
+ .mtu_change = mtu_change,
+ .link_change = link_change,
+ .addr_add = addr_add,
+ .addr_del = addr_del,
+ .addr6_add = addr6_add,
+ .addr6_del = addr6_del,
+ .route_add = route_add,
+ .route_del = route_del,
+ .route6_add = route6_add,
+ .route6_del = route6_del,
+ /* lib specific callback */
+ .cfg_finished = cfg_finished,
+};
+
+static int
+test_if_proxy(void)
+{
+ int ec;
+ char buf[INET6_ADDRSTRLEN];
+ const struct rte_ifpx_info *pinfo;
+
+ state = 0;
+ memset(&net_cfg, 0, sizeof(net_cfg));
+ /* Since we are not going to test RX/TX we can just create proxy and
+ * bind it to itself to test just notification functionality.
+ */
+ net_cfg.port_id = rte_ifpx_create(RTE_IFPX_DEFAULT);
+ RTE_VERIFY(net_cfg.port_id != RTE_MAX_ETHPORTS);
+ rte_ifpx_port_bind(net_cfg.port_id, net_cfg.port_id);
+ rte_ifpx_callbacks_register(&cbs);
+ rte_ifpx_listen();
+
+ pthread_mutex_lock(&mutex);
+ /* During initialization we should observe IPv4/6 loopback routes. */
+ net_cfg.route4.s_addr = RTE_IPV4(127, 0, 0, 1);
+ net_cfg.depth4 = 32;
+ memcpy(net_cfg.route6.s6_addr, in6addr_loopback.s6_addr, 16);
+ net_cfg.depth6 = 128;
+ ec = wait_for(INITIALIZED | ROUTE_ADD | ROUTE6_ADD, 2);
+ if (ec != 0) {
+ printf("Failed to obtain network configuration\n");
+ goto exit;
+ }
+ pinfo = rte_ifpx_info_get(net_cfg.port_id);
+ RTE_VERIFY(pinfo);
+
+ /* Make sure the link is down. */
+ net_cfg.is_up = 0;
+ ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name);
+ RTE_VERIFY(ec == ETIMEDOUT || ec == 0);
+
+ /* Test link up notification. */
+ net_cfg.is_up = 1;
+ ec = expect(LINK_CHANGED, "ip link set dev %s up", pinfo->if_name);
+ if (ec != 0) {
+ printf("Failed to notify about link going up\n");
+ goto exit;
+ }
+
+ /* Test for MAC changes notification. */
+ rte_eth_random_addr(net_cfg.mac_addr.addr_bytes);
+ rte_ether_format_addr(buf, sizeof(buf), &net_cfg.mac_addr);
+ ec = expect(MAC_CHANGED, "ip link set dev %s address %s",
+ pinfo->if_name, buf);
+ if (ec != 0) {
+ printf("Missing/wrong notification about mac change\n");
+ goto exit;
+ }
+
+ /* Test for MTU changes notification. */
+ net_cfg.mtu = pinfo->mtu + 100;
+ ec = expect(MTU_CHANGED, "ip link set dev %s mtu %d",
+ pinfo->if_name, net_cfg.mtu);
+ if (ec != 0) {
+ printf("Missing/wrong notification about mtu change\n");
+ goto exit;
+ }
+
+ /* Test for adding of IPv4 address - using address from TEST-2 pool.
+ * This test is specific to linux netlink behaviour - after adding
+ * address we get both notification about address being added and new
+ * route. So I check both.
+ */
+ net_cfg.ipv4.s_addr = RTE_IPV4(198, 51, 100, 14);
+ net_cfg.route4.s_addr = net_cfg.ipv4.s_addr;
+ net_cfg.depth4 = 32;
+ ec = expect(ADDR_ADD | ROUTE_ADD, "ip addr add 198.51.100.14 dev %s",
+ pinfo->if_name);
+ if (ec != 0) {
+ printf("Missing/wrong notifications about IPv4 address add\n");
+ goto exit;
+ }
+
+ /* Test for IPv4 address removal. See comment above for 'addr add'. */
+ ec = expect(ADDR_DEL | ROUTE_DEL, "ip addr del 198.51.100.14/32 dev %s",
+ pinfo->if_name);
+ if (ec != 0) {
+ printf("Missing/wrong notifications about IPv4 address del\n");
+ goto exit;
+ }
+
+ /* Test for adding IPv4 route. */
+ net_cfg.route4.s_addr = RTE_IPV4(198, 51, 100, 0);
+ net_cfg.depth4 = 24;
+ ec = expect(ROUTE_ADD, "ip route add 198.51.100.0/24 dev %s",
+ pinfo->if_name);
+ if (ec != 0) {
+ printf("Missing/wrong notifications about IPv4 route add\n");
+ goto exit;
+ }
+
+ /* Test for IPv4 route removal. */
+ ec = expect(ROUTE_DEL, "ip route del 198.51.100.0/24 dev %s",
+ pinfo->if_name);
+ if (ec != 0) {
+ printf("Missing/wrong notifications about IPv4 route del\n");
+ goto exit;
+ }
+
+ /* Now the same for IPv6 - with address from "documentation pool". */
+ inet_pton(AF_INET6, "2001:db8::dead:beef", net_cfg.ipv6.s6_addr);
+ /* This is specific to linux netlink behaviour - after adding address
+ * we get both notification about address being added and new route.
+ * So I wait for both.
+ */
+ memcpy(net_cfg.route6.s6_addr, net_cfg.ipv6.s6_addr, 16);
+ net_cfg.depth6 = 128;
+ ec = expect(ADDR6_ADD | ROUTE6_ADD,
+ "ip addr add 2001:db8::dead:beef dev %s",
+ pinfo->if_name);
+ if (ec != 0) {
+ printf("Missing/wrong notifications about IPv6 address add\n");
+ goto exit;
+ }
+
+ /* See comment above for 'addr6 add'. */
+ ec = expect(ADDR6_DEL | ROUTE6_DEL,
+ "ip addr del 2001:db8::dead:beef/128 dev %s",
+ pinfo->if_name);
+ if (ec != 0) {
+ printf("Missing/wrong notifications about IPv6 address del\n");
+ goto exit;
+ }
+
+ net_cfg.depth6 = 96;
+ ec = expect(ROUTE6_ADD, "ip route add 2001:db8::dead:0/96 dev %s",
+ pinfo->if_name);
+ if (ec != 0) {
+ printf("Missing/wrong notifications about IPv6 route add\n");
+ goto exit;
+ }
+
+ ec = expect(ROUTE6_DEL, "ip route del 2001:db8::dead:0/96 dev %s",
+ pinfo->if_name);
+ if (ec != 0) {
+ printf("Missing/wrong notifications about IPv6 route del\n");
+ goto exit;
+ }
+
+ /* Finally put link down and test for notification. */
+ net_cfg.is_up = 0;
+ ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name);
+ if (ec != 0) {
+ printf("Failed to notify about link going down\n");
+ goto exit;
+ }
+
+exit:
+ pthread_mutex_unlock(&mutex);
+ rte_ifpx_destroy(net_cfg.port_id);
+ rte_ifpx_close();
+
+ return ec;
+}
+
+REGISTER_TEST_COMMAND(if_proxy_autotest, test_if_proxy)
diff --git a/doc/guides/prog_guide/if_proxy_lib.rst b/doc/guides/prog_guide/if_proxy_lib.rst
new file mode 100644
index 000000000..dc1202cdf
--- /dev/null
+++ b/doc/guides/prog_guide/if_proxy_lib.rst
@@ -0,0 +1,103 @@
+.. SPDX-License-Identifier: BSD-3-Clause
+ Copyright(C) 2019 Marvell International Ltd.
+
+.. _IF_Proxy_Library:
+
+IF Proxy Library
+================
+
+When a network interface is assigned to DPDK it usually disappears from
+the system.
+This way user looses ability to configure it via typical configuration
+tools and is left basically with two options:
+
+ - configure it via command line arguments,
+
+ - add support for live configuration via some IPC mechanism.
+
+The first option is static and the second one requires some work to add
+communication loop (e.g. separate thread listening/communicating on
+a socket).
+
+This library adds a possibility to configure DPDK ports by using normal
+configuration utilities (e.g. from iproute2 suite).
+It requires user to configure additional DPDK ports that are visible to
+the system (such as Tap or KNI - actually any port that has valid
+'if_index' in 'struct rte_eth_dev_info' will do) and designate them as
+a port representor (a proxy) in the system.
+
+Let's see typical intended usage by an example.
+Suppose that you have application that handles traffic on two ports (in
+the white list below).
+
+ ./app -w 00:14.0 -w 00:16.0 --vdev=net_tap0 --vdev=net_tap1
+
+So in addition you configure two proxy ports and in the application code
+you bind them to the "main" ports:
+
+ rte_if_proxy_port_bind(port0, proxy0);
+ rte_if_proxy_port_bind(port1, proxy1);
+
+This binding is a logical one - there is no automatic packet forwarding
+configured.
+This is because library cannot tell upfront what portion of the traffic
+received on ports 0/1 should be redirected to the system via proxies and
+also it does not know how the application is structured (what packet
+processing engines it uses).
+Therefore it is application writer responsibility to include proxy ports
+into its packet processing and forward appropriate packets between
+proxies and ports.
+What the library actually does is that it gets network configuration
+from the system and listens to its changes.
+This information is then matched against 'if_index' of the configured
+proxies (when applicable - routing information is global) and passed to
+the application via set of callbacks that user has to register:
+
+ rte_if_proxy_callbacks_register(&cbs);
+
+Here 'cbs' is a 'struct rte_if_proxy_callbacks' which has following
+members:
+
+ void (*mac_change)(uint16_t port_id, const struct rte_ether_addr *mac);
+ void (*mtu_change)(uint16_t port_id, uint16_t mtu);
+ void (*link_change)(uint16_t port_id, int is_up);
+ /* IPv4 addresses are in host order */
+ void (*addr_add)(uint16_t port_id, uint32_t ip);
+ void (*addr_del)(uint16_t port_id, uint32_t ip);
+ void (*addr6_add)(uint16_t port_id, const uint8_t *ip);
+ void (*addr6_del)(uint16_t port_id, const uint8_t *ip);
+ void (*route_add)(uint32_t ip, uint8_t depth);
+ void (*route_del)(uint32_t ip, uint8_t depth);
+ void (*route6_add)(const uint8_t *ip, uint8_t depth);
+ void (*route6_del)(const uint8_t *ip, uint8_t depth);
+ /* lib specific callback - called when initial network configuration
+ * query is finished */
+ void (*cfg_finished)(void);
+
+So for example when the user issues command:
+
+ ip link set dev dtap0 mtu 1600
+
+then library will call `mtu_change()` callback with port_id equal to
+'port0' (id of the port bound to this proxy) and 'mtu' equal to 1600
+('dtap0' is the default interface name for 'net_tap0').
+Application can simply use `rte_eth_dev_set_mtu()` as this callback.
+The same way `rte_eth_dev_default_mac_addr_set()` can be used for
+`mac_change()` and `rte_eth_dev_set_link_up/down()` can be used inside
+the callback that does dispatch based on 'is_up' argument.
+
+Please note however that the context in which these callbacks are called
+is most probably different from the one in which packets are handled and
+it is application writer responsibility to use proper synchronization
+mechanisms - if they are needed.
+
+If the application supports IP protocol stack then it can utilize
+callbacks for adding/removing of addresses to the proxies and also
+routing information (note that routing info is not associated with any
+port).
+E.g. application can feed some LPM tables with these addresses and upon
+reception of a packet on some port match this packet against those
+tables to figure out what to do with this packet.
+If the decision is to pass it to the system then it can simply forward
+them to the proxy corresponding to the port on which packet has been
+received by using standard PMD TX interface.
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index dc4851c57..0a1541f34 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -57,6 +57,7 @@ Programmer's Guide
metrics_lib
bpf_lib
ipsec_lib
+ if_proxy_lib
source_org
dev_kit_build_system
dev_kit_root_make_help
diff --git a/examples/Makefile b/examples/Makefile
index feff79784..5aa9ab431 100644
--- a/examples/Makefile
+++ b/examples/Makefile
@@ -81,6 +81,7 @@ else
$(info vm_power_manager requires libvirt >= 0.9.3)
endif
endif
+DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += if_proxy
DIRS-y += eventdev_pipeline
diff --git a/examples/if_proxy/Makefile b/examples/if_proxy/Makefile
new file mode 100644
index 000000000..dd0515fa4
--- /dev/null
+++ b/examples/if_proxy/Makefile
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Marvell International Ltd.
+
+# binary name
+APP = if_proxy
+
+# all source are stored in SRCS-y
+SRCS-y := main.c
+
+# Build using pkg-config variables if possible
+ifeq ($(shell pkg-config --exists libdpdk && echo 0),0)
+
+all: shared
+.PHONY: shared static
+shared: build/$(APP)-shared
+ ln -sf $(APP)-shared build/$(APP)
+static: build/$(APP)-static
+ ln -sf $(APP)-static build/$(APP)
+
+PKGCONF=pkg-config --define-prefix
+
+PC_FILE := $(shell $(PKGCONF) --path libdpdk)
+CFLAGS += -O3 $(shell $(PKGCONF) --cflags libdpdk)
+LDFLAGS_SHARED = $(shell $(PKGCONF) --libs libdpdk)
+LDFLAGS_STATIC = -Wl,-Bstatic $(shell $(PKGCONF) --static --libs libdpdk)
+
+build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
+ $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
+
+build/$(APP)-static: $(SRCS-y) Makefile $(PC_FILE) | build
+ $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_STATIC)
+
+build:
+ @mkdir -p $@
+
+.PHONY: clean
+clean:
+ rm -f build/$(APP) build/$(APP)-static build/$(APP)-shared
+ test -d build && rmdir -p build || true
+
+else # Build using legacy build system
+
+ifeq ($(RTE_SDK),)
+$(error "Please define RTE_SDK environment variable")
+endif
+
+# Default target, detect a build directory, by looking for a path with a .config
+RTE_TARGET ?= $(notdir $(abspath $(dir $(firstword $(wildcard $(RTE_SDK)/*/.config)))))
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+CFLAGS += -O3
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += $(WERROR_FLAGS)
+LDLIBS += -lrte_if_proxy -lrte_ethdev -lrte_eal
+
+include $(RTE_SDK)/mk/rte.extapp.mk
+endif
diff --git a/examples/if_proxy/main.c b/examples/if_proxy/main.c
new file mode 100644
index 000000000..2195fb490
--- /dev/null
+++ b/examples/if_proxy/main.c
@@ -0,0 +1,203 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell International Ltd.
+ */
+
+#include <rte_if_proxy.h>
+
+#include <string.h>
+#include <unistd.h>
+#include <signal.h>
+#include <arpa/inet.h>
+
+static
+char buf[INET6_ADDRSTRLEN];
+
+static
+uint16_t proxy_id = RTE_MAX_ETHPORTS;
+
+static
+void mac_change(uint16_t port_id, const struct rte_ether_addr *mac)
+{
+ char buf[3*RTE_ETHER_ADDR_LEN];
+
+ rte_ether_format_addr(buf, sizeof(buf), mac);
+ printf("\tmac change for port %u -> %s\n", port_id, buf);
+}
+
+static
+void mtu_change(uint16_t port_id, uint16_t mtu)
+{
+ printf("\tmtu change for port %u -> %u\n", port_id, mtu);
+}
+
+static
+void link_change(uint16_t port_id, int is_up)
+{
+ printf("\tport %u going %s\n", port_id, is_up ? "up" : "down");
+}
+
+static
+void addr_add(uint16_t port_id, uint32_t ip)
+{
+ struct in_addr a = { .s_addr = htonl(ip) };
+
+ printf("\taddress add for port %u -> %s\n", port_id,
+ inet_ntop(AF_INET, &a, buf, sizeof(buf)));
+}
+
+static
+void addr_del(uint16_t port_id, uint32_t ip)
+{
+ struct in_addr a = { .s_addr = htonl(ip) };
+
+ printf("\taddress del for port %u -> %s\n", port_id,
+ inet_ntop(AF_INET, &a, buf, sizeof(buf)));
+}
+
+static
+void addr6_add(uint16_t port_id, const uint8_t *ip)
+{
+ struct in6_addr a;
+
+ memcpy(a.s6_addr, ip, 16);
+ printf("\taddress6 add for port %u -> %s\n", port_id,
+ inet_ntop(AF_INET6, &a, buf, sizeof(buf)));
+}
+
+static
+void addr6_del(uint16_t port_id, const uint8_t *ip)
+{
+ struct in6_addr a;
+
+ memcpy(a.s6_addr, ip, 16);
+ printf("\taddress6 del for port %u -> %s\n", port_id,
+ inet_ntop(AF_INET6, &a, buf, sizeof(buf)));
+}
+
+static
+void route_add(uint32_t ip, uint8_t depth)
+{
+ struct in_addr a = { .s_addr = htonl(ip) };
+
+ printf("\troute add -> %s/%u\n",
+ inet_ntop(AF_INET, &a, buf, sizeof(buf)), depth);
+}
+
+static
+void route_del(uint32_t ip, uint8_t depth)
+{
+ struct in_addr a = { .s_addr = htonl(ip) };
+
+ printf("\troute del -> %s/%u\n",
+ inet_ntop(AF_INET, &a, buf, sizeof(buf)), depth);
+}
+
+static
+void route6_add(const uint8_t *ip, uint8_t depth)
+{
+ struct in6_addr a;
+
+ memcpy(a.s6_addr, ip, 16);
+ printf("\troute6 add -> %s/%u\n",
+ inet_ntop(AF_INET6, &a, buf, sizeof(buf)), depth);
+}
+
+static
+void route6_del(const uint8_t *ip, uint8_t depth)
+{
+ struct in6_addr a;
+
+ memcpy(a.s6_addr, ip, 16);
+ printf("\troute6 del -> %s/%u\n",
+ inet_ntop(AF_INET6, &a, buf, sizeof(buf)), depth);
+}
+
+struct rte_ifpx_callbacks cbs = {
+ .mac_change = mac_change,
+ .mtu_change = mtu_change,
+ .link_change = link_change,
+ .addr_add = addr_add,
+ .addr_del = addr_del,
+ .addr6_add = addr6_add,
+ .addr6_del = addr6_del,
+ .route_add = route_add,
+ .route_del = route_del,
+ .route6_add = route6_add,
+ .route6_del = route6_del,
+};
+
+static
+void proxy_bind_change(int sig)
+{
+ uint16_t port;
+ if (sig == SIGUSR1)
+ port = 0;
+ else if (sig == SIGUSR2)
+ port = 1;
+ else
+ return;
+
+ if (port >= rte_eth_dev_count_avail()) {
+ printf("\tNot enough ports allocated!\n");
+ return;
+ }
+
+ if (rte_ifpx_proxy_get(port) == RTE_MAX_ETHPORTS) {
+ printf("\tbinding port %d to proxy\n", port);
+ rte_ifpx_port_bind(port, proxy_id);
+ } else {
+ printf("\tunbinding port %d\n", port);
+ rte_ifpx_port_unbind(port);
+ }
+}
+
+int
+main(int argc, char **argv)
+{
+ int i, sig, nb_ports;
+ sigset_t set;
+
+ /* init EAL */
+ i = rte_eal_init(argc, argv);
+ if (i < 0)
+ rte_exit(EXIT_FAILURE, "Invalid EAL arguments\n");
+ argc -= i;
+ argv += i;
+
+ nb_ports = rte_eth_dev_count_avail();
+ if (nb_ports == 0)
+ rte_exit(EXIT_FAILURE, "No Ethernet ports - bye\n");
+
+ proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT);
+ if (proxy_id >= RTE_MAX_ETHPORTS) {
+ printf("Failed to create default proxy\n");
+ return -1;
+ }
+ /* Bind all ports to the same proxy. */
+ for (i = 0; i < nb_ports; ++i)
+ rte_ifpx_port_bind(i, proxy_id);
+ rte_ifpx_callbacks_register(&cbs);
+ rte_ifpx_listen();
+
+ /* Since we do not process packets - only listen to net events - we only
+ * wait for signal either to quit or to change proxy binding.
+ */
+ signal(SIGUSR1, proxy_bind_change);
+ signal(SIGUSR2, proxy_bind_change);
+
+ sigemptyset(&set);
+ sigaddset(&set, SIGINT);
+ sigprocmask(SIG_BLOCK, &set, NULL);
+ printf("Press ^C to quit\n");
+ do {
+ i = sigwait(&set, &sig);
+ } while (i != 0 && sig != SIGINT);
+
+ RTE_ETH_FOREACH_DEV(i) {
+ printf("\nClosing port %d...\n", i);
+ rte_eth_dev_close(i);
+ }
+ printf("Bye\n");
+
+ return 0;
+}
diff --git a/examples/if_proxy/meson.build b/examples/if_proxy/meson.build
new file mode 100644
index 000000000..5f5826a90
--- /dev/null
+++ b/examples/if_proxy/meson.build
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Marvell International Ltd.
+
+# meson file, for building this example as part of a main DPDK build.
+#
+# To build this example as a standalone application with an already-installed
+# DPDK instance, use 'make'
+
+allow_experimental_apis = true
+sources = files(
+ 'main.c'
+)
diff --git a/examples/meson.build b/examples/meson.build
index 1f2b6f516..468ef8a90 100644
--- a/examples/meson.build
+++ b/examples/meson.build
@@ -16,7 +16,7 @@ all_examples = [
'eventdev_pipeline',
'fips_validation', 'flow_classify',
'flow_filtering', 'helloworld',
- 'ioat',
+ 'if_proxy', 'ioat',
'ip_fragmentation', 'ip_pipeline',
'ip_reassembly', 'ipsec-secgw',
'ipv4_multicast', 'kni',
--
2.17.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-14 14:25 [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library Andrzej Ostruszka
` (2 preceding siblings ...)
2020-01-14 14:25 ` [dpdk-dev] [RFC PATCH 3/3] if_proxy: add example, test and documentation Andrzej Ostruszka
@ 2020-01-14 15:16 ` Morten Brørup
2020-01-14 17:38 ` Andrzej Ostruszka
3 siblings, 1 reply; 21+ messages in thread
From: Morten Brørup @ 2020-01-14 15:16 UTC (permalink / raw)
To: Andrzej Ostruszka, dev
Cc: Jerin Jacob Kollanukkaran, Nithin Kumar Dabilpuram,
Pavan Nikhilesh Bhagavatula, Kiran Kumar Kokkilagadda,
Krzysztof Kanas
Andrzej,
Basically you are adding a very small subset of the Linux IP stack to interface with DPDK applications via callbacks. The library also seems to support interfacing to the route table, so it is not "interface proxy", but "IP stack proxy".
You already mention ARP table as future work. How about namespaces, ip tables, and other advanced features... I foresee the Devil in the details for any real use case.
Unless the library is an O/S wrapper to make Linux NETLINK-like messages available from other operating systems, I don't really see the value in this library... if it is Linux specific, why not just use NETLINK in the DPDK application's control plane?
Med venlig hilsen / kind regards
- Morten Brørup
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Andrzej Ostruszka
> Sent: Tuesday, January 14, 2020 3:25 PM
> To: dev@dpdk.org
> Cc: Jerin Jacob Kollanukkaran; Nithin Kumar Dabilpuram; Pavan Nikhilesh
> Bhagavatula; Kiran Kumar Kokkilagadda; Krzysztof Kanas
> Subject: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
>
> What is this useful for
> =======================
>
> Usually, when an ethernet port is assigned to DPDK it vanishes from the
> system and user looses ability to control it via normal configuration
> utilities (e.g. those from iproute2 package). Moreover by default DPDK
> application is not aware of the network configuration of the system.
>
> To address both of these issues application needs to:
> - add some command line interface (or other mechanism) allowing for
> control of the port and its configuration
> - query the status of network configuration and monitor its changes
>
> The purpose of this library is to help with both of these tasks (as
> long
> as they remain in domain of configuration available to the system). In
> other words, if DPDK application has some special needs, that cannot be
> addressed by the normal system configuration utilities, then they need
> to be solved by the application itself.
>
> The connection between DPDK and system is based on the existence of
> ports that are visible to both DPDK and system (like Tap, KNI and
> possibly some other drivers). These ports serve as an interface
> proxies.
>
> Let's visualize the action of the library by the following example:
>
> Linux | DPDK
> ==============================================================
> |
> | +-------+ +-------+
> | | Port1 | | Port2 |
> "ip link set dev tap1 mtu 1600" | +-------+ +-------+
> | | ^ ^
> | +------+ | mtu_change |
> `->| Tap1 |---' callback |
> +------+ |
> "ip addr add 198.51.100.14 \ | |
> dev tap2" | |
> | +------+ |
> `->| Tap2 |-------------------'
> +------+ addr_add callback
> |
> "ip route add 198.0.2.0/24 \ |
> dev eth0" |
> | | route_add callback
> `------------->
> |
>
> So we have two ports Port1 and Port2 that are not visible to the
> system.
> We create two proxy interfaces (here based on Tap driver) and bind the
> ports to their proxies. When user issues a command changing MTU for
> Tap1 interface the library notes this and calls "mtu_change" callback
> for the Port1. Similarly when user adds an IPv4 address to the Tap2
> interface "addr_add" callback is called for the Port2. Note also that
> that not only port related callbacks are available - for example you
> can
> also get information about routing table. See below for a complete
> list
> of available callbacks.
>
> Please note that nothing has been mentioned about forwarding of the
> packets between system and DPDK. Since the proxies are normal DPDK
> ports you can receive/send to them via usual RX/TX burst API. However
> since the library is not aware of the structure of packet processing
> used by the application it cannot automatically forward the packets -
> it
> is responsibility of the application to include proxy ports into its
> packet processing engine.
>
> As mentioned above the intention of the library is to:
> - provide information about network configuration that would allow
> application to decide what to do with the packets received on DPDK
> ports,
> - allow for control of the ports via standard configuration utilities
>
> Although the library only helps you to identify proxy for given port
> (and vice versa) and calls appropriate callbacks it does open some
> interesting possibilities. For example you can use the proxy ports to
> forward packets for protocols that you do not wish to handle in DPDK
> application to the system protocol stack and just listen to the
> configuration changes - so that way you can "offload" handling of those
> protocols to the system.
>
>
> Why this RFC
> ============
>
> We would like to solicit some input from the community:
> - regarding usefulness of this library
> - what is missing or what needs to be changed
> - about currently proposed API
> - any other suggestions and/or improvements are also welcome
>
>
> How to use it
> =============
>
> Usage of this library is rather simple. You have to:
> 1. Create proxy (if you don't have port suitable for being proxy or you
> have one but do not wish to use it as a proxy).
> 2. Bind port to proxy.
> 3. Register callbacks.
> 4. Start listening to the network configuration.
>
> The only mandatory requirement for DPDK port to be able to act as
> a proxy is that it is visible in the system - this is checked during
> port to proxy binding by calling rte_eth_dev_info_get() on proxy port
> and inspecting 'if_index' field (it has to be non-zero).
> One can create such port in the application by calling:
>
> proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT);
>
> Upon success this returns id of DPDK proxy port created
> (RTE_MAX_ETHPORTS on failure). The argument selects type of proxy port
> to create (currently Tap/KNI only). This function actually is just
> a wrapper around:
>
> uint16_t rte_ifpx_create_by_devarg(const char *devarg);
>
> creating valid 'devarg' string for the chosen type of proxy. If you
> have
> other driver capable of acting as a proxy you can call
> rte_ifpx_create_by_devarg() directly passing appropriate argument.
>
> Once you have id of both port and proxy you can bind the two via:
>
> rte_ifpx_port_bind(port_id, proxy_id);
>
> This creates logical binding - as mentioned above there is no automatic
> packet forwarding. With this binding whenever user changes the state
> of
> proxy interface in the system (link up/down, change mac/mtu, add/remove
> IPv4/IPv6) you get appropriate callback called for the bound port.
>
> So far we've mentioned several times that the library calls callbacks.
> They are grouped in 'struct rte_ifpx_callbacks' and user provides them
> to the library via:
>
> rte_ifpx_callbacks_register(&cbs);
>
> It is worth mentioning that the context (lcore/thread) in which these
> callbacks are called is implementation defined. It might differ
> between
> different platforms, so the application needs to assume that some kind
> of inter lcore/thread synchronization/communication is required.
>
> Once we have bindings in place and callbacks registered, the only
> essential part that remains is to get the current network configuration
> and start listening to its changes. This is accomplished via a call
> to:
>
> rte_ifpx_listen();
>
> And basically this is all one needs to understand how to use this
> library. Other less essential parts include:
> - ability to query what callbacks are available for given platform
> - getting mapping between proxy and port
> - unbinding the ports from proxy
> - destroying proxy port
> - closing the listening service
> - getting basic information about proxy
>
>
> Currently available features and implementation
> ===============================================
>
> The library's API is system independent but it obviously needs some
> system dependent parts. We provide exemplary Linux implementation
> (based
> on netlink sockets). Very similar implementation is possible for
> FreeBSD (with the usage of PF_ROUTE sockets). Windows implementation
> would need to differ much (probably IP Helper library would be of some
> help).
>
> Here is the list of currently implemented callbacks:
>
> struct rte_ifpx_callbacks {
> void (*mac_change)(uint16_t port_id, const struct rte_ether_addr
> *mac);
> void (*mtu_change)(uint16_t port_id, uint16_t mtu);
> void (*link_change)(uint16_t port_id, int is_up);
> void (*addr_add)(uint16_t port_id, uint32_t ip);
> void (*addr_del)(uint16_t port_id, uint32_t ip);
> void (*addr6_add)(uint16_t port_id, const uint8_t *ip);
> void (*addr6_del)(uint16_t port_id, const uint8_t *ip);
> void (*route_add)(uint32_t ip, uint8_t depth);
> void (*route_del)(uint32_t ip, uint8_t depth);
> void (*route6_add)(const uint8_t *ip, uint8_t depth);
> void (*route6_del)(const uint8_t *ip, uint8_t depth);
> void (*cfg_finished)(void);
> };
>
> They are all rather self-descriptive with the exception of the last
> one.
> When the user calls rte_ifpx_listen() the library first queries the
> system for its current configuration. That might require several
> request/reply exchanges between DPDK and system and once it is finished
> this callback is called to let application know that all info has been
> gathered.
>
> BTW at the moment all IPv4 addresses are passed in host order.
>
> It is worth to mention also that while typical case would be a 1-to-1
> mapping between port and proxy, the 1-to-many mapping is also
> supported.
> In that case port related callbacks will be called for each port bound
> to given proxy interface - in that case it is application
> responsibility
> to define semantic of such mapping (e.g. all changes apply to all
> ports,
> or link changes apply to all but other are accepted in "round robin"
> fashion, or ...).
>
> As mentioned above Linux implementation is based on netlink socket.
> This socket is registered as file descriptor in EAL interrupts
> (similarly to how EAL alarms are implemented).
>
>
> What is inside this RFC
> =======================
> - 1 commit for API
> - 1 commit for implementation - this is just to show PoC, and allow for
> early playing around with the idea (e.g. run the test/example from
> the
> next commit)
> - 1 commit for test/example - just to show how this can be used
>
>
> Next steps
> ==========
>
> - gather community feedback
> - polish the implementation:
> * call the notification callbacks without lock held (at the moment
> attempts to modify callbacks from within the callback would
> deadlock)
> * separate the system dependent parts from the rest so that it is
> easy
> to figure out what needs to be reimplemented on different platforms
> * apply community suggestions - if any
> - add neighbour callbacks (ARP table)
>
> Best regards
> Andrzej Ostruszka
>
> Andrzej Ostruszka (3):
> lib: introduce IF proxy library (API)
> if_proxy: add preliminary Linux implementation
> if_proxy: add example, test and documentation
>
> app/test/Makefile | 5 +
> app/test/meson.build | 1 +
> app/test/test_if_proxy.c | 431 ++++++++++
> config/common_base | 5 +
> doc/guides/prog_guide/if_proxy_lib.rst | 103 +++
> doc/guides/prog_guide/index.rst | 1 +
> examples/Makefile | 1 +
> examples/if_proxy/Makefile | 58 ++
> examples/if_proxy/main.c | 203 +++++
> examples/if_proxy/meson.build | 12 +
> examples/meson.build | 2 +-
> lib/Makefile | 2 +
> .../common/include/rte_eal_interrupts.h | 2 +
> lib/librte_eal/linux/eal/eal_interrupts.c | 14 +-
> lib/librte_if_proxy/Makefile | 25 +
> lib/librte_if_proxy/meson.build | 7 +
> lib/librte_if_proxy/rte_if_proxy.c | 803 ++++++++++++++++++
> lib/librte_if_proxy/rte_if_proxy.h | 364 ++++++++
> lib/meson.build | 2 +-
> 19 files changed, 2035 insertions(+), 6 deletions(-)
> create mode 100644 app/test/test_if_proxy.c
> create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst
> create mode 100644 examples/if_proxy/Makefile
> create mode 100644 examples/if_proxy/main.c
> create mode 100644 examples/if_proxy/meson.build
> create mode 100644 lib/librte_if_proxy/Makefile
> create mode 100644 lib/librte_if_proxy/meson.build
> create mode 100644 lib/librte_if_proxy/rte_if_proxy.c
> create mode 100644 lib/librte_if_proxy/rte_if_proxy.h
>
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-14 15:16 ` [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library Morten Brørup
@ 2020-01-14 17:38 ` Andrzej Ostruszka
2020-01-15 10:15 ` Bruce Richardson
0 siblings, 1 reply; 21+ messages in thread
From: Andrzej Ostruszka @ 2020-01-14 17:38 UTC (permalink / raw)
To: dev
On 1/14/20 4:16 PM, Morten Brørup wrote:
> Andrzej,
Hello Morten
> Basically you are adding a very small subset of the Linux IP stack> to interface with DPDK applications via callbacks.
Yes, at the moment this is limited - we'd prefer first to solicit
some input from community.
> The library also seems to support interfacing to the route table,
> so it is not "interface proxy" but "IP stack proxy".
True, to some extent - for example you can bring the interface up and
down which has nothing to do with IP stack. As for the name of the
library - that is actually part where we are completely open. The proxy
represents port (thus the name) but that is not all, so any better name
proposals are welcome.
> You already mention ARP table as future work. How about namespaces,
> ip tables, and other advanced features... I foresee the Devil in the
> details for any real use case.
Right now I don't know what other things are needed. This idea is still
early. However imagine you'd like to use DPDK to speed up packet
processing of IP stack - would you like to implement all the protocols
that are needed? Or just let the system handle the control path and
handle the data path and sniff the control params from the system.
> Unless the library is an O/S wrapper to make Linux NETLINK-like messages
> available from other operating systems, ...
The idea is to have this system independent - and in the "next things"
I've mentioned splitting current implementation into common and
system-dependent parts. AFAIK we do not plan to provide implementation
for other systems, but would like to form it so that is clear how to do
that. As mentioned in the description FreeBSD implementation could be
really similar but the Windows one would probably require some thread
polling periodically system with "IP Helper" lib calls - I'm not Windows
programmer. So no - the intent is not to provide "NETLINK-like
messages" to other systems.
> ... I don't really see the value in this library... if it is Linux
> specific, why not just use NETLINK in the DPDK application's control
> plane?
NETLINK is just Linux specific implementation. And if you confine
yourself only to a Linux specific world - you can think of this library
as what you have just described :). Free implementation of NETLINK
handling - with a defined API.
Best regards
Andrzej
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-14 17:38 ` Andrzej Ostruszka
@ 2020-01-15 10:15 ` Bruce Richardson
2020-01-15 11:27 ` Jerin Jacob
2020-01-15 12:28 ` Morten Brørup
0 siblings, 2 replies; 21+ messages in thread
From: Bruce Richardson @ 2020-01-15 10:15 UTC (permalink / raw)
To: Andrzej Ostruszka; +Cc: dev
On Tue, Jan 14, 2020 at 06:38:37PM +0100, Andrzej Ostruszka wrote:
> On 1/14/20 4:16 PM, Morten Brørup wrote:
> > Andrzej,
>
> Hello Morten
>
> > Basically you are adding a very small subset of the Linux IP stack> to interface with DPDK applications via callbacks.
>
> Yes, at the moment this is limited - we'd prefer first to solicit
> some input from community.
>
> > The library also seems to support interfacing to the route table,
> > so it is not "interface proxy" but "IP stack proxy".
>
> True, to some extent - for example you can bring the interface up and
> down which has nothing to do with IP stack. As for the name of the
> library - that is actually part where we are completely open. The proxy
> represents port (thus the name) but that is not all, so any better name
> proposals are welcome.
>
> > You already mention ARP table as future work. How about namespaces,
> > ip tables, and other advanced features... I foresee the Devil in the
> > details for any real use case.
>
> Right now I don't know what other things are needed. This idea is still
> early. However imagine you'd like to use DPDK to speed up packet
> processing of IP stack - would you like to implement all the protocols
> that are needed? Or just let the system handle the control path and
> handle the data path and sniff the control params from the system.
>
Like Morten, I'd be a bit concerned at the possible scope of the work if we
start pulling in functionality from the IP stack like ARP etc. To avoid
this becoming a massive effort, how useful would it be if we just limited
the scope to physical NIC setup only, and did not do anything above the l2
layer?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 10:15 ` Bruce Richardson
@ 2020-01-15 11:27 ` Jerin Jacob
2020-01-15 12:28 ` Morten Brørup
1 sibling, 0 replies; 21+ messages in thread
From: Jerin Jacob @ 2020-01-15 11:27 UTC (permalink / raw)
To: Bruce Richardson, mb; +Cc: dpdk-dev, Andrzej Ostruszka
On Wed, Jan 15, 2020 at 3:45 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Tue, Jan 14, 2020 at 06:38:37PM +0100, Andrzej Ostruszka wrote:
> > On 1/14/20 4:16 PM, Morten Brřrup wrote:
> > > Andrzej,
> >
> > Hello Morten
> >
> > > Basically you are adding a very small subset of the Linux IP stack> to interface with DPDK applications via callbacks.
> >
> > Yes, at the moment this is limited - we'd prefer first to solicit
> > some input from community.
> >
> > > The library also seems to support interfacing to the route table,
> > > so it is not "interface proxy" but "IP stack proxy".
> >
> > True, to some extent - for example you can bring the interface up and
> > down which has nothing to do with IP stack. As for the name of the
> > library - that is actually part where we are completely open. The proxy
> > represents port (thus the name) but that is not all, so any better name
> > proposals are welcome.
> >
> > > You already mention ARP table as future work. How about namespaces,
> > > ip tables, and other advanced features... I foresee the Devil in the
> > > details for any real use case.
> >
> > Right now I don't know what other things are needed. This idea is still
> > early. However imagine you'd like to use DPDK to speed up packet
> > processing of IP stack - would you like to implement all the protocols
> > that are needed? Or just let the system handle the control path and
> > handle the data path and sniff the control params from the system.
> >
> Like Morten, I'd be a bit concerned at the possible scope of the work if we
> start pulling in functionality from the IP stack like ARP etc. To avoid
> this becoming a massive effort, how useful would it be if we just limited
> the scope to physical NIC setup only, and did not do anything above the l2
> layer?
Like the IPSec library, Marvell would like to add support for
additional protocols
(probably begin with IPv4, UDP) to DPDK. One of our concerns was the control
plane interface for those protocols for effective use in DPDK. Since DPDK has
support for FreeBSD and Windows OS now, We can not use NETLINK directly
in the library. This is the sole intention of this library was the
abstract control
plane interface. We can start with only the L2 layer for now
and but in the future when we add the L3 layer then we need to add the
additional items.
Suggestions?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 10:15 ` Bruce Richardson
2020-01-15 11:27 ` Jerin Jacob
@ 2020-01-15 12:28 ` Morten Brørup
2020-01-15 12:57 ` Jerin Jacob
2020-01-15 14:09 ` Bruce Richardson
1 sibling, 2 replies; 21+ messages in thread
From: Morten Brørup @ 2020-01-15 12:28 UTC (permalink / raw)
To: Bruce Richardson, Andrzej Ostruszka; +Cc: dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, January 15, 2020 11:16 AM
>
> On Tue, Jan 14, 2020 at 06:38:37PM +0100, Andrzej Ostruszka wrote:
> > On 1/14/20 4:16 PM, Morten Brørup wrote:
> > > Andrzej,
> >
> > Hello Morten
> >
> > > Basically you are adding a very small subset of the Linux IP stack>
> to interface with DPDK applications via callbacks.
> >
> > Yes, at the moment this is limited - we'd prefer first to solicit
> > some input from community.
> >
> > > The library also seems to support interfacing to the route table,
> > > so it is not "interface proxy" but "IP stack proxy".
> >
> > True, to some extent - for example you can bring the interface up and
> > down which has nothing to do with IP stack. As for the name of the
> > library - that is actually part where we are completely open. The
> proxy
> > represents port (thus the name) but that is not all, so any better
> name
> > proposals are welcome.
> >
> > > You already mention ARP table as future work. How about namespaces,
> > > ip tables, and other advanced features... I foresee the Devil in
> the
> > > details for any real use case.
> >
> > Right now I don't know what other things are needed. This idea is
> still
> > early. However imagine you'd like to use DPDK to speed up packet
> > processing of IP stack - would you like to implement all the
> protocols
> > that are needed? Or just let the system handle the control path and
> > handle the data path and sniff the control params from the system.
> >
> Like Morten, I'd be a bit concerned at the possible scope of the work
> if we
> start pulling in functionality from the IP stack like ARP etc. To avoid
> this becoming a massive effort, how useful would it be if we just
> limited
> the scope to physical NIC setup only, and did not do anything above the
> l2
> layer?
Think about it... Regardless of scope, this is clearly a control plane API, not a data plane API.
It provides a proxy API for the O/S control plane (NETLINK in the case of Linux), so the DPDK application can use the user interface that the O/S already provides (e.g. "ip link set dev tap1 mtu 1600" etc.) for its control plane, instead of implementing its own CLI (or GUI or whatever).
In order to provide significant value, it will have to grow massively, so I can use it as imagined: To make a Linux firewall where the DPDK application handles the data plane, and the normal Linux commands are used for setting up the firewall, incl. firewall rules, port forwarding, NAPT, etc.. The Devil is in the details here!
Although I like the concept and idea behind it, I don't think a control plane proxy API belongs in DPDK. But it could possibly be hosted by the DPDK project, if approved as such.
Med venlig hilsen / kind regards
- Morten Brørup
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 12:28 ` Morten Brørup
@ 2020-01-15 12:57 ` Jerin Jacob
2020-01-15 15:30 ` Morten Brørup
2020-01-15 14:09 ` Bruce Richardson
1 sibling, 1 reply; 21+ messages in thread
From: Jerin Jacob @ 2020-01-15 12:57 UTC (permalink / raw)
To: Morten Brørup; +Cc: Bruce Richardson, Andrzej Ostruszka, dpdk-dev
On Wed, Jan 15, 2020 at 5:58 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Wednesday, January 15, 2020 11:16 AM
> >
> > On Tue, Jan 14, 2020 at 06:38:37PM +0100, Andrzej Ostruszka wrote:
> > > On 1/14/20 4:16 PM, Morten Brørup wrote:
> > > > Andrzej,
> > >
> > > Hello Morten
> > >
> > > > Basically you are adding a very small subset of the Linux IP stack>
> > to interface with DPDK applications via callbacks.
> > >
> > > Yes, at the moment this is limited - we'd prefer first to solicit
> > > some input from community.
> > >
> > > > The library also seems to support interfacing to the route table,
> > > > so it is not "interface proxy" but "IP stack proxy".
> > >
> > > True, to some extent - for example you can bring the interface up and
> > > down which has nothing to do with IP stack. As for the name of the
> > > library - that is actually part where we are completely open. The
> > proxy
> > > represents port (thus the name) but that is not all, so any better
> > name
> > > proposals are welcome.
> > >
> > > > You already mention ARP table as future work. How about namespaces,
> > > > ip tables, and other advanced features... I foresee the Devil in
> > the
> > > > details for any real use case.
> > >
> > > Right now I don't know what other things are needed. This idea is
> > still
> > > early. However imagine you'd like to use DPDK to speed up packet
> > > processing of IP stack - would you like to implement all the
> > protocols
> > > that are needed? Or just let the system handle the control path and
> > > handle the data path and sniff the control params from the system.
> > >
> > Like Morten, I'd be a bit concerned at the possible scope of the work
> > if we
> > start pulling in functionality from the IP stack like ARP etc. To avoid
> > this becoming a massive effort, how useful would it be if we just
> > limited
> > the scope to physical NIC setup only, and did not do anything above the
> > l2
> > layer?
>
> Think about it... Regardless of scope, this is clearly a control plane API, not a data plane API.
>
> It provides a proxy API for the O/S control plane (NETLINK in the case of Linux), so the DPDK application can use the user interface that the O/S already provides (e.g. "ip link set dev tap1 mtu 1600" etc.) for its control plane, instead of implementing its own CLI (or GUI or whatever).
Yes.
>
> In order to provide significant value, it will have to grow massively, so I can use it as imagined: To make a Linux firewall where the DPDK application handles the data plane, and the normal Linux commands are used for setting up the firewall, incl. firewall rules, port forwarding, NAPT, etc.. The Devil is in the details here!
Yes.
Another use case would be to handle exception where DPDK may not
handle all the traffic, Traffic such ARP can be redirected to OS. This
would enable DP to focus on the real fast path protocols such as IPv4,
UDP etc.
>
> Although I like the concept and idea behind it, I don't think a control plane proxy API belongs in DPDK. But it could possibly be hosted by the DPDK project if approved as such.
Why? rte_flow, rte_tm all control plane APIs and it is part of
DPDK.IMO, in order to have effective use of data plane, the control
plane has to be integrated together in an OS-independent way.
>
>
> Med venlig hilsen / kind regards
> - Morten Brørup
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 12:57 ` Jerin Jacob
@ 2020-01-15 15:30 ` Morten Brørup
2020-01-15 16:04 ` Jerin Jacob
0 siblings, 1 reply; 21+ messages in thread
From: Morten Brørup @ 2020-01-15 15:30 UTC (permalink / raw)
To: Jerin Jacob; +Cc: Bruce Richardson, Andrzej Ostruszka, dpdk-dev
> -----Original Message-----
> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Wednesday, January 15, 2020 1:57 PM
>
> On Wed, Jan 15, 2020 at 5:58 PM Morten Brørup
> <mb@smartsharesystems.com> wrote:
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> Richardson
> > > Sent: Wednesday, January 15, 2020 11:16 AM
> > >
> > > On Tue, Jan 14, 2020 at 06:38:37PM +0100, Andrzej Ostruszka wrote:
> > > > On 1/14/20 4:16 PM, Morten Brørup wrote:
> > > > > Andrzej,
> > > >
> > > > Hello Morten
> > > >
> > > > > Basically you are adding a very small subset of the Linux IP
> stack>
> > > to interface with DPDK applications via callbacks.
> > > >
> > > > Yes, at the moment this is limited - we'd prefer first to solicit
> > > > some input from community.
> > > >
> > > > > The library also seems to support interfacing to the route
> table,
> > > > > so it is not "interface proxy" but "IP stack proxy".
> > > >
> > > > True, to some extent - for example you can bring the interface up
> and
> > > > down which has nothing to do with IP stack. As for the name of
> the
> > > > library - that is actually part where we are completely open.
> The
> > > proxy
> > > > represents port (thus the name) but that is not all, so any
> better
> > > name
> > > > proposals are welcome.
> > > >
> > > > > You already mention ARP table as future work. How about
> namespaces,
> > > > > ip tables, and other advanced features... I foresee the Devil
> in
> > > the
> > > > > details for any real use case.
> > > >
> > > > Right now I don't know what other things are needed. This idea
> is
> > > still
> > > > early. However imagine you'd like to use DPDK to speed up packet
> > > > processing of IP stack - would you like to implement all the
> > > protocols
> > > > that are needed? Or just let the system handle the control path
> and
> > > > handle the data path and sniff the control params from the
> system.
> > > >
> > > Like Morten, I'd be a bit concerned at the possible scope of the
> work
> > > if we
> > > start pulling in functionality from the IP stack like ARP etc. To
> avoid
> > > this becoming a massive effort, how useful would it be if we just
> > > limited
> > > the scope to physical NIC setup only, and did not do anything above
> the
> > > l2
> > > layer?
> >
> > Think about it... Regardless of scope, this is clearly a control
> plane API, not a data plane API.
> >
> > It provides a proxy API for the O/S control plane (NETLINK in the
> case of Linux), so the DPDK application can use the user interface that
> the O/S already provides (e.g. "ip link set dev tap1 mtu 1600" etc.)
> for its control plane, instead of implementing its own CLI (or GUI or
> whatever).
>
> Yes.
>
> >
> > In order to provide significant value, it will have to grow
> massively, so I can use it as imagined: To make a Linux firewall where
> the DPDK application handles the data plane, and the normal Linux
> commands are used for setting up the firewall, incl. firewall rules,
> port forwarding, NAPT, etc.. The Devil is in the details here!
>
> Yes.
> Another use case would be to handle exception where DPDK may not
> handle all the traffic, Traffic such ARP can be redirected to OS. This
> would enable DP to focus on the real fast path protocols such as IPv4,
> UDP etc.
>
These are use cases for DPDK being used in an environment where the IP stack features provided by Linux suffices. It would be great for a simple CPE or Wi-Fi router, e.g. OpenWRT with a DPDK data plane replacing the Linux kernel's data plane.
For this use case, I think an example application would be a much more useful way to achieve your goal. Implementing it as an application will also uncover what is really needed, instead of us all speculating about what a proxy library might need to include.
But consider an advanced router with VRFs, VLANs, policy based routing, multiple WANs provided through network namespaces... the library will be huge!
> >
> > Although I like the concept and idea behind it, I don't think a
> control plane proxy API belongs in DPDK. But it could possibly be
> hosted by the DPDK project if approved as such.
>
> Why? rte_flow, rte_tm all control plane APIs and it is part of
> DPDK.
Yes, there are some DPDK libraries leaning more towards control plane than data plane. Another example to prove your point: The whole process scheduling library has very little to do with packet processing. Vaguely related features are creeping in when objections are not strong enough.
> IMO, in order to have effective use of data plane, the control
> plane has to be integrated together in an OS-independent way.
>
Also remember that not all DPDK applications need an IP stack resembling what Linux has. E.g. the SmartShare StraightShaper is a transparent bandwidth optimization appliance, and it doesn't perform any routing, it doesn't use any O/S-like features in the data path, and thus it doesn't need to integrate with the IP stack in the O/S. (The management interface uses the Linux IP stack, but it is completely isolated from the DPDK application's data plane.) The same can be said about e.g. T-Rex.
Obviously, not all DPDK applications use all DPDK libraries, and since I'm not obligated to use it, I'm not strongly opposed against it. I only question its usefulness outside of the specific use case of replacing the fast path in the Linux kernel.
-Morten
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 15:30 ` Morten Brørup
@ 2020-01-15 16:04 ` Jerin Jacob
2020-01-15 18:15 ` Morten Brørup
0 siblings, 1 reply; 21+ messages in thread
From: Jerin Jacob @ 2020-01-15 16:04 UTC (permalink / raw)
To: Morten Brørup; +Cc: Bruce Richardson, Andrzej Ostruszka, dpdk-dev
On Wed, Jan 15, 2020 at 9:00 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > -----Original Message-----
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Wednesday, January 15, 2020 1:57 PM
> >
> > On Wed, Jan 15, 2020 at 5:58 PM Morten Brørup
> > <mb@smartsharesystems.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > Richardson
> > > > Sent: Wednesday, January 15, 2020 11:16 AM
> > > >
> > > > On Tue, Jan 14, 2020 at 06:38:37PM +0100, Andrzej Ostruszka wrote:
> > > > > On 1/14/20 4:16 PM, Morten Brørup wrote:
> > > > > > Andrzej,
> > > > >
> > > > > Hello Morten
> > > > >
> > > > > > Basically you are adding a very small subset of the Linux IP
> > stack>
> > > > to interface with DPDK applications via callbacks.
> > > > >
> > > > > Yes, at the moment this is limited - we'd prefer first to solicit
> > > > > some input from community.
> > > > >
> > > > > > The library also seems to support interfacing to the route
> > table,
> > > > > > so it is not "interface proxy" but "IP stack proxy".
> > > > >
> > > > > True, to some extent - for example you can bring the interface up
> > and
> > > > > down which has nothing to do with IP stack. As for the name of
> > the
> > > > > library - that is actually part where we are completely open.
> > The
> > > > proxy
> > > > > represents port (thus the name) but that is not all, so any
> > better
> > > > name
> > > > > proposals are welcome.
> > > > >
> > > > > > You already mention ARP table as future work. How about
> > namespaces,
> > > > > > ip tables, and other advanced features... I foresee the Devil
> > in
> > > > the
> > > > > > details for any real use case.
> > > > >
> > > > > Right now I don't know what other things are needed. This idea
> > is
> > > > still
> > > > > early. However imagine you'd like to use DPDK to speed up packet
> > > > > processing of IP stack - would you like to implement all the
> > > > protocols
> > > > > that are needed? Or just let the system handle the control path
> > and
> > > > > handle the data path and sniff the control params from the
> > system.
> > > > >
> > > > Like Morten, I'd be a bit concerned at the possible scope of the
> > work
> > > > if we
> > > > start pulling in functionality from the IP stack like ARP etc. To
> > avoid
> > > > this becoming a massive effort, how useful would it be if we just
> > > > limited
> > > > the scope to physical NIC setup only, and did not do anything above
> > the
> > > > l2
> > > > layer?
> > >
> > > Think about it... Regardless of scope, this is clearly a control
> > plane API, not a data plane API.
> > >
> > > It provides a proxy API for the O/S control plane (NETLINK in the
> > case of Linux), so the DPDK application can use the user interface that
> > the O/S already provides (e.g. "ip link set dev tap1 mtu 1600" etc.)
> > for its control plane, instead of implementing its own CLI (or GUI or
> > whatever).
> >
> > Yes.
> >
> > >
> > > In order to provide significant value, it will have to grow
> > massively, so I can use it as imagined: To make a Linux firewall where
> > the DPDK application handles the data plane, and the normal Linux
> > commands are used for setting up the firewall, incl. firewall rules,
> > port forwarding, NAPT, etc.. The Devil is in the details here!
> >
> > Yes.
> > Another use case would be to handle exception where DPDK may not
> > handle all the traffic, Traffic such ARP can be redirected to OS. This
> > would enable DP to focus on the real fast path protocols such as IPv4,
> > UDP etc.
> >
>
> These are use cases for DPDK being used in an environment where the IP stack features provided by Linux suffices. It would be great for a simple CPE or Wi-Fi router, e.g. OpenWRT with a DPDK data plane replacing the Linux kernel's data plane.
IMO, it not replacing the Linux IP stack, instead, using the slow path
services from Linux or any OS. The use case would vary from simple
WiFI router to 5G transport stack.
>
> For this use case, I think an example application would be a much more useful way to achieve your goal. Implementing it as an application will also uncover what is really needed, instead of us all speculating about what a proxy library might need to include.
>
> But consider an advanced router with VRFs, VLANs, policy based routing, multiple WANs provided through network namespaces... the library will be huge!
We thought of adding the infrastructure and the need per basics, we
can scale it up. There is no such infrastructure now with DPDK.
At least if someone wishes to contribute to this these area then there
should be the path to improve things wrt current situation.
>
> > >
> > > Although I like the concept and idea behind it, I don't think a
> > control plane proxy API belongs in DPDK. But it could possibly be
> > hosted by the DPDK project if approved as such.
> >
> > Why? rte_flow, rte_tm all control plane APIs and it is part of
> > DPDK.
>
> Yes, there are some DPDK libraries leaning more towards control plane than data plane. Another example to prove your point: The whole process scheduling library has very little to do with packet processing. Vaguely related features are creeping in when objections are not strong enough.
Yes. That's the reason for the control path vs data path argument that
doesn't have any value.
If it is useful for packet processing use cases then have it.
>
> > IMO, in order to have effective use of data plane, the control
> > plane has to be integrated together in an OS-independent way.
> >
>
> Also remember that not all DPDK applications need an IP stack resembling what Linux has. E.g. the SmartShare StraightShaper is a transparent bandwidth optimization appliance, and it doesn't perform any routing, it doesn't use any O/S-like features in the data path, and thus it doesn't need to integrate with the IP stack in the O/S. (The management interface uses the Linux IP stack, but it is completely isolated from the DPDK application's data plane.) The same can be said about e.g. T-Rex.
>
> Obviously, not all DPDK applications use all DPDK libraries, and since I'm not obligated to use it, I'm not strongly opposed against it. I only question its usefulness outside of the specific use case of replacing the fast path in the Linux kernel.
Of course, We still follow the "À la carte" model, where we are not
forcing to use the library in the end-user application. You can always
use whatever control path that makes sense with the end-user
applications.
But if some application wants to write control plane SW that needs to
work Linux/FreeBSD/Windows or other RTOS then it can be used (Again if
someone wishes to do so).
We can also provide the means for NOPing out the callbacks or override
with something it is the specific end-user library as well, so that
complete flexibly will be still with the application wrt the usage.
>
> -Morten
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 16:04 ` Jerin Jacob
@ 2020-01-15 18:15 ` Morten Brørup
2020-01-16 7:15 ` Jerin Jacob
2020-01-16 9:09 ` Andrzej Ostruszka
0 siblings, 2 replies; 21+ messages in thread
From: Morten Brørup @ 2020-01-15 18:15 UTC (permalink / raw)
To: Jerin Jacob; +Cc: Bruce Richardson, Andrzej Ostruszka, dpdk-dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> Sent: Wednesday, January 15, 2020 5:04 PM
>
> On Wed, Jan 15, 2020 at 9:00 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > -----Original Message-----
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Wednesday, January 15, 2020 1:57 PM
> > >
> > > On Wed, Jan 15, 2020 at 5:58 PM Morten Brørup
> > > <mb@smartsharesystems.com> wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > Richardson
> > > > > Sent: Wednesday, January 15, 2020 11:16 AM
> > > > >
> > > > > On Tue, Jan 14, 2020 at 06:38:37PM +0100, Andrzej Ostruszka wrote:
> > > > > > On 1/14/20 4:16 PM, Morten Brørup wrote:
> > > > > > > Andrzej,
> > > > > >
> > > > > > Hello Morten
> > > > > >
> > > > > > > Basically you are adding a very small subset of the Linux IP
> > > stack>
> > > > > to interface with DPDK applications via callbacks.
> > > > > >
> > > > > > Yes, at the moment this is limited - we'd prefer first to solicit
> > > > > > some input from community.
> > > > > >
> > > > > > > The library also seems to support interfacing to the route
> > > table,
> > > > > > > so it is not "interface proxy" but "IP stack proxy".
> > > > > >
> > > > > > True, to some extent - for example you can bring the interface up
> > > and
> > > > > > down which has nothing to do with IP stack. As for the name of
> > > the
> > > > > > library - that is actually part where we are completely open.
> > > The
> > > > > proxy
> > > > > > represents port (thus the name) but that is not all, so any
> > > better
> > > > > name
> > > > > > proposals are welcome.
> > > > > >
> > > > > > > You already mention ARP table as future work. How about
> > > namespaces,
> > > > > > > ip tables, and other advanced features... I foresee the Devil
> > > in
> > > > > the
> > > > > > > details for any real use case.
> > > > > >
> > > > > > Right now I don't know what other things are needed. This idea
> > > is
> > > > > still
> > > > > > early. However imagine you'd like to use DPDK to speed up packet
> > > > > > processing of IP stack - would you like to implement all the
> > > > > protocols
> > > > > > that are needed? Or just let the system handle the control path
> > > and
> > > > > > handle the data path and sniff the control params from the
> > > system.
> > > > > >
> > > > > Like Morten, I'd be a bit concerned at the possible scope of the
> > > work
> > > > > if we
> > > > > start pulling in functionality from the IP stack like ARP etc. To
> > > avoid
> > > > > this becoming a massive effort, how useful would it be if we just
> > > > > limited
> > > > > the scope to physical NIC setup only, and did not do anything above
> > > the
> > > > > l2
> > > > > layer?
> > > >
> > > > Think about it... Regardless of scope, this is clearly a control
> > > plane API, not a data plane API.
> > > >
> > > > It provides a proxy API for the O/S control plane (NETLINK in the
> > > case of Linux), so the DPDK application can use the user interface that
> > > the O/S already provides (e.g. "ip link set dev tap1 mtu 1600" etc.)
> > > for its control plane, instead of implementing its own CLI (or GUI or
> > > whatever).
> > >
> > > Yes.
> > >
> > > >
> > > > In order to provide significant value, it will have to grow
> > > massively, so I can use it as imagined: To make a Linux firewall where
> > > the DPDK application handles the data plane, and the normal Linux
> > > commands are used for setting up the firewall, incl. firewall rules,
> > > port forwarding, NAPT, etc.. The Devil is in the details here!
> > >
> > > Yes.
> > > Another use case would be to handle exception where DPDK may not
> > > handle all the traffic, Traffic such ARP can be redirected to OS. This
> > > would enable DP to focus on the real fast path protocols such as IPv4,
> > > UDP etc.
> > >
> >
> > These are use cases for DPDK being used in an environment where the IP
> stack features provided by Linux suffices. It would be great for a simple
> CPE or Wi-Fi router, e.g. OpenWRT with a DPDK data plane replacing the
> Linux kernel's data plane.
>
> IMO, it not replacing the Linux IP stack, instead, using the slow path
> services from Linux or any OS. The use case would vary from simple
> WiFI router to 5G transport stack.
I only mentioned the special case where a DPDK application uses the Linux slow path services (though your proxy API) and the DPDK application handles the data plane instead of the packets being handled by the Linux kernel's data plane. But I agree, the concept is broader than that.
>
> >
> > For this use case, I think an example application would be a much more
> useful way to achieve your goal. Implementing it as an application will
> also uncover what is really needed, instead of us all speculating about
> what a proxy library might need to include.
> >
> > But consider an advanced router with VRFs, VLANs, policy based routing,
> multiple WANs provided through network namespaces... the library will be
> huge!
>
> We thought of adding the infrastructure and the need per basics, we
> can scale it up. There is no such infrastructure now with DPDK.
> At least if someone wishes to contribute to this these area then there
> should be the path to improve things wrt current situation.
>
> >
> > > >
> > > > Although I like the concept and idea behind it, I don't think a
> > > control plane proxy API belongs in DPDK. But it could possibly be
> > > hosted by the DPDK project if approved as such.
> > >
> > > Why? rte_flow, rte_tm all control plane APIs and it is part of
> > > DPDK.
> >
> > Yes, there are some DPDK libraries leaning more towards control plane
> than data plane. Another example to prove your point: The whole process
> scheduling library has very little to do with packet processing. Vaguely
> related features are creeping in when objections are not strong enough.
>
> Yes. That's the reason for the control path vs data path argument that
> doesn't have any value.
> If it is useful for packet processing use cases then have it.
>
> >
> > > IMO, in order to have effective use of data plane, the control
> > > plane has to be integrated together in an OS-independent way.
> > >
> >
> > Also remember that not all DPDK applications need an IP stack resembling
> what Linux has. E.g. the SmartShare StraightShaper is a transparent
> bandwidth optimization appliance, and it doesn't perform any routing, it
> doesn't use any O/S-like features in the data path, and thus it doesn't
> need to integrate with the IP stack in the O/S. (The management interface
> uses the Linux IP stack, but it is completely isolated from the DPDK
> application's data plane.) The same can be said about e.g. T-Rex.
> >
> > Obviously, not all DPDK applications use all DPDK libraries, and since
> I'm not obligated to use it, I'm not strongly opposed against it. I only
> question its usefulness outside of the specific use case of replacing the
> fast path in the Linux kernel.
>
> Of course, We still follow the "À la carte" model, where we are not
> forcing to use the library in the end-user application. You can always
> use whatever control path that makes sense with the end-user
> applications.
> But if some application wants to write control plane SW that needs to
> work Linux/FreeBSD/Windows or other RTOS then it can be used (Again if
> someone wishes to do so).
> We can also provide the means for NOPing out the callbacks or override
> with something it is the specific end-user library as well, so that
> complete flexibly will be still with the application wrt the usage.
>
OK, you convinced me that a general API for interfacing to the O/S control plane might be useful. So let me switch from arguing against it to providing some constructive feedback:
You should consider that most DPDK APIs are not thread safe, meaning that their internal structures cannot be manipulated/reconfigured by a control plane thread while data plane threads are accessing them. E.g. a route cannot be added in the DPDK route library while it is also being used by for lookups by a DPDK data plane thread. The same goes for the hash table library. This means that callbacks are probably not the right design pattern.
AFAIK, the DPDK documentation doesn't mention any "best practices" for interaction between the control plane and data plans threads, so I understand why you chose a design pattern similar to the NIC Link Status Change interrupt design pattern.
Furthermore, I have now skimmed the other parts of your patch set. If I got it right, it looks like there's a limit of 64 callbacks; this will probably not suffice in the long run.
And on the administrative side, I assume one of you guys will volunteer as the maintainer of this library?
-Morten
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 18:15 ` Morten Brørup
@ 2020-01-16 7:15 ` Jerin Jacob
2020-01-16 9:11 ` Morten Brørup
2020-01-16 9:09 ` Andrzej Ostruszka
1 sibling, 1 reply; 21+ messages in thread
From: Jerin Jacob @ 2020-01-16 7:15 UTC (permalink / raw)
To: Morten Brørup; +Cc: Bruce Richardson, Andrzej Ostruszka, dpdk-dev
On Wed, Jan 15, 2020 at 11:45 PM Morten Brørup <mb@smartsharesystems.com> wrote:
> > > > IMO, in order to have effective use of data plane, the control
> > > > plane has to be integrated together in an OS-independent way.
> > > >
> > >
> > > Also remember that not all DPDK applications need an IP stack resembling
> > what Linux has. E.g. the SmartShare StraightShaper is a transparent
> > bandwidth optimization appliance, and it doesn't perform any routing, it
> > doesn't use any O/S-like features in the data path, and thus it doesn't
> > need to integrate with the IP stack in the O/S. (The management interface
> > uses the Linux IP stack, but it is completely isolated from the DPDK
> > application's data plane.) The same can be said about e.g. T-Rex.
> > >
> > > Obviously, not all DPDK applications use all DPDK libraries, and since
> > I'm not obligated to use it, I'm not strongly opposed against it. I only
> > question its usefulness outside of the specific use case of replacing the
> > fast path in the Linux kernel.
> >
> > Of course, We still follow the "À la carte" model, where we are not
> > forcing to use the library in the end-user application. You can always
> > use whatever control path that makes sense with the end-user
> > applications.
> > But if some application wants to write control plane SW that needs to
> > work Linux/FreeBSD/Windows or other RTOS then it can be used (Again if
> > someone wishes to do so).
> > We can also provide the means for NOPing out the callbacks or override
> > with something it is the specific end-user library as well, so that
> > complete flexibly will be still with the application wrt the usage.
> >
>
> OK, you convinced me that a general API for interfacing to the O/S control plane might be useful. So let me switch from arguing against it to providing some constructive feedback:
Good news :-)
>
> You should consider that most DPDK APIs are not thread safe, meaning that their internal structures cannot be manipulated/reconfigured by a control plane thread while data plane threads are accessing them. E.g. a route cannot be added in the DPDK route library while it is also being used by for lookups by a DPDK data plane thread. The same goes for the hash table library. This means that callbacks are probably not the right design pattern.
I think, we can have only two design patterns for this case.
1) push model(i.e callback). In this case, DP gets the callback, if it
is not the correct time to apply the configuration then DP can store
it in its own queue and pull it latter.
2) pull model. In this case, the library stores the events. When DP
needs the events, it can pull the events from the library.
Do you have any other model in mind? and what is your preference among two?
>
> AFAIK, the DPDK documentation doesn't mention any "best practices" for interaction between the control plane and data plans threads, so I understand why you chose a design pattern similar to the NIC Link Status Change interrupt design pattern.
>
> Furthermore, I have now skimmed the other parts of your patch set. If I got it right, it looks like there's a limit of 64 callbacks; this will probably not suffice in the long run.
Yes. We will increase it.
> And on the administrative side, I assume one of you guys will volunteer as the maintainer of this library?
Yes
>
> -Morten
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-16 7:15 ` Jerin Jacob
@ 2020-01-16 9:11 ` Morten Brørup
0 siblings, 0 replies; 21+ messages in thread
From: Morten Brørup @ 2020-01-16 9:11 UTC (permalink / raw)
To: Jerin Jacob; +Cc: Bruce Richardson, Andrzej Ostruszka, dpdk-dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> Sent: Thursday, January 16, 2020 8:15 AM
>
> On Wed, Jan 15, 2020 at 11:45 PM Morten Brørup
> <mb@smartsharesystems.com> wrote:
>
> > > > > IMO, in order to have effective use of data plane, the control
> > > > > plane has to be integrated together in an OS-independent way.
> > > > >
> > > >
> > > > Also remember that not all DPDK applications need an IP stack
> resembling
> > > what Linux has. E.g. the SmartShare StraightShaper is a transparent
> > > bandwidth optimization appliance, and it doesn't perform any
> routing, it
> > > doesn't use any O/S-like features in the data path, and thus it
> doesn't
> > > need to integrate with the IP stack in the O/S. (The management
> interface
> > > uses the Linux IP stack, but it is completely isolated from the
> DPDK
> > > application's data plane.) The same can be said about e.g. T-Rex.
> > > >
> > > > Obviously, not all DPDK applications use all DPDK libraries, and
> since
> > > I'm not obligated to use it, I'm not strongly opposed against it. I
> only
> > > question its usefulness outside of the specific use case of
> replacing the
> > > fast path in the Linux kernel.
> > >
> > > Of course, We still follow the "À la carte" model, where we are not
> > > forcing to use the library in the end-user application. You can
> always
> > > use whatever control path that makes sense with the end-user
> > > applications.
> > > But if some application wants to write control plane SW that needs
> to
> > > work Linux/FreeBSD/Windows or other RTOS then it can be used (Again
> if
> > > someone wishes to do so).
> > > We can also provide the means for NOPing out the callbacks or
> override
> > > with something it is the specific end-user library as well, so that
> > > complete flexibly will be still with the application wrt the usage.
> > >
> >
> > OK, you convinced me that a general API for interfacing to the O/S
> control plane might be useful. So let me switch from arguing against it
> to providing some constructive feedback:
>
> Good news :-)
>
> >
> > You should consider that most DPDK APIs are not thread safe, meaning
> that their internal structures cannot be manipulated/reconfigured by a
> control plane thread while data plane threads are accessing them. E.g.
> a route cannot be added in the DPDK route library while it is also
> being used by for lookups by a DPDK data plane thread. The same goes
> for the hash table library. This means that callbacks are probably not
> the right design pattern.
>
> I think, we can have only two design patterns for this case.
>
> 1) push model(i.e callback). In this case, DP gets the callback, if it
> is not the correct time to apply the configuration then DP can store
> it in its own queue and pull it latter.
> 2) pull model. In this case, the library stores the events. When DP
> needs the events, it can pull the events from the library.
>
> Do you have any other model in mind? and what is your preference among
> two?
>
This library interfaces to the O/S on the one side, and a DPDK application on the other side.
Looking at the interface towards the DPDK application, I would personally prefer a pull model. It will allow the DPDK application to handle the events when it is convenient and safe for the DPDK application to manipulate its non-thread safe data structures.
Looking at the interface towards the O/S, Linux Netlink is well defined and described in RFC 3549, and message queues (e.g. DPDK rings) seem like a perfect match for this.
I don't know enough about the Windows network stack to tell if the same applies here, so you should look into this before proceeding. On the other hand, the "memif" Memory Interface PMD is Linux only; so you could also consider limiting your library support to operating systems often being used as routers, i.e. Linux and BSD, and explicitly omit Windows support.
I have no preferences about the message format, but since Linux Netlink is described in an RFC, you could consider using this exact message format or a closely related message format. The RFC authors probably thought this through very thoroughly.
> >
> > AFAIK, the DPDK documentation doesn't mention any "best practices"
> for interaction between the control plane and data plans threads, so I
> understand why you chose a design pattern similar to the NIC Link
> Status Change interrupt design pattern.
> >
> > Furthermore, I have now skimmed the other parts of your patch set. If
> I got it right, it looks like there's a limit of 64 callbacks; this
> will probably not suffice in the long run.
>
> Yes. We will increase it.
>
> > And on the administrative side, I assume one of you guys will
> volunteer as the maintainer of this library?
>
> Yes
Great.
-Morten
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 18:15 ` Morten Brørup
2020-01-16 7:15 ` Jerin Jacob
@ 2020-01-16 9:09 ` Andrzej Ostruszka
2020-01-16 9:30 ` Morten Brørup
1 sibling, 1 reply; 21+ messages in thread
From: Andrzej Ostruszka @ 2020-01-16 9:09 UTC (permalink / raw)
To: Morten Brørup, Jerin Jacob; +Cc: Bruce Richardson, dpdk-dev
On 1/15/20 7:15 PM, Morten Brørup wrote:
[...]
> OK, you convinced me that a general API for interfacing to the O/S
> control plane might be useful.
Glad to hear that.
[...]
> You should consider that most DPDK APIs are not thread safe,
> meaning that their internal structures cannot be manipulated/reconfigured
> by a control plane thread while data plane threads are accessing them.
> E.g. a route cannot be added in the DPDK route library while it is also
> being used by for lookups by a DPDK data plane thread. The same goes
> for the hash table library.
You are thinking already about modification of the application data.
That is actually beyond the scope of the library. The intention of the
library is to provide with notification of a change. It is meant to be
the task of the callback (provided by the user) to act on the change.
It can store the change to be picked up at the next packet burst
iteration, or use some RCU synchronization or even stop the world and
push the change (if the writer of application deems that appropriate).
> This means that callbacks are probably not the right design pattern.
What are other possibilities? The library could keep "copy" of the
interesting configuration and periodically update it and mark the
changes to let application notice. But that would be inefficient - I
would have to query all data to check for the diff. So I think the
callback is the right design - we get only changes. However please note
above explanation, that it is up to application writer to provide
callback that would fit design of the application and in cooperation
with it will move the network config change into internal data structures.
> Furthermore, I have now skimmed the other parts of your patch set.
> If I got it right, it looks like there's a limit of 64 callbacks;
> this will probably not suffice in the long run.
This is interesting. What has given you that impression? I'm really
curious since I've written it :). There is a limit on a number of
proxies (but this is the same as limit on DPDK ports - so not really a
limitation of this lib). BTW since this is a slow path, and I don't
need a fast access I keep proxies in a list, so that only those active
have allocated memory.
Each type of callback is just a member of rte_ifpx_callbacks struct -
and yes, as you previously noted, this struct will grow with additional
functionality added, but there is no real limit on it. At the moment
callbacks are meant to be global - there is a list of callback sets
(ifpx_callbacks) that is common for all proxies.
I expect that the most common use will be just one set of callbacks for
application. But instead of having just one global var I keep a list of
sets so many can be registered. There are other options possible:
- each type of callback can be a list
- callbacks could be "per proxy" - meaning that each proxy port could
have its own callbacks
The first one could be beneficial if user wants many callbacks
registered for some particular type of notification and is not
interested in others.
The second one can be useful if different proxies should be treated
differently - in that case one could avoid conditionals in callback
switching behaviour depending on the proxy used.
But again this kind of uses are not what I expect as a common use case
so I went with current design.
> And on the administrative side, I assume one of you guys will volunteer
> as the maintainer of this library?
Yes.
With regards
Andrzej Ostruszka
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-16 9:09 ` Andrzej Ostruszka
@ 2020-01-16 9:30 ` Morten Brørup
2020-01-16 10:42 ` Andrzej Ostruszka
0 siblings, 1 reply; 21+ messages in thread
From: Morten Brørup @ 2020-01-16 9:30 UTC (permalink / raw)
To: Andrzej Ostruszka, Jerin Jacob; +Cc: Bruce Richardson, dpdk-dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Andrzej Ostruszka
> Sent: Thursday, January 16, 2020 10:10 AM
>
> On 1/15/20 7:15 PM, Morten Brørup wrote:
> [...]
> > OK, you convinced me that a general API for interfacing to the O/S
> > control plane might be useful.
>
> Glad to hear that.
>
> [...]
> > You should consider that most DPDK APIs are not thread safe,
> > meaning that their internal structures cannot be
> manipulated/reconfigured
> > by a control plane thread while data plane threads are accessing
> them.
> > E.g. a route cannot be added in the DPDK route library while it is
> also
> > being used by for lookups by a DPDK data plane thread. The same goes
> > for the hash table library.
>
> You are thinking already about modification of the application data.
> That is actually beyond the scope of the library.
Yes, it is beyond the scope of the library; but I prefer the library to be designed for how typical applications are going to use it.
I suggest that you supplement the library with an example DPDK application that is a simple IPv4 router, forwarding packets and responding to ARP requests - according to its configuration applied in the O/S via your proxy library. You could even add support for relevant ICMP packets (e.g. respond to ICMP Echo Request and send TTL Exceeded when appropriate). It will help you determine what is required by the library, and how the library best interfaces to a "typical" DPDK application.
> The intention of the
> library is to provide with notification of a change. It is meant to be
> the task of the callback (provided by the user) to act on the change.
> It can store the change to be picked up at the next packet burst
> iteration, or use some RCU synchronization or even stop the world and
> push the change (if the writer of application deems that appropriate).
>
> > This means that callbacks are probably not the right design pattern.
>
> What are other possibilities? The library could keep "copy" of the
> interesting configuration and periodically update it and mark the
> changes to let application notice. But that would be inefficient - I
> would have to query all data to check for the diff. So I think the
> callback is the right design - we get only changes. However please
> note
> above explanation, that it is up to application writer to provide
> callback that would fit design of the application and in cooperation
> with it will move the network config change into internal data
> structures.
>
I think a poll based design pattern is more appropriate. Getting a Netlink message from the O/S and converting it to a callback in the library, and then converting it back to a message in the DPDK application seems like crossing the river to get water.
> > Furthermore, I have now skimmed the other parts of your patch set.
> > If I got it right, it looks like there's a limit of 64 callbacks;
> > this will probably not suffice in the long run.
>
> This is interesting. What has given you that impression? I'm really
> curious since I've written it :).
It was a bitmap of wanted callbacks. I only skimmed the source code, so I'm probably wrong about this. Forget I mentioned it.
> There is a limit on a number of
> proxies (but this is the same as limit on DPDK ports - so not really a
> limitation of this lib). BTW since this is a slow path, and I don't
> need a fast access I keep proxies in a list, so that only those active
> have allocated memory.
>
> Each type of callback is just a member of rte_ifpx_callbacks struct -
> and yes, as you previously noted, this struct will grow with additional
> functionality added, but there is no real limit on it. At the moment
> callbacks are meant to be global - there is a list of callback sets
> (ifpx_callbacks) that is common for all proxies.
>
> I expect that the most common use will be just one set of callbacks for
> application. But instead of having just one global var I keep a list
> of
> sets so many can be registered. There are other options possible:
> - each type of callback can be a list
> - callbacks could be "per proxy" - meaning that each proxy port could
> have its own callbacks
>
> The first one could be beneficial if user wants many callbacks
> registered for some particular type of notification and is not
> interested in others.
> The second one can be useful if different proxies should be treated
> differently - in that case one could avoid conditionals in callback
> switching behaviour depending on the proxy used.
>
> But again this kind of uses are not what I expect as a common use case
> so I went with current design.
>
> > And on the administrative side, I assume one of you guys will
> volunteer
> > as the maintainer of this library?
>
> Yes.
>
> With regards
> Andrzej Ostruszka
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-16 9:30 ` Morten Brørup
@ 2020-01-16 10:42 ` Andrzej Ostruszka
2020-01-16 10:58 ` Morten Brørup
0 siblings, 1 reply; 21+ messages in thread
From: Andrzej Ostruszka @ 2020-01-16 10:42 UTC (permalink / raw)
To: Morten Brørup, Jerin Jacob; +Cc: Bruce Richardson, dpdk-dev
On 1/16/20 10:30 AM, Morten Brørup wrote:
[...]
>> You are thinking already about modification of the application data.
>> That is actually beyond the scope of the library.
>
> Yes, it is beyond the scope of the library; but I prefer the library to
> be designed for how typical applications are going to use it.
>
> I suggest that you supplement the library with an example DPDK application
> that is a simple IPv4 router, forwarding packets and responding to ARP
> requests - according to its configuration applied in the O/S via your proxy
> library. You could even add support for relevant ICMP packets (e.g. respond
> to ICMP Echo Request and send TTL Exceeded when appropriate).
Actually our thinking was more along the way: such router would see
these control packets so it will send them (tx burst) to proxy port and
let the system stack do its job: change config and possibly send reply.
The former would be listened on NETLINK (in Linux) and the later would
be just read from proxy port and forwarded to the bound port. That way
DPDK application would not have to re-implement these control protocols.
> It will help you determine what is required by the library, and how
> the library best interfaces to a "typical" DPDK application.
Yes indeed, that kind usage discovery exercise would be good.
> I think a poll based design pattern is more appropriate. Getting a Netlink
> message from the O/S and converting it to a callback in the library, and
> then converting it back to a message in the DPDK application seems like
> crossing the river to get water.
You'd still need to repack the message and that could be the job of the
callback.
At the moment we don't have much experience with the library and to me
the callback is more generic approach with which one can achieve
different designs. However nothing here is curved in stone so if we
figure out that this is too generic we will change it.
[...]
> It was a bitmap of wanted callbacks.
Aaa, right. Currently the set of available callbacks is returned as a
bitmask. This API will change if we find out the need for more callbacks.
With regards
Andrzej Ostruszka
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-16 10:42 ` Andrzej Ostruszka
@ 2020-01-16 10:58 ` Morten Brørup
2020-01-16 12:06 ` Andrzej Ostruszka
0 siblings, 1 reply; 21+ messages in thread
From: Morten Brørup @ 2020-01-16 10:58 UTC (permalink / raw)
To: Andrzej Ostruszka, Jerin Jacob; +Cc: Bruce Richardson, dpdk-dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Andrzej Ostruszka
> Sent: Thursday, January 16, 2020 11:43 AM
>
> On 1/16/20 10:30 AM, Morten Brørup wrote:
> [...]
> >> You are thinking already about modification of the application data.
> >> That is actually beyond the scope of the library.
> >
> > Yes, it is beyond the scope of the library; but I prefer the library
> to
> > be designed for how typical applications are going to use it.
> >
> > I suggest that you supplement the library with an example DPDK
> application
> > that is a simple IPv4 router, forwarding packets and responding to
> ARP
> > requests - according to its configuration applied in the O/S via your
> proxy
> > library. You could even add support for relevant ICMP packets (e.g.
> respond
> > to ICMP Echo Request and send TTL Exceeded when appropriate).
>
> Actually our thinking was more along the way: such router would see
> these control packets so it will send them (tx burst) to proxy port and
> let the system stack do its job: change config and possibly send reply.
> The former would be listened on NETLINK (in Linux) and the later would
> be just read from proxy port and forwarded to the bound port. That way
> DPDK application would not have to re-implement these control
> protocols.
>
You are right. I momentarily forgot that.
And the example application will show how to do this.
> > It will help you determine what is required by the library, and how
> > the library best interfaces to a "typical" DPDK application.
>
> Yes indeed, that kind usage discovery exercise would be good.
>
> > I think a poll based design pattern is more appropriate. Getting a
> Netlink
> > message from the O/S and converting it to a callback in the library,
> and
> > then converting it back to a message in the DPDK application seems
> like
> > crossing the river to get water.
>
> You'd still need to repack the message and that could be the job of the
> callback.
>
> At the moment we don't have much experience with the library and to me
> the callback is more generic approach with which one can achieve
> different designs. However nothing here is curved in stone so if we
> figure out that this is too generic we will change it.
>
Please re-read my reply to Jerin Jacob why I prefer a pull model instead:
https://mails.dpdk.org/archives/dev/2020-January/155386.html
Take a stab at the example application, and see which design pattern is the best fit.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-16 10:58 ` Morten Brørup
@ 2020-01-16 12:06 ` Andrzej Ostruszka
0 siblings, 0 replies; 21+ messages in thread
From: Andrzej Ostruszka @ 2020-01-16 12:06 UTC (permalink / raw)
To: Morten Brørup, Jerin Jacob; +Cc: Bruce Richardson, dpdk-dev
Morten
First of all thank you for your feedback. If anything else pops into
your mind please do not hesitate to share it.
We just had a quick internal discussion and we decided that we'll try to
come up with both options (callback and message queue).
On 1/16/20 11:58 AM, Morten Brørup wrote:
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Andrzej Ostruszka
>> Sent: Thursday, January 16, 2020 11:43 AM
[...]
>> You'd still need to repack the message and that could be the job of the
>> callback.
>>
>> At the moment we don't have much experience with the library and to me
>> the callback is more generic approach with which one can achieve
>> different designs. However nothing here is curved in stone so if we
>> figure out that this is too generic we will change it.
>>
>
> Please re-read my reply to Jerin Jacob why I prefer a pull model instead:
> https://mails.dpdk.org/archives/dev/2020-January/155386.html
Yes - I got your point first time. Remark above was not meant to imply
that "pull mode" is not a valid way (it is perfectly valid and probably
most often used in DPDK). I just noted that by staying at callback
level only one can still implement it.
But it is true that this way would impose more burden on the application
writer - so instead we now plan to provide both options.
> Take a stab at the example application, and see which design pattern is the best fit.
We will. This is a definitely a good idea to work out things in "battle".
With regards
Andrzej Ostruszka
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [RFC PATCH 0/3] introduce IF proxy library
2020-01-15 12:28 ` Morten Brørup
2020-01-15 12:57 ` Jerin Jacob
@ 2020-01-15 14:09 ` Bruce Richardson
1 sibling, 0 replies; 21+ messages in thread
From: Bruce Richardson @ 2020-01-15 14:09 UTC (permalink / raw)
To: Morten Brørup; +Cc: Andrzej Ostruszka, dev
On Wed, Jan 15, 2020 at 01:28:46PM +0100, Morten Brørup wrote:
> > -----Original Message----- From: dev [mailto:dev-bounces@dpdk.org] On
> > Behalf Of Bruce Richardson Sent: Wednesday, January 15, 2020 11:16 AM
> >
> > On Tue, Jan 14, 2020 at 06:38:37PM +0100, Andrzej Ostruszka wrote:
> > > On 1/14/20 4:16 PM, Morten Brørup wrote:
> > > > Andrzej,
> > >
> > > Hello Morten
> > >
> > > > Basically you are adding a very small subset of the Linux IP stack>
> > to interface with DPDK applications via callbacks.
> > >
> > > Yes, at the moment this is limited - we'd prefer first to solicit
> > > some input from community.
> > >
> > > > The library also seems to support interfacing to the route table,
> > > > so it is not "interface proxy" but "IP stack proxy".
> > >
> > > True, to some extent - for example you can bring the interface up and
> > > down which has nothing to do with IP stack. As for the name of the
> > > library - that is actually part where we are completely open. The
> > proxy
> > > represents port (thus the name) but that is not all, so any better
> > name
> > > proposals are welcome.
> > >
> > > > You already mention ARP table as future work. How about namespaces,
> > > > ip tables, and other advanced features... I foresee the Devil in
> > the
> > > > details for any real use case.
> > >
> > > Right now I don't know what other things are needed. This idea is
> > still
> > > early. However imagine you'd like to use DPDK to speed up packet
> > > processing of IP stack - would you like to implement all the
> > protocols
> > > that are needed? Or just let the system handle the control path and
> > > handle the data path and sniff the control params from the system.
> > >
> > Like Morten, I'd be a bit concerned at the possible scope of the work
> > if we start pulling in functionality from the IP stack like ARP etc. To
> > avoid this becoming a massive effort, how useful would it be if we just
> > limited the scope to physical NIC setup only, and did not do anything
> > above the l2 layer?
>
> Think about it... Regardless of scope, this is clearly a control plane
> API, not a data plane API.
>
> It provides a proxy API for the O/S control plane (NETLINK in the case of
> Linux), so the DPDK application can use the user interface that the O/S
> already provides (e.g. "ip link set dev tap1 mtu 1600" etc.) for its
> control plane, instead of implementing its own CLI (or GUI or whatever).
>
> In order to provide significant value, it will have to grow massively, so
> I can use it as imagined: To make a Linux firewall where the DPDK
> application handles the data plane, and the normal Linux commands are
> used for setting up the firewall, incl. firewall rules, port forwarding,
> NAPT, etc.. The Devil is in the details here!
>
> Although I like the concept and idea behind it, I don't think a control
> plane proxy API belongs in DPDK. But it could possibly be hosted by the
> DPDK project, if approved as such.
>
Personally, I wouldn't worry to much about control plane vs userplane for
this, if it's of significant benefit to DPDK users then it should be
considered.
/Bruce
^ permalink raw reply [flat|nested] 21+ messages in thread