* [dpdk-dev] [PATCH 0/4] Introduce IF proxy library @ 2020-03-06 16:41 Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library Andrzej Ostruszka ` (8 more replies) 0 siblings, 9 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-06 16:41 UTC (permalink / raw) To: dev What is this useful for ======================= Usually, when an ethernet port is assigned to DPDK it vanishes from the system and user looses ability to control it via normal configuration utilities (e.g. those from iproute2 package). Moreover by default DPDK application is not aware of the network configuration of the system. To address both of these issues application needs to: - add some command line interface (or other mechanism) allowing for control of the port and its configuration - query the status of network configuration and monitor its changes The purpose of this library is to help with both of these tasks (as long as they remain in domain of configuration available to the system). In other words, if DPDK application has some special needs, that cannot be addressed by the normal system configuration utilities, then they need to be solved by the application itself. The connection between DPDK and system is based on the existence of ports that are visible to both DPDK and system (like Tap, KNI and possibly some other drivers). These ports serve as an interface proxies. Let's visualize the action of the library by the following example: Linux | DPDK ============================================================== | | +-------+ +-------+ | | Port1 | | Port2 | "ip link set dev tap1 mtu 1600" | +-------+ +-------+ | | ^ ^ ^ | +------+ | mtu_change | | `->| Tap1 |---' callback | | +------+ | | "ip addr add 198.51.100.14 \ | | | dev tap2" | | | | +------+ | | +->| Tap2 |------------------' | | +------+ addr_add callback | "ip route add 198.0.2.0/24 \ | | | dev tap2" | | route_add callback | | `---------------------' So we have two ports Port1 and Port2 that are not visible to the system. We create two proxy interfaces (here based on Tap driver) and bind the ports to their proxies. When user issues a command changing MTU for Tap1 interface the library notes this and calls "mtu_change" callback for the Port1. Similarly when user adds an IPv4 address to the Tap2 interface "addr_add" callback is called for the Port2 and the same happens for configuration of routing rule pointing to Tap2. Apart from callbacks this library can notify about changes via adding events to notification queues. See below for more inforamtion about that and a complete list of available callbacks. Please note that nothing has been mentioned about forwarding of the packets between system and DPDK. Since the proxies are normal DPDK ports you can receive/send to them via usual RX/TX burst API. However since the library is not aware of the structure of packet processing used by the application it cannot automatically forward the packets - it is responsibility of the application to include proxy ports into its packet processing engine. As mentioned above the intention of the library is to: - provide information about network configuration that would allow application to decide what to do with the packets received on DPDK ports, - allow for control of the ports via standard configuration utilities Although the library only helps you to identify proxy for given port (and vice versa) and calls appropriate callbacks it does open some interesting possibilities. For example you can use the proxy ports to forward packets for protocols that you do not wish to handle in DPDK application to the system protocol stack and just listen to the configuration changes - so that way you can "offload" handling of those protocols to the system. How to use it ============= Usage of this library is rather simple. You have to: 1. Create proxy (if you don't have port suitable for being proxy or you have one but do not wish to use it as a proxy). 2. Bind port to proxy. 3. Register callbacks and/or event queues. 4. Start listening to the network configuration. The only mandatory requirement for DPDK port to be able to act as a proxy is that it is visible in the system - this is checked during port to proxy binding by calling rte_eth_dev_info_get() on proxy port and inspecting 'if_index' field (it has to be non-zero). One can create such port in the application by calling: proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); Upon success this returns id of DPDK proxy port created (RTE_MAX_ETHPORTS on failure). The argument selects type of proxy port to create (currently Tap/KNI only). This function actually is just a wrapper around: uint16_t rte_ifpx_create_by_devarg(const char *devarg); creating valid 'devarg' string for the chosen type of proxy. If you have other driver capable of acting as a proxy you can call rte_ifpx_create_by_devarg() directly passing appropriate argument. Once you have id of both port and proxy you can bind the two via: rte_ifpx_port_bind(port_id, proxy_id); This creates logical binding - as mentioned above there is no automatic packet forwarding. With this binding whenever user changes the state of proxy interface in the system (link up/down, change mac/mtu, add/remove IPv4/IPv6) you get appropriate notification for the bound port. So far we've mentioned several times that the library calls callbacks. They are grouped in 'struct rte_ifpx_callbacks' and user provides them to the library via: rte_ifpx_callbacks_register(&cbs); It is worth mentioning that the context (lcore/thread) in which these callbacks are called is implementation defined. It might differ between different platforms, so the application needs to assume that some kind of inter lcore/thread synchronization/communication is required. Apart from notification via callbacks this library also supports notifying about the changes via adding events to the configured notification queues. The queues are registered via: int rte_ifpx_queue_add(struct rte_ring *r); and the actual logic used is: if there is callback registered then it is called, if it returns non-zero then event is considered completed, otherwise event is added to each configured notification queue. That way application can update data structures that are safe to be modified by single writer from within callback or do the common preprocessing steps (if any needed) in callback and data that is replicated can be updated during handling of queued events. Once we have bindings in place and notification configured, the only essential part that remains is to get the current network configuration and start listening to its changes. This is accomplished via a call to: rte_ifpx_listen(); And basically this is all one needs to understand how to use this library. Other less essential parts include: - ability to query what events are available for given platform - getting mapping between proxy and port - unbinding the ports from proxy - destroying proxy port - closing the listening service - getting basic information about proxy Currently available features and implementation =============================================== The library's API is system independent but it obviously needs some system dependent parts. We provide exemplary Linux implementation (based on netlink sockets). Very similar implementation is possible for FreeBSD (with the usage of PF_ROUTE sockets). Windows implementation would need to differ much (probably IP Helper library would be of some help). Here is the list of currently implemented callbacks: struct rte_ifpx_callbacks { int (*mac_change)(const struct rte_ifpx_mac_change *event); int (*mtu_change)(const struct rte_ifpx_mtu_change *event); int (*link_change)(const struct rte_ifpx_link_change *event); int (*addr_add)(const struct rte_ifpx_addr_change *event); int (*addr_del)(const struct rte_ifpx_addr_change *event); int (*addr6_add)(const struct rte_ifpx_addr6_change *event); int (*addr6_del)(const struct rte_ifpx_addr6_change *event); int (*route_add)(const struct rte_ifpx_route_change *event); int (*route_del)(const struct rte_ifpx_route_change *event); int (*route6_add)(const struct rte_ifpx_route6_change *event); int (*route6_del)(const struct rte_ifpx_route6_change *event); int (*neigh_add)(const struct rte_ifpx_neigh_change *event); int (*neigh_del)(const struct rte_ifpx_neigh_change *event); int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); int (*cfg_done)(void); }; They are all rather self-descriptive with the exception of the last one. When the user calls rte_ifpx_listen() the library first queries the system for its current configuration. That might require several request/reply exchanges between DPDK and system and once it is finished this callback is called to let application know that all info has been gathered. It is worth to mention also that while typical case would be a 1-to-1 mapping between port and proxy, the 1-to-many mapping is also supported. In that case port related callbacks will be called for each port bound to given proxy interface - in that case it is application responsibility to define semantic of such mapping (e.g. all changes apply to all ports, or link changes apply to all but other are accepted in "round robin" fashion, or ...). As mentioned above Linux implementation is based on netlink socket. This socket is registered as file descriptor in EAL interrupts (similarly to how EAL alarms are implemented). What has changed since the RFC ============================== - Platform dependent parts has been separated into a ifpx_platform structure with callbacks for initialization, getting information about the interface, listening to the changes and closing of the library. That should allow easier reimplementation. - Notification scheme has been changed - instead of having just callbacks now event queueing is also available (or a mix of those two). - Filtering of events only related to the proxy ports - previously all network configuration changes were reported. But DPDK application needs not to know all configuration - only just portion related to the proxy ports. If a packet comes that does not match rules then it can be forwarded via proxy to the system to decide what to do with it. If that is not desired and such packets should be dropped then null port can be created with proxy and e.g. default route installed on it. - Removed previous example which was just printing notification. Instead added a simplified (stripped vectorization and other performance improvements) version of l3fwd that should serve as an example of using this library in real applications. With regards Andrzej Ostruszka Andrzej Ostruszka (4): lib: introduce IF Proxy library if_proxy: add library documentation if_proxy: add simple functionality test if_proxy: add example application MAINTAINERS | 6 + app/test/Makefile | 5 + app/test/meson.build | 4 + app/test/test_if_proxy.c | 706 +++++++++++ config/common_base | 5 + config/common_linux | 1 + doc/guides/prog_guide/if_proxy_lib.rst | 142 +++ doc/guides/prog_guide/index.rst | 1 + examples/Makefile | 1 + examples/l3fwd-ifpx/Makefile | 60 + examples/l3fwd-ifpx/l3fwd.c | 1123 +++++++++++++++++ examples/l3fwd-ifpx/l3fwd.h | 98 ++ examples/l3fwd-ifpx/main.c | 729 +++++++++++ examples/l3fwd-ifpx/meson.build | 11 + examples/meson.build | 2 +- lib/Makefile | 2 + .../common/include/rte_eal_interrupts.h | 2 + lib/librte_eal/linux/eal/eal_interrupts.c | 14 +- lib/librte_if_proxy/Makefile | 29 + lib/librte_if_proxy/if_proxy_common.c | 494 ++++++++ lib/librte_if_proxy/if_proxy_priv.h | 97 ++ lib/librte_if_proxy/linux/Makefile | 4 + lib/librte_if_proxy/linux/if_proxy.c | 552 ++++++++ lib/librte_if_proxy/meson.build | 19 + lib/librte_if_proxy/rte_if_proxy.h | 561 ++++++++ lib/librte_if_proxy/rte_if_proxy_version.map | 19 + lib/meson.build | 2 +- 27 files changed, 4683 insertions(+), 6 deletions(-) create mode 100644 app/test/test_if_proxy.c create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst create mode 100644 examples/l3fwd-ifpx/Makefile create mode 100644 examples/l3fwd-ifpx/l3fwd.c create mode 100644 examples/l3fwd-ifpx/l3fwd.h create mode 100644 examples/l3fwd-ifpx/main.c create mode 100644 examples/l3fwd-ifpx/meson.build create mode 100644 lib/librte_if_proxy/Makefile create mode 100644 lib/librte_if_proxy/if_proxy_common.c create mode 100644 lib/librte_if_proxy/if_proxy_priv.h create mode 100644 lib/librte_if_proxy/linux/Makefile create mode 100644 lib/librte_if_proxy/linux/if_proxy.c create mode 100644 lib/librte_if_proxy/meson.build create mode 100644 lib/librte_if_proxy/rte_if_proxy.h create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka @ 2020-03-06 16:41 ` Andrzej Ostruszka 2020-03-31 12:36 ` Harman Kalra 2020-04-01 5:29 ` Varghese, Vipin 2020-03-06 16:41 ` [dpdk-dev] [PATCH 2/4] if_proxy: add library documentation Andrzej Ostruszka ` (7 subsequent siblings) 8 siblings, 2 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-06 16:41 UTC (permalink / raw) To: dev, Thomas Monjalon This library allows to designate ports visible to the system (such as Tun/Tap or KNI) as port representors serving as proxies for other DPDK ports. When such a proxy is configured this library initially queries network configuration from the system and later monitors its changes. The information gathered is passed to the application either via a set of user registered callbacks or as an event added to the configured notification queue (or a combination of these two mechanisms). This way user can use normal network utilities (like those from the iproute2 suite) to configure DPDK ports. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 3 + config/common_base | 5 + config/common_linux | 1 + lib/Makefile | 2 + .../common/include/rte_eal_interrupts.h | 2 + lib/librte_eal/linux/eal/eal_interrupts.c | 14 +- lib/librte_if_proxy/Makefile | 29 + lib/librte_if_proxy/if_proxy_common.c | 494 +++++++++++++++ lib/librte_if_proxy/if_proxy_priv.h | 97 +++ lib/librte_if_proxy/linux/Makefile | 4 + lib/librte_if_proxy/linux/if_proxy.c | 552 +++++++++++++++++ lib/librte_if_proxy/meson.build | 19 + lib/librte_if_proxy/rte_if_proxy.h | 561 ++++++++++++++++++ lib/librte_if_proxy/rte_if_proxy_version.map | 19 + lib/meson.build | 2 +- 15 files changed, 1799 insertions(+), 5 deletions(-) create mode 100644 lib/librte_if_proxy/Makefile create mode 100644 lib/librte_if_proxy/if_proxy_common.c create mode 100644 lib/librte_if_proxy/if_proxy_priv.h create mode 100644 lib/librte_if_proxy/linux/Makefile create mode 100644 lib/librte_if_proxy/linux/if_proxy.c create mode 100644 lib/librte_if_proxy/meson.build create mode 100644 lib/librte_if_proxy/rte_if_proxy.h create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map diff --git a/MAINTAINERS b/MAINTAINERS index f4e0ed8e0..aec7326ca 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1469,6 +1469,9 @@ F: examples/bpf/ F: app/test/test_bpf.c F: doc/guides/prog_guide/bpf_lib.rst +IF Proxy - EXPERIMENTAL +M: Andrzej Ostruszka <aostruszka@marvell.com> +F: lib/librte_if_proxy/ Test Applications ----------------- diff --git a/config/common_base b/config/common_base index 7ca2f28b1..dcc0a0650 100644 --- a/config/common_base +++ b/config/common_base @@ -1075,6 +1075,11 @@ CONFIG_RTE_LIBRTE_BPF_ELF=n # CONFIG_RTE_LIBRTE_IPSEC=y +# +# Compile librte_if_proxy +# +CONFIG_RTE_LIBRTE_IF_PROXY=n + # # Compile the test application # diff --git a/config/common_linux b/config/common_linux index 816810671..1244eb0ae 100644 --- a/config/common_linux +++ b/config/common_linux @@ -16,6 +16,7 @@ CONFIG_RTE_LIBRTE_VHOST_NUMA=y CONFIG_RTE_LIBRTE_VHOST_POSTCOPY=n CONFIG_RTE_LIBRTE_PMD_VHOST=y CONFIG_RTE_LIBRTE_IFC_PMD=y +CONFIG_RTE_LIBRTE_IF_PROXY=y CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y CONFIG_RTE_LIBRTE_PMD_MEMIF=y CONFIG_RTE_LIBRTE_PMD_SOFTNIC=y diff --git a/lib/Makefile b/lib/Makefile index 46b91ae1a..6a20806f1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -118,6 +118,8 @@ DIRS-$(CONFIG_RTE_LIBRTE_TELEMETRY) += librte_telemetry DEPDIRS-librte_telemetry := librte_eal librte_metrics librte_ethdev DIRS-$(CONFIG_RTE_LIBRTE_RCU) += librte_rcu DEPDIRS-librte_rcu := librte_eal +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += librte_if_proxy +DEPDIRS-librte_if_proxy := librte_eal librte_ethdev ifeq ($(CONFIG_RTE_EXEC_ENV_LINUX),y) DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni diff --git a/lib/librte_eal/common/include/rte_eal_interrupts.h b/lib/librte_eal/common/include/rte_eal_interrupts.h index 773a34a42..296a3853d 100644 --- a/lib/librte_eal/common/include/rte_eal_interrupts.h +++ b/lib/librte_eal/common/include/rte_eal_interrupts.h @@ -36,6 +36,8 @@ enum rte_intr_handle_type { RTE_INTR_HANDLE_VDEV, /**< virtual device */ RTE_INTR_HANDLE_DEV_EVENT, /**< device event handle */ RTE_INTR_HANDLE_VFIO_REQ, /**< VFIO request handle */ + RTE_INTR_HANDLE_NETLINK, /**< netlink notification handle */ + RTE_INTR_HANDLE_MAX /**< count of elements */ }; diff --git a/lib/librte_eal/linux/eal/eal_interrupts.c b/lib/librte_eal/linux/eal/eal_interrupts.c index cb8e10709..16236a8c4 100644 --- a/lib/librte_eal/linux/eal/eal_interrupts.c +++ b/lib/librte_eal/linux/eal/eal_interrupts.c @@ -680,6 +680,9 @@ rte_intr_enable(const struct rte_intr_handle *intr_handle) break; /* not used at this moment */ case RTE_INTR_HANDLE_ALARM: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif return -1; #ifdef VFIO_PRESENT case RTE_INTR_HANDLE_VFIO_MSIX: @@ -796,6 +799,9 @@ rte_intr_disable(const struct rte_intr_handle *intr_handle) break; /* not used at this moment */ case RTE_INTR_HANDLE_ALARM: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif return -1; #ifdef VFIO_PRESENT case RTE_INTR_HANDLE_VFIO_MSIX: @@ -889,12 +895,12 @@ eal_intr_process_interrupts(struct epoll_event *events, int nfds) break; #endif #endif - case RTE_INTR_HANDLE_VDEV: case RTE_INTR_HANDLE_EXT: - bytes_read = 0; - call = true; - break; + case RTE_INTR_HANDLE_VDEV: case RTE_INTR_HANDLE_DEV_EVENT: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif bytes_read = 0; call = true; break; diff --git a/lib/librte_if_proxy/Makefile b/lib/librte_if_proxy/Makefile new file mode 100644 index 000000000..43cb702a2 --- /dev/null +++ b/lib/librte_if_proxy/Makefile @@ -0,0 +1,29 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +include $(RTE_SDK)/mk/rte.vars.mk + +# library name +LIB = librte_if_proxy.a + +CFLAGS += -DALLOW_EXPERIMENTAL_API +CFLAGS += -O3 +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) +LDLIBS += -lrte_eal -lrte_ethdev + +EXPORT_MAP := rte_if_proxy_version.map + +LIBABIVER := 1 + +# all source are stored in SRCS-y +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) := if_proxy_common.c + +SYSDIR := $(patsubst "%app",%,$(CONFIG_RTE_EXEC_ENV)) +include $(SRCDIR)/$(SYSDIR)/Makefile + +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += $(addprefix $(SYSDIR)/,$(SRCS)) + +# install this header file +SYMLINK-$(CONFIG_RTE_LIBRTE_IF_PROXY)-include := rte_if_proxy.h + +include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_if_proxy/if_proxy_common.c b/lib/librte_if_proxy/if_proxy_common.c new file mode 100644 index 000000000..230727d0c --- /dev/null +++ b/lib/librte_if_proxy/if_proxy_common.c @@ -0,0 +1,494 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#include <if_proxy_priv.h> +#include <rte_string_fns.h> + + +/* Definitions of data mentioned in if_proxy_priv.h and local ones. */ +int ifpx_log_type; + +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; + +rte_spinlock_t ifpx_lock = RTE_SPINLOCK_INITIALIZER; + +struct ifpx_proxies_head ifpx_proxies = TAILQ_HEAD_INITIALIZER(ifpx_proxies); + +struct ifpx_queue_node { + TAILQ_ENTRY(ifpx_queue_node) elem; + uint16_t state; + struct rte_ring *r; +}; +static +TAILQ_HEAD(ifpx_queues_head, ifpx_queue_node) ifpx_queues = + TAILQ_HEAD_INITIALIZER(ifpx_queues); + +/* All function pointers have the same size - so use this one to typecast + * different callbacks in rte_ifpx_callbacks and test their presence in a + * generic way. + */ +union cb_ptr_t { + int (*f_ptr)(void*); /* type for normal event notification */ + int (*cfg_done)(void); /* lib notification for finished config */ +}; +union { + struct rte_ifpx_callbacks cbs; + union cb_ptr_t funcs[RTE_IFPX_NUM_EVENTS]; +} ifpx_callbacks; + +uint64_t rte_ifpx_events_available(void) +{ + /* All events are supported on Linux. */ + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; +} + +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type) +{ + char devargs[16] = { '\0' }; + int dev_cnt = 0, nlen; + uint16_t port_id; + + switch (type) { + case RTE_IFPX_DEFAULT: + case RTE_IFPX_TAP: + nlen = strlcpy(devargs, "net_tap", sizeof(devargs)); + break; + case RTE_IFPX_KNI: + nlen = strlcpy(devargs, "net_kni", sizeof(devargs)); + break; + default: + IFPX_LOG(ERR, "Unknown proxy type: %d", type); + return RTE_MAX_ETHPORTS; + } + + RTE_ETH_FOREACH_DEV(port_id) { + if (strcmp(rte_eth_devices[port_id].device->driver->name, + devargs) == 0) + ++dev_cnt; + } + snprintf(devargs+nlen, sizeof(devargs)-nlen, "%d", dev_cnt); + + return rte_ifpx_proxy_create_by_devarg(devargs); +} + +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg) +{ + uint16_t port_id = RTE_MAX_ETHPORTS; + struct rte_dev_iterator iter; + + if (rte_dev_probe(devarg) < 0) { + IFPX_LOG(ERR, "Failed to create proxy port %s\n", devarg); + return RTE_MAX_ETHPORTS; + } + + if (rte_eth_iterator_init(&iter, devarg) == 0) { + port_id = rte_eth_iterator_next(&iter); + if (port_id != RTE_MAX_ETHPORTS) + rte_eth_iterator_cleanup(&iter); + } + + return port_id; +} + +int ifpx_proxy_destroy(struct ifpx_proxy_node *px) +{ + unsigned int i; + uint16_t proxy_id = px->proxy_id; + + TAILQ_REMOVE(&ifpx_proxies, px, elem); + free(px); + + /* Clear any bindings for this proxy. */ + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) { + if (ifpx_ports[i] == proxy_id) { + if (i == proxy_id) /* this entry is for proxy itself */ + ifpx_ports[i] = RTE_MAX_ETHPORTS; + else + rte_ifpx_port_unbind(i); + } + } + + return rte_dev_remove(rte_eth_devices[proxy_id].device); +} + +int rte_ifpx_proxy_destroy(uint16_t proxy_id) +{ + struct ifpx_proxy_node *px; + int ec = 0; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id != proxy_id) + continue; + } + if (!px) { + ec = -EINVAL; + goto exit; + } + if (px->state & IN_USE) + px->state |= DEL_PENDING; + else + ec = ifpx_proxy_destroy(px); +exit: + rte_spinlock_unlock(&ifpx_lock); + return ec; +} + +int rte_ifpx_queue_add(struct rte_ring *r) +{ + struct ifpx_queue_node *node; + int ec = 0; + + if (!r) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(node, &ifpx_queues, elem) { + if (node->r == r) { + ec = -EEXIST; + goto exit; + } + } + + node = malloc(sizeof(*node)); + if (!node) { + ec = -ENOMEM; + goto exit; + } + + node->r = r; + TAILQ_INSERT_TAIL(&ifpx_queues, node, elem); +exit: + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +int rte_ifpx_queue_remove(struct rte_ring *r) +{ + struct ifpx_queue_node *node, *next; + int ec = -EINVAL; + + if (!r) + return ec; + + rte_spinlock_lock(&ifpx_lock); + for (node = TAILQ_FIRST(&ifpx_queues); node; node = next) { + next = TAILQ_NEXT(node, elem); + if (node->r != r) + continue; + TAILQ_REMOVE(&ifpx_queues, node, elem); + free(node); + ec = 0; + break; + } + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id) +{ + struct rte_eth_dev_info proxy_eth_info; + struct ifpx_proxy_node *px; + int ec; + + if (port_id >= RTE_MAX_ETHPORTS || proxy_id >= RTE_MAX_ETHPORTS || + /* port is a proxy */ + ifpx_ports[port_id] == port_id) { + IFPX_LOG(ERR, "Invalid port_id: %d", port_id); + return -EINVAL; + } + + /* Do automatic rebinding but issue a warning since this is not + * considered to be a valid behaviour. + */ + if (ifpx_ports[port_id] != RTE_MAX_ETHPORTS) { + IFPX_LOG(WARNING, "Port already bound: %d -> %d", port_id, + ifpx_ports[port_id]); + } + + /* Search for existing proxy - if not found add one to the list. */ + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == proxy_id) + break; + } + if (!px) { + ec = rte_eth_dev_info_get(proxy_id, &proxy_eth_info); + if (ec < 0 || proxy_eth_info.if_index == 0) { + IFPX_LOG(ERR, "Invalid proxy: %d", proxy_id); + rte_spinlock_unlock(&ifpx_lock); + return ec < 0 ? ec : -EINVAL; + } + px = malloc(sizeof(*px)); + if (!px) { + rte_spinlock_unlock(&ifpx_lock); + return -ENOMEM; + } + px->proxy_id = proxy_id; + px->info.if_index = proxy_eth_info.if_index; + rte_eth_dev_get_mtu(proxy_id, &px->info.mtu); + rte_eth_macaddr_get(proxy_id, &px->info.mac); + memset(px->info.if_name, 0, sizeof(px->info.if_name)); + TAILQ_INSERT_TAIL(&ifpx_proxies, px, elem); + ifpx_ports[proxy_id] = proxy_id; + } + rte_spinlock_unlock(&ifpx_lock); + ifpx_ports[port_id] = proxy_id; + + /* Add proxy MAC to the port - since port will often just forward + * packets from the proxy/system they will be sent with proxy MAC as + * src. In order to pass communication in other direction we should be + * accepting packets with proxy MAC as dst. + */ + rte_eth_dev_mac_addr_add(port_id, &px->info.mac, 0); + + if (ifpx_platform.get_info) + ifpx_platform.get_info(px->info.if_index); + + return 0; +} + +int rte_ifpx_port_unbind(uint16_t port_id) +{ + if (port_id >= RTE_MAX_ETHPORTS || + ifpx_ports[port_id] == RTE_MAX_ETHPORTS || + /* port is a proxy */ + ifpx_ports[port_id] == port_id) + return -EINVAL; + + ifpx_ports[port_id] = RTE_MAX_ETHPORTS; + /* Proxy without any port bound is OK - that is the state of the proxy + * that has just been created, and it can still report routing + * information. So we do not even check if this is the case. + */ + + return 0; +} + +int rte_ifpx_callbacks_register(const struct rte_ifpx_callbacks *cbs) +{ + if (!cbs) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + ifpx_callbacks.cbs = *cbs; + rte_spinlock_unlock(&ifpx_lock); + + return 0; +} + +void rte_ifpx_callbacks_unregister(void) +{ + rte_spinlock_lock(&ifpx_lock); + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); + rte_spinlock_unlock(&ifpx_lock); +} + +uint16_t rte_ifpx_proxy_get(uint16_t port_id) +{ + if (port_id >= RTE_MAX_ETHPORTS) + return RTE_MAX_ETHPORTS; + + return ifpx_ports[port_id]; +} + +unsigned int rte_ifpx_port_get(uint16_t proxy_id, + uint16_t *ports, unsigned int num) +{ + unsigned int p, cnt = 0; + + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] == proxy_id && ifpx_ports[p] != p) { + ++cnt; + if (ports && num > 0) { + *ports++ = p; + --num; + } + } + } + return cnt; +} + +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id) +{ + struct ifpx_proxy_node *px; + + if (port_id >= RTE_MAX_ETHPORTS || + ifpx_ports[port_id] == RTE_MAX_ETHPORTS) + return NULL; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == ifpx_ports[port_id]) + break; + } + rte_spinlock_unlock(&ifpx_lock); + RTE_ASSERT(px && "Internal IF Proxy library error"); + + return &px->info; +} + +static +void queue_event(const struct rte_ifpx_event *ev, struct rte_ring *r) +{ + struct rte_ifpx_event *e = malloc(sizeof(*ev)); + + if (!e) { + IFPX_LOG(ERR, "Failed to allocate event!"); + return; + } + RTE_ASSERT(r); + + *e = *ev; + rte_ring_sp_enqueue(r, e); +} + +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px) +{ + struct ifpx_queue_node *q; + int done = 0; + uint16_t p, proxy_id; + + if (px) { + if (px->state & DEL_PENDING) + return; + proxy_id = px->proxy_id; + RTE_ASSERT(proxy_id != RTE_MAX_ETHPORTS); + px->state |= IN_USE; + } else + proxy_id = RTE_MAX_ETHPORTS; + + RTE_ASSERT(ev); + /* This function is expected to be called with a lock held. */ + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); + + if (ifpx_callbacks.funcs[ev->type].f_ptr) { + union cb_ptr_t cb = ifpx_callbacks.funcs[ev->type]; + + /* Drop the lock for the time of callback call. */ + rte_spinlock_unlock(&ifpx_lock); + if (px) { + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] != proxy_id || + ifpx_ports[p] == p) + continue; + ev->data.port_id = p; + done = cb.f_ptr(&ev->data) || done; + } + } else { + RTE_ASSERT(ev->type == RTE_IFPX_CFG_DONE); + done = cb.cfg_done(); + } + rte_spinlock_lock(&ifpx_lock); + } + if (done) + goto exit; + + /* Event not "consumed" yet so try to notify via queues. */ + TAILQ_FOREACH(q, &ifpx_queues, elem) { + if (px) { + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] != proxy_id || + ifpx_ports[p] == p) + continue; + /* Set the port_id - the remaining params should + * be filled before calling this function. + */ + ev->data.port_id = p; + queue_event(ev, q->r); + } + } else + queue_event(ev, q->r); + } +exit: + if (px) + px->state &= ~IN_USE; +} + +void ifpx_cleanup_proxies(void) +{ + struct ifpx_proxy_node *px, *next; + for (px = TAILQ_FIRST(&ifpx_proxies); px; px = next) { + next = TAILQ_NEXT(px, elem); + if (px->state & DEL_PENDING) + ifpx_proxy_destroy(px); + } +} + +int rte_ifpx_listen(void) +{ + int ec; + + if (!ifpx_platform.listen) + return -ENOTSUP; + + ec = ifpx_platform.listen(); + if (ec == 0 && ifpx_platform.get_info) + ifpx_platform.get_info(0); + + return ec; +} + +int rte_ifpx_close(void) +{ + struct ifpx_proxy_node *px; + struct ifpx_queue_node *q; + unsigned int p; + int ec = 0; + + if (ifpx_platform.close) { + ec = ifpx_platform.close(); + if (ec != 0) + IFPX_LOG(ERR, "Platform 'close' calback failed."); + } + + rte_spinlock_lock(&ifpx_lock); + /* Remove queues. */ + while (!TAILQ_EMPTY(&ifpx_queues)) { + q = TAILQ_FIRST(&ifpx_queues); + TAILQ_REMOVE(&ifpx_queues, q, elem); + free(q); + } + + /* Clear callbacks. */ + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); + + /* Unbind ports. */ + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] == RTE_MAX_ETHPORTS) + continue; + if (ifpx_ports[p] == p) + /* port is a proxy - just clear entry */ + ifpx_ports[p] = RTE_MAX_ETHPORTS; + else + rte_ifpx_port_unbind(p); + } + + /* Clear proxies. */ + while (!TAILQ_EMPTY(&ifpx_proxies)) { + px = TAILQ_FIRST(&ifpx_proxies); + TAILQ_REMOVE(&ifpx_proxies, px, elem); + free(px); + } + + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +RTE_INIT(if_proxy_init) +{ + unsigned int i; + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) + ifpx_ports[i] = RTE_MAX_ETHPORTS; + + ifpx_log_type = rte_log_register("lib.if_proxy"); + if (ifpx_log_type >= 0) + rte_log_set_level(ifpx_log_type, RTE_LOG_WARNING); + + if (ifpx_platform.init) + ifpx_platform.init(); +} diff --git a/lib/librte_if_proxy/if_proxy_priv.h b/lib/librte_if_proxy/if_proxy_priv.h new file mode 100644 index 000000000..2fbf9127a --- /dev/null +++ b/lib/librte_if_proxy/if_proxy_priv.h @@ -0,0 +1,97 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ +#ifndef _IF_PROXY_PRIV_H_ +#define _IF_PROXY_PRIV_H_ + +#include <rte_if_proxy.h> +#include <rte_spinlock.h> + +extern int ifpx_log_type; +#define IFPX_LOG(level, fmt, args...) \ + rte_log(RTE_LOG_ ## level, ifpx_log_type, "%s(): " fmt "\n", \ + __func__, ##args) + +/* Table keeping mapping between port and their proxies. */ +extern +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; + +/* Callbacks and proxies are kept in linked lists. Since this library is really + * a slow/config path we guard them with a lock - and only one for all of them + * should be enough. We don't expect a need to protect other data structures - + * e.g. data for given port is expected be accessed/modified from single thread. + */ +extern rte_spinlock_t ifpx_lock; + +enum ifpx_node_status { + IN_USE = 1U << 0, + DEL_PENDING = 1U << 1, +}; + +/* List of configured proxies */ +struct ifpx_proxy_node { + TAILQ_ENTRY(ifpx_proxy_node) elem; + uint16_t proxy_id; + uint16_t state; + struct rte_ifpx_info info; +}; +extern +TAILQ_HEAD(ifpx_proxies_head, ifpx_proxy_node) ifpx_proxies; + +/* This function should be called by the implementation whenever it notices + * change in the network configuration. The arguments are: + * - ev : pointer to filled event data structure (all fields are expected to be + * filled, with the exception of 'port_id' for all proxy/port related + * events: this function clones the event notification for each bound port + * and fills 'port_id' appropriately). + * - px : proxy node when given event is proxy/port related, otherwise pass NULL + */ +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px); + +/* This function should be called by the implementation whenever it is done with + * notification about network configuration change. It is only really needed + * for the case of callback based API - from the callback user might to attempt + * to remove callbacks/proxies. Removing of callbacks is handled by the + * ifpx_notify_event() function above, however only implementation really knows + * when notification for given proxy is finished so it is a duty of it to call + * this function to cleanup all proxies that has been marked for deletion. + */ +void ifpx_cleanup_proxies(void); + +/* This is the internal function removing the proxy from the list. It is + * related to the notification function above and intended to be used by the + * platform implementation for the case of callback based API. + * During notification via callback the internal lock is released so that + * operation would not deadlock on an attempt to take a lock. However + * modification (destruction) is not really performed - instead the + * callbacks/proxies are marked as "to be deleted". + * Handling of callbacks that are "to be deleted" is done by the + * ifpx_notify_event() function itself however it cannot delete the proxies (in + * particular the proxy passed as an argument) since they might still be refered + * by the calling function. So it is a responsibility of the platform + * implementation to check after calling notification function if there are any + * proxies to be removed and use ifpx_proxy_destroy() to actually release them. + */ +int ifpx_proxy_destroy(struct ifpx_proxy_node *px); + +/* Every implementation should provide definition of this structure: + * - init : called during library initialization (NULL when not needed) + * - listen : this function should start service listening to the network + * configuration events/changes, + * - close : this function should close the service started by listen() + * - get_info : this function should query system for current configuration of + * interface with index 'if_index'. After successful initialization of + * listening service this function is calle with 0 as an argument. In that + * case configuration of all ports should be obtained - and when this + * procedure completes a RTE_IFPX_CFG_DONE event should be signaled via + * ifpx_notify_event(). + */ +extern +struct ifpx_platform_callbacks { + void (*init)(void); + int (*listen)(void); + int (*close)(void); + void (*get_info)(int if_index); +} ifpx_platform; + +#endif /* _IF_PROXY_PRIV_H_ */ diff --git a/lib/librte_if_proxy/linux/Makefile b/lib/librte_if_proxy/linux/Makefile new file mode 100644 index 000000000..275b7e1e3 --- /dev/null +++ b/lib/librte_if_proxy/linux/Makefile @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +SRCS += if_proxy.c diff --git a/lib/librte_if_proxy/linux/if_proxy.c b/lib/librte_if_proxy/linux/if_proxy.c new file mode 100644 index 000000000..bf851c096 --- /dev/null +++ b/lib/librte_if_proxy/linux/if_proxy.c @@ -0,0 +1,552 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ +#include <if_proxy_priv.h> +#include <rte_interrupts.h> +#include <rte_string_fns.h> + +#include <stdbool.h> +#include <unistd.h> +#include <errno.h> +#include <sys/socket.h> +#include <linux/rtnetlink.h> +#include <linux/if.h> + +static +struct rte_intr_handle ifpx_irq = { + .type = RTE_INTR_HANDLE_NETLINK, + .fd = -1, +}; + +static +unsigned int ifpx_pid; + +static +int request_info(int type, int index) +{ + static rte_spinlock_t send_lock = RTE_SPINLOCK_INITIALIZER; + struct info_get { + struct nlmsghdr h; + union { + struct ifinfomsg ifm; + struct ifaddrmsg ifa; + struct rtmsg rtm; + struct ndmsg ndm; + } __rte_aligned(NLMSG_ALIGNTO); + } info_req; + int ret; + + memset(&info_req, 0, sizeof(info_req)); + /* First byte of these messages is family, so just make sure that this + * memset is enough to get all families. + */ + RTE_ASSERT(AF_UNSPEC == 0); + + info_req.h.nlmsg_pid = ifpx_pid; + info_req.h.nlmsg_type = type; + info_req.h.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP; + info_req.h.nlmsg_len = offsetof(struct info_get, ifm); + + switch (type) { + case RTM_GETLINK: + info_req.h.nlmsg_len += sizeof(info_req.ifm); + info_req.ifm.ifi_index = index; + break; + case RTM_GETADDR: + info_req.h.nlmsg_len += sizeof(info_req.ifa); + info_req.ifa.ifa_index = index; + break; + case RTM_GETROUTE: + info_req.h.nlmsg_len += sizeof(info_req.rtm); + break; + case RTM_GETNEIGH: + info_req.h.nlmsg_len += sizeof(info_req.ndm); + break; + default: + IFPX_LOG(WARNING, "Unhandled message type: %d", type); + return -EINVAL; + } + /* Store request type (and if it is global or link specific) in 'seq'. + * Later it is used during handling of reply to continue requesting of + * information dump from system - if needed. + */ + info_req.h.nlmsg_seq = index << 8 | type; + + IFPX_LOG(DEBUG, "\tRequesting msg %d for: %u", type, index); + + rte_spinlock_lock(&send_lock); + ret = send(ifpx_irq.fd, &info_req, info_req.h.nlmsg_len, 0); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to send netlink msg: %d", errno); + rte_errno = errno; + } + rte_spinlock_unlock(&send_lock); + + return ret; +} + +static +void handle_link(const struct nlmsghdr *h) +{ + const struct ifinfomsg *ifi = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + const struct rtattr *attrs[IFLA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + + IFPX_LOG(DEBUG, "\tLink action (%u): %u, 0x%x/0x%x (flags/changed)", + ifi->ifi_index, h->nlmsg_type, ifi->ifi_flags, + ifi->ifi_change); + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (unsigned int)ifi->ifi_index) + break; + } + + /* Drop messages that are not associated with any proxy */ + if (!px) + goto exit; + /* When message is a reply to request for specific interface then keep + * it only when it contains info for this interface. + */ + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && + (h->nlmsg_seq >> 8) != (unsigned)ifi->ifi_index) + goto exit; + + for (attr = IFLA_RTA(ifi); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > IFLA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + if (ifi->ifi_change & IFF_UP) { + ev.type = RTE_IFPX_LINK_CHANGE; + ev.link_change.is_up = ifi->ifi_flags & IFF_UP; + ifpx_notify_event(&ev, px); + } + if (attrs[IFLA_MTU]) { + uint16_t mtu = *(const int *)RTA_DATA(attrs[IFLA_MTU]); + if (mtu != px->info.mtu) { + px->info.mtu = mtu; + ev.type = RTE_IFPX_MTU_CHANGE; + ev.mtu_change.mtu = mtu; + ifpx_notify_event(&ev, px); + } + } + if (attrs[IFLA_ADDRESS]) { + const struct rte_ether_addr *mac = + RTA_DATA(attrs[IFLA_ADDRESS]); + + RTE_ASSERT(RTA_PAYLOAD(attrs[IFLA_ADDRESS]) == + RTE_ETHER_ADDR_LEN); + if (memcmp(mac, &px->info.mac, RTE_ETHER_ADDR_LEN) != 0) { + rte_ether_addr_copy(mac, &px->info.mac); + ev.type = RTE_IFPX_MAC_CHANGE; + rte_ether_addr_copy(mac, &ev.mac_change.mac); + ifpx_notify_event(&ev, px); + } + } + if (h->nlmsg_pid == ifpx_pid) { + RTE_ASSERT((h->nlmsg_seq & 0xFF) == RTM_GETLINK); + /* If this is reply for specific link request (not initial + * global dump) then follow up with address request, otherwise + * just store the interface name. + */ + if (h->nlmsg_seq >> 8) + request_info(RTM_GETADDR, ifi->ifi_index); + else if (!px->info.if_name[0] && attrs[IFLA_IFNAME]) + strlcpy(px->info.if_name, RTA_DATA(attrs[IFLA_IFNAME]), + sizeof(px->info.if_name)); + } + + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void handle_addr(const struct nlmsghdr *h, bool needs_del) +{ + const struct ifaddrmsg *ifa = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifa)); + const struct rtattr *attrs[IFA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tAddr action (%u): %u, family: %u", + ifa->ifa_index, h->nlmsg_type, ifa->ifa_family); + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == ifa->ifa_index) + break; + } + + /* Drop messages that are not associated with any proxy */ + if (!px) + goto exit; + /* When message is a reply to request for specific interface then keep + * it only when it contains info for this interface. + */ + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && + (h->nlmsg_seq >> 8) != ifa->ifa_index) + goto exit; + + for (attr = IFA_RTA(ifa); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > IFA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + if (attrs[IFA_ADDRESS]) { + ip = RTA_DATA(attrs[IFA_ADDRESS]); + if (ifa->ifa_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_ADDR_DEL + : RTE_IFPX_ADDR_ADD; + ev.addr_change.ip = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_ADDR6_DEL + : RTE_IFPX_ADDR6_ADD; + memcpy(ev.addr6_change.ip, ip, 16); + } + ifpx_notify_event(&ev, px); + ifpx_cleanup_proxies(); + } +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void handle_route(const struct nlmsghdr *h, bool needs_del) +{ + const struct rtmsg *r = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*r)); + const struct rtattr *attrs[RTA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct rte_ifpx_event ev; + struct ifpx_proxy_node *px = NULL; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tRoute action: %u, family: %u", + h->nlmsg_type, r->rtm_family); + + for (attr = RTM_RTA(r); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > RTA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + memset(&ev, 0, sizeof(ev)); + ev.type = RTE_IFPX_NUM_EVENTS; + + rte_spinlock_lock(&ifpx_lock); + if (attrs[RTA_OIF]) { + int if_index = *((int32_t*)RTA_DATA(attrs[RTA_OIF])); + + if (if_index > 0) { + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (uint32_t)if_index) + break; + } + } + } + /* We are only interested in routes related to the proxy interfaces and + * we need to have dst - otherwise skip the message. + */ + if (!px || !attrs[RTA_DST]) + goto exit; + + ip = RTA_DATA(attrs[RTA_DST]); + /* This is common to both IPv4/6. */ + ev.route_change.depth = r->rtm_dst_len; + if (r->rtm_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_ROUTE_DEL + : RTE_IFPX_ROUTE_ADD; + ev.route_change.ip = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_ROUTE6_DEL + : RTE_IFPX_ROUTE6_ADD; + memcpy(ev.route6_change.ip, ip, 16); + } + if (attrs[RTA_GATEWAY]) { + ip = RTA_DATA(attrs[RTA_GATEWAY]); + if (r->rtm_family == AF_INET) + ev.route_change.gateway = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + else + memcpy(ev.route6_change.gateway, ip, 16); + } + + ifpx_notify_event(&ev, px); + /* Let's check for proxies to remove here too - just in case somebody + * removed the non-proxy related callback. + */ + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +/* Link, addr and route related messages seem to have this macro defined but not + * neighbour one. Define one if it is missing - const qualifiers added just to + * silence compiler - for some reason it is not needed in equivalent macros for + * other messages and here compiler is complaining about (char*) cast on pointer + * to const. + */ +#ifndef NDA_RTA +#define NDA_RTA(r) ((const struct rtattr*)(((const char*)(r)) + \ + NLMSG_ALIGN(sizeof(struct ndmsg)))) +#endif + +static +void handle_neigh(const struct nlmsghdr *h, bool needs_del) +{ + const struct ndmsg *n = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*n)); + const struct rtattr *attrs[NDA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tNeighbour action: %u, family: %u, state: %u, if: %d", + h->nlmsg_type, n->ndm_family, n->ndm_state, n->ndm_ifindex); + + for (attr = NDA_RTA(n); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > NDA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + memset(&ev, 0, sizeof(ev)); + ev.type = RTE_IFPX_NUM_EVENTS; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (unsigned)n->ndm_ifindex) + break; + } + /* We need only subset of neighbourhood related to proxy interfaces. + * lladdr seems to be needed only for adding new entry - modifications + * (also reported via RTM_NEWLINK) and deletion include only dst. + */ + if (!px || !attrs[NDA_DST] || (!needs_del && !attrs[NDA_LLADDR])) + goto exit; + + ip = RTA_DATA(attrs[NDA_DST]); + if (n->ndm_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_NEIGH_DEL + : RTE_IFPX_NEIGH_ADD; + ev.neigh_change.ip = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_NEIGH6_DEL + : RTE_IFPX_NEIGH6_ADD; + memcpy(ev.neigh6_change.ip, ip, 16); + } + if (attrs[NDA_LLADDR]) + rte_ether_addr_copy(RTA_DATA(attrs[NDA_LLADDR]), + &ev.neigh_change.mac); + + ifpx_notify_event(&ev, px); + /* Let's check for proxies to remove here too - just in case somebody + * removed the non-proxy related callback. + */ + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void if_proxy_intr_callback(void *arg __rte_unused) +{ + struct nlmsghdr *h; + struct sockaddr_nl addr; + socklen_t addr_len; + char buf[8192]; + ssize_t len; + +restart: + len = recvfrom(ifpx_irq.fd, buf, sizeof(buf), 0, + (struct sockaddr *)&addr, &addr_len); + if (len < 0) { + if (errno == EINTR) { + IFPX_LOG(DEBUG, "recvmsg() interrupted"); + goto restart; + } + IFPX_LOG(ERR, "Failed to read netlink msg: %ld (errno %d)", + len, errno); + return; + } + if (addr_len != sizeof(addr)) { + IFPX_LOG(ERR, "Invalid netlink addr size: %d", addr_len); + return; + } + IFPX_LOG(DEBUG, "Read %lu bytes (buf %lu) from %u/%u", len, + sizeof(buf), addr.nl_pid, addr.nl_groups); + + for (h = (struct nlmsghdr *)buf; NLMSG_OK(h, len); + h = NLMSG_NEXT(h, len)) { + IFPX_LOG(DEBUG, "Recv msg: %u (%u/%u/%u seq/flags/pid)", + h->nlmsg_type, h->nlmsg_seq, h->nlmsg_flags, + h->nlmsg_pid); + + switch (h->nlmsg_type) { + case RTM_NEWLINK: + case RTM_DELLINK: + handle_link(h); + break; + case RTM_NEWADDR: + case RTM_DELADDR: + handle_addr(h, h->nlmsg_type == RTM_DELADDR); + break; + case RTM_NEWROUTE: + case RTM_DELROUTE: + handle_route(h, h->nlmsg_type == RTM_DELROUTE); + break; + case RTM_NEWNEIGH: + case RTM_DELNEIGH: + handle_neigh(h, h->nlmsg_type == RTM_DELNEIGH); + break; + } + + /* If this is a reply for global request then follow up with + * additional requests and notify about finish. + */ + if (h->nlmsg_pid == ifpx_pid && (h->nlmsg_seq >> 8) == 0 && + h->nlmsg_type == NLMSG_DONE) { + if ((h->nlmsg_seq & 0xFF) == RTM_GETLINK) + request_info(RTM_GETADDR, 0); + else if ((h->nlmsg_seq & 0xFF) == RTM_GETADDR) + request_info(RTM_GETROUTE, 0); + else if ((h->nlmsg_seq & 0xFF) == RTM_GETROUTE) + request_info(RTM_GETNEIGH, 0); + else { + struct rte_ifpx_event ev = { + .type = RTE_IFPX_CFG_DONE + }; + + RTE_ASSERT((h->nlmsg_seq & 0xFF) == + RTM_GETNEIGH); + rte_spinlock_lock(&ifpx_lock); + ifpx_notify_event(&ev, NULL); + rte_spinlock_unlock(&ifpx_lock); + } + } + } + IFPX_LOG(DEBUG, "Finished msg loop: %ld bytes left", len); +} + +static +int nlink_listen(void) +{ + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + .nl_pid = 0, + }; + socklen_t addr_len = sizeof(addr); + int ret; + + if (ifpx_irq.fd != -1) { + rte_errno = EBUSY; + return -1; + } + + addr.nl_groups = 1 << (RTNLGRP_LINK-1) + | 1 << (RTNLGRP_NEIGH-1) + | 1 << (RTNLGRP_IPV4_IFADDR-1) + | 1 << (RTNLGRP_IPV6_IFADDR-1) + | 1 << (RTNLGRP_IPV4_ROUTE-1) + | 1 << (RTNLGRP_IPV6_ROUTE-1); + + ifpx_irq.fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, + NETLINK_ROUTE); + if (ifpx_irq.fd == -1) { + IFPX_LOG(ERR, "Failed to create netlink socket: %d", errno); + goto error; + } + /* Starting with kernel 4.19 you can request dump for a specific + * interface and kernel will filter out and send only relevant info. + * Otherwise NLM_F_DUMP will generate info for all interfaces and you + * need to filter them yourself. + */ +#ifdef NETLINK_DUMP_STRICT_CHK + ret = 1; /* use this var also as an input param */ + ret = setsockopt(ifpx_irq.fd, SOL_SOCKET, NETLINK_DUMP_STRICT_CHK, + &ret, sizeof(ret)); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to set socket option: %d", errno); + goto error; + } +#endif + + ret = bind(ifpx_irq.fd, (struct sockaddr *)&addr, addr_len); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to bind socket: %d", errno); + goto error; + } + ret = getsockname(ifpx_irq.fd, (struct sockaddr *)&addr, &addr_len); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to get socket addr: %d", errno); + goto error; + } else { + ifpx_pid = addr.nl_pid; + IFPX_LOG(DEBUG, "Assigned port ID: %u", addr.nl_pid); + } + + ret = rte_intr_callback_register(&ifpx_irq, if_proxy_intr_callback, + NULL); + if (ret == 0) + return 0; + +error: + rte_errno = errno; + if (ifpx_irq.fd != -1) { + close(ifpx_irq.fd); + ifpx_irq.fd = -1; + } + return -1; +} + +static +int nlink_close(void) +{ + int ec; + + if (ifpx_irq.fd < 0) + return -EBADFD; + + do + ec = rte_intr_callback_unregister(&ifpx_irq, + if_proxy_intr_callback, NULL); + while (ec == -EAGAIN); /* unlikely but possible - at least I think so */ + + close(ifpx_irq.fd); + ifpx_irq.fd = -1; + ifpx_pid = 0; + + return 0; +} + +static +void nlink_get_info(int if_index) +{ + if (ifpx_irq.fd != -1) + request_info(RTM_GETLINK, if_index); +} + +struct ifpx_platform_callbacks ifpx_platform = { + .init = NULL, + .listen = nlink_listen, + .close = nlink_close, + .get_info = nlink_get_info, +}; diff --git a/lib/librte_if_proxy/meson.build b/lib/librte_if_proxy/meson.build new file mode 100644 index 000000000..f0c1a6e15 --- /dev/null +++ b/lib/librte_if_proxy/meson.build @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +# Currently only implemented on Linux +if not is_linux + build = false + reason = 'only supported on linux' +endif + +version = 1 +allow_experimental_apis = true + +deps += ['ethdev'] +sources = files('if_proxy_common.c') +headers = files('rte_if_proxy.h') + +if is_linux + sources += files('linux/if_proxy.c') +endif diff --git a/lib/librte_if_proxy/rte_if_proxy.h b/lib/librte_if_proxy/rte_if_proxy.h new file mode 100644 index 000000000..e620319b3 --- /dev/null +++ b/lib/librte_if_proxy/rte_if_proxy.h @@ -0,0 +1,561 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#ifndef _RTE_IF_PROXY_H_ +#define _RTE_IF_PROXY_H_ + +/** + * @file + * RTE IF Proxy library + * + * The IF Proxy library allows for monitoring of system network configuration + * and configuration of DPDK ports by using usual system utilities (like the + * ones from iproute2 package). + * + * It is based on the notion of "proxy interface" which actually can be any DPDK + * port which is also visible to the system - that is it has non-zero 'if_index' + * field in 'rte_eth_dev_info' structure. + * + * If application doesn't have any such port (or doesn't want to use it for + * proxy) it can create one by calling: + * + * proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); + * + * This function is just a wrapper that constructs valid 'devargs' string based + * on the proxy type chosen (currently Tap or KNI) and creates the interface by + * calling rte_ifpx_dev_create(). + * + * Once one has DPDK port capable of being proxy one can bind target DPDK port + * to it by calling. + * + * rte_ifpx_port_bind(port_id, proxy_id); + * + * This binding is a logical one - there is no automatic packet forwarding + * between port and it's proxy since the library doesn't know the structure of + * application's packet processing. It remains application responsibility to + * forward the packets from/to proxy port (by calling the usual DPDK RX/TX burst + * API). However when the library notes some change to the proxy interface it + * will simply call appropriate callback with 'port_id' of the DPDK port that is + * bound to this proxy interface. The binding can be 1 to many - that is many + * ports can point to one proxy - in that case registered callbacks will be + * called for every bound port. + * + * The callbacks that are used for notifications are described by the + * 'rte_ifpx_callbacks' structure and they are registered by calling: + * + * rte_ifpx_callbacks_register(&cbs); + * + * Finally the application should call: + * + * rte_ifpx_listen(); + * + * which will query system for present network configuration and start listening + * to its changes. + */ + +#include <rte_eal.h> +#include <rte_ethdev.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Enum naming the type of proxy to create. + * + * @see rte_ifpx_create() + */ +enum rte_ifpx_proxy_type { + RTE_IFPX_DEFAULT, /**< Use default proxy type for given arch. */ + RTE_IFPX_TAP, /**< Use Tap based port for proxy. */ + RTE_IFPX_KNI /**< Use KNI based port for proxy. */ +}; + +/** + * Create DPDK port that can serve as an interface proxy. + * + * This function is just a wrapper around rte_ifpx_create_by_devarg() that + * constructs its 'devarg' argument based on type of proxy requested. + * + * @param type + * A type of proxy to create. + * + * @return + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. + * + * @see enum rte_ifpx_type + * @see rte_ifpx_create_by_devarg() + */ +__rte_experimental +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type); + +/** + * Create DPDK port that can serve as an interface proxy. + * + * @param devarg + * A string passed to rte_dev_probe() to create proxy port. + * + * @return + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. + */ +__rte_experimental +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg); + +/** + * Remove DPDK proxy port. + * + * In addition to removing the proxy port the bindings (if any) are cleared. + * + * @param proxy_id + * Port id of the proxy that should be removed. + * + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_proxy_destroy(uint16_t proxy_id); + +/** + * The rte_ifpx_event_type enum lists all possible event types that can be + * signaled by this library. To learn what events are supported on your + * platform call rte_ifpx_events_available(). + * + * NOTE - do not reorder these enums freely, their values need to correspond to + * the order of the callbacks in struct rte_ifpx_callbacks. + */ +enum rte_ifpx_event_type { + RTE_IFPX_MAC_CHANGE, /**< @see struct rte_ifpx_mac_change */ + RTE_IFPX_MTU_CHANGE, /**< @see struct rte_ifpx_mtu_change */ + RTE_IFPX_LINK_CHANGE, /**< @see struct rte_ifpx_link_change */ + RTE_IFPX_ADDR_ADD, /**< @see struct rte_ifpx_addr_change */ + RTE_IFPX_ADDR_DEL, /**< @see struct rte_ifpx_addr_change */ + RTE_IFPX_ADDR6_ADD, /**< @see struct rte_ifpx_addr6_change */ + RTE_IFPX_ADDR6_DEL, /**< @see struct rte_ifpx_addr6_change */ + RTE_IFPX_ROUTE_ADD, /**< @see struct rte_ifpx_route_change */ + RTE_IFPX_ROUTE_DEL, /**< @see struct rte_ifpx_route_change */ + RTE_IFPX_ROUTE6_ADD, /**< @see struct rte_ifpx_route6_change */ + RTE_IFPX_ROUTE6_DEL, /**< @see struct rte_ifpx_route6_change */ + RTE_IFPX_NEIGH_ADD, /**< @see struct rte_ifpx_neigh_change */ + RTE_IFPX_NEIGH_DEL, /**< @see struct rte_ifpx_neigh_change */ + RTE_IFPX_NEIGH6_ADD, /**< @see struct rte_ifpx_neigh6_change */ + RTE_IFPX_NEIGH6_DEL, /**< @see struct rte_ifpx_neigh6_change */ + RTE_IFPX_CFG_DONE, /**< This event is a lib specific event - it is + * signaled when initial network configuration + * query is finished and has no event data. + */ + RTE_IFPX_NUM_EVENTS, +}; + +/** + * Get the bit mask of implemented events/callbacks for this platform. + * + * @return + * Bit mask of events/callbacks implemented: each event type can be tested by + * checking bit (1 << ev) where 'ev' is one of the rte_ifpx_event_type enum + * values. + * @see enum rte_ifpx_event_type + */ +__rte_experimental +uint64_t rte_ifpx_events_available(void); + +/** + * The rte_ifpx_event defines structure used to pass notification event to + * application. Each event type has its own dedicated inner structure - these + * structures are also used when using callbacks notifications. + */ +struct rte_ifpx_event { + enum rte_ifpx_event_type type; + union { + /** Structure used to pass notification about MAC change of the + * proxy interface. + * @see RTE_IFPX_MAC_CHANGE + */ + struct rte_ifpx_mac_change { + uint16_t port_id; + struct rte_ether_addr mac; + } mac_change; + /** Structure used to pass notification about MTU change. + * @see RTE_IFPX_MTU_CHANGE + */ + struct rte_ifpx_mtu_change { + uint16_t port_id; + uint16_t mtu; + } mtu_change; + /** Structure used to pass notification about link going + * up/down. + * @see RTE_IFPX_LINK_CHANGE + */ + struct rte_ifpx_link_change { + uint16_t port_id; + int is_up; + } link_change; + /** Structure used to pass notification about IPv4 address being + * added/removed. All IPv4 addresses reported by this library + * are in host order. + * @see RTE_IFPX_ADDR_ADD + * @see RTE_IFPX_ADDR_DEL + */ + struct rte_ifpx_addr_change { + uint16_t port_id; + uint32_t ip; + } addr_change; + /** Structure used to pass notification about IPv6 address being + * added/removed. + * @see RTE_IFPX_ADDR6_ADD + * @see RTE_IFPX_ADDR6_DEL + */ + struct rte_ifpx_addr6_change { + uint16_t port_id; + uint8_t ip[16]; + } addr6_change; + /** Structure used to pass notification about IPv4 route being + * added/removed. + * @see RTE_IFPX_ROUTE_ADD + * @see RTE_IFPX_ROUTE_DEL + */ + struct rte_ifpx_route_change { + uint16_t port_id; + uint8_t depth; + uint32_t ip; + uint32_t gateway; + } route_change; + /** Structure used to pass notification about IPv6 route being + * added/removed. + * @see RTE_IFPX_ROUTE6_ADD + * @see RTE_IFPX_ROUTE6_DEL + */ + struct rte_ifpx_route6_change { + uint16_t port_id; + uint8_t depth; + uint8_t ip[16]; + uint8_t gateway[16]; + } route6_change; + /** Structure used to pass notification about IPv4 neighbour + * info changes. + * @see RTE_IFPX_NEIGH_ADD + * @see RTE_IFPX_NEIGH_DEL + */ + struct rte_ifpx_neigh_change { + uint16_t port_id; + struct rte_ether_addr mac; + uint32_t ip; + } neigh_change; + /** Structure used to pass notification about IPv6 neighbour + * info changes. + * @see RTE_IFPX_NEIGH6_ADD + * @see RTE_IFPX_NEIGH6_DEL + */ + struct rte_ifpx_neigh6_change { + uint16_t port_id; + struct rte_ether_addr mac; + uint8_t ip[16]; + } neigh6_change; + /* This structure is used internally - to abstract common parts + * of proxy/port related events and to be able to refer to this + * union without giving it a name. + */ + struct { + uint16_t port_id; + } data; + }; +}; + +/** + * This library can deliver notification about network configuration changes + * either by the use of registered callbacks and/or by queueing change events to + * configured notification queues. The logic used is: + * 1. If there is callback registered for given event type it is called. In + * case of many ports to one proxy binding, this callback is called for every + * port bound. + * 2. If this callback returns non-zero value (for any of ports in case of + * many-1 bindings) the handling of an event is considered as complete. + * 3. Otherwise the event is added to each configured event queue. The event is + * allocated with malloc() so after dequeueing and handling the application + * should deallocate it with free(). + * + * This dual notification mechanism is meant to provide some flexibility to + * application writer. For example, if you store your data in a single writer/ + * many readers coherent data structure you could just update this structure + * from the callback. If you keep separate copy per lcore/port you could make + * some common preparations (if applicable) in the callback, return 0 and use + * notification queues to pick up the change and update data structures. Or you + * could skip the callbacks altogether and just use notification queues - and + * configure them at the level appropriate for your application design (one + * global / one per lcore / one per port ...). + */ + +/** + * Add notification queue to the list of queues. + * + * @param r + * Ring used for queueing of notification events - application can assume that + * there is only one producer. + * @return + * 0 on success, negative otherwise. + */ +int rte_ifpx_queue_add(struct rte_ring *r); + +/** + * Remove notification queue from the list of queues. + * + * @param r + * Notification ring used for queueing of notification events (previously + * added via rte_ifpx_queue_add()). + * @return + * 0 on success, negative otherwise. + */ +int rte_ifpx_queue_remove(struct rte_ring *r); + +/** + * This structure groups the callbacks that might be called as a notification + * events for changing network configuration. Not every platform might + * implement all of them and you can query the availability with + * rte_ifpx_callbacks_available() function. + * @see rte_ifpx_events_available() + * @see rte_ifpx_callbacks_register() + */ +struct rte_ifpx_callbacks { + int (*mac_change)(const struct rte_ifpx_mac_change *event); + /**< Callback for notification about MAC change of the proxy interface. + * This callback (as all other port related callbacks) is called for + * each port (with its port_id as a first argument) bound to the proxy + * interface for which change has been observed. + * @see struct rte_ifpx_mac_change + * @return non-zero if event handling is finished + */ + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); + /**< Callback for notification about MTU change. + * @see struct rte_ifpx_mtu_change + * @return non-zero if event handling is finished + */ + int (*link_change)(const struct rte_ifpx_link_change *event); + /**< Callback for notification about link going up/down. + * @see struct rte_ifpx_link_change + * @return non-zero if event handling is finished + */ + int (*addr_add)(const struct rte_ifpx_addr_change *event); + /**< Callback for notification about IPv4 address being added. + * @see struct rte_ifpx_addr_change + * @return non-zero if event handling is finished + */ + int (*addr_del)(const struct rte_ifpx_addr_change *event); + /**< Callback for notification about IPv4 address removal. + * @see struct rte_ifpx_addr_change + * @return non-zero if event handling is finished + */ + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); + /**< Callback for notification about IPv6 address being added. + * @see struct rte_ifpx_addr6_change + */ + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); + /**< Callback for notification about IPv4 address removal. + * @see struct rte_ifpx_addr6_change + * @return non-zero if event handling is finished + */ + /* Please note that "route" callbacks might be also called when user + * adds address to the interface (that is in addition to address related + * callbacks). + */ + int (*route_add)(const struct rte_ifpx_route_change *event); + /**< Callback for notification about IPv4 route being added. + * @see struct rte_ifpx_route_change + * @return non-zero if event handling is finished + */ + int (*route_del)(const struct rte_ifpx_route_change *event); + /**< Callback for notification about IPv4 route removal. + * @see struct rte_ifpx_route_change + * @return non-zero if event handling is finished + */ + int (*route6_add)(const struct rte_ifpx_route6_change *event); + /**< Callback for notification about IPv6 route being added. + * @see struct rte_ifpx_route6_change + * @return non-zero if event handling is finished + */ + int (*route6_del)(const struct rte_ifpx_route6_change *event); + /**< Callback for notification about IPv6 route removal. + * @see struct rte_ifpx_route6_change + * @return non-zero if event handling is finished + */ + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); + /**< Callback for notification about IPv4 neighbour being added. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); + /**< Callback for notification about IPv4 neighbour removal. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); + /**< Callback for notification about IPv6 neighbour being added. + * @see struct rte_ifpx_neigh_change + */ + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); + /**< Callback for notification about IPv6 neighbour removal. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*cfg_done)(void); + /**< Lib specific callback - called when initial network configuration + * query is finished. + * @return non-zero if event handling is finished + */ +}; + +/** + * Register proxy callbacks. + * + * This function registers callbacks to be called upon appropriate network + * event notification. + * + * @param cbs + * Set of callbacks that will be called. The library does not take any + * ownership of the pointer passed - the callbacks are stored internally. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_callbacks_register(const struct rte_ifpx_callbacks *cbs); + +/** + * Unregister proxy callbacks. + * + * This function unregisters callbacks previously registered with + * rte_ifpx_callbacks_register(). + * + * @param cbs + * Handle/pointer returned on previous callback registration. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +void rte_ifpx_callbacks_unregister(void); + +/** + * Bind the port to its proxy. + * + * After calling this function all network configuration of the proxy (and it's + * changes) will be passed to given port by calling registered callbacks with + * 'port_id' as an argument. + * + * Note: since both arguments are of the same type in order to not mix them and + * ease remembering the order the first one is kept the same for bind/unbind. + * + * @param port_id + * Id of the port to be bound. + * @param proxy_id + * Id of the proxy the port needs to be bound to. + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id); + +/** + * Unbind the port from its proxy. + * + * After calling this function registered callbacks will no longer be called for + * this port (but they might be called for other ports in one to many binding + * scenario). + * + * @param port_id + * Id of the port to unbind. + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_port_unbind(uint16_t port_id); + +/** + * Get the system network configuration and start listening to its changes. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_listen(void); + +/** + * Remove all bindings/callbacks and stop listening to network configuration. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_close(void); + +/** + * Get the id of the proxy the port is bound to. + * + * @param port_id + * Id of the port for which to get proxy. + * @return + * Port id of the proxy on success, RTE_MAX_ETHPORTS on error. + */ +__rte_experimental +uint16_t rte_ifpx_proxy_get(uint16_t port_id); + +/** + * Test for port acting as a proxy. + * + * @param port_id + * Id of the port. + * @return + * 1 if port acts as a proxy, 0 otherwise. + */ +static inline +int rte_ifpx_is_proxy(uint16_t port_id) +{ + return rte_ifpx_proxy_get(port_id) == port_id; +} + +/** + * Get the ids of the ports bound to the proxy. + * + * @param proxy_id + * Id of the proxy for which to get ports. + * @param ports + * Array where to store the port ids. + * @param num + * Size of the 'ports' array. + * @return + * The number of ports bound to given proxy. Note that bound ports are filled + * in 'ports' array up to its size but the return value is always the total + * number of ports bound - so you can make call first with NULL/0 to query for + * the size of the buffer to create or call it with the buffer you have and + * later check if it was large enough. + */ +__rte_experimental +unsigned int rte_ifpx_port_get(uint16_t proxy_id, + uint16_t *ports, unsigned int num); + +/** + * The structure containing some properties of the proxy interface. + */ +struct rte_ifpx_info { + unsigned int if_index; /* entry valid iff if_index != 0 */ + uint16_t mtu; + struct rte_ether_addr mac; + char if_name[RTE_ETH_NAME_MAX_LEN]; +}; + +/** + * Get the properties of the proxy interface. Argument can be either id of the + * proxy or an id of a port that is bound to it. + * + * @param port_id + * Id of the port (or proxy) for which to get proxy properties. + * @return + * Pointer to the proxy information structure. + */ +__rte_experimental +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id); + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_IF_PROXY_H_ */ diff --git a/lib/librte_if_proxy/rte_if_proxy_version.map b/lib/librte_if_proxy/rte_if_proxy_version.map new file mode 100644 index 000000000..e2093137d --- /dev/null +++ b/lib/librte_if_proxy/rte_if_proxy_version.map @@ -0,0 +1,19 @@ +EXPERIMENTAL { + global: + + rte_ifpx_proxy_create; + rte_ifpx_proxy_create_by_devarg; + rte_ifpx_proxy_destroy; + rte_ifpx_events_available; + rte_ifpx_callbacks_register; + rte_ifpx_callbacks_unregister; + rte_ifpx_port_bind; + rte_ifpx_port_unbind; + rte_ifpx_listen; + rte_ifpx_close; + rte_ifpx_proxy_get; + rte_ifpx_port_get; + rte_ifpx_info_get; + + local: *; +}; diff --git a/lib/meson.build b/lib/meson.build index 0af3efab2..c913b33dd 100644 --- a/lib/meson.build +++ b/lib/meson.build @@ -19,7 +19,7 @@ libraries = [ 'acl', 'bbdev', 'bitratestats', 'cfgfile', 'compressdev', 'cryptodev', 'distributor', 'efd', 'eventdev', - 'gro', 'gso', 'ip_frag', 'jobstats', + 'gro', 'gso', 'if_proxy', 'ip_frag', 'jobstats', 'kni', 'latencystats', 'lpm', 'member', 'power', 'pdump', 'rawdev', 'rcu', 'rib', 'reorder', 'sched', 'security', 'stack', 'vhost', -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library 2020-03-06 16:41 ` [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library Andrzej Ostruszka @ 2020-03-31 12:36 ` Harman Kalra 2020-03-31 15:37 ` Andrzej Ostruszka [C] 2020-04-01 5:29 ` Varghese, Vipin 1 sibling, 1 reply; 64+ messages in thread From: Harman Kalra @ 2020-03-31 12:36 UTC (permalink / raw) To: Andrzej Ostruszka; +Cc: dev, Thomas Monjalon On Fri, Mar 06, 2020 at 05:41:01PM +0100, Andrzej Ostruszka wrote: > This library allows to designate ports visible to the system (such as > Tun/Tap or KNI) as port representors serving as proxies for other DPDK > ports. When such a proxy is configured this library initially queries > network configuration from the system and later monitors its changes. > > The information gathered is passed to the application either via a set > of user registered callbacks or as an event added to the configured > notification queue (or a combination of these two mechanisms). This way > user can use normal network utilities (like those from the iproute2 > suite) to configure DPDK ports. > > Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> > --- > MAINTAINERS | 3 + > config/common_base | 5 + > config/common_linux | 1 + > lib/Makefile | 2 + > .../common/include/rte_eal_interrupts.h | 2 + > lib/librte_eal/linux/eal/eal_interrupts.c | 14 +- > lib/librte_if_proxy/Makefile | 29 + > lib/librte_if_proxy/if_proxy_common.c | 494 +++++++++++++++ > lib/librte_if_proxy/if_proxy_priv.h | 97 +++ > lib/librte_if_proxy/linux/Makefile | 4 + > lib/librte_if_proxy/linux/if_proxy.c | 552 +++++++++++++++++ > lib/librte_if_proxy/meson.build | 19 + > lib/librte_if_proxy/rte_if_proxy.h | 561 ++++++++++++++++++ > lib/librte_if_proxy/rte_if_proxy_version.map | 19 + > lib/meson.build | 2 +- > 15 files changed, 1799 insertions(+), 5 deletions(-) > create mode 100644 lib/librte_if_proxy/Makefile > create mode 100644 lib/librte_if_proxy/if_proxy_common.c > create mode 100644 lib/librte_if_proxy/if_proxy_priv.h > create mode 100644 lib/librte_if_proxy/linux/Makefile > create mode 100644 lib/librte_if_proxy/linux/if_proxy.c > create mode 100644 lib/librte_if_proxy/meson.build > create mode 100644 lib/librte_if_proxy/rte_if_proxy.h > create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map > > diff --git a/MAINTAINERS b/MAINTAINERS > index f4e0ed8e0..aec7326ca 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -1469,6 +1469,9 @@ F: examples/bpf/ > F: app/test/test_bpf.c > F: doc/guides/prog_guide/bpf_lib.rst > > +IF Proxy - EXPERIMENTAL > +M: Andrzej Ostruszka <aostruszka@marvell.com> > +F: lib/librte_if_proxy/ > > Test Applications > ----------------- > diff --git a/config/common_base b/config/common_base > index 7ca2f28b1..dcc0a0650 100644 > --- a/config/common_base > +++ b/config/common_base > @@ -1075,6 +1075,11 @@ CONFIG_RTE_LIBRTE_BPF_ELF=n > # > CONFIG_RTE_LIBRTE_IPSEC=y > > +# > +# Compile librte_if_proxy > +# > +CONFIG_RTE_LIBRTE_IF_PROXY=n > + > # > # Compile the test application > # > diff --git a/config/common_linux b/config/common_linux > index 816810671..1244eb0ae 100644 > --- a/config/common_linux > +++ b/config/common_linux > @@ -16,6 +16,7 @@ CONFIG_RTE_LIBRTE_VHOST_NUMA=y > CONFIG_RTE_LIBRTE_VHOST_POSTCOPY=n > CONFIG_RTE_LIBRTE_PMD_VHOST=y > CONFIG_RTE_LIBRTE_IFC_PMD=y > +CONFIG_RTE_LIBRTE_IF_PROXY=y > CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y > CONFIG_RTE_LIBRTE_PMD_MEMIF=y > CONFIG_RTE_LIBRTE_PMD_SOFTNIC=y > diff --git a/lib/Makefile b/lib/Makefile > index 46b91ae1a..6a20806f1 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -118,6 +118,8 @@ DIRS-$(CONFIG_RTE_LIBRTE_TELEMETRY) += librte_telemetry > DEPDIRS-librte_telemetry := librte_eal librte_metrics librte_ethdev > DIRS-$(CONFIG_RTE_LIBRTE_RCU) += librte_rcu > DEPDIRS-librte_rcu := librte_eal > +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += librte_if_proxy > +DEPDIRS-librte_if_proxy := librte_eal librte_ethdev > > ifeq ($(CONFIG_RTE_EXEC_ENV_LINUX),y) > DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni > diff --git a/lib/librte_eal/common/include/rte_eal_interrupts.h b/lib/librte_eal/common/include/rte_eal_interrupts.h > index 773a34a42..296a3853d 100644 > --- a/lib/librte_eal/common/include/rte_eal_interrupts.h > +++ b/lib/librte_eal/common/include/rte_eal_interrupts.h > @@ -36,6 +36,8 @@ enum rte_intr_handle_type { > RTE_INTR_HANDLE_VDEV, /**< virtual device */ > RTE_INTR_HANDLE_DEV_EVENT, /**< device event handle */ > RTE_INTR_HANDLE_VFIO_REQ, /**< VFIO request handle */ > + RTE_INTR_HANDLE_NETLINK, /**< netlink notification handle */ > + > RTE_INTR_HANDLE_MAX /**< count of elements */ > }; > > diff --git a/lib/librte_eal/linux/eal/eal_interrupts.c b/lib/librte_eal/linux/eal/eal_interrupts.c > index cb8e10709..16236a8c4 100644 > --- a/lib/librte_eal/linux/eal/eal_interrupts.c > +++ b/lib/librte_eal/linux/eal/eal_interrupts.c > @@ -680,6 +680,9 @@ rte_intr_enable(const struct rte_intr_handle *intr_handle) > break; > /* not used at this moment */ > case RTE_INTR_HANDLE_ALARM: > +#if RTE_LIBRTE_IF_PROXY > + case RTE_INTR_HANDLE_NETLINK: > +#endif > return -1; > #ifdef VFIO_PRESENT > case RTE_INTR_HANDLE_VFIO_MSIX: > @@ -796,6 +799,9 @@ rte_intr_disable(const struct rte_intr_handle *intr_handle) > break; > /* not used at this moment */ > case RTE_INTR_HANDLE_ALARM: > +#if RTE_LIBRTE_IF_PROXY > + case RTE_INTR_HANDLE_NETLINK: > +#endif > return -1; > #ifdef VFIO_PRESENT > case RTE_INTR_HANDLE_VFIO_MSIX: > @@ -889,12 +895,12 @@ eal_intr_process_interrupts(struct epoll_event *events, int nfds) > break; > #endif > #endif > - case RTE_INTR_HANDLE_VDEV: > case RTE_INTR_HANDLE_EXT: > - bytes_read = 0; > - call = true; > - break; > + case RTE_INTR_HANDLE_VDEV: > case RTE_INTR_HANDLE_DEV_EVENT: > +#if RTE_LIBRTE_IF_PROXY > + case RTE_INTR_HANDLE_NETLINK: > +#endif > bytes_read = 0; > call = true; > break; > diff --git a/lib/librte_if_proxy/Makefile b/lib/librte_if_proxy/Makefile > new file mode 100644 > index 000000000..43cb702a2 > --- /dev/null > +++ b/lib/librte_if_proxy/Makefile > @@ -0,0 +1,29 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(C) 2020 Marvell International Ltd. > + > +include $(RTE_SDK)/mk/rte.vars.mk > + > +# library name > +LIB = librte_if_proxy.a > + > +CFLAGS += -DALLOW_EXPERIMENTAL_API > +CFLAGS += -O3 > +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) > +LDLIBS += -lrte_eal -lrte_ethdev > + > +EXPORT_MAP := rte_if_proxy_version.map > + > +LIBABIVER := 1 > + > +# all source are stored in SRCS-y > +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) := if_proxy_common.c > + > +SYSDIR := $(patsubst "%app",%,$(CONFIG_RTE_EXEC_ENV)) > +include $(SRCDIR)/$(SYSDIR)/Makefile > + > +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += $(addprefix $(SYSDIR)/,$(SRCS)) > + > +# install this header file > +SYMLINK-$(CONFIG_RTE_LIBRTE_IF_PROXY)-include := rte_if_proxy.h > + > +include $(RTE_SDK)/mk/rte.lib.mk > diff --git a/lib/librte_if_proxy/if_proxy_common.c b/lib/librte_if_proxy/if_proxy_common.c > new file mode 100644 > index 000000000..230727d0c > --- /dev/null > +++ b/lib/librte_if_proxy/if_proxy_common.c > @@ -0,0 +1,494 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2020 Marvell International Ltd. > + */ > + > +#include <if_proxy_priv.h> > +#include <rte_string_fns.h> > + > + > +/* Definitions of data mentioned in if_proxy_priv.h and local ones. */ > +int ifpx_log_type; > + > +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; > + > +rte_spinlock_t ifpx_lock = RTE_SPINLOCK_INITIALIZER; > + > +struct ifpx_proxies_head ifpx_proxies = TAILQ_HEAD_INITIALIZER(ifpx_proxies); > + > +struct ifpx_queue_node { > + TAILQ_ENTRY(ifpx_queue_node) elem; > + uint16_t state; > + struct rte_ring *r; > +}; > +static > +TAILQ_HEAD(ifpx_queues_head, ifpx_queue_node) ifpx_queues = > + TAILQ_HEAD_INITIALIZER(ifpx_queues); > + > +/* All function pointers have the same size - so use this one to typecast > + * different callbacks in rte_ifpx_callbacks and test their presence in a > + * generic way. > + */ > +union cb_ptr_t { > + int (*f_ptr)(void*); /* type for normal event notification */ > + int (*cfg_done)(void); /* lib notification for finished config */ > +}; > +union { > + struct rte_ifpx_callbacks cbs; > + union cb_ptr_t funcs[RTE_IFPX_NUM_EVENTS]; > +} ifpx_callbacks; > + > +uint64_t rte_ifpx_events_available(void) > +{ > + /* All events are supported on Linux. */ > + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; > +} > + > +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type) > +{ > + char devargs[16] = { '\0' }; > + int dev_cnt = 0, nlen; > + uint16_t port_id; > + > + switch (type) { > + case RTE_IFPX_DEFAULT: > + case RTE_IFPX_TAP: > + nlen = strlcpy(devargs, "net_tap", sizeof(devargs)); > + break; > + case RTE_IFPX_KNI: > + nlen = strlcpy(devargs, "net_kni", sizeof(devargs)); > + break; > + default: > + IFPX_LOG(ERR, "Unknown proxy type: %d", type); > + return RTE_MAX_ETHPORTS; > + } > + > + RTE_ETH_FOREACH_DEV(port_id) { > + if (strcmp(rte_eth_devices[port_id].device->driver->name, > + devargs) == 0) > + ++dev_cnt; > + } > + snprintf(devargs+nlen, sizeof(devargs)-nlen, "%d", dev_cnt); > + > + return rte_ifpx_proxy_create_by_devarg(devargs); > +} > + > +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg) > +{ > + uint16_t port_id = RTE_MAX_ETHPORTS; > + struct rte_dev_iterator iter; > + > + if (rte_dev_probe(devarg) < 0) { > + IFPX_LOG(ERR, "Failed to create proxy port %s\n", devarg); > + return RTE_MAX_ETHPORTS; > + } > + > + if (rte_eth_iterator_init(&iter, devarg) == 0) { > + port_id = rte_eth_iterator_next(&iter); > + if (port_id != RTE_MAX_ETHPORTS) > + rte_eth_iterator_cleanup(&iter); > + } > + > + return port_id; > +} > + > +int ifpx_proxy_destroy(struct ifpx_proxy_node *px) > +{ > + unsigned int i; > + uint16_t proxy_id = px->proxy_id; > + > + TAILQ_REMOVE(&ifpx_proxies, px, elem); > + free(px); > + > + /* Clear any bindings for this proxy. */ > + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) { > + if (ifpx_ports[i] == proxy_id) { > + if (i == proxy_id) /* this entry is for proxy itself */ > + ifpx_ports[i] = RTE_MAX_ETHPORTS; > + else > + rte_ifpx_port_unbind(i); > + } > + } > + > + return rte_dev_remove(rte_eth_devices[proxy_id].device); > +} > + > +int rte_ifpx_proxy_destroy(uint16_t proxy_id) > +{ > + struct ifpx_proxy_node *px; > + int ec = 0; > + > + rte_spinlock_lock(&ifpx_lock); > + TAILQ_FOREACH(px, &ifpx_proxies, elem) { > + if (px->proxy_id != proxy_id) > + continue; > + } > + if (!px) { > + ec = -EINVAL; > + goto exit; > + } > + if (px->state & IN_USE) > + px->state |= DEL_PENDING; > + else > + ec = ifpx_proxy_destroy(px); > +exit: > + rte_spinlock_unlock(&ifpx_lock); > + return ec; > +} > + > +int rte_ifpx_queue_add(struct rte_ring *r) > +{ > + struct ifpx_queue_node *node; > + int ec = 0; > + > + if (!r) > + return -EINVAL; > + > + rte_spinlock_lock(&ifpx_lock); > + TAILQ_FOREACH(node, &ifpx_queues, elem) { > + if (node->r == r) { > + ec = -EEXIST; > + goto exit; > + } > + } > + > + node = malloc(sizeof(*node)); > + if (!node) { > + ec = -ENOMEM; > + goto exit; > + } > + > + node->r = r; > + TAILQ_INSERT_TAIL(&ifpx_queues, node, elem); > +exit: > + rte_spinlock_unlock(&ifpx_lock); > + > + return ec; > +} > + > +int rte_ifpx_queue_remove(struct rte_ring *r) > +{ > + struct ifpx_queue_node *node, *next; > + int ec = -EINVAL; > + > + if (!r) > + return ec; > + > + rte_spinlock_lock(&ifpx_lock); > + for (node = TAILQ_FIRST(&ifpx_queues); node; node = next) { > + next = TAILQ_NEXT(node, elem); > + if (node->r != r) > + continue; > + TAILQ_REMOVE(&ifpx_queues, node, elem); > + free(node); > + ec = 0; > + break; > + } > + rte_spinlock_unlock(&ifpx_lock); > + > + return ec; > +} > + > +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id) > +{ > + struct rte_eth_dev_info proxy_eth_info; > + struct ifpx_proxy_node *px; > + int ec; > + > + if (port_id >= RTE_MAX_ETHPORTS || proxy_id >= RTE_MAX_ETHPORTS || > + /* port is a proxy */ > + ifpx_ports[port_id] == port_id) { > + IFPX_LOG(ERR, "Invalid port_id: %d", port_id); > + return -EINVAL; > + } > + > + /* Do automatic rebinding but issue a warning since this is not > + * considered to be a valid behaviour. > + */ > + if (ifpx_ports[port_id] != RTE_MAX_ETHPORTS) { > + IFPX_LOG(WARNING, "Port already bound: %d -> %d", port_id, > + ifpx_ports[port_id]); > + } > + > + /* Search for existing proxy - if not found add one to the list. */ > + rte_spinlock_lock(&ifpx_lock); > + TAILQ_FOREACH(px, &ifpx_proxies, elem) { > + if (px->proxy_id == proxy_id) > + break; > + } > + if (!px) { > + ec = rte_eth_dev_info_get(proxy_id, &proxy_eth_info); > + if (ec < 0 || proxy_eth_info.if_index == 0) { > + IFPX_LOG(ERR, "Invalid proxy: %d", proxy_id); > + rte_spinlock_unlock(&ifpx_lock); > + return ec < 0 ? ec : -EINVAL; > + } > + px = malloc(sizeof(*px)); > + if (!px) { > + rte_spinlock_unlock(&ifpx_lock); > + return -ENOMEM; > + } > + px->proxy_id = proxy_id; > + px->info.if_index = proxy_eth_info.if_index; > + rte_eth_dev_get_mtu(proxy_id, &px->info.mtu); > + rte_eth_macaddr_get(proxy_id, &px->info.mac); > + memset(px->info.if_name, 0, sizeof(px->info.if_name)); > + TAILQ_INSERT_TAIL(&ifpx_proxies, px, elem); > + ifpx_ports[proxy_id] = proxy_id; > + } > + rte_spinlock_unlock(&ifpx_lock); > + ifpx_ports[port_id] = proxy_id; > + > + /* Add proxy MAC to the port - since port will often just forward > + * packets from the proxy/system they will be sent with proxy MAC as > + * src. In order to pass communication in other direction we should be > + * accepting packets with proxy MAC as dst. > + */ > + rte_eth_dev_mac_addr_add(port_id, &px->info.mac, 0); > + > + if (ifpx_platform.get_info) > + ifpx_platform.get_info(px->info.if_index); > + > + return 0; > +} > + > +int rte_ifpx_port_unbind(uint16_t port_id) > +{ > + if (port_id >= RTE_MAX_ETHPORTS || > + ifpx_ports[port_id] == RTE_MAX_ETHPORTS || > + /* port is a proxy */ > + ifpx_ports[port_id] == port_id) > + return -EINVAL; > + > + ifpx_ports[port_id] = RTE_MAX_ETHPORTS; > + /* Proxy without any port bound is OK - that is the state of the proxy > + * that has just been created, and it can still report routing > + * information. So we do not even check if this is the case. > + */ > + > + return 0; > +} > + > +int rte_ifpx_callbacks_register(const struct rte_ifpx_callbacks *cbs) > +{ > + if (!cbs) > + return -EINVAL; > + > + rte_spinlock_lock(&ifpx_lock); > + ifpx_callbacks.cbs = *cbs; > + rte_spinlock_unlock(&ifpx_lock); > + > + return 0; > +} > + > +void rte_ifpx_callbacks_unregister(void) > +{ > + rte_spinlock_lock(&ifpx_lock); > + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); > + rte_spinlock_unlock(&ifpx_lock); > +} > + > +uint16_t rte_ifpx_proxy_get(uint16_t port_id) > +{ > + if (port_id >= RTE_MAX_ETHPORTS) > + return RTE_MAX_ETHPORTS; > + > + return ifpx_ports[port_id]; > +} > + > +unsigned int rte_ifpx_port_get(uint16_t proxy_id, > + uint16_t *ports, unsigned int num) > +{ > + unsigned int p, cnt = 0; > + > + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { > + if (ifpx_ports[p] == proxy_id && ifpx_ports[p] != p) { > + ++cnt; > + if (ports && num > 0) { > + *ports++ = p; > + --num; > + } > + } > + } > + return cnt; > +} > + > +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id) > +{ > + struct ifpx_proxy_node *px; > + > + if (port_id >= RTE_MAX_ETHPORTS || > + ifpx_ports[port_id] == RTE_MAX_ETHPORTS) > + return NULL; > + > + rte_spinlock_lock(&ifpx_lock); > + TAILQ_FOREACH(px, &ifpx_proxies, elem) { > + if (px->proxy_id == ifpx_ports[port_id]) > + break; > + } > + rte_spinlock_unlock(&ifpx_lock); > + RTE_ASSERT(px && "Internal IF Proxy library error"); > + > + return &px->info; > +} > + > +static > +void queue_event(const struct rte_ifpx_event *ev, struct rte_ring *r) > +{ > + struct rte_ifpx_event *e = malloc(sizeof(*ev)); > + > + if (!e) { > + IFPX_LOG(ERR, "Failed to allocate event!"); > + return; > + } > + RTE_ASSERT(r); > + > + *e = *ev; > + rte_ring_sp_enqueue(r, e); > +} > + > +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px) > +{ > + struct ifpx_queue_node *q; > + int done = 0; > + uint16_t p, proxy_id; > + > + if (px) { > + if (px->state & DEL_PENDING) > + return; > + proxy_id = px->proxy_id; > + RTE_ASSERT(proxy_id != RTE_MAX_ETHPORTS); > + px->state |= IN_USE; > + } else > + proxy_id = RTE_MAX_ETHPORTS; > + > + RTE_ASSERT(ev); > + /* This function is expected to be called with a lock held. */ > + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); > + > + if (ifpx_callbacks.funcs[ev->type].f_ptr) { > + union cb_ptr_t cb = ifpx_callbacks.funcs[ev->type]; > + > + /* Drop the lock for the time of callback call. */ > + rte_spinlock_unlock(&ifpx_lock); > + if (px) { > + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { > + if (ifpx_ports[p] != proxy_id || > + ifpx_ports[p] == p) > + continue; > + ev->data.port_id = p; > + done = cb.f_ptr(&ev->data) || done; Since callback are handled as DPDK interrupts, hope there is no event which gets lost. Cannot afford to loose a route change event as kernel might not send it again. > + } > + } else { > + RTE_ASSERT(ev->type == RTE_IFPX_CFG_DONE); > + done = cb.cfg_done(); > + } > + rte_spinlock_lock(&ifpx_lock); > + } > + if (done) > + goto exit; > + > + /* Event not "consumed" yet so try to notify via queues. */ > + TAILQ_FOREACH(q, &ifpx_queues, elem) { > + if (px) { > + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { > + if (ifpx_ports[p] != proxy_id || > + ifpx_ports[p] == p) > + continue; > + /* Set the port_id - the remaining params should > + * be filled before calling this function. > + */ > + ev->data.port_id = p; > + queue_event(ev, q->r); > + } > + } else > + queue_event(ev, q->r); > + } > +exit: > + if (px) > + px->state &= ~IN_USE; > +} > + > +void ifpx_cleanup_proxies(void) > +{ > + struct ifpx_proxy_node *px, *next; > + for (px = TAILQ_FIRST(&ifpx_proxies); px; px = next) { > + next = TAILQ_NEXT(px, elem); > + if (px->state & DEL_PENDING) > + ifpx_proxy_destroy(px); > + } > +} > + > +int rte_ifpx_listen(void) > +{ > + int ec; > + > + if (!ifpx_platform.listen) > + return -ENOTSUP; > + > + ec = ifpx_platform.listen(); > + if (ec == 0 && ifpx_platform.get_info) > + ifpx_platform.get_info(0); nlink_get_info calls request_info with a if_index, passing 0 might be good in current scenario but valid index should be passed to get_info. > + > + return ec; > +} > + > +int rte_ifpx_close(void) > +{ > + struct ifpx_proxy_node *px; > + struct ifpx_queue_node *q; > + unsigned int p; > + int ec = 0; > + > + if (ifpx_platform.close) { > + ec = ifpx_platform.close(); > + if (ec != 0) > + IFPX_LOG(ERR, "Platform 'close' calback failed."); > + } > + > + rte_spinlock_lock(&ifpx_lock); > + /* Remove queues. */ > + while (!TAILQ_EMPTY(&ifpx_queues)) { > + q = TAILQ_FIRST(&ifpx_queues); > + TAILQ_REMOVE(&ifpx_queues, q, elem); > + free(q); > + } > + > + /* Clear callbacks. */ > + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); > + > + /* Unbind ports. */ > + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { > + if (ifpx_ports[p] == RTE_MAX_ETHPORTS) > + continue; > + if (ifpx_ports[p] == p) > + /* port is a proxy - just clear entry */ > + ifpx_ports[p] = RTE_MAX_ETHPORTS; > + else > + rte_ifpx_port_unbind(p); > + } > + > + /* Clear proxies. */ > + while (!TAILQ_EMPTY(&ifpx_proxies)) { > + px = TAILQ_FIRST(&ifpx_proxies); > + TAILQ_REMOVE(&ifpx_proxies, px, elem); > + free(px); > + } > + > + rte_spinlock_unlock(&ifpx_lock); > + > + return ec; > +} > + > +RTE_INIT(if_proxy_init) > +{ > + unsigned int i; > + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) > + ifpx_ports[i] = RTE_MAX_ETHPORTS; > + > + ifpx_log_type = rte_log_register("lib.if_proxy"); > + if (ifpx_log_type >= 0) > + rte_log_set_level(ifpx_log_type, RTE_LOG_WARNING); > + > + if (ifpx_platform.init) > + ifpx_platform.init(); > +} > diff --git a/lib/librte_if_proxy/if_proxy_priv.h b/lib/librte_if_proxy/if_proxy_priv.h > new file mode 100644 > index 000000000..2fbf9127a > --- /dev/null > +++ b/lib/librte_if_proxy/if_proxy_priv.h > @@ -0,0 +1,97 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2020 Marvell International Ltd. > + */ > +#ifndef _IF_PROXY_PRIV_H_ > +#define _IF_PROXY_PRIV_H_ > + > +#include <rte_if_proxy.h> > +#include <rte_spinlock.h> > + > +extern int ifpx_log_type; > +#define IFPX_LOG(level, fmt, args...) \ > + rte_log(RTE_LOG_ ## level, ifpx_log_type, "%s(): " fmt "\n", \ > + __func__, ##args) > + > +/* Table keeping mapping between port and their proxies. */ > +extern > +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; > + > +/* Callbacks and proxies are kept in linked lists. Since this library is really > + * a slow/config path we guard them with a lock - and only one for all of them > + * should be enough. We don't expect a need to protect other data structures - > + * e.g. data for given port is expected be accessed/modified from single thread. > + */ > +extern rte_spinlock_t ifpx_lock; > + > +enum ifpx_node_status { > + IN_USE = 1U << 0, > + DEL_PENDING = 1U << 1, > +}; > + > +/* List of configured proxies */ > +struct ifpx_proxy_node { > + TAILQ_ENTRY(ifpx_proxy_node) elem; > + uint16_t proxy_id; > + uint16_t state; > + struct rte_ifpx_info info; > +}; > +extern > +TAILQ_HEAD(ifpx_proxies_head, ifpx_proxy_node) ifpx_proxies; > + > +/* This function should be called by the implementation whenever it notices > + * change in the network configuration. The arguments are: > + * - ev : pointer to filled event data structure (all fields are expected to be > + * filled, with the exception of 'port_id' for all proxy/port related > + * events: this function clones the event notification for each bound port > + * and fills 'port_id' appropriately). > + * - px : proxy node when given event is proxy/port related, otherwise pass NULL > + */ > +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px); > + > +/* This function should be called by the implementation whenever it is done with > + * notification about network configuration change. It is only really needed > + * for the case of callback based API - from the callback user might to attempt > + * to remove callbacks/proxies. Removing of callbacks is handled by the > + * ifpx_notify_event() function above, however only implementation really knows > + * when notification for given proxy is finished so it is a duty of it to call > + * this function to cleanup all proxies that has been marked for deletion. > + */ > +void ifpx_cleanup_proxies(void); > + > +/* This is the internal function removing the proxy from the list. It is > + * related to the notification function above and intended to be used by the > + * platform implementation for the case of callback based API. > + * During notification via callback the internal lock is released so that > + * operation would not deadlock on an attempt to take a lock. However > + * modification (destruction) is not really performed - instead the > + * callbacks/proxies are marked as "to be deleted". > + * Handling of callbacks that are "to be deleted" is done by the > + * ifpx_notify_event() function itself however it cannot delete the proxies (in > + * particular the proxy passed as an argument) since they might still be refered > + * by the calling function. So it is a responsibility of the platform > + * implementation to check after calling notification function if there are any > + * proxies to be removed and use ifpx_proxy_destroy() to actually release them. > + */ > +int ifpx_proxy_destroy(struct ifpx_proxy_node *px); > + > +/* Every implementation should provide definition of this structure: > + * - init : called during library initialization (NULL when not needed) > + * - listen : this function should start service listening to the network > + * configuration events/changes, > + * - close : this function should close the service started by listen() > + * - get_info : this function should query system for current configuration of > + * interface with index 'if_index'. After successful initialization of > + * listening service this function is calle with 0 as an argument. In that > + * case configuration of all ports should be obtained - and when this > + * procedure completes a RTE_IFPX_CFG_DONE event should be signaled via > + * ifpx_notify_event(). > + */ > +extern > +struct ifpx_platform_callbacks { > + void (*init)(void); > + int (*listen)(void); > + int (*close)(void); > + void (*get_info)(int if_index); > +} ifpx_platform; > + > +#endif /* _IF_PROXY_PRIV_H_ */ > diff --git a/lib/librte_if_proxy/linux/Makefile b/lib/librte_if_proxy/linux/Makefile > new file mode 100644 > index 000000000..275b7e1e3 > --- /dev/null > +++ b/lib/librte_if_proxy/linux/Makefile > @@ -0,0 +1,4 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(C) 2020 Marvell International Ltd. > + > +SRCS += if_proxy.c > diff --git a/lib/librte_if_proxy/linux/if_proxy.c b/lib/librte_if_proxy/linux/if_proxy.c > new file mode 100644 > index 000000000..bf851c096 > --- /dev/null > +++ b/lib/librte_if_proxy/linux/if_proxy.c > @@ -0,0 +1,552 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2020 Marvell International Ltd. > + */ > +#include <if_proxy_priv.h> > +#include <rte_interrupts.h> > +#include <rte_string_fns.h> > + > +#include <stdbool.h> > +#include <unistd.h> > +#include <errno.h> > +#include <sys/socket.h> > +#include <linux/rtnetlink.h> > +#include <linux/if.h> > + > +static > +struct rte_intr_handle ifpx_irq = { > + .type = RTE_INTR_HANDLE_NETLINK, > + .fd = -1, > +}; > + > +static > +unsigned int ifpx_pid; > + > +static > +int request_info(int type, int index) > +{ > + static rte_spinlock_t send_lock = RTE_SPINLOCK_INITIALIZER; > + struct info_get { > + struct nlmsghdr h; > + union { > + struct ifinfomsg ifm; > + struct ifaddrmsg ifa; > + struct rtmsg rtm; > + struct ndmsg ndm; > + } __rte_aligned(NLMSG_ALIGNTO); > + } info_req; > + int ret; > + > + memset(&info_req, 0, sizeof(info_req)); > + /* First byte of these messages is family, so just make sure that this > + * memset is enough to get all families. > + */ > + RTE_ASSERT(AF_UNSPEC == 0); > + > + info_req.h.nlmsg_pid = ifpx_pid; > + info_req.h.nlmsg_type = type; > + info_req.h.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP; > + info_req.h.nlmsg_len = offsetof(struct info_get, ifm); > + > + switch (type) { > + case RTM_GETLINK: > + info_req.h.nlmsg_len += sizeof(info_req.ifm); > + info_req.ifm.ifi_index = index; > + break; > + case RTM_GETADDR: > + info_req.h.nlmsg_len += sizeof(info_req.ifa); > + info_req.ifa.ifa_index = index; > + break; > + case RTM_GETROUTE: > + info_req.h.nlmsg_len += sizeof(info_req.rtm); > + break; > + case RTM_GETNEIGH: > + info_req.h.nlmsg_len += sizeof(info_req.ndm); > + break; > + default: > + IFPX_LOG(WARNING, "Unhandled message type: %d", type); > + return -EINVAL; > + } > + /* Store request type (and if it is global or link specific) in 'seq'. > + * Later it is used during handling of reply to continue requesting of > + * information dump from system - if needed. > + */ > + info_req.h.nlmsg_seq = index << 8 | type; > + > + IFPX_LOG(DEBUG, "\tRequesting msg %d for: %u", type, index); > + > + rte_spinlock_lock(&send_lock); > + ret = send(ifpx_irq.fd, &info_req, info_req.h.nlmsg_len, 0); > + if (ret < 0) { > + IFPX_LOG(ERR, "Failed to send netlink msg: %d", errno); > + rte_errno = errno; > + } > + rte_spinlock_unlock(&send_lock); > + > + return ret; > +} > + > +static > +void handle_link(const struct nlmsghdr *h) > +{ > + const struct ifinfomsg *ifi = NLMSG_DATA(h); > + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); > + const struct rtattr *attrs[IFLA_MAX+1] = { NULL }; > + const struct rtattr *attr; > + struct ifpx_proxy_node *px; > + struct rte_ifpx_event ev; > + > + IFPX_LOG(DEBUG, "\tLink action (%u): %u, 0x%x/0x%x (flags/changed)", > + ifi->ifi_index, h->nlmsg_type, ifi->ifi_flags, > + ifi->ifi_change); > + > + rte_spinlock_lock(&ifpx_lock); > + TAILQ_FOREACH(px, &ifpx_proxies, elem) { > + if (px->info.if_index == (unsigned int)ifi->ifi_index) > + break; > + } > + > + /* Drop messages that are not associated with any proxy */ > + if (!px) > + goto exit; > + /* When message is a reply to request for specific interface then keep > + * it only when it contains info for this interface. > + */ > + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && > + (h->nlmsg_seq >> 8) != (unsigned)ifi->ifi_index) > + goto exit; > + > + for (attr = IFLA_RTA(ifi); RTA_OK(attr, alen); > + attr = RTA_NEXT(attr, alen)) { > + if (attr->rta_type > IFLA_MAX) > + continue; > + attrs[attr->rta_type] = attr; > + } > + > + if (ifi->ifi_change & IFF_UP) { > + ev.type = RTE_IFPX_LINK_CHANGE; > + ev.link_change.is_up = ifi->ifi_flags & IFF_UP; > + ifpx_notify_event(&ev, px); > + } > + if (attrs[IFLA_MTU]) { > + uint16_t mtu = *(const int *)RTA_DATA(attrs[IFLA_MTU]); > + if (mtu != px->info.mtu) { > + px->info.mtu = mtu; > + ev.type = RTE_IFPX_MTU_CHANGE; > + ev.mtu_change.mtu = mtu; > + ifpx_notify_event(&ev, px); > + } > + } > + if (attrs[IFLA_ADDRESS]) { > + const struct rte_ether_addr *mac = > + RTA_DATA(attrs[IFLA_ADDRESS]); > + > + RTE_ASSERT(RTA_PAYLOAD(attrs[IFLA_ADDRESS]) == > + RTE_ETHER_ADDR_LEN); > + if (memcmp(mac, &px->info.mac, RTE_ETHER_ADDR_LEN) != 0) { > + rte_ether_addr_copy(mac, &px->info.mac); > + ev.type = RTE_IFPX_MAC_CHANGE; > + rte_ether_addr_copy(mac, &ev.mac_change.mac); > + ifpx_notify_event(&ev, px); > + } > + } > + if (h->nlmsg_pid == ifpx_pid) { > + RTE_ASSERT((h->nlmsg_seq & 0xFF) == RTM_GETLINK); > + /* If this is reply for specific link request (not initial > + * global dump) then follow up with address request, otherwise > + * just store the interface name. > + */ > + if (h->nlmsg_seq >> 8) > + request_info(RTM_GETADDR, ifi->ifi_index); > + else if (!px->info.if_name[0] && attrs[IFLA_IFNAME]) > + strlcpy(px->info.if_name, RTA_DATA(attrs[IFLA_IFNAME]), > + sizeof(px->info.if_name)); > + } > + > + ifpx_cleanup_proxies(); > +exit: > + rte_spinlock_unlock(&ifpx_lock); > +} > + > +static > +void handle_addr(const struct nlmsghdr *h, bool needs_del) > +{ > + const struct ifaddrmsg *ifa = NLMSG_DATA(h); > + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifa)); > + const struct rtattr *attrs[IFA_MAX+1] = { NULL }; > + const struct rtattr *attr; > + struct ifpx_proxy_node *px; > + struct rte_ifpx_event ev; > + const uint8_t *ip; > + > + IFPX_LOG(DEBUG, "\tAddr action (%u): %u, family: %u", > + ifa->ifa_index, h->nlmsg_type, ifa->ifa_family); > + > + rte_spinlock_lock(&ifpx_lock); > + TAILQ_FOREACH(px, &ifpx_proxies, elem) { > + if (px->info.if_index == ifa->ifa_index) > + break; > + } > + > + /* Drop messages that are not associated with any proxy */ > + if (!px) > + goto exit; > + /* When message is a reply to request for specific interface then keep > + * it only when it contains info for this interface. > + */ > + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && > + (h->nlmsg_seq >> 8) != ifa->ifa_index) > + goto exit; > + > + for (attr = IFA_RTA(ifa); RTA_OK(attr, alen); > + attr = RTA_NEXT(attr, alen)) { > + if (attr->rta_type > IFA_MAX) > + continue; > + attrs[attr->rta_type] = attr; > + } > + > + if (attrs[IFA_ADDRESS]) { > + ip = RTA_DATA(attrs[IFA_ADDRESS]); > + if (ifa->ifa_family == AF_INET) { > + ev.type = needs_del ? RTE_IFPX_ADDR_DEL > + : RTE_IFPX_ADDR_ADD; > + ev.addr_change.ip = > + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); > + } else { > + ev.type = needs_del ? RTE_IFPX_ADDR6_DEL > + : RTE_IFPX_ADDR6_ADD; > + memcpy(ev.addr6_change.ip, ip, 16); > + } > + ifpx_notify_event(&ev, px); > + ifpx_cleanup_proxies(); > + } > +exit: > + rte_spinlock_unlock(&ifpx_lock); > +} > + > +static > +void handle_route(const struct nlmsghdr *h, bool needs_del) > +{ > + const struct rtmsg *r = NLMSG_DATA(h); > + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*r)); > + const struct rtattr *attrs[RTA_MAX+1] = { NULL }; > + const struct rtattr *attr; > + struct rte_ifpx_event ev; > + struct ifpx_proxy_node *px = NULL; > + const uint8_t *ip; > + > + IFPX_LOG(DEBUG, "\tRoute action: %u, family: %u", > + h->nlmsg_type, r->rtm_family); > + > + for (attr = RTM_RTA(r); RTA_OK(attr, alen); > + attr = RTA_NEXT(attr, alen)) { > + if (attr->rta_type > RTA_MAX) > + continue; > + attrs[attr->rta_type] = attr; > + } > + > + memset(&ev, 0, sizeof(ev)); > + ev.type = RTE_IFPX_NUM_EVENTS; > + > + rte_spinlock_lock(&ifpx_lock); > + if (attrs[RTA_OIF]) { > + int if_index = *((int32_t*)RTA_DATA(attrs[RTA_OIF])); > + > + if (if_index > 0) { > + TAILQ_FOREACH(px, &ifpx_proxies, elem) { > + if (px->info.if_index == (uint32_t)if_index) > + break; > + } > + } > + } > + /* We are only interested in routes related to the proxy interfaces and > + * we need to have dst - otherwise skip the message. > + */ > + if (!px || !attrs[RTA_DST]) > + goto exit; > + > + ip = RTA_DATA(attrs[RTA_DST]); > + /* This is common to both IPv4/6. */ > + ev.route_change.depth = r->rtm_dst_len; > + if (r->rtm_family == AF_INET) { > + ev.type = needs_del ? RTE_IFPX_ROUTE_DEL > + : RTE_IFPX_ROUTE_ADD; > + ev.route_change.ip = > + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); > + } else { > + ev.type = needs_del ? RTE_IFPX_ROUTE6_DEL > + : RTE_IFPX_ROUTE6_ADD; > + memcpy(ev.route6_change.ip, ip, 16); > + } > + if (attrs[RTA_GATEWAY]) { > + ip = RTA_DATA(attrs[RTA_GATEWAY]); > + if (r->rtm_family == AF_INET) > + ev.route_change.gateway = > + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); > + else > + memcpy(ev.route6_change.gateway, ip, 16); > + } > + > + ifpx_notify_event(&ev, px); > + /* Let's check for proxies to remove here too - just in case somebody > + * removed the non-proxy related callback. > + */ > + ifpx_cleanup_proxies(); > +exit: > + rte_spinlock_unlock(&ifpx_lock); > +} > + > +/* Link, addr and route related messages seem to have this macro defined but not > + * neighbour one. Define one if it is missing - const qualifiers added just to > + * silence compiler - for some reason it is not needed in equivalent macros for > + * other messages and here compiler is complaining about (char*) cast on pointer > + * to const. > + */ > +#ifndef NDA_RTA > +#define NDA_RTA(r) ((const struct rtattr*)(((const char*)(r)) + \ > + NLMSG_ALIGN(sizeof(struct ndmsg)))) > +#endif > + > +static > +void handle_neigh(const struct nlmsghdr *h, bool needs_del) > +{ > + const struct ndmsg *n = NLMSG_DATA(h); > + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*n)); > + const struct rtattr *attrs[NDA_MAX+1] = { NULL }; > + const struct rtattr *attr; > + struct ifpx_proxy_node *px; > + struct rte_ifpx_event ev; > + const uint8_t *ip; > + > + IFPX_LOG(DEBUG, "\tNeighbour action: %u, family: %u, state: %u, if: %d", > + h->nlmsg_type, n->ndm_family, n->ndm_state, n->ndm_ifindex); > + > + for (attr = NDA_RTA(n); RTA_OK(attr, alen); > + attr = RTA_NEXT(attr, alen)) { > + if (attr->rta_type > NDA_MAX) > + continue; > + attrs[attr->rta_type] = attr; > + } > + > + memset(&ev, 0, sizeof(ev)); > + ev.type = RTE_IFPX_NUM_EVENTS; > + > + rte_spinlock_lock(&ifpx_lock); > + TAILQ_FOREACH(px, &ifpx_proxies, elem) { > + if (px->info.if_index == (unsigned)n->ndm_ifindex) > + break; > + } > + /* We need only subset of neighbourhood related to proxy interfaces. > + * lladdr seems to be needed only for adding new entry - modifications > + * (also reported via RTM_NEWLINK) and deletion include only dst. > + */ > + if (!px || !attrs[NDA_DST] || (!needs_del && !attrs[NDA_LLADDR])) > + goto exit; > + > + ip = RTA_DATA(attrs[NDA_DST]); > + if (n->ndm_family == AF_INET) { > + ev.type = needs_del ? RTE_IFPX_NEIGH_DEL > + : RTE_IFPX_NEIGH_ADD; > + ev.neigh_change.ip = > + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); > + } else { > + ev.type = needs_del ? RTE_IFPX_NEIGH6_DEL > + : RTE_IFPX_NEIGH6_ADD; > + memcpy(ev.neigh6_change.ip, ip, 16); > + } > + if (attrs[NDA_LLADDR]) > + rte_ether_addr_copy(RTA_DATA(attrs[NDA_LLADDR]), > + &ev.neigh_change.mac); > + > + ifpx_notify_event(&ev, px); > + /* Let's check for proxies to remove here too - just in case somebody > + * removed the non-proxy related callback. > + */ > + ifpx_cleanup_proxies(); > +exit: > + rte_spinlock_unlock(&ifpx_lock); > +} > + > +static > +void if_proxy_intr_callback(void *arg __rte_unused) > +{ > + struct nlmsghdr *h; > + struct sockaddr_nl addr; > + socklen_t addr_len; > + char buf[8192]; > + ssize_t len; > + > +restart: > + len = recvfrom(ifpx_irq.fd, buf, sizeof(buf), 0, > + (struct sockaddr *)&addr, &addr_len); > + if (len < 0) { > + if (errno == EINTR) { > + IFPX_LOG(DEBUG, "recvmsg() interrupted"); > + goto restart; > + } > + IFPX_LOG(ERR, "Failed to read netlink msg: %ld (errno %d)", > + len, errno); > + return; > + } > + if (addr_len != sizeof(addr)) { > + IFPX_LOG(ERR, "Invalid netlink addr size: %d", addr_len); > + return; > + } > + IFPX_LOG(DEBUG, "Read %lu bytes (buf %lu) from %u/%u", len, > + sizeof(buf), addr.nl_pid, addr.nl_groups); > + > + for (h = (struct nlmsghdr *)buf; NLMSG_OK(h, len); > + h = NLMSG_NEXT(h, len)) { > + IFPX_LOG(DEBUG, "Recv msg: %u (%u/%u/%u seq/flags/pid)", > + h->nlmsg_type, h->nlmsg_seq, h->nlmsg_flags, > + h->nlmsg_pid); > + > + switch (h->nlmsg_type) { > + case RTM_NEWLINK: > + case RTM_DELLINK: > + handle_link(h); > + break; > + case RTM_NEWADDR: > + case RTM_DELADDR: > + handle_addr(h, h->nlmsg_type == RTM_DELADDR); > + break; > + case RTM_NEWROUTE: > + case RTM_DELROUTE: > + handle_route(h, h->nlmsg_type == RTM_DELROUTE); > + break; > + case RTM_NEWNEIGH: > + case RTM_DELNEIGH: > + handle_neigh(h, h->nlmsg_type == RTM_DELNEIGH); > + break; > + } > + > + /* If this is a reply for global request then follow up with > + * additional requests and notify about finish. > + */ > + if (h->nlmsg_pid == ifpx_pid && (h->nlmsg_seq >> 8) == 0 && > + h->nlmsg_type == NLMSG_DONE) { Sorry, but in what scenario will the flow reach here. > + if ((h->nlmsg_seq & 0xFF) == RTM_GETLINK) > + request_info(RTM_GETADDR, 0); > + else if ((h->nlmsg_seq & 0xFF) == RTM_GETADDR) > + request_info(RTM_GETROUTE, 0); > + else if ((h->nlmsg_seq & 0xFF) == RTM_GETROUTE) > + request_info(RTM_GETNEIGH, 0); > + else { > + struct rte_ifpx_event ev = { > + .type = RTE_IFPX_CFG_DONE > + }; > + > + RTE_ASSERT((h->nlmsg_seq & 0xFF) == > + RTM_GETNEIGH); > + rte_spinlock_lock(&ifpx_lock); > + ifpx_notify_event(&ev, NULL); > + rte_spinlock_unlock(&ifpx_lock); > + } > + } > + } > + IFPX_LOG(DEBUG, "Finished msg loop: %ld bytes left", len); > +} > + > +static > +int nlink_listen(void) > +{ > + struct sockaddr_nl addr = { > + .nl_family = AF_NETLINK, > + .nl_pid = 0, > + }; > + socklen_t addr_len = sizeof(addr); > + int ret; > + > + if (ifpx_irq.fd != -1) { > + rte_errno = EBUSY; > + return -1; > + } > + > + addr.nl_groups = 1 << (RTNLGRP_LINK-1) > + | 1 << (RTNLGRP_NEIGH-1) > + | 1 << (RTNLGRP_IPV4_IFADDR-1) > + | 1 << (RTNLGRP_IPV6_IFADDR-1) > + | 1 << (RTNLGRP_IPV4_ROUTE-1) > + | 1 << (RTNLGRP_IPV6_ROUTE-1); > + > + ifpx_irq.fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, > + NETLINK_ROUTE); > + if (ifpx_irq.fd == -1) { > + IFPX_LOG(ERR, "Failed to create netlink socket: %d", errno); > + goto error; > + } > + /* Starting with kernel 4.19 you can request dump for a specific > + * interface and kernel will filter out and send only relevant info. > + * Otherwise NLM_F_DUMP will generate info for all interfaces and you > + * need to filter them yourself. > + */ > +#ifdef NETLINK_DUMP_STRICT_CHK > + ret = 1; /* use this var also as an input param */ > + ret = setsockopt(ifpx_irq.fd, SOL_SOCKET, NETLINK_DUMP_STRICT_CHK, > + &ret, sizeof(ret)); > + if (ret < 0) { > + IFPX_LOG(ERR, "Failed to set socket option: %d", errno); > + goto error; > + } > +#endif > + > + ret = bind(ifpx_irq.fd, (struct sockaddr *)&addr, addr_len); > + if (ret < 0) { > + IFPX_LOG(ERR, "Failed to bind socket: %d", errno); > + goto error; > + } > + ret = getsockname(ifpx_irq.fd, (struct sockaddr *)&addr, &addr_len); > + if (ret < 0) { > + IFPX_LOG(ERR, "Failed to get socket addr: %d", errno); > + goto error; > + } else { > + ifpx_pid = addr.nl_pid; > + IFPX_LOG(DEBUG, "Assigned port ID: %u", addr.nl_pid); > + } > + > + ret = rte_intr_callback_register(&ifpx_irq, if_proxy_intr_callback, > + NULL); > + if (ret == 0) > + return 0; > + > +error: > + rte_errno = errno; > + if (ifpx_irq.fd != -1) { > + close(ifpx_irq.fd); > + ifpx_irq.fd = -1; > + } > + return -1; > +} > + > +static > +int nlink_close(void) > +{ > + int ec; > + > + if (ifpx_irq.fd < 0) > + return -EBADFD; > + > + do > + ec = rte_intr_callback_unregister(&ifpx_irq, > + if_proxy_intr_callback, NULL); > + while (ec == -EAGAIN); /* unlikely but possible - at least I think so */ > + > + close(ifpx_irq.fd); > + ifpx_irq.fd = -1; > + ifpx_pid = 0; > + > + return 0; > +} > + > +static > +void nlink_get_info(int if_index) > +{ > + if (ifpx_irq.fd != -1) > + request_info(RTM_GETLINK, if_index); > +} > + > +struct ifpx_platform_callbacks ifpx_platform = { > + .init = NULL, > + .listen = nlink_listen, > + .close = nlink_close, > + .get_info = nlink_get_info, > +}; > diff --git a/lib/librte_if_proxy/meson.build b/lib/librte_if_proxy/meson.build > new file mode 100644 > index 000000000..f0c1a6e15 > --- /dev/null > +++ b/lib/librte_if_proxy/meson.build > @@ -0,0 +1,19 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(C) 2020 Marvell International Ltd. > + > +# Currently only implemented on Linux > +if not is_linux > + build = false > + reason = 'only supported on linux' > +endif > + > +version = 1 > +allow_experimental_apis = true > + > +deps += ['ethdev'] > +sources = files('if_proxy_common.c') > +headers = files('rte_if_proxy.h') > + > +if is_linux > + sources += files('linux/if_proxy.c') > +endif > diff --git a/lib/librte_if_proxy/rte_if_proxy.h b/lib/librte_if_proxy/rte_if_proxy.h > new file mode 100644 > index 000000000..e620319b3 > --- /dev/null > +++ b/lib/librte_if_proxy/rte_if_proxy.h > @@ -0,0 +1,561 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2020 Marvell International Ltd. > + */ > + > +#ifndef _RTE_IF_PROXY_H_ > +#define _RTE_IF_PROXY_H_ > + > +/** > + * @file > + * RTE IF Proxy library > + * > + * The IF Proxy library allows for monitoring of system network configuration > + * and configuration of DPDK ports by using usual system utilities (like the > + * ones from iproute2 package). > + * > + * It is based on the notion of "proxy interface" which actually can be any DPDK > + * port which is also visible to the system - that is it has non-zero 'if_index' > + * field in 'rte_eth_dev_info' structure. > + * > + * If application doesn't have any such port (or doesn't want to use it for > + * proxy) it can create one by calling: > + * > + * proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); > + * > + * This function is just a wrapper that constructs valid 'devargs' string based > + * on the proxy type chosen (currently Tap or KNI) and creates the interface by > + * calling rte_ifpx_dev_create(). > + * > + * Once one has DPDK port capable of being proxy one can bind target DPDK port > + * to it by calling. > + * > + * rte_ifpx_port_bind(port_id, proxy_id); > + * > + * This binding is a logical one - there is no automatic packet forwarding > + * between port and it's proxy since the library doesn't know the structure of > + * application's packet processing. It remains application responsibility to > + * forward the packets from/to proxy port (by calling the usual DPDK RX/TX burst > + * API). However when the library notes some change to the proxy interface it > + * will simply call appropriate callback with 'port_id' of the DPDK port that is > + * bound to this proxy interface. The binding can be 1 to many - that is many > + * ports can point to one proxy - in that case registered callbacks will be > + * called for every bound port. > + * > + * The callbacks that are used for notifications are described by the > + * 'rte_ifpx_callbacks' structure and they are registered by calling: > + * > + * rte_ifpx_callbacks_register(&cbs); > + * > + * Finally the application should call: > + * > + * rte_ifpx_listen(); > + * > + * which will query system for present network configuration and start listening > + * to its changes. > + */ > + > +#include <rte_eal.h> > +#include <rte_ethdev.h> > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +/** > + * Enum naming the type of proxy to create. > + * > + * @see rte_ifpx_create() > + */ > +enum rte_ifpx_proxy_type { > + RTE_IFPX_DEFAULT, /**< Use default proxy type for given arch. */ > + RTE_IFPX_TAP, /**< Use Tap based port for proxy. */ > + RTE_IFPX_KNI /**< Use KNI based port for proxy. */ > +}; > + > +/** > + * Create DPDK port that can serve as an interface proxy. > + * > + * This function is just a wrapper around rte_ifpx_create_by_devarg() that > + * constructs its 'devarg' argument based on type of proxy requested. > + * > + * @param type > + * A type of proxy to create. > + * > + * @return > + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. > + * > + * @see enum rte_ifpx_type > + * @see rte_ifpx_create_by_devarg() > + */ > +__rte_experimental > +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type); > + > +/** > + * Create DPDK port that can serve as an interface proxy. > + * > + * @param devarg > + * A string passed to rte_dev_probe() to create proxy port. > + * > + * @return > + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. > + */ > +__rte_experimental > +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg); > + > +/** > + * Remove DPDK proxy port. > + * > + * In addition to removing the proxy port the bindings (if any) are cleared. > + * > + * @param proxy_id > + * Port id of the proxy that should be removed. > + * > + * @return > + * 0 on success, negative on error. > + */ > +__rte_experimental > +int rte_ifpx_proxy_destroy(uint16_t proxy_id); > + > +/** > + * The rte_ifpx_event_type enum lists all possible event types that can be > + * signaled by this library. To learn what events are supported on your > + * platform call rte_ifpx_events_available(). > + * > + * NOTE - do not reorder these enums freely, their values need to correspond to > + * the order of the callbacks in struct rte_ifpx_callbacks. > + */ > +enum rte_ifpx_event_type { > + RTE_IFPX_MAC_CHANGE, /**< @see struct rte_ifpx_mac_change */ > + RTE_IFPX_MTU_CHANGE, /**< @see struct rte_ifpx_mtu_change */ > + RTE_IFPX_LINK_CHANGE, /**< @see struct rte_ifpx_link_change */ > + RTE_IFPX_ADDR_ADD, /**< @see struct rte_ifpx_addr_change */ > + RTE_IFPX_ADDR_DEL, /**< @see struct rte_ifpx_addr_change */ > + RTE_IFPX_ADDR6_ADD, /**< @see struct rte_ifpx_addr6_change */ > + RTE_IFPX_ADDR6_DEL, /**< @see struct rte_ifpx_addr6_change */ > + RTE_IFPX_ROUTE_ADD, /**< @see struct rte_ifpx_route_change */ > + RTE_IFPX_ROUTE_DEL, /**< @see struct rte_ifpx_route_change */ > + RTE_IFPX_ROUTE6_ADD, /**< @see struct rte_ifpx_route6_change */ > + RTE_IFPX_ROUTE6_DEL, /**< @see struct rte_ifpx_route6_change */ > + RTE_IFPX_NEIGH_ADD, /**< @see struct rte_ifpx_neigh_change */ > + RTE_IFPX_NEIGH_DEL, /**< @see struct rte_ifpx_neigh_change */ > + RTE_IFPX_NEIGH6_ADD, /**< @see struct rte_ifpx_neigh6_change */ > + RTE_IFPX_NEIGH6_DEL, /**< @see struct rte_ifpx_neigh6_change */ > + RTE_IFPX_CFG_DONE, /**< This event is a lib specific event - it is > + * signaled when initial network configuration > + * query is finished and has no event data. > + */ > + RTE_IFPX_NUM_EVENTS, > +}; > + > +/** > + * Get the bit mask of implemented events/callbacks for this platform. > + * > + * @return > + * Bit mask of events/callbacks implemented: each event type can be tested by > + * checking bit (1 << ev) where 'ev' is one of the rte_ifpx_event_type enum > + * values. > + * @see enum rte_ifpx_event_type > + */ > +__rte_experimental > +uint64_t rte_ifpx_events_available(void); > + > +/** > + * The rte_ifpx_event defines structure used to pass notification event to > + * application. Each event type has its own dedicated inner structure - these > + * structures are also used when using callbacks notifications. > + */ > +struct rte_ifpx_event { > + enum rte_ifpx_event_type type; > + union { > + /** Structure used to pass notification about MAC change of the > + * proxy interface. > + * @see RTE_IFPX_MAC_CHANGE > + */ > + struct rte_ifpx_mac_change { > + uint16_t port_id; > + struct rte_ether_addr mac; > + } mac_change; > + /** Structure used to pass notification about MTU change. > + * @see RTE_IFPX_MTU_CHANGE > + */ > + struct rte_ifpx_mtu_change { > + uint16_t port_id; > + uint16_t mtu; > + } mtu_change; > + /** Structure used to pass notification about link going > + * up/down. > + * @see RTE_IFPX_LINK_CHANGE > + */ > + struct rte_ifpx_link_change { > + uint16_t port_id; > + int is_up; > + } link_change; > + /** Structure used to pass notification about IPv4 address being > + * added/removed. All IPv4 addresses reported by this library > + * are in host order. > + * @see RTE_IFPX_ADDR_ADD > + * @see RTE_IFPX_ADDR_DEL > + */ > + struct rte_ifpx_addr_change { > + uint16_t port_id; > + uint32_t ip; > + } addr_change; > + /** Structure used to pass notification about IPv6 address being > + * added/removed. > + * @see RTE_IFPX_ADDR6_ADD > + * @see RTE_IFPX_ADDR6_DEL > + */ > + struct rte_ifpx_addr6_change { > + uint16_t port_id; > + uint8_t ip[16]; > + } addr6_change; > + /** Structure used to pass notification about IPv4 route being > + * added/removed. > + * @see RTE_IFPX_ROUTE_ADD > + * @see RTE_IFPX_ROUTE_DEL > + */ > + struct rte_ifpx_route_change { > + uint16_t port_id; > + uint8_t depth; > + uint32_t ip; > + uint32_t gateway; > + } route_change; > + /** Structure used to pass notification about IPv6 route being > + * added/removed. > + * @see RTE_IFPX_ROUTE6_ADD > + * @see RTE_IFPX_ROUTE6_DEL > + */ > + struct rte_ifpx_route6_change { > + uint16_t port_id; > + uint8_t depth; > + uint8_t ip[16]; > + uint8_t gateway[16]; > + } route6_change; > + /** Structure used to pass notification about IPv4 neighbour > + * info changes. > + * @see RTE_IFPX_NEIGH_ADD > + * @see RTE_IFPX_NEIGH_DEL > + */ > + struct rte_ifpx_neigh_change { > + uint16_t port_id; > + struct rte_ether_addr mac; > + uint32_t ip; > + } neigh_change; > + /** Structure used to pass notification about IPv6 neighbour > + * info changes. > + * @see RTE_IFPX_NEIGH6_ADD > + * @see RTE_IFPX_NEIGH6_DEL > + */ > + struct rte_ifpx_neigh6_change { > + uint16_t port_id; > + struct rte_ether_addr mac; > + uint8_t ip[16]; > + } neigh6_change; > + /* This structure is used internally - to abstract common parts > + * of proxy/port related events and to be able to refer to this > + * union without giving it a name. > + */ > + struct { > + uint16_t port_id; > + } data; > + }; > +}; > + > +/** > + * This library can deliver notification about network configuration changes > + * either by the use of registered callbacks and/or by queueing change events to > + * configured notification queues. The logic used is: > + * 1. If there is callback registered for given event type it is called. In > + * case of many ports to one proxy binding, this callback is called for every > + * port bound. > + * 2. If this callback returns non-zero value (for any of ports in case of > + * many-1 bindings) the handling of an event is considered as complete. > + * 3. Otherwise the event is added to each configured event queue. The event is > + * allocated with malloc() so after dequeueing and handling the application > + * should deallocate it with free(). > + * > + * This dual notification mechanism is meant to provide some flexibility to > + * application writer. For example, if you store your data in a single writer/ > + * many readers coherent data structure you could just update this structure > + * from the callback. If you keep separate copy per lcore/port you could make > + * some common preparations (if applicable) in the callback, return 0 and use > + * notification queues to pick up the change and update data structures. Or you > + * could skip the callbacks altogether and just use notification queues - and > + * configure them at the level appropriate for your application design (one > + * global / one per lcore / one per port ...). > + */ > + > +/** > + * Add notification queue to the list of queues. > + * > + * @param r > + * Ring used for queueing of notification events - application can assume that > + * there is only one producer. > + * @return > + * 0 on success, negative otherwise. > + */ > +int rte_ifpx_queue_add(struct rte_ring *r); > + > +/** > + * Remove notification queue from the list of queues. > + * > + * @param r > + * Notification ring used for queueing of notification events (previously > + * added via rte_ifpx_queue_add()). > + * @return > + * 0 on success, negative otherwise. > + */ > +int rte_ifpx_queue_remove(struct rte_ring *r); > + > +/** > + * This structure groups the callbacks that might be called as a notification > + * events for changing network configuration. Not every platform might > + * implement all of them and you can query the availability with > + * rte_ifpx_callbacks_available() function. > + * @see rte_ifpx_events_available() > + * @see rte_ifpx_callbacks_register() > + */ > +struct rte_ifpx_callbacks { > + int (*mac_change)(const struct rte_ifpx_mac_change *event); > + /**< Callback for notification about MAC change of the proxy interface. > + * This callback (as all other port related callbacks) is called for > + * each port (with its port_id as a first argument) bound to the proxy > + * interface for which change has been observed. > + * @see struct rte_ifpx_mac_change > + * @return non-zero if event handling is finished > + */ > + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); > + /**< Callback for notification about MTU change. > + * @see struct rte_ifpx_mtu_change > + * @return non-zero if event handling is finished > + */ > + int (*link_change)(const struct rte_ifpx_link_change *event); > + /**< Callback for notification about link going up/down. > + * @see struct rte_ifpx_link_change > + * @return non-zero if event handling is finished > + */ > + int (*addr_add)(const struct rte_ifpx_addr_change *event); > + /**< Callback for notification about IPv4 address being added. > + * @see struct rte_ifpx_addr_change > + * @return non-zero if event handling is finished > + */ > + int (*addr_del)(const struct rte_ifpx_addr_change *event); > + /**< Callback for notification about IPv4 address removal. > + * @see struct rte_ifpx_addr_change > + * @return non-zero if event handling is finished > + */ > + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); > + /**< Callback for notification about IPv6 address being added. > + * @see struct rte_ifpx_addr6_change > + */ > + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); > + /**< Callback for notification about IPv4 address removal. > + * @see struct rte_ifpx_addr6_change > + * @return non-zero if event handling is finished > + */ > + /* Please note that "route" callbacks might be also called when user > + * adds address to the interface (that is in addition to address related > + * callbacks). > + */ > + int (*route_add)(const struct rte_ifpx_route_change *event); > + /**< Callback for notification about IPv4 route being added. > + * @see struct rte_ifpx_route_change > + * @return non-zero if event handling is finished > + */ > + int (*route_del)(const struct rte_ifpx_route_change *event); > + /**< Callback for notification about IPv4 route removal. > + * @see struct rte_ifpx_route_change > + * @return non-zero if event handling is finished > + */ > + int (*route6_add)(const struct rte_ifpx_route6_change *event); > + /**< Callback for notification about IPv6 route being added. > + * @see struct rte_ifpx_route6_change > + * @return non-zero if event handling is finished > + */ > + int (*route6_del)(const struct rte_ifpx_route6_change *event); > + /**< Callback for notification about IPv6 route removal. > + * @see struct rte_ifpx_route6_change > + * @return non-zero if event handling is finished > + */ > + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); > + /**< Callback for notification about IPv4 neighbour being added. > + * @see struct rte_ifpx_neigh_change > + * @return non-zero if event handling is finished > + */ > + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); > + /**< Callback for notification about IPv4 neighbour removal. > + * @see struct rte_ifpx_neigh_change > + * @return non-zero if event handling is finished > + */ > + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); > + /**< Callback for notification about IPv6 neighbour being added. > + * @see struct rte_ifpx_neigh_change > + */ > + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); > + /**< Callback for notification about IPv6 neighbour removal. > + * @see struct rte_ifpx_neigh_change > + * @return non-zero if event handling is finished > + */ > + int (*cfg_done)(void); > + /**< Lib specific callback - called when initial network configuration > + * query is finished. > + * @return non-zero if event handling is finished > + */ > +}; > + > +/** > + * Register proxy callbacks. > + * > + * This function registers callbacks to be called upon appropriate network > + * event notification. > + * > + * @param cbs > + * Set of callbacks that will be called. The library does not take any > + * ownership of the pointer passed - the callbacks are stored internally. > + * > + * @return > + * 0 on success, negative otherwise. > + */ > +__rte_experimental > +int rte_ifpx_callbacks_register(const struct rte_ifpx_callbacks *cbs); > + > +/** > + * Unregister proxy callbacks. > + * > + * This function unregisters callbacks previously registered with > + * rte_ifpx_callbacks_register(). > + * > + * @param cbs > + * Handle/pointer returned on previous callback registration. > + * > + * @return > + * 0 on success, negative otherwise. > + */ > +__rte_experimental > +void rte_ifpx_callbacks_unregister(void); > + > +/** > + * Bind the port to its proxy. > + * > + * After calling this function all network configuration of the proxy (and it's > + * changes) will be passed to given port by calling registered callbacks with > + * 'port_id' as an argument. > + * > + * Note: since both arguments are of the same type in order to not mix them and > + * ease remembering the order the first one is kept the same for bind/unbind. > + * > + * @param port_id > + * Id of the port to be bound. > + * @param proxy_id > + * Id of the proxy the port needs to be bound to. > + * @return > + * 0 on success, negative on error. > + */ > +__rte_experimental > +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id); > + > +/** > + * Unbind the port from its proxy. > + * > + * After calling this function registered callbacks will no longer be called for > + * this port (but they might be called for other ports in one to many binding > + * scenario). > + * > + * @param port_id > + * Id of the port to unbind. > + * @return > + * 0 on success, negative on error. > + */ > +__rte_experimental > +int rte_ifpx_port_unbind(uint16_t port_id); > + > +/** > + * Get the system network configuration and start listening to its changes. > + * > + * @return > + * 0 on success, negative otherwise. > + */ > +__rte_experimental > +int rte_ifpx_listen(void); > + > +/** > + * Remove all bindings/callbacks and stop listening to network configuration. > + * > + * @return > + * 0 on success, negative otherwise. > + */ > +__rte_experimental > +int rte_ifpx_close(void); > + > +/** > + * Get the id of the proxy the port is bound to. > + * > + * @param port_id > + * Id of the port for which to get proxy. > + * @return > + * Port id of the proxy on success, RTE_MAX_ETHPORTS on error. > + */ > +__rte_experimental > +uint16_t rte_ifpx_proxy_get(uint16_t port_id); > + > +/** > + * Test for port acting as a proxy. > + * > + * @param port_id > + * Id of the port. > + * @return > + * 1 if port acts as a proxy, 0 otherwise. > + */ > +static inline > +int rte_ifpx_is_proxy(uint16_t port_id) > +{ > + return rte_ifpx_proxy_get(port_id) == port_id; > +} > + > +/** > + * Get the ids of the ports bound to the proxy. > + * > + * @param proxy_id > + * Id of the proxy for which to get ports. > + * @param ports > + * Array where to store the port ids. > + * @param num > + * Size of the 'ports' array. > + * @return > + * The number of ports bound to given proxy. Note that bound ports are filled > + * in 'ports' array up to its size but the return value is always the total > + * number of ports bound - so you can make call first with NULL/0 to query for > + * the size of the buffer to create or call it with the buffer you have and > + * later check if it was large enough. > + */ > +__rte_experimental > +unsigned int rte_ifpx_port_get(uint16_t proxy_id, > + uint16_t *ports, unsigned int num); > + > +/** > + * The structure containing some properties of the proxy interface. > + */ > +struct rte_ifpx_info { > + unsigned int if_index; /* entry valid iff if_index != 0 */ > + uint16_t mtu; > + struct rte_ether_addr mac; > + char if_name[RTE_ETH_NAME_MAX_LEN]; > +}; > + > +/** > + * Get the properties of the proxy interface. Argument can be either id of the > + * proxy or an id of a port that is bound to it. > + * > + * @param port_id > + * Id of the port (or proxy) for which to get proxy properties. > + * @return > + * Pointer to the proxy information structure. > + */ > +__rte_experimental > +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id); > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_IF_PROXY_H_ */ > diff --git a/lib/librte_if_proxy/rte_if_proxy_version.map b/lib/librte_if_proxy/rte_if_proxy_version.map > new file mode 100644 > index 000000000..e2093137d > --- /dev/null > +++ b/lib/librte_if_proxy/rte_if_proxy_version.map > @@ -0,0 +1,19 @@ > +EXPERIMENTAL { > + global: > + > + rte_ifpx_proxy_create; > + rte_ifpx_proxy_create_by_devarg; > + rte_ifpx_proxy_destroy; > + rte_ifpx_events_available; > + rte_ifpx_callbacks_register; > + rte_ifpx_callbacks_unregister; > + rte_ifpx_port_bind; > + rte_ifpx_port_unbind; > + rte_ifpx_listen; > + rte_ifpx_close; > + rte_ifpx_proxy_get; > + rte_ifpx_port_get; > + rte_ifpx_info_get; > + > + local: *; > +}; > diff --git a/lib/meson.build b/lib/meson.build > index 0af3efab2..c913b33dd 100644 > --- a/lib/meson.build > +++ b/lib/meson.build > @@ -19,7 +19,7 @@ libraries = [ > 'acl', 'bbdev', 'bitratestats', 'cfgfile', > 'compressdev', 'cryptodev', > 'distributor', 'efd', 'eventdev', > - 'gro', 'gso', 'ip_frag', 'jobstats', > + 'gro', 'gso', 'if_proxy', 'ip_frag', 'jobstats', > 'kni', 'latencystats', 'lpm', 'member', > 'power', 'pdump', 'rawdev', > 'rcu', 'rib', 'reorder', 'sched', 'security', 'stack', 'vhost', > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library 2020-03-31 12:36 ` Harman Kalra @ 2020-03-31 15:37 ` Andrzej Ostruszka [C] 0 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-03-31 15:37 UTC (permalink / raw) To: Harman Kalra, Andrzej Ostruszka [C]; +Cc: dev, Thomas Monjalon On 3/31/20 2:36 PM, Harman Kalra wrote: > On Fri, Mar 06, 2020 at 05:41:01PM +0100, Andrzej Ostruszka wrote: [...] >> +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px) >> +{ >> + struct ifpx_queue_node *q; >> + int done = 0; >> + uint16_t p, proxy_id; >> + >> + if (px) { >> + if (px->state & DEL_PENDING) >> + return; >> + proxy_id = px->proxy_id; >> + RTE_ASSERT(proxy_id != RTE_MAX_ETHPORTS); >> + px->state |= IN_USE; >> + } else >> + proxy_id = RTE_MAX_ETHPORTS; >> + >> + RTE_ASSERT(ev); >> + /* This function is expected to be called with a lock held. */ >> + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); >> + >> + if (ifpx_callbacks.funcs[ev->type].f_ptr) { >> + union cb_ptr_t cb = ifpx_callbacks.funcs[ev->type]; >> + >> + /* Drop the lock for the time of callback call. */ >> + rte_spinlock_unlock(&ifpx_lock); >> + if (px) { >> + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { >> + if (ifpx_ports[p] != proxy_id || >> + ifpx_ports[p] == p) >> + continue; >> + ev->data.port_id = p; >> + done = cb.f_ptr(&ev->data) || done; > Since callback are handled as DPDK interrupts, hope there is no event > which gets lost. Cannot afford to loose a route change event as kernel > might not send it again. We have some protection against this in form of netlink socket buffer. In general, callbacks (as noted previously by Morten) can't block so this should be fine - we might need to play around with SO_RCVBUF socket option of the netlink socket but so far I have not experienced any problem. > >> + } >> + } else { >> + RTE_ASSERT(ev->type == RTE_IFPX_CFG_DONE); >> + done = cb.cfg_done(); >> + } >> + rte_spinlock_lock(&ifpx_lock); >> + } >> + if (done) >> + goto exit; >> + >> + /* Event not "consumed" yet so try to notify via queues. */ >> + TAILQ_FOREACH(q, &ifpx_queues, elem) { >> + if (px) { >> + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { >> + if (ifpx_ports[p] != proxy_id || >> + ifpx_ports[p] == p) >> + continue; >> + /* Set the port_id - the remaining params should >> + * be filled before calling this function. >> + */ >> + ev->data.port_id = p; >> + queue_event(ev, q->r); >> + } >> + } else >> + queue_event(ev, q->r); >> + } >> +exit: >> + if (px) >> + px->state &= ~IN_USE; >> +} >> + >> +void ifpx_cleanup_proxies(void) >> +{ >> + struct ifpx_proxy_node *px, *next; >> + for (px = TAILQ_FIRST(&ifpx_proxies); px; px = next) { >> + next = TAILQ_NEXT(px, elem); >> + if (px->state & DEL_PENDING) >> + ifpx_proxy_destroy(px); >> + } >> +} >> + >> +int rte_ifpx_listen(void) >> +{ >> + int ec; >> + >> + if (!ifpx_platform.listen) >> + return -ENOTSUP; >> + >> + ec = ifpx_platform.listen(); >> + if (ec == 0 && ifpx_platform.get_info) >> + ifpx_platform.get_info(0); > nlink_get_info calls request_info with a if_index, passing 0 might > be good in current scenario but valid index should be passed to > get_info. 0 is an invalid if_index (on Windows too) so I've used it to encode "all interfaces". This is related to your next comment. So I'll expand on this there. [...] >> +static >> +int request_info(int type, int index) >> +{ >> + static rte_spinlock_t send_lock = RTE_SPINLOCK_INITIALIZER; >> + struct info_get { >> + struct nlmsghdr h; >> + union { >> + struct ifinfomsg ifm; >> + struct ifaddrmsg ifa; >> + struct rtmsg rtm; >> + struct ndmsg ndm; >> + } __rte_aligned(NLMSG_ALIGNTO); >> + } info_req; >> + int ret; >> + >> + memset(&info_req, 0, sizeof(info_req)); >> + /* First byte of these messages is family, so just make sure that this >> + * memset is enough to get all families. >> + */ >> + RTE_ASSERT(AF_UNSPEC == 0); >> + >> + info_req.h.nlmsg_pid = ifpx_pid; >> + info_req.h.nlmsg_type = type; >> + info_req.h.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP; >> + info_req.h.nlmsg_len = offsetof(struct info_get, ifm); >> + >> + switch (type) { >> + case RTM_GETLINK: >> + info_req.h.nlmsg_len += sizeof(info_req.ifm); >> + info_req.ifm.ifi_index = index; >> + break; >> + case RTM_GETADDR: >> + info_req.h.nlmsg_len += sizeof(info_req.ifa); >> + info_req.ifa.ifa_index = index; >> + break; >> + case RTM_GETROUTE: >> + info_req.h.nlmsg_len += sizeof(info_req.rtm); >> + break; >> + case RTM_GETNEIGH: >> + info_req.h.nlmsg_len += sizeof(info_req.ndm); >> + break; >> + default: >> + IFPX_LOG(WARNING, "Unhandled message type: %d", type); >> + return -EINVAL; >> + } >> + /* Store request type (and if it is global or link specific) in 'seq'. >> + * Later it is used during handling of reply to continue requesting of >> + * information dump from system - if needed. >> + */ >> + info_req.h.nlmsg_seq = index << 8 | type; >> + >> + IFPX_LOG(DEBUG, "\tRequesting msg %d for: %u", type, index); >> + >> + rte_spinlock_lock(&send_lock); >> + ret = send(ifpx_irq.fd, &info_req, info_req.h.nlmsg_len, 0); >> + if (ret < 0) { >> + IFPX_LOG(ERR, "Failed to send netlink msg: %d", errno); >> + rte_errno = errno; >> + } >> + rte_spinlock_unlock(&send_lock); >> + >> + return ret; >> +} [...] >> +static >> +void if_proxy_intr_callback(void *arg __rte_unused) >> +{ >> + struct nlmsghdr *h; >> + struct sockaddr_nl addr; >> + socklen_t addr_len; >> + char buf[8192]; >> + ssize_t len; >> + >> +restart: >> + len = recvfrom(ifpx_irq.fd, buf, sizeof(buf), 0, >> + (struct sockaddr *)&addr, &addr_len); >> + if (len < 0) { >> + if (errno == EINTR) { >> + IFPX_LOG(DEBUG, "recvmsg() interrupted"); >> + goto restart; >> + } >> + IFPX_LOG(ERR, "Failed to read netlink msg: %ld (errno %d)", >> + len, errno); >> + return; >> + } >> + if (addr_len != sizeof(addr)) { >> + IFPX_LOG(ERR, "Invalid netlink addr size: %d", addr_len); >> + return; >> + } >> + IFPX_LOG(DEBUG, "Read %lu bytes (buf %lu) from %u/%u", len, >> + sizeof(buf), addr.nl_pid, addr.nl_groups); >> + >> + for (h = (struct nlmsghdr *)buf; NLMSG_OK(h, len); >> + h = NLMSG_NEXT(h, len)) { >> + IFPX_LOG(DEBUG, "Recv msg: %u (%u/%u/%u seq/flags/pid)", >> + h->nlmsg_type, h->nlmsg_seq, h->nlmsg_flags, >> + h->nlmsg_pid); >> + >> + switch (h->nlmsg_type) { >> + case RTM_NEWLINK: >> + case RTM_DELLINK: >> + handle_link(h); >> + break; >> + case RTM_NEWADDR: >> + case RTM_DELADDR: >> + handle_addr(h, h->nlmsg_type == RTM_DELADDR); >> + break; >> + case RTM_NEWROUTE: >> + case RTM_DELROUTE: >> + handle_route(h, h->nlmsg_type == RTM_DELROUTE); >> + break; >> + case RTM_NEWNEIGH: >> + case RTM_DELNEIGH: >> + handle_neigh(h, h->nlmsg_type == RTM_DELNEIGH); >> + break; >> + } >> + >> + /* If this is a reply for global request then follow up with >> + * additional requests and notify about finish. >> + */ >> + if (h->nlmsg_pid == ifpx_pid && (h->nlmsg_seq >> 8) == 0 && >> + h->nlmsg_type == NLMSG_DONE) { > Sorry, but in what scenario will the flow reach here. OK. So let me describe the initialization flow on Linux (the only available implementation right now). When we start listening we first request dumping of the whole configuration. We call get_info(0). Again this '0' is invalid if_index so is used as "all intefaces" value. This index is written in Netlink msg headers and is coupled with possible filtering of messages on kernel side (see comment in nlink_listen() below). When we request info we always use REQUEST|DUMP flags but on newer kernels there is an option (when if_index is non-zero) to send out only information for that interace instead of dumping all info. In addition it is encoded in nlmsg_seq. So there are different types of info we get from kernel: link/address/routing/neighbouring. Instead of requesting them all at once I do that sequentially and in get_info() I start with a request for link info. This code that you asked about above is a check that: - this message is a direct reply to us (pid) - and reply for global request (index = 0) - and this is the last part of multi-segmented message (this is how Linux dumps info - sends couple of messages with the additional "DONE" msg at the end). And the logic below is sequencing LINK->ADDR->ROUTE->NEIGH-> we are done so notify the user about that. That way we have at most one active "transaction" with kernel. >> + if ((h->nlmsg_seq & 0xFF) == RTM_GETLINK) >> + request_info(RTM_GETADDR, 0); >> + else if ((h->nlmsg_seq & 0xFF) == RTM_GETADDR) >> + request_info(RTM_GETROUTE, 0); >> + else if ((h->nlmsg_seq & 0xFF) == RTM_GETROUTE) >> + request_info(RTM_GETNEIGH, 0); >> + else { >> + struct rte_ifpx_event ev = { >> + .type = RTE_IFPX_CFG_DONE >> + }; >> + >> + RTE_ASSERT((h->nlmsg_seq & 0xFF) == >> + RTM_GETNEIGH); >> + rte_spinlock_lock(&ifpx_lock); >> + ifpx_notify_event(&ev, NULL); >> + rte_spinlock_unlock(&ifpx_lock); >> + } >> + } >> + } >> + IFPX_LOG(DEBUG, "Finished msg loop: %ld bytes left", len); >> +} >> + >> +static >> +int nlink_listen(void) >> +{ >> + struct sockaddr_nl addr = { >> + .nl_family = AF_NETLINK, >> + .nl_pid = 0, >> + }; >> + socklen_t addr_len = sizeof(addr); >> + int ret; >> + >> + if (ifpx_irq.fd != -1) { >> + rte_errno = EBUSY; >> + return -1; >> + } >> + >> + addr.nl_groups = 1 << (RTNLGRP_LINK-1) >> + | 1 << (RTNLGRP_NEIGH-1) >> + | 1 << (RTNLGRP_IPV4_IFADDR-1) >> + | 1 << (RTNLGRP_IPV6_IFADDR-1) >> + | 1 << (RTNLGRP_IPV4_ROUTE-1) >> + | 1 << (RTNLGRP_IPV6_ROUTE-1); >> + >> + ifpx_irq.fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, >> + NETLINK_ROUTE); >> + if (ifpx_irq.fd == -1) { >> + IFPX_LOG(ERR, "Failed to create netlink socket: %d", errno); >> + goto error; >> + } >> + /* Starting with kernel 4.19 you can request dump for a specific >> + * interface and kernel will filter out and send only relevant info. >> + * Otherwise NLM_F_DUMP will generate info for all interfaces and you >> + * need to filter them yourself. >> + */ >> +#ifdef NETLINK_DUMP_STRICT_CHK >> + ret = 1; /* use this var also as an input param */ >> + ret = setsockopt(ifpx_irq.fd, SOL_SOCKET, NETLINK_DUMP_STRICT_CHK, >> + &ret, sizeof(ret)); >> + if (ret < 0) { >> + IFPX_LOG(ERR, "Failed to set socket option: %d", errno); >> + goto error; >> + } >> +#endif >> + >> + ret = bind(ifpx_irq.fd, (struct sockaddr *)&addr, addr_len); >> + if (ret < 0) { >> + IFPX_LOG(ERR, "Failed to bind socket: %d", errno); >> + goto error; >> + } >> + ret = getsockname(ifpx_irq.fd, (struct sockaddr *)&addr, &addr_len); >> + if (ret < 0) { >> + IFPX_LOG(ERR, "Failed to get socket addr: %d", errno); >> + goto error; >> + } else { >> + ifpx_pid = addr.nl_pid; >> + IFPX_LOG(DEBUG, "Assigned port ID: %u", addr.nl_pid); >> + } >> + >> + ret = rte_intr_callback_register(&ifpx_irq, if_proxy_intr_callback, >> + NULL); >> + if (ret == 0) >> + return 0; >> + >> +error: >> + rte_errno = errno; >> + if (ifpx_irq.fd != -1) { >> + close(ifpx_irq.fd); >> + ifpx_irq.fd = -1; >> + } >> + return -1; >> +} [...] If you are playing with this library (running test case or the exemplary application) and would like to have better view what is going on you can add "--log=lib.if_proxy:debug" to the arguments list. Thanks for taking a look at this. The more people do this the better this should be. E.g. explaining initialization flow to you made me realize that the there is another case where I request info which is not handled well - normally user should bind the proxies and start listening. But if for some reason user binds proxy later, during listening, I request info for that particular interface but implementation will request only link level and will not follow with request for other info. I will fix this in the next version. With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library 2020-03-06 16:41 ` [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-03-31 12:36 ` Harman Kalra @ 2020-04-01 5:29 ` Varghese, Vipin 2020-04-01 20:08 ` Andrzej Ostruszka [C] 1 sibling, 1 reply; 64+ messages in thread From: Varghese, Vipin @ 2020-04-01 5:29 UTC (permalink / raw) To: Andrzej Ostruszka, dev, Thomas Monjalon snipped > diff --git a/lib/librte_if_proxy/Makefile b/lib/librte_if_proxy/Makefile > new file mode 100644 > index 000000000..43cb702a2 > --- /dev/null > +++ b/lib/librte_if_proxy/Makefile > @@ -0,0 +1,29 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(C) 2020 Marvell International Ltd. > + > +include $(RTE_SDK)/mk/rte.vars.mk > + > +# library name > +LIB = librte_if_proxy.a > + > +CFLAGS += -DALLOW_EXPERIMENTAL_API > +CFLAGS += -O3 > +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) > +LDLIBS += -lrte_eal -lrte_ethdev > + > +EXPORT_MAP := rte_if_proxy_version.map > + > +LIBABIVER := 1 > + > +# all source are stored in SRCS-y > +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) := if_proxy_common.c > + > +SYSDIR := $(patsubst "%app",%,$(CONFIG_RTE_EXEC_ENV)) > +include $(SRCDIR)/$(SYSDIR)/Makefile > + Should there be check `ifeq ($(CONFIG_RTE_LIBRTE_KNI),y)` and `ifeq ($(CONFIG_RTE_LIBRTE_TAP),y)`? > +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += $(addprefix $(SYSDIR)/,$(SRCS)) > + > +# install this header file > +SYMLINK-$(CONFIG_RTE_LIBRTE_IF_PROXY)-include := rte_if_proxy.h > + > +include $(RTE_SDK)/mk/rte.lib.mk Snipped > + > +uint64_t rte_ifpx_events_available(void) > +{ > + /* All events are supported on Linux. */ > + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; Should we give the available from the used count? > +} > + Snipped > + > +void rte_ifpx_callbacks_unregister(void) > +{ > + rte_spinlock_lock(&ifpx_lock); > + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); What would happen to pending events, are agreeing to drop all? > + rte_spinlock_unlock(&ifpx_lock); > +} > + > +uint16_t rte_ifpx_proxy_get(uint16_t port_id) > +{ > + if (port_id >= RTE_MAX_ETHPORTS) > + return RTE_MAX_ETHPORTS; In the init function, the default value is set with RTE_MAX_ETHPORTS. Will there be a scenario port_id can be greater? > + > + return ifpx_ports[port_id]; > +} > + > +unsigned int rte_ifpx_port_get(uint16_t proxy_id, > + uint16_t *ports, unsigned int num) > +{ > + unsigned int p, cnt = 0; > + > + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { > + if (ifpx_ports[p] == proxy_id && ifpx_ports[p] != p) { > + ++cnt; > + if (ports && num > 0) { > + *ports++ = p; > + --num; > + } > + } > + } Application can dynamically ports to DPDK. if this is correct, will this require lock to make this thread safe? > + return cnt; > +} > + > +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id) > +{ > + struct ifpx_proxy_node *px; > + > + if (port_id >= RTE_MAX_ETHPORTS || > + ifpx_ports[port_id] == RTE_MAX_ETHPORTS) > + return NULL; > + > + rte_spinlock_lock(&ifpx_lock); > + TAILQ_FOREACH(px, &ifpx_proxies, elem) { > + if (px->proxy_id == ifpx_ports[port_id]) > + break; > + } > + rte_spinlock_unlock(&ifpx_lock); > + RTE_ASSERT(px && "Internal IF Proxy library error"); Can you help me understand the assert logic with const string? > + > + return &px->info; > +} > + > +static > +void queue_event(const struct rte_ifpx_event *ev, struct rte_ring *r) > +{ > + struct rte_ifpx_event *e = malloc(sizeof(*ev)); Is there specific reason not to use rte_malloc? > + > + if (!e) { > + IFPX_LOG(ERR, "Failed to allocate event!"); > + return; > + } > + RTE_ASSERT(r); > + > + *e = *ev; > + rte_ring_sp_enqueue(r, e); > +} > + > +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px) > +{ > + struct ifpx_queue_node *q; > + int done = 0; > + uint16_t p, proxy_id; > + > + if (px) { > + if (px->state & DEL_PENDING) > + return; > + proxy_id = px->proxy_id; > + RTE_ASSERT(proxy_id != RTE_MAX_ETHPORTS); > + px->state |= IN_USE; > + } else > + proxy_id = RTE_MAX_ETHPORTS; > + > + RTE_ASSERT(ev); > + /* This function is expected to be called with a lock held. */ > + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); > + > + if (ifpx_callbacks.funcs[ev->type].f_ptr) { > + union cb_ptr_t cb = ifpx_callbacks.funcs[ev->type]; > + > + /* Drop the lock for the time of callback call. */ > + rte_spinlock_unlock(&ifpx_lock); > + if (px) { > + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { > + if (ifpx_ports[p] != proxy_id || > + ifpx_ports[p] == p) > + continue; > + ev->data.port_id = p; > + done = cb.f_ptr(&ev->data) || done; > + } > + } else { > + RTE_ASSERT(ev->type == RTE_IFPX_CFG_DONE); > + done = cb.cfg_done(); > + } > + rte_spinlock_lock(&ifpx_lock); > + } > + if (done) > + goto exit; > + > + /* Event not "consumed" yet so try to notify via queues. */ Is there a chance when trying to use queues the events are consumed by method above by listener? > + TAILQ_FOREACH(q, &ifpx_queues, elem) { > + if (px) { > + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { > + if (ifpx_ports[p] != proxy_id || > + ifpx_ports[p] == p) > + continue; > + /* Set the port_id - the remaining params > should > + * be filled before calling this function. > + */ > + ev->data.port_id = p; > + queue_event(ev, q->r); > + } > + } else > + queue_event(ev, q->r); > + } > +exit: > + if (px) > + px->state &= ~IN_USE; > +} Snipped > + > +RTE_INIT(if_proxy_init) > +{ > + unsigned int i; Is IF_PROXY supported for vdev also? > + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) > + ifpx_ports[i] = RTE_MAX_ETHPORTS; > + > + ifpx_log_type = rte_log_register("lib.if_proxy"); > + if (ifpx_log_type >= 0) > + rte_log_set_level(ifpx_log_type, RTE_LOG_WARNING); > + > + if (ifpx_platform.init) > + ifpx_platform.init(); > +} Snipped > +SRCS += if_proxy.c > diff --git a/lib/librte_if_proxy/linux/if_proxy.c > b/lib/librte_if_proxy/linux/if_proxy.c > new file mode 100644 > index 000000000..bf851c096 > --- /dev/null > +++ b/lib/librte_if_proxy/linux/if_proxy.c > @@ -0,0 +1,552 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2020 Marvell International Ltd. > + */ Assuming all the events are executed `if and only if` the current process if Primary? If it is secondary for physical interface certain `rte_eth_api` will fail. Can we have check the events are processed for primary only? Snipped > diff --git a/lib/librte_if_proxy/meson.build b/lib/librte_if_proxy/meson.build > new file mode 100644 > index 000000000..f0c1a6e15 > --- /dev/null > +++ b/lib/librte_if_proxy/meson.build > @@ -0,0 +1,19 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(C) 2020 Marvell International Ltd. > + > +# Currently only implemented on Linux > +if not is_linux > + build = false > + reason = 'only supported on linux' > +endif > + > +version = 1 > +allow_experimental_apis = true > + > +deps += ['ethdev'] > +sources = files('if_proxy_common.c') > +headers = files('rte_if_proxy.h') Does the if_proxy have dependency on TAP and KNI. Should not we add check as ` if dpdk_conf.has('RTE_LIBRTE_KNI')` and ` if dpdk_conf.has('RTE_LIBRTE_TAP')`? > + > +if is_linux > + sources += files('linux/if_proxy.c') > +endif Snipped ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library 2020-04-01 5:29 ` Varghese, Vipin @ 2020-04-01 20:08 ` Andrzej Ostruszka [C] 2020-04-08 3:04 ` Varghese, Vipin 0 siblings, 1 reply; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-04-01 20:08 UTC (permalink / raw) To: Varghese, Vipin, dev, Thomas Monjalon First of all thank you Vipin for taking a look at this. On 4/1/20 7:29 AM, Varghese, Vipin wrote: > snipped >> diff --git a/lib/librte_if_proxy/Makefile b/lib/librte_if_proxy/Makefile >> new file mode 100644 >> index 000000000..43cb702a2 >> --- /dev/null >> +++ b/lib/librte_if_proxy/Makefile >> @@ -0,0 +1,29 @@ >> +# SPDX-License-Identifier: BSD-3-Clause >> +# Copyright(C) 2020 Marvell International Ltd. >> + >> +include $(RTE_SDK)/mk/rte.vars.mk >> + >> +# library name >> +LIB = librte_if_proxy.a >> + >> +CFLAGS += -DALLOW_EXPERIMENTAL_API >> +CFLAGS += -O3 >> +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) >> +LDLIBS += -lrte_eal -lrte_ethdev >> + >> +EXPORT_MAP := rte_if_proxy_version.map >> + >> +LIBABIVER := 1 >> + >> +# all source are stored in SRCS-y >> +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) := if_proxy_common.c >> + >> +SYSDIR := $(patsubst "%app",%,$(CONFIG_RTE_EXEC_ENV)) >> +include $(SRCDIR)/$(SYSDIR)/Makefile >> + > Should there be check `ifeq ($(CONFIG_RTE_LIBRTE_KNI),y)` and `ifeq ($(CONFIG_RTE_LIBRTE_TAP),y)`? Might not be necessary. While it is true that if you want to create proxy via this lib, then currently it is only KNI or TAP. However any DPDK port can act as a proxy - as long as it is visible to the system and reports non-zero if_index in its dev_info. However it is true that if we allow building of if_proxy even if TAP/KNI is not enabled then I should add conditionals to the proxy creation function that would show some meaningful warning when they are not enabled. Will take a look at this. >> +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += $(addprefix $(SYSDIR)/,$(SRCS)) >> + >> +# install this header file >> +SYMLINK-$(CONFIG_RTE_LIBRTE_IF_PROXY)-include := rte_if_proxy.h >> + >> +include $(RTE_SDK)/mk/rte.lib.mk > Snipped > >> + >> +uint64_t rte_ifpx_events_available(void) >> +{ >> + /* All events are supported on Linux. */ >> + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; > Should we give the available from the used count? I'm not sure I follow what you wanted to ask. I want to return bitmask with each bit being lit for every event type. I could go with or'ing of all (1ULL << RTE_IFPX_MAC_CHANGE) | (1ULL << RTE_IFPX_MTU_CHANGE) ... but deemed that this would be simpler. >> +} >> + > > Snipped > >> + >> +void rte_ifpx_callbacks_unregister(void) >> +{ >> + rte_spinlock_lock(&ifpx_lock); >> + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); > What would happen to pending events, are agreeing to drop all? ifpx_events_notify() is called under the same lock. So either someone calls this unregister and then notify will not find any callback or the other way. Note that notify drops the lock for the time of callback call (to allow modifications from the callback) but the pointer is first copied - so the behaviour would be as if the unregister was called later. I'm not sure I answered your question though - if not then please ask again with some more details. >> + rte_spinlock_unlock(&ifpx_lock); >> +} >> + >> +uint16_t rte_ifpx_proxy_get(uint16_t port_id) >> +{ >> + if (port_id >= RTE_MAX_ETHPORTS) >> + return RTE_MAX_ETHPORTS; > In the init function, the default value is set with RTE_MAX_ETHPORTS. Will there be a scenario port_id can be greater? Here port_id is an input from user - (s)he can make an error. Internally this should never happen. >> + >> + return ifpx_ports[port_id]; >> +} >> + >> +unsigned int rte_ifpx_port_get(uint16_t proxy_id, >> + uint16_t *ports, unsigned int num) >> +{ >> + unsigned int p, cnt = 0; >> + >> + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { >> + if (ifpx_ports[p] == proxy_id && ifpx_ports[p] != p) { >> + ++cnt; >> + if (ports && num > 0) { >> + *ports++ = p; >> + --num; >> + } >> + } >> + } > Application can dynamically ports to DPDK. if this is correct, will this require lock to make this thread safe? This is a good point. Currently ifpx_ports is not protected by the lock. Since this is a slow/control path I'll go with moving this under lock instead of trying to make this lockless. >> + return cnt; >> +} >> + >> +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id) >> +{ >> + struct ifpx_proxy_node *px; >> + >> + if (port_id >= RTE_MAX_ETHPORTS || >> + ifpx_ports[port_id] == RTE_MAX_ETHPORTS) >> + return NULL; >> + >> + rte_spinlock_lock(&ifpx_lock); >> + TAILQ_FOREACH(px, &ifpx_proxies, elem) { >> + if (px->proxy_id == ifpx_ports[port_id]) >> + break; >> + } >> + rte_spinlock_unlock(&ifpx_lock); >> + RTE_ASSERT(px && "Internal IF Proxy library error"); > > Can you help me understand the assert logic with const string? This is a practice sometimes used to have a meaningful error message printed (together with an expression) while assertion fires. The value of expression does not depend on this string but the expression is "stringified" in macro and printed on console so that way you can add some message to the condition being checked. I think this is the only public function where I've used this - all internal ASSERTS have no message so I might drop it here if you want. >> + >> + return &px->info; >> +} >> + >> +static >> +void queue_event(const struct rte_ifpx_event *ev, struct rte_ring *r) >> +{ >> + struct rte_ifpx_event *e = malloc(sizeof(*ev)); > Is there specific reason not to use rte_malloc? Not really - that was actually a question of mine recently on this list. This is a slow/control path, so maybe we should save hugepage memory for the fast path? I have no strong opinion here and can switch to rte_malloc() if that is thought as a better option. >> + >> + if (!e) { >> + IFPX_LOG(ERR, "Failed to allocate event!"); >> + return; >> + } >> + RTE_ASSERT(r); >> + >> + *e = *ev; >> + rte_ring_sp_enqueue(r, e); >> +} >> + >> +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px) >> +{ >> + struct ifpx_queue_node *q; >> + int done = 0; >> + uint16_t p, proxy_id; >> + >> + if (px) { >> + if (px->state & DEL_PENDING) >> + return; >> + proxy_id = px->proxy_id; >> + RTE_ASSERT(proxy_id != RTE_MAX_ETHPORTS); >> + px->state |= IN_USE; >> + } else >> + proxy_id = RTE_MAX_ETHPORTS; >> + >> + RTE_ASSERT(ev); >> + /* This function is expected to be called with a lock held. */ >> + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); >> + >> + if (ifpx_callbacks.funcs[ev->type].f_ptr) { >> + union cb_ptr_t cb = ifpx_callbacks.funcs[ev->type]; >> + >> + /* Drop the lock for the time of callback call. */ >> + rte_spinlock_unlock(&ifpx_lock); >> + if (px) { >> + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { >> + if (ifpx_ports[p] != proxy_id || >> + ifpx_ports[p] == p) >> + continue; >> + ev->data.port_id = p; >> + done = cb.f_ptr(&ev->data) || done; >> + } >> + } else { >> + RTE_ASSERT(ev->type == RTE_IFPX_CFG_DONE); >> + done = cb.cfg_done(); >> + } >> + rte_spinlock_lock(&ifpx_lock); >> + } >> + if (done) >> + goto exit; >> + >> + /* Event not "consumed" yet so try to notify via queues. */ > > Is there a chance when trying to use queues the events are consumed by method above by listener? This is fully under control of application - if application wants certain events to be notified by the queues then either it should not register callback for that event type or, if it registers, then this callback should not return non-zero value (just do some common preparation or something like that). >> + TAILQ_FOREACH(q, &ifpx_queues, elem) { >> + if (px) { >> + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { >> + if (ifpx_ports[p] != proxy_id || >> + ifpx_ports[p] == p) >> + continue; >> + /* Set the port_id - the remaining params >> should >> + * be filled before calling this function. >> + */ >> + ev->data.port_id = p; >> + queue_event(ev, q->r); >> + } >> + } else >> + queue_event(ev, q->r); >> + } >> +exit: >> + if (px) >> + px->state &= ~IN_USE; >> +} > > Snipped > >> + >> +RTE_INIT(if_proxy_init) >> +{ >> + unsigned int i; > > Is IF_PROXY supported for vdev also? I'm not sure I understand the question here. Any port can be bound to a proxy (vdev or not) and any port visible to system (having non-zero if_index in dev_info) can be used as a proxy. Does that answers your question? If not please explain. >> + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) >> + ifpx_ports[i] = RTE_MAX_ETHPORTS; >> + >> + ifpx_log_type = rte_log_register("lib.if_proxy"); >> + if (ifpx_log_type >= 0) >> + rte_log_set_level(ifpx_log_type, RTE_LOG_WARNING); >> + >> + if (ifpx_platform.init) >> + ifpx_platform.init(); >> +} > Snipped > >> +SRCS += if_proxy.c >> diff --git a/lib/librte_if_proxy/linux/if_proxy.c >> b/lib/librte_if_proxy/linux/if_proxy.c >> new file mode 100644 >> index 000000000..bf851c096 >> --- /dev/null >> +++ b/lib/librte_if_proxy/linux/if_proxy.c >> @@ -0,0 +1,552 @@ >> +/* SPDX-License-Identifier: BSD-3-Clause >> + * Copyright(C) 2020 Marvell International Ltd. >> + */ > > Assuming all the events are executed `if and only if` the current process if Primary? If it is secondary for physical interface certain `rte_eth_api` will fail. Can we have check the events are processed for primary only? Yes that was my assumption however at the moment I'm using: - rte_eth_iterator_init/next/cleanup() - rte_eth_dev_info_get() - rte_eth_dev_get_mtu() - rte_eth_macaddr_get() - rte_eth_dev_mac_addr_add() - rte_dev_probe/remove() Is there a problem with these? If it is, then I'll think about adding check for secondary. > Snipped > >> diff --git a/lib/librte_if_proxy/meson.build b/lib/librte_if_proxy/meson.build >> new file mode 100644 >> index 000000000..f0c1a6e15 >> --- /dev/null >> +++ b/lib/librte_if_proxy/meson.build >> @@ -0,0 +1,19 @@ >> +# SPDX-License-Identifier: BSD-3-Clause >> +# Copyright(C) 2020 Marvell International Ltd. >> + >> +# Currently only implemented on Linux >> +if not is_linux >> + build = false >> + reason = 'only supported on linux' >> +endif >> + >> +version = 1 >> +allow_experimental_apis = true >> + >> +deps += ['ethdev'] >> +sources = files('if_proxy_common.c') >> +headers = files('rte_if_proxy.h') > > Does the if_proxy have dependency on TAP and KNI. Should not we add check as ` if dpdk_conf.has('RTE_LIBRTE_KNI')` and ` if dpdk_conf.has('RTE_LIBRTE_TAP')`? This is the same as for Makefile - I think I'll go with allowing it to build but adding conditionals in proxy creation. However if you and/or others think it would be better to skip build then I will adapt. >> + >> +if is_linux >> + sources += files('linux/if_proxy.c') >> +endif > > Snipped > Thanks for reviewing this. With regards Andrzej ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library 2020-04-01 20:08 ` Andrzej Ostruszka [C] @ 2020-04-08 3:04 ` Varghese, Vipin 2020-04-08 18:13 ` Andrzej Ostruszka [C] 0 siblings, 1 reply; 64+ messages in thread From: Varghese, Vipin @ 2020-04-08 3:04 UTC (permalink / raw) To: Andrzej Ostruszka [C], dev, Thomas Monjalon Hi Andrzej, Thanks for the reply. Please find explanations for some of the queries snipped > >> +uint64_t rte_ifpx_events_available(void) { > >> + /* All events are supported on Linux. */ > >> + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; > > Should we give the available from the used count? > > I'm not sure I follow what you wanted to ask. I want to return bitmask with > each bit being lit for every event type. I could go with or'ing of all (1ULL << > RTE_IFPX_MAC_CHANGE) | (1ULL << RTE_IFPX_MTU_CHANGE) ... > but deemed that this would be simpler. I assume the function `rte_ifpx_events_available` returns current available events. That is at time t0, if we have used 3 events the return of function will give back ` return ((1ULL << RTE_IFPX_NUM_EVENTS) - 1 - ifpx_consumed_events);`. Snipped > > > >> + > >> +void rte_ifpx_callbacks_unregister(void) > >> +{ > >> + rte_spinlock_lock(&ifpx_lock); > >> + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); > > What would happen to pending events, are agreeing to drop all? > > ifpx_events_notify() is called under the same lock. So either someone calls this > unregister and then notify will not find any callback or the other way. Note > that notify drops the lock for the time of callback call (to allow modifications > from the callback) but the pointer is first copied - so the behaviour would be as > if the unregister was called later. > > I'm not sure I answered your question though - if not then please ask again > with some more details. Let us assume we have 3 callbacks to service for event_a namely cb-1, cb-2, and cb-3. So tail-list cb-1->cb-2->cb3, the user invoked unregister. What will happen to the 3 events? Should we finish the 3 callback handler and then remove. snipped > > Assuming all the events are executed `if and only if` the current process if > Primary? If it is secondary for physical interface certain `rte_eth_api` will fail. > Can we have check the events are processed for primary only? > > Yes that was my assumption however at the moment I'm using: > - rte_eth_iterator_init/next/cleanup() > - rte_eth_dev_info_get() > - rte_eth_dev_get_mtu() > - rte_eth_macaddr_get() > - rte_eth_dev_mac_addr_add() > - rte_dev_probe/remove() > > Is there a problem with these? If it is, then I'll think about adding check for > secondary. Based on my limited testing with PF and VF, certain functions works and other do not. In case of TUN PMD set/get mac_addr is not present. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library 2020-04-08 3:04 ` Varghese, Vipin @ 2020-04-08 18:13 ` Andrzej Ostruszka [C] 0 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-04-08 18:13 UTC (permalink / raw) To: Varghese, Vipin, dev, Thomas Monjalon On 4/8/20 5:04 AM, Varghese, Vipin wrote: > Hi Andrzej, > > Thanks for the reply. Please find explanations for some of the queries > > snipped >>>> +uint64_t rte_ifpx_events_available(void) { >>>> + /* All events are supported on Linux. */ >>>> + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; >>> Should we give the available from the used count? >> >> I'm not sure I follow what you wanted to ask. I want to return bitmask with >> each bit being lit for every event type. I could go with or'ing of all (1ULL << >> RTE_IFPX_MAC_CHANGE) | (1ULL << RTE_IFPX_MTU_CHANGE) ... >> but deemed that this would be simpler. > > I assume the function `rte_ifpx_events_available` returns current available events. That is at time t0, if we have used 3 events the return of function will give back ` return ((1ULL << RTE_IFPX_NUM_EVENTS) - 1 - ifpx_consumed_events);`. It returns events available on given platform - static thing, dependent on the implementation of IF Proxy (currently only Linux supported though - and it supports all events defined so far). >>>> + >>>> +void rte_ifpx_callbacks_unregister(void) >>>> +{ >>>> + rte_spinlock_lock(&ifpx_lock); >>>> + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); >>> What would happen to pending events, are agreeing to drop all? >> >> ifpx_events_notify() is called under the same lock. So either someone calls this >> unregister and then notify will not find any callback or the other way. Note >> that notify drops the lock for the time of callback call (to allow modifications >> from the callback) but the pointer is first copied - so the behaviour would be as >> if the unregister was called later. >> >> I'm not sure I answered your question though - if not then please ask again >> with some more details. > > Let us assume we have 3 callbacks to service for event_a namely cb-1, cb-2, and cb-3. So tail-list cb-1->cb-2->cb3, the user invoked unregister. What will happen to the 3 events? Should we finish the 3 callback handler and then remove. Hhhmmm, have you been reviewing latest version? With the introduction of event queues there is now only one global set of callbacks (no list), so only 1 callback for each possible event type. >>> Assuming all the events are executed `if and only if` the current process if >> Primary? If it is secondary for physical interface certain `rte_eth_api` will fail. >> Can we have check the events are processed for primary only? >> >> Yes that was my assumption however at the moment I'm using: >> - rte_eth_iterator_init/next/cleanup() >> - rte_eth_dev_info_get() >> - rte_eth_dev_get_mtu() >> - rte_eth_macaddr_get() >> - rte_eth_dev_mac_addr_add() >> - rte_dev_probe/remove() >> >> Is there a problem with these? If it is, then I'll think about adding check for >> secondary. > Based on my limited testing with PF and VF, certain functions works and other do not. In case of TUN PMD set/get mac_addr is not present. TUN is not being used (for that reason) - only TAP. I could add check for PRIMARY, but that way I would be artificially excluding cases where that would work without the change. So for now I intend to leave things like they are and address the actual problem (if it pops up). Note also that I'm not checking errors for the mac_get/set so if given functionality is not supported nothing will happen. With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH 2/4] if_proxy: add library documentation 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library Andrzej Ostruszka @ 2020-03-06 16:41 ` Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 3/4] if_proxy: add simple functionality test Andrzej Ostruszka ` (6 subsequent siblings) 8 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-06 16:41 UTC (permalink / raw) To: dev, Thomas Monjalon, John McNamara, Marko Kovacevic This commit adds documentation of IF Proxy library. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + doc/guides/prog_guide/if_proxy_lib.rst | 142 +++++++++++++++++++++++++ doc/guides/prog_guide/index.rst | 1 + 3 files changed, 144 insertions(+) create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst diff --git a/MAINTAINERS b/MAINTAINERS index aec7326ca..3854d7661 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1472,6 +1472,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: doc/guides/prog_guide/if_proxy_lib.rst Test Applications ----------------- diff --git a/doc/guides/prog_guide/if_proxy_lib.rst b/doc/guides/prog_guide/if_proxy_lib.rst new file mode 100644 index 000000000..f0b9ed70d --- /dev/null +++ b/doc/guides/prog_guide/if_proxy_lib.rst @@ -0,0 +1,142 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(C) 2020 Marvell International Ltd. + +.. _IF_Proxy_Library: + +IF Proxy Library +================ + +When a network interface is assigned to DPDK it usually disappears from +the system and user looses ability to configure it via typical +configuration tools. +There are basically two options to deal with this situation: + +- configure it via command line arguments and/or load configuration + from some file, +- add support for live configuration via some IPC mechanism. + +The first option is static and the second one requires some work to add +communication loop (e.g. separate thread listening/communicating on +a socket). + +This library adds a possibility to configure DPDK ports by using normal +configuration utilities (e.g. from iproute2 suite). +It requires user to configure additional DPDK ports that are visible to +the system (such as Tap or KNI - actually any port that has valid +`if_index` in ``struct rte_eth_dev_info`` will do) and designate them as +a port representor (a proxy) in the system. + +Let's see typical intended usage by an example. +Suppose that you have application that handles traffic on two ports (in +the white list below):: + + ./app -w 00:14.0 -w 00:16.0 --vdev=net_tap0 --vdev=net_tap1 + +So in addition to the "regular" ports you need to configure proxy ports. +These proxy ports can be created via a command line (like above) or from +within the application (e.g. by using `rte_ifpx_proxy_create()` +function). + +When you have proxy ports you need to bind them to the "regular" ports:: + + rte_ifpx_port_bind(port0, proxy0); + rte_ifpx_port_bind(port1, proxy1); + +This binding is a logical one - there is no automatic packet forwarding +configured. +This is because library cannot tell upfront what portion of the traffic +received on ports 0/1 should be redirected to the system via proxies and +also it does not know how the application is structured (what packet +processing engines it uses). +Therefore it is application writer responsibility to include proxy ports +into its packet processing and forward appropriate packets between +proxies and ports. +What the library actually does is that it gets network configuration +from the system and listens to its changes. +This information is then matched against `if_index` of the configured +proxies and passed to the application. + +There are two mechanisms via which library passes notifications to the +application. +First is the set of global callbacks that user has +to register via:: + + rte_ifpx_callbacks_register(&cbs); + +Here `cbs` is a ``struct rte_ifpx_callbacks`` which has following +members:: + + int (*mac_change)(const struct rte_ifpx_mac_change *event); + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); + int (*link_change)(const struct rte_ifpx_link_change *event); + int (*addr_add)(const struct rte_ifpx_addr_change *event); + int (*addr_del)(const struct rte_ifpx_addr_change *event); + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); + int (*route_add)(const struct rte_ifpx_route_change *event); + int (*route_del)(const struct rte_ifpx_route_change *event); + int (*route6_add)(const struct rte_ifpx_route6_change *event); + int (*route6_del)(const struct rte_ifpx_route6_change *event); + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); + int (*cfg_done)(void); + +All of them should be self explanatory apart from the last one which is +library specific callback - called when initial network configuration +query is finished. + +So for example when the user issues command:: + + ip link set dev dtap0 mtu 1600 + +then library will call `mtu_change()` callback with MTU change event +having port_id equal to `port0` (id of the port bound to this proxy) and +`mtu` equal to 1600 (``dtap0`` is the default interface name for +``net_tap0``). +Application can simply use `rte_eth_dev_set_mtu()` in this callback. +The same way `rte_eth_dev_default_mac_addr_set()` can be used in +`mac_change()` and `rte_eth_dev_set_link_up/down()` inside the +`link_change()` callback that does dispatch based on `is_up` member of +its `event` argument. + +Please note however that the context in which these callbacks are called +is most probably different from the one in which packets are handled and +it is application writer responsibility to use proper synchronization +mechanisms - if they are needed. + +Second notification mechanism relies on queueing of event notifications +to the configured notification rings. +Application can add queue via:: + + int rte_ifpx_queue_add(struct rte_ring *r); + +This type of notification is used when there is no callback registered +for given type of event or when it is registered but it returns 0. +This way application has following choices: + +- if the data structure that needs to be updated due to notification + is safe to be modified by a single writer (while being used by other + readers) then it can simply do that inside the callback and return + non-zero value to signal end of the event handling + +- otherwise, when there are some common preparation steps that needs + to be done only once, application can register callback that will + perform these steps and return 0 - library will then add an event to + each registered notification queue + +- if the data structures are replicated and there are no common steps + then application can simply skip registering of the callbacks and + configure notification queues (e.g. 1 per each lcore) + +Once we have bindings in place and notification configured, the only +essential part that remains is to get the current network configuration +and start listening to its changes. +This is accomplished via a call to:: + + int rte_ifpx_listen(void); + +From that moment you should see notifications coming to your +application: first ones resulting from querying of current system +configurations and subsequent on the configuration changes. diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst index fb250abf5..2fd5a8c72 100644 --- a/doc/guides/prog_guide/index.rst +++ b/doc/guides/prog_guide/index.rst @@ -57,6 +57,7 @@ Programmer's Guide metrics_lib bpf_lib ipsec_lib + if_proxy_lib source_org dev_kit_build_system dev_kit_root_make_help -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH 3/4] if_proxy: add simple functionality test 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 2/4] if_proxy: add library documentation Andrzej Ostruszka @ 2020-03-06 16:41 ` Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 4/4] if_proxy: add example application Andrzej Ostruszka ` (5 subsequent siblings) 8 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-06 16:41 UTC (permalink / raw) To: dev, Thomas Monjalon This commit adds simple test of the library notifications. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + app/test/Makefile | 5 + app/test/meson.build | 4 + app/test/test_if_proxy.c | 706 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 716 insertions(+) create mode 100644 app/test/test_if_proxy.c diff --git a/MAINTAINERS b/MAINTAINERS index 3854d7661..a92cb7356 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1472,6 +1472,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: app/test/test_if_proxy.c F: doc/guides/prog_guide/if_proxy_lib.rst Test Applications diff --git a/app/test/Makefile b/app/test/Makefile index 1f080d162..dc287f94b 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -231,6 +231,11 @@ SRCS-$(CONFIG_RTE_LIBRTE_BPF) += test_bpf.c SRCS-$(CONFIG_RTE_LIBRTE_RCU) += test_rcu_qsbr.c test_rcu_qsbr_perf.c +ifeq ($(CONFIG_RTE_LIBRTE_IF_PROXY),y) +SRCS-y += test_if_proxy.c +LDLIBS += -lrte_if_proxy +endif + SRCS-$(CONFIG_RTE_LIBRTE_IPSEC) += test_ipsec.c SRCS-$(CONFIG_RTE_LIBRTE_IPSEC) += test_ipsec_sad.c ifeq ($(CONFIG_RTE_LIBRTE_IPSEC),y) diff --git a/app/test/meson.build b/app/test/meson.build index 0a2ce710f..870c3a8bb 100644 --- a/app/test/meson.build +++ b/app/test/meson.build @@ -352,6 +352,10 @@ endif if dpdk_conf.has('RTE_LIBRTE_PDUMP') test_deps += 'pdump' endif +if dpdk_conf.has('RTE_LIBRTE_IF_PROXY') + test_deps += 'if_proxy' + test_sources += 'test_if_proxy.c' +endif cflags = machine_args if cc.has_argument('-Wno-format-truncation') diff --git a/app/test/test_if_proxy.c b/app/test/test_if_proxy.c new file mode 100644 index 000000000..acd496126 --- /dev/null +++ b/app/test/test_if_proxy.c @@ -0,0 +1,706 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#include "test.h" + +#include <rte_ethdev.h> +#include <rte_if_proxy.h> +#include <rte_cycles.h> + +#include <string.h> +#include <unistd.h> +#include <signal.h> +#include <net/if.h> +#include <arpa/inet.h> +#include <pthread.h> +#include <time.h> + +/* There are two types of event notifications - one using callbacks and one + * using event queues (rings). We'll test them both and this "bool" will govern + * the type of API to use. + */ +static int use_callbacks = 1; +static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; +static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; + +static struct rte_ring *ev_queue; + +enum net_event_mask { + INITIALIZED = 1U << RTE_IFPX_CFG_DONE, + LINK_CHANGED = 1U << RTE_IFPX_LINK_CHANGE, + MAC_CHANGED = 1U << RTE_IFPX_MAC_CHANGE, + MTU_CHANGED = 1U << RTE_IFPX_MTU_CHANGE, + ADDR_ADD = 1U << RTE_IFPX_ADDR_ADD, + ADDR_DEL = 1U << RTE_IFPX_ADDR_DEL, + ROUTE_ADD = 1U << RTE_IFPX_ROUTE_ADD, + ROUTE_DEL = 1U << RTE_IFPX_ROUTE_DEL, + ADDR6_ADD = 1U << RTE_IFPX_ADDR6_ADD, + ADDR6_DEL = 1U << RTE_IFPX_ADDR6_DEL, + ROUTE6_ADD = 1U << RTE_IFPX_ROUTE6_ADD, + ROUTE6_DEL = 1U << RTE_IFPX_ROUTE6_DEL, + NEIGH_ADD = 1U << RTE_IFPX_NEIGH_ADD, + NEIGH_DEL = 1U << RTE_IFPX_NEIGH_DEL, + NEIGH6_ADD = 1U << RTE_IFPX_NEIGH6_ADD, + NEIGH6_DEL = 1U << RTE_IFPX_NEIGH6_DEL, +}; + +static unsigned int state; + +static struct { + struct rte_ether_addr mac_addr; + uint16_t port_id, mtu; + struct in_addr ipv4, route4; + struct in6_addr ipv6, route6; + uint16_t depth4, depth6; + int is_up; +} net_cfg; + +static +int unlock_notify(unsigned int op) +{ + /* the mutex is expected to be locked on entry */ + RTE_VERIFY(pthread_mutex_trylock(&mutex) == EBUSY); + state |= op; + + pthread_mutex_unlock(&mutex); + return pthread_cond_signal(&cond); +} + +static +void handle_event(struct rte_ifpx_event *ev); + +static +int wait_for(unsigned int op_mask, unsigned int sec) +{ + int ec; + + if (use_callbacks) { + struct timespec time; + + ec = pthread_mutex_trylock(&mutex); + /* the mutex is expected to be locked on entry */ + RTE_VERIFY(ec == EBUSY); + + ec = 0; + clock_gettime(CLOCK_REALTIME, &time); + time.tv_sec += sec; + + while ((state & op_mask) != op_mask && ec == 0) + ec = pthread_cond_timedwait(&cond, &mutex, &time); + } else { + uint64_t deadline; + struct rte_ifpx_event *ev; + + ec = 0; + deadline = rte_get_timer_cycles() + sec * rte_get_timer_hz(); + + while ((state & op_mask) != op_mask) { + if (rte_get_timer_cycles() >= deadline) { + ec = ETIMEDOUT; + break; + } + if (rte_ring_dequeue(ev_queue, (void**)&ev) == 0) + handle_event(ev); + } + } + + return ec; +} + +static +int expect(unsigned int op_mask, const char *fmt, ...) +#if __GNUC__ + __attribute__((format(printf, 2, 3))); +#endif + +static +int expect(unsigned int op_mask, const char *fmt, ...) +{ + char cmd[128]; + va_list args; + int ret; + + state &= ~op_mask; + va_start(args, fmt); + vsnprintf(cmd, sizeof(cmd), fmt, args); + va_end(args); + ret = system(cmd); + if (ret == 0) + /* IPv6 address notifications seem to need that long delay. */ + return wait_for(op_mask, 2); + return ret; +} + +static +int mac_change(const struct rte_ifpx_mac_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(MAC_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int mtu_change(const struct rte_ifpx_mtu_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->mtu == net_cfg.mtu) { + unlock_notify(MTU_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int link_change(const struct rte_ifpx_link_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->is_up == net_cfg.is_up) { + /* Special case for testing of callbacks modification from + * inside of callback: we catch putting link down (the last + * operation in test) and remove callbacks registered. + */ + if (!ev->is_up) { + rte_ifpx_callbacks_unregister(); + } + unlock_notify(LINK_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr_add(const struct rte_ifpx_addr_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->ip == net_cfg.ipv4.s_addr) { + unlock_notify(ADDR_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr_del(const struct rte_ifpx_addr_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->ip == net_cfg.ipv4.s_addr) { + unlock_notify(ADDR_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr6_add(const struct rte_ifpx_addr6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(ADDR6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr6_del(const struct rte_ifpx_addr6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(ADDR6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route_add(const struct rte_ifpx_route_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth4 == ev->depth && net_cfg.route4.s_addr == ev->ip) { + unlock_notify(ROUTE_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route_del(const struct rte_ifpx_route_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth4 == ev->depth && net_cfg.route4.s_addr == ev->ip) { + unlock_notify(ROUTE_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route6_add(const struct rte_ifpx_route6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth6 == ev->depth && + /* don't check for trailing zeros */ + memcmp(ev->ip, net_cfg.route6.s6_addr, ev->depth/8) == 0) { + unlock_notify(ROUTE6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route6_del(const struct rte_ifpx_route6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth6 == ev->depth && + /* don't check for trailing zeros */ + memcmp(ev->ip, net_cfg.route6.s6_addr, ev->depth/8) == 0) { + unlock_notify(ROUTE6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh_add(const struct rte_ifpx_neigh_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.ipv4.s_addr == ev->ip && + memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(NEIGH_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh_del(const struct rte_ifpx_neigh_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.ipv4.s_addr == ev->ip) { + unlock_notify(NEIGH_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh6_add(const struct rte_ifpx_neigh6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0 && + memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(NEIGH6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh6_del(const struct rte_ifpx_neigh6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(NEIGH6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int cfg_done(void) +{ + pthread_mutex_lock(&mutex); + unlock_notify(INITIALIZED); + return 1; +} + +static +void handle_event(struct rte_ifpx_event *ev) +{ + if (ev->type != RTE_IFPX_CFG_DONE) + RTE_VERIFY(ev->data.port_id == net_cfg.port_id); + + /* If params do not match what we expect just free the event. */ + switch (ev->type) { + case RTE_IFPX_MAC_CHANGE: + if (memcmp(ev->mac_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_MTU_CHANGE: + if (ev->mtu_change.mtu != net_cfg.mtu) + goto exit; + break; + case RTE_IFPX_LINK_CHANGE: + if (ev->link_change.is_up != net_cfg.is_up) + goto exit; + break; + case RTE_IFPX_ADDR_ADD: + if (ev->addr_change.ip != net_cfg.ipv4.s_addr) + goto exit; + break; + case RTE_IFPX_ADDR_DEL: + if (ev->addr_change.ip != net_cfg.ipv4.s_addr) + goto exit; + break; + case RTE_IFPX_ADDR6_ADD: + if (memcmp(ev->addr6_change.ip, net_cfg.ipv6.s6_addr, + 16) != 0) + goto exit; + break; + case RTE_IFPX_ADDR6_DEL: + if (memcmp(ev->addr6_change.ip, net_cfg.ipv6.s6_addr, + 16) != 0) + goto exit; + break; + case RTE_IFPX_ROUTE_ADD: + if (net_cfg.depth4 != ev->route_change.depth || + net_cfg.route4.s_addr != ev->route_change.ip) + goto exit; + break; + case RTE_IFPX_ROUTE_DEL: + if (net_cfg.depth4 != ev->route_change.depth || + net_cfg.route4.s_addr != ev->route_change.ip) + goto exit; + break; + case RTE_IFPX_ROUTE6_ADD: + if (net_cfg.depth6 != ev->route6_change.depth || + /* don't check for trailing zeros */ + memcmp(ev->route6_change.ip, net_cfg.route6.s6_addr, + ev->route6_change.depth/8) != 0) + goto exit; + break; + case RTE_IFPX_ROUTE6_DEL: + if (net_cfg.depth6 != ev->route6_change.depth || + /* don't check for trailing zeros */ + memcmp(ev->route6_change.ip, net_cfg.route6.s6_addr, + ev->route6_change.depth/8) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH_ADD: + if (net_cfg.ipv4.s_addr != ev->neigh_change.ip || + memcmp(ev->neigh_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH_DEL: + if (net_cfg.ipv4.s_addr != ev->neigh_change.ip) + goto exit; + break; + case RTE_IFPX_NEIGH6_ADD: + if (memcmp(ev->neigh6_change.ip, + net_cfg.ipv6.s6_addr, 16) != 0 || + memcmp(ev->neigh6_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH6_DEL: + if (memcmp(ev->neigh6_change.ip, net_cfg.ipv6.s6_addr, 16) != 0) + goto exit; + break; + case RTE_IFPX_CFG_DONE: + break; + default: + RTE_VERIFY(0 && "Unhandled event type"); + } + + state |= 1U << ev->type; +exit: + free(ev); +} + +static +struct rte_ifpx_callbacks cbs = { + .mac_change = mac_change, + .mtu_change = mtu_change, + .link_change = link_change, + .addr_add = addr_add, + .addr_del = addr_del, + .addr6_add = addr6_add, + .addr6_del = addr6_del, + .route_add = route_add, + .route_del = route_del, + .route6_add = route6_add, + .route6_del = route6_del, + .neigh_add = neigh_add, + .neigh_del = neigh_del, + .neigh6_add = neigh6_add, + .neigh6_del = neigh6_del, + /* lib specific callback */ + .cfg_done = cfg_done, +}; + +static +int test_notifications(const struct rte_ifpx_info *pinfo) +{ + char mac_buf[RTE_ETHER_ADDR_FMT_SIZE]; + int ec; + + /* Test link up notification. */ + net_cfg.is_up = 1; + ec = expect(LINK_CHANGED, "ip link set dev %s up", pinfo->if_name); + if (ec != 0) { + printf("Failed to notify about link going up\n"); + return ec; + } + + /* Test for MAC changes notification. */ + rte_eth_random_addr(net_cfg.mac_addr.addr_bytes); + rte_ether_format_addr(mac_buf, sizeof(mac_buf), &net_cfg.mac_addr); + ec = expect(MAC_CHANGED, "ip link set dev %s address %s", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notification about mac change\n"); + return ec; + } + + /* Test for MTU changes notification. */ + net_cfg.mtu = pinfo->mtu + 100; + ec = expect(MTU_CHANGED, "ip link set dev %s mtu %d", + pinfo->if_name, net_cfg.mtu); + if (ec != 0) { + printf("Missing/wrong notification about mtu change\n"); + return ec; + } + + /* Test for adding of IPv4 address - using address from TEST-2 pool. + * This test is specific to linux netlink behaviour - after adding + * address we get both notification about address being added and new + * route. So I check both. + */ + net_cfg.ipv4.s_addr = RTE_IPV4(198, 51, 100, 14); + net_cfg.route4.s_addr = net_cfg.ipv4.s_addr; + net_cfg.depth4 = 32; + ec = expect(ADDR_ADD | ROUTE_ADD, "ip addr add 198.51.100.14 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 address add\n"); + return ec; + } + + /* Test for IPv4 address removal. See comment above for 'addr add'. */ + ec = expect(ADDR_DEL | ROUTE_DEL, "ip addr del 198.51.100.14/32 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 address del\n"); + return ec; + } + + /* Test for adding IPv4 route. */ + net_cfg.route4.s_addr = RTE_IPV4(198, 51, 100, 0); + net_cfg.depth4 = 24; + ec = expect(ROUTE_ADD, "ip route add 198.51.100.0/24 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 route add\n"); + return ec; + } + + /* Test for IPv4 route removal. */ + ec = expect(ROUTE_DEL, "ip route del 198.51.100.0/24 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 route del\n"); + return ec; + } + + /* Test for neighbour addresses notifications. */ + rte_eth_random_addr(net_cfg.mac_addr.addr_bytes); + rte_ether_format_addr(mac_buf, sizeof(mac_buf), &net_cfg.mac_addr); + + ec = expect(NEIGH_ADD, + "ip neigh add 198.51.100.14 dev %s lladdr %s nud noarp", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 neighbour add\n"); + return ec; + } + + ec = expect(NEIGH_DEL, "ip neigh del 198.51.100.14 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 neighbour del\n"); + return ec; + } + + /* Now the same for IPv6 - with address from "documentation pool". */ + inet_pton(AF_INET6, "2001:db8::dead:beef", net_cfg.ipv6.s6_addr); + /* This is specific to linux netlink behaviour - after adding address + * we get both notification about address being added and new route. + * So I wait for both. + */ + memcpy(net_cfg.route6.s6_addr, net_cfg.ipv6.s6_addr, 16); + net_cfg.depth6 = 128; + ec = expect(ADDR6_ADD | ROUTE6_ADD, + "ip addr add 2001:db8::dead:beef dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 address add\n"); + return ec; + } + + /* See comment above for 'addr6 add'. */ + ec = expect(ADDR6_DEL | ROUTE6_DEL, + "ip addr del 2001:db8::dead:beef/128 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 address del\n"); + return ec; + } + + net_cfg.depth6 = 96; + ec = expect(ROUTE6_ADD, "ip route add 2001:db8::dead:0/96 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 route add\n"); + return ec; + } + + ec = expect(ROUTE6_DEL, "ip route del 2001:db8::dead:0/96 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 route del\n"); + return ec; + } + + ec = expect(NEIGH6_ADD, + "ip neigh add 2001:db8::dead:beef dev %s lladdr %s nud noarp", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 neighbour add\n"); + return ec; + } + + ec = expect(NEIGH6_DEL, "ip neigh del 2001:db8::dead:beef dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 neighbour del\n"); + return ec; + } + + /* Finally put link down and test for notification. */ + net_cfg.is_up = 0; + ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name); + if (ec != 0) { + printf("Failed to notify about link going down\n"); + return ec; + } + + return 0; +} + +static +int test_if_proxy(void) +{ + int ec; + const struct rte_ifpx_info *pinfo; + uint16_t proxy_id; + + state = 0; + memset(&net_cfg, 0, sizeof(net_cfg)); + + if (rte_eth_dev_count_avail() == 0) { + printf("Run this test with at least one port configured\n"); + return 1; + } + /* User the first port available. */ + RTE_ETH_FOREACH_DEV(net_cfg.port_id) + break; + proxy_id = rte_ifpx_proxy_create(RTE_IFPX_DEFAULT); + RTE_VERIFY(proxy_id != RTE_MAX_ETHPORTS); + rte_ifpx_port_bind(net_cfg.port_id, proxy_id); + rte_ifpx_callbacks_register(&cbs); + rte_ifpx_listen(); + + /* Let's start with callback based API. */ + use_callbacks = 1; + pthread_mutex_lock(&mutex); + ec = wait_for(INITIALIZED, 2); + if (ec != 0) { + printf("Failed to obtain network configuration\n"); + goto exit; + } + pinfo = rte_ifpx_info_get(net_cfg.port_id); + RTE_VERIFY(pinfo); + + /* Make sure the link is down. */ + net_cfg.is_up = 0; + ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name); + RTE_VERIFY(ec == ETIMEDOUT || ec == 0); + + ec = test_notifications(pinfo); + if (ec != 0) { + printf("Failed test with callback based API\n"); + goto exit; + } + /* Switch to event queue based API and repeat tests. */ + use_callbacks = 0; + ev_queue = rte_ring_create("IFPX-events", 16, SOCKET_ID_ANY, + RING_F_SP_ENQ | RING_F_SC_DEQ); + ec = rte_ifpx_queue_add(ev_queue); + if (ec != 0) { + printf("Failed to add a notification queue\n"); + goto exit; + } + ec = test_notifications(pinfo); + if (ec != 0) { + printf("Failed test with event queue based API\n"); + goto exit; + } + +exit: + pthread_mutex_unlock(&mutex); + /* Proxy ports are not owned by the lib. Internal references to them + * are cleared on close, but the ports are not destroyed so we need to + * do that explicitly. + */ + rte_ifpx_proxy_destroy(proxy_id); + rte_ifpx_close(); + /* Queue is removed from the lib by rte_ifpx_close() - here we just + * free it. + */ + rte_ring_free(ev_queue); + ev_queue = NULL; + + return ec; +} + +REGISTER_TEST_COMMAND(if_proxy_autotest, test_if_proxy) -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH 4/4] if_proxy: add example application 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka ` (2 preceding siblings ...) 2020-03-06 16:41 ` [dpdk-dev] [PATCH 3/4] if_proxy: add simple functionality test Andrzej Ostruszka @ 2020-03-06 16:41 ` Andrzej Ostruszka 2020-03-06 17:17 ` [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka ` (4 subsequent siblings) 8 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-06 16:41 UTC (permalink / raw) To: dev, Thomas Monjalon Add an example application showing possible library usage. This is a simplified version of l3fwd where: - many performance improvements has been removed in order to simplify logic and put focus on the proxy library usage, - the configuration of forwarding has to be done by the user (using typical system tools on proxy ports) - these changes are passed to the application via library notifications. It is meant to show how you can update some data from callbacks (routing - see note below) and how those that are replicated (e.g. kept per lcore) can be updated via event queueing (here neighbouring info). Note: This example assumes that LPM tables can be updated by a single writer while being used by others. To the best of author's knowledge this is the case (by preliminary code inspection) but DPDK does not make such a promise. Obviously, upon the change, there will be a transient period (when some IPs will be directed still to the old destination) but that is expected. Note also that in some cases you might need to tweak your system configuration to see effects. For example you send Gratuitous ARP to DPDK port and expect neighbour tables to be updated in application which does not happen. The packet will be sent to the kernel but it might drop it, please check /proc/sys/net/ipv4/conf/dtap0/arp_accept and related configuration options ('dtap0' here is just a name of your proxy port). Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + examples/Makefile | 1 + examples/l3fwd-ifpx/Makefile | 60 ++ examples/l3fwd-ifpx/l3fwd.c | 1123 +++++++++++++++++++++++++++++++ examples/l3fwd-ifpx/l3fwd.h | 98 +++ examples/l3fwd-ifpx/main.c | 729 ++++++++++++++++++++ examples/l3fwd-ifpx/meson.build | 11 + examples/meson.build | 2 +- 8 files changed, 2024 insertions(+), 1 deletion(-) create mode 100644 examples/l3fwd-ifpx/Makefile create mode 100644 examples/l3fwd-ifpx/l3fwd.c create mode 100644 examples/l3fwd-ifpx/l3fwd.h create mode 100644 examples/l3fwd-ifpx/main.c create mode 100644 examples/l3fwd-ifpx/meson.build This patch depends on: http://patches.dpdk.org/patch/66214/ diff --git a/MAINTAINERS b/MAINTAINERS index a92cb7356..79355f9eb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1472,6 +1472,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: examples/l3fwd-ifpx/ F: app/test/test_if_proxy.c F: doc/guides/prog_guide/if_proxy_lib.rst diff --git a/examples/Makefile b/examples/Makefile index feff79784..a8cb02a6c 100644 --- a/examples/Makefile +++ b/examples/Makefile @@ -81,6 +81,7 @@ else $(info vm_power_manager requires libvirt >= 0.9.3) endif endif +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += l3fwd-ifpx DIRS-y += eventdev_pipeline diff --git a/examples/l3fwd-ifpx/Makefile b/examples/l3fwd-ifpx/Makefile new file mode 100644 index 000000000..68eefeb75 --- /dev/null +++ b/examples/l3fwd-ifpx/Makefile @@ -0,0 +1,60 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2020 Marvell International Ltd. + +# binary name +APP = l3fwd + +# all source are stored in SRCS-y +SRCS-y := main.c l3fwd.c + +# Build using pkg-config variables if possible +ifeq ($(shell pkg-config --exists libdpdk && echo 0),0) + +all: shared +.PHONY: shared static +shared: build/$(APP)-shared + ln -sf $(APP)-shared build/$(APP) +static: build/$(APP)-static + ln -sf $(APP)-static build/$(APP) + +PKGCONF ?= pkg-config + +PC_FILE := $(shell $(PKGCONF) --path libdpdk 2>/dev/null) +CFLAGS += -DALLOW_EXPERIMENTAL_API -O3 $(shell $(PKGCONF) --cflags libdpdk) +LDFLAGS_SHARED = $(shell $(PKGCONF) --libs libdpdk) +LDFLAGS_STATIC = -Wl,-Bstatic $(shell $(PKGCONF) --static --libs libdpdk) + +build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED) + +build/$(APP)-static: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_STATIC) + +build: + @mkdir -p $@ + +.PHONY: clean +clean: + rm -f build/$(APP) build/$(APP)-static build/$(APP)-shared + test -d build && rmdir -p build || true + +else # Build using legacy build system + +ifeq ($(RTE_SDK),) +$(error "Please define RTE_SDK environment variable") +endif + +# Default target, detect a build directory, by looking for a path with a .config +RTE_TARGET ?= $(notdir $(abspath $(dir $(firstword $(wildcard $(RTE_SDK)/*/.config))))) + +include $(RTE_SDK)/mk/rte.vars.mk + +CFLAGS += -DALLOW_EXPERIMENTAL_API + +CFLAGS += -I$(SRCDIR) +CFLAGS += -O3 $(USER_FLAGS) +CFLAGS += $(WERROR_FLAGS) +LDLIBS += -lrte_if_proxy -lrte_ethdev -lrte_eal + +include $(RTE_SDK)/mk/rte.extapp.mk +endif diff --git a/examples/l3fwd-ifpx/l3fwd.c b/examples/l3fwd-ifpx/l3fwd.c new file mode 100644 index 000000000..0f5a06f9f --- /dev/null +++ b/examples/l3fwd-ifpx/l3fwd.c @@ -0,0 +1,1123 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <inttypes.h> +#include <sys/types.h> +#include <string.h> +#include <sys/queue.h> +#include <stdarg.h> +#include <errno.h> +#include <getopt.h> +#include <sys/socket.h> +#include <arpa/inet.h> + +#include <rte_debug.h> +#include <rte_ether.h> +#include <rte_cycles.h> +#include <rte_malloc.h> +#include <rte_mbuf.h> +#include <rte_ip.h> + +#include <rte_jhash.h> +//#include <rte_hash_crc.h> + +#include <rte_tcp.h> +#include <rte_udp.h> +#include <rte_lpm.h> +#include <rte_lpm6.h> +#include <rte_if_proxy.h> + +#include "l3fwd.h" + +#define DO_RFC_1812_CHECKS + +#define IPV4_L3FWD_LPM_MAX_RULES 1024 +#define IPV4_L3FWD_LPM_NUMBER_TBL8S (1 << 8) +#define IPV6_L3FWD_LPM_MAX_RULES 1024 +#define IPV6_L3FWD_LPM_NUMBER_TBL8S (1 << 16) + +static volatile bool ifpx_ready; + +/* ethernet addresses of ports */ +static +union lladdr_t port_mac[RTE_MAX_ETHPORTS]; + +static struct rte_lpm *ipv4_routes; +static struct rte_lpm6 *ipv6_routes; + +static +struct ipv4_gateway { + uint16_t port; + union lladdr_t lladdr; + uint32_t ip; +} ipv4_gateways[128]; + +static +struct ipv6_gateway { + uint16_t port; + union lladdr_t lladdr; + uint8_t ip[16]; +} ipv6_gateways[128]; + +/* The lowest 2 bits of next hop (which is 24/21 bit for IPv4/6) are reserved to + * encode: + * 00 -> host route: higher bits of next hop are port id and dst MAC should be + * based on dst IP + * 01 -> gateway route: higher bits of next hop are index into gateway array and + * use port and MAC cached there (if no MAC cached yet then search for it + * based on gateway IP) + * 10 -> proxy entry: packet directed to us, just take higher bits as port id of + * proxy and send packet there (without any modification) + * The port id (16 bits) will always fit however this will not work if you + * need more than 2^20 gateways. + */ +enum route_type { + HOST_ROUTE = 0x00, + GW_ROUTE = 0x01, + PROXY_ADDR = 0x02, +}; + +RTE_STD_C11 +_Static_assert(RTE_DIM(ipv4_gateways) <= (1 << 22) && + RTE_DIM(ipv6_gateways) <= (1 << 19), + "Gateway array index has to fit within next_hop with 2 bits reserved"); + +static +uint32_t find_add_gateway(uint16_t port, uint32_t ip) +{ + // FIXME - think how GW slots are released!!! + // Probably on removal of GW route, which means I need to check + // the rule before deleting it. + uint32_t i, idx = -1U; + + for (i = 0; i < RTE_DIM(ipv4_gateways); ++i) { + /* Remember first free slot in case GW is not present. */ + if (idx == -1U && ipv4_gateways[i].ip == 0) + idx = i; + else if (ipv4_gateways[i].ip == ip) + /* For now assume that given GW will be always at the + * same port, so no checking for that + */ + return i; + } + if (idx != -1U) { + ipv4_gateways[idx].port = port; + ipv4_gateways[idx].ip = ip; + /* Since ARP tables are kept per lcore MAC will be updated + * during first lookup. + */ + } + return idx; +} + +static +uint32_t find_add_gateway6(uint16_t port, const uint8_t *ip) +{ + uint32_t i, idx = -1U; + + for (i = 0; i < RTE_DIM(ipv6_gateways); ++i) { + /* Remember first free slot in case GW is not present. */ + if (idx == -1U && ipv6_gateways[i].ip[0] == 0) + idx = i; + else if (ipv6_gateways[i].ip[0]) + /* For now assume that given GW will be always at the + * same port, so no checking for that + */ + return i; + } + if (idx != -1U) { + ipv6_gateways[idx].port = port; + memcpy(ipv6_gateways[idx].ip, ip, 16); + /* Since ARP tables are kept per lcore MAC will be updated + * during first lookup. + */ + } + return idx; +} + +/* Assumptions: + * - Link related changes (MAC/MTU/...) need to be executed once, and it's OK + * to run them from the callback - if this is not the case (e.g. -EBUSY for + * MTU change, then event notification need to be used and more sophisticated + * coordination with lcore loops and stopping/starting of the ports: for + * example lcores not receiving on this port just mark it as inactive and stop + * transmitting to it and the one with RX stops the port sets the MAC starts + * it and notifies other lcores that it is back). + * - LPM is safe to be modified by one writer, and read by many without any + * locks (it looks to me like this is the case), however upon routing change + * there might be a transient period during which packets are not directed + * according to new rule. + * - Hash is unsafe to be used that way (and I don't want to turn on relevant + * flags just to excersize queued notifications) so every lcore keeps its + * copy of relevant data. + * Therefore there are callbacks defined for the routing info/address changes + * and remaining ones are handled via events on per lcore basis. + */ +static +int mac_change(const struct rte_ifpx_mac_change *ev) +{ + int i; + struct rte_ether_addr mac_addr; + char buf[RTE_ETHER_ADDR_FMT_SIZE]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(buf, sizeof(buf), &ev->mac); + RTE_LOG(DEBUG, L3FWD, "MAC change for port %d: %s\n", + ev->port_id, buf); + } + // FIXME - use copy because RTE functions don't take const args + rte_ether_addr_copy(&ev->mac, &mac_addr); + i = rte_eth_dev_default_mac_addr_set(ev->port_id, &mac_addr); + if (i == -EOPNOTSUPP) + i = rte_eth_dev_mac_addr_add(ev->port_id, &mac_addr, 0); + if (i < 0) + RTE_LOG(WARNING, L3FWD, "Failed to set MAC address\n"); + else { + port_mac[ev->port_id].mac.addr = ev->mac; + port_mac[ev->port_id].mac.valid = 1; + } + return 1; +} + +#if 0 // FIXME - either remove this or add IP fragmentation +static +int mtu_change(const struct rte_ifpx_mtu_change *ev) +{ + RTE_LOG(DEBUG, L3FWD, "MTU change for port %d: %d\n", + ev->port_id, ev->mtu); + if (rte_eth_dev_set_mtu(ev->port_id, ev->mtu) < 0) + RTE_LOG(WARNING, L3FWD, "Failed to set MTU\n"); + return 1; +} +#endif + +static +int link_change(const struct rte_ifpx_link_change *ev) +{ + RTE_LOG(DEBUG, L3FWD, "Link change for port %d: %d\n", + ev->port_id, ev->is_up); + if (ev->is_up) { + rte_eth_dev_set_link_up(ev->port_id); + active_port_mask |= 1U << ev->port_id; + } else { + rte_eth_dev_set_link_down(ev->port_id); + active_port_mask &= ~(1U << ev->port_id); + } + active_port_mask &= enabled_port_mask; + // FIXME - shouldn't proxy be marked too?? We only get port notification! + return 1; +} + +static +int addr_add(const struct rte_ifpx_addr_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 address for port %d: %s\n", + ev->port_id, buf); + } + rte_lpm_add(ipv4_routes, ev->ip, 32, + ev->port_id << 2 | PROXY_ADDR); + return 1; +} + +static +int route_add(const struct rte_ifpx_route_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t nh, ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 route for port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + + /* On Linux upon changing of the IP we get notification for both addr + * and route, so just check if we already have addr entry and if so + * then ignore this notification. + */ + if (ev->depth == 32 && + rte_lpm_lookup(ipv4_routes, ev->ip, &nh) == 0 && nh & PROXY_ADDR) + return 1; + + if (ev->gateway) { + nh = find_add_gateway(ev->port_id, ev->gateway); + if (nh != -1U) + rte_lpm_add(ipv4_routes, ev->ip, ev->depth, + nh << 2 | GW_ROUTE); + else + RTE_LOG(WARNING, L3FWD, "No free slot in GW array\n"); + } else + rte_lpm_add(ipv4_routes, ev->ip, ev->depth, + ev->port_id << 2 | HOST_ROUTE); + return 1; +} + +static +int addr_del(const struct rte_ifpx_addr_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 address removed from port %d: %s\n", + ev->port_id, buf); + } + rte_lpm_delete(ipv4_routes, ev->ip, 32); + return 1; +} + +static +int route_del(const struct rte_ifpx_route_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 route removed from port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + rte_lpm_delete(ipv4_routes, ev->ip, ev->depth); + return 1; +} + +static +int addr6_add(const struct rte_ifpx_addr6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 address for port %d: %s\n", + ev->port_id, buf); + } + rte_lpm6_add(ipv6_routes, ev->ip, 128, + ev->port_id << 2 | PROXY_ADDR); + return 1; +} + +static +int route6_add(const struct rte_ifpx_route6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + /* See comment in route_add(). */ + uint32_t nh; + if (ev->depth == 128 && + rte_lpm6_lookup(ipv6_routes, ev->ip, &nh) == 0 && nh & PROXY_ADDR) + return 1; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 route for port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + /* no valid IPv6 address starts with 0x00 */ + if (ev->gateway[0]) { + nh = find_add_gateway6(ev->port_id, ev->ip); + if (nh != -1U) + rte_lpm6_add(ipv6_routes, ev->ip, ev->depth, + nh << 2 | GW_ROUTE); + else + RTE_LOG(WARNING, L3FWD, "No free slot in GW6 array\n"); + } else + rte_lpm6_add(ipv6_routes, ev->ip, ev->depth, + ev->port_id << 2 | HOST_ROUTE); + return 1; +} + +static +int addr6_del(const struct rte_ifpx_addr6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 address removed from port %d: %s\n", + ev->port_id, buf); + } + rte_lpm6_delete(ipv6_routes, ev->ip, 128); + return 1; +} + +static +int route6_del(const struct rte_ifpx_route6_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 route removed from port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + rte_lpm6_delete(ipv6_routes, ev->ip, ev->depth); + return 1; +} + +static +int cfg_done(void) +{ + uint16_t port_id, px; + const struct rte_ifpx_info *pinfo; + + RTE_LOG(DEBUG, L3FWD, "Proxy config finished\n"); + + /* Copy MAC addresses of the proxies - to be used as src MAC during + * forwarding. + */ + RTE_ETH_FOREACH_DEV(port_id) { + px = rte_ifpx_proxy_get(port_id); + if (px != RTE_MAX_ETHPORTS && px != port_id) { + pinfo = rte_ifpx_info_get(px); + rte_ether_addr_copy(&pinfo->mac, + &port_mac[port_id].mac.addr); + port_mac[port_id].mac.valid = 1; + } + } + + ifpx_ready = 1; + return 1; +} + +static +struct rte_ifpx_callbacks ifpx_callbacks = { + .mac_change = mac_change, +#if 0 + .mtu_change = mtu_change, +#endif + .link_change = link_change, + .addr_add = addr_add, + .addr_del = addr_del, + .addr6_add = addr6_add, + .addr6_del = addr6_del, + .route_add = route_add, + .route_del = route_del, + .route6_add = route6_add, + .route6_del = route6_del, + .cfg_done = cfg_done, +}; + +int init_if_proxy(void) +{ +// uint32_t ports_checked = 0; +// uint16_t port_id = 0xFFFF; +// struct rte_eth_link link; + char buf[16]; + unsigned i; + +#if 0 + /* Synchronize link statuses with statuses of their proxies. */ + do { + /* enabled_port_mask is non-zero so this loop is safe */ + do + ++port_id; + while ((enabled_port_mask & (1u << port_id)) == 0); + memset(&link, 0, sizeof(link)); + if (rte_eth_link_get_nowait(port_id, &link) < 0) { + RTE_LOG(ERR, L3FWD, + "Failed to get link state for port: %d\n", + port_id); + return -1; + } + if (rte_ifpx_is_proxy(port_id)) { + if (link.link_status == ETH_LINK_UP) + active_port_mask |= 1u << port_id; + } else { + if (active_port_mask & 1 << rte_ifpx_proxy_get(port_id)) { + rte_eth_dev_set_link_up(port_id); + active_port_mask |= 1u << port_id; + } else + rte_eth_dev_set_link_down(port_id); + } + ports_checked |= 1u << port_id; + } while (ports_checked != enabled_port_mask); +#endif + + rte_ifpx_callbacks_register(&ifpx_callbacks); + + RTE_LCORE_FOREACH(i) { + if (lcore_conf[i].n_rx_queue == 0) + continue; + snprintf(buf, sizeof(buf), "IFPX-events_%d", i); + lcore_conf[i].ev_queue = rte_ring_create(buf, 16, SOCKET_ID_ANY, + RING_F_SP_ENQ | RING_F_SC_DEQ); + if (!lcore_conf[i].ev_queue) { + RTE_LOG(ERR, L3FWD, + "Failed to create event queue for lcore %d\n", i); + return -1; + } + rte_ifpx_queue_add(lcore_conf[i].ev_queue); + } + + return rte_ifpx_listen(); +} + +void close_if_proxy(void) +{ + unsigned i; + + RTE_LCORE_FOREACH(i) { + if (lcore_conf[i].n_rx_queue == 0) + continue; + rte_ring_free(lcore_conf[i].ev_queue); + } + rte_ifpx_close(); +} + +void wait_for_config_done(void) +{ + while (!ifpx_ready) + rte_delay_ms(100); +} + +#ifdef DO_RFC_1812_CHECKS +static inline +int is_valid_ipv4_pkt(struct rte_ipv4_hdr *pkt, uint32_t link_len) +{ + /* From http://www.rfc-editor.org/rfc/rfc1812.txt section 5.2.2 */ + /* + * 1. The packet length reported by the Link Layer must be large + * enough to hold the minimum length legal IP datagram (20 bytes). + */ + if (link_len < sizeof(struct rte_ipv4_hdr)) + return -1; + + /* 2. The IP checksum must be correct. */ + /* this is checked in H/W */ + + /* + * 3. The IP version number must be 4. If the version number is not 4 + * then the packet may be another version of IP, such as IPng or + * ST-II. + */ + if (((pkt->version_ihl) >> 4) != 4) + return -3; + /* + * 4. The IP header length field must be large enough to hold the + * minimum length legal IP datagram (20 bytes = 5 words). + */ + if ((pkt->version_ihl & 0xf) < 5) + return -4; + + /* + * 5. The IP total length field must be large enough to hold the IP + * datagram header, whose length is specified in the IP header length + * field. + */ + if (rte_cpu_to_be_16(pkt->total_length) < sizeof(struct rte_ipv4_hdr)) + return -5; + + return 0; +} +#endif + +/* Send burst of packets on an output interface */ +static inline +int send_burst(struct lcore_conf *lconf, uint16_t n, uint16_t port) +{ + struct rte_mbuf **m_table; + int ret; + uint16_t queueid; + + queueid = lconf->tx_queue_id[port]; + m_table = (struct rte_mbuf **)lconf->tx_mbufs[port].m_table; + + ret = rte_eth_tx_burst(port, queueid, m_table, n); + if (unlikely(ret < n)) { + do { + rte_pktmbuf_free(m_table[ret]); + } while (++ret < n); + } + + return 0; +} + +/* Enqueue a single packet, and send burst if queue is filled */ +static inline +int send_single_packet(struct lcore_conf *lconf, + struct rte_mbuf *m, uint16_t port) +{ + uint16_t len; + + len = lconf->tx_mbufs[port].len; + lconf->tx_mbufs[port].m_table[len] = m; + len++; + + /* enough pkts to be sent */ + if (unlikely(len == MAX_PKT_BURST)) { + send_burst(lconf, MAX_PKT_BURST, port); + len = 0; + } + + lconf->tx_mbufs[port].len = len; + return 0; +} + +static inline +int ipv4_get_destination(const struct rte_ipv4_hdr *ipv4_hdr, + struct rte_lpm *lpm, uint32_t *next_hop) +{ + return rte_lpm_lookup(lpm, + rte_be_to_cpu_32(ipv4_hdr->dst_addr), + next_hop); +} + +static inline +int ipv6_get_destination(const struct rte_ipv6_hdr *ipv6_hdr, + struct rte_lpm6 *lpm, uint32_t *next_hop) +{ + return rte_lpm6_lookup(lpm, ipv6_hdr->dst_addr, next_hop); +} + +static +uint16_t ipv4_process_pkt(struct lcore_conf *lconf, struct rte_ether_hdr *eth_hdr, + struct rte_ipv4_hdr *ipv4_hdr, uint16_t portid) +{ + union lladdr_t lladdr = { 0 }; + int i; + uint32_t ip, nh; + + /* Here we know that packet is not from proxy - this case is handled + * in the main loop - so if we fail to find destination we will direct + * it to the proxy. + */ + if (ipv4_get_destination(ipv4_hdr, ipv4_routes, &nh) < 0) + return rte_ifpx_proxy_get(portid); + + if (nh & PROXY_ADDR) + return nh >> 2; + + /* Packet not to us so update src/dst MAC. */ + if (nh & GW_ROUTE) { + i = nh >> 2; + if (ipv4_gateways[i].lladdr.mac.valid) + lladdr = ipv4_gateways[i].lladdr; + else { + i = rte_hash_lookup(lconf->neigh_hash, + &ipv4_gateways[i].ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh_map[i]; + ipv4_gateways[i].lladdr = lladdr; + } + nh = ipv4_gateways[i].port; + } else { + nh >>= 2; + ip = rte_be_to_cpu_32(ipv4_hdr->dst_addr); + i = rte_hash_lookup(lconf->neigh_hash, &ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh_map[i]; + } + + RTE_ASSERT(lladdr.mac.valid); + RTE_ASSERT(port_mac[nh].mac.valid); + /* dst addr */ + *(uint64_t *)ð_hdr->d_addr = lladdr.val; + /* src addr */ + rte_ether_addr_copy(&port_mac[nh].mac.addr, ð_hdr->s_addr); + + return nh; +} + +static +uint16_t ipv6_process_pkt(struct lcore_conf *lconf, struct rte_ether_hdr *eth_hdr, + struct rte_ipv6_hdr *ipv6_hdr, uint16_t portid) +{ + union lladdr_t lladdr = { 0 }; + int i; + uint32_t nh; + + /* Here we know that packet is not from proxy - this case is handled + * in the main loop - so if we fail to find destination we will direct + * it to the proxy. + */ + if (ipv6_get_destination(ipv6_hdr, ipv6_routes, &nh) < 0) + return rte_ifpx_proxy_get(portid); + + if (nh & PROXY_ADDR) + return nh >> 2; + + /* Packet not to us so update src/dst MAC. */ + if (nh & GW_ROUTE) { + i = nh >> 2; + if (ipv6_gateways[i].lladdr.mac.valid) + lladdr = ipv6_gateways[i].lladdr; + else { + i = rte_hash_lookup(lconf->neigh6_hash, + ipv6_gateways[i].ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh6_map[i]; + ipv6_gateways[i].lladdr = lladdr; + } + nh = ipv6_gateways[i].port; + } else { + nh >>= 2; + i = rte_hash_lookup(lconf->neigh6_hash, ipv6_hdr->dst_addr); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh6_map[i]; + } + + RTE_ASSERT(lladdr.mac.valid); + /* dst addr */ + *(uint64_t *)ð_hdr->d_addr = lladdr.val; + /* src addr */ + rte_ether_addr_copy(&port_mac[nh].mac.addr, ð_hdr->s_addr); + + return nh; +} + +static __rte_always_inline +void l3fwd_lpm_simple_forward(struct rte_mbuf *m, uint16_t portid, + struct lcore_conf *lconf) +{ + struct rte_ether_hdr *eth_hdr; + uint32_t nh; + + eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *); + + if (RTE_ETH_IS_IPV4_HDR(m->packet_type)) { + /* Handle IPv4 headers.*/ + struct rte_ipv4_hdr *ipv4_hdr; + + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *, + sizeof(*eth_hdr)); + +#ifdef DO_RFC_1812_CHECKS + /* Check to make sure the packet is valid (RFC1812) */ + if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) { + rte_pktmbuf_free(m); + return; + } +#endif + nh = ipv4_process_pkt(lconf, eth_hdr, ipv4_hdr, portid); + +#ifdef DO_RFC_1812_CHECKS + /* Update time to live and header checksum */ + --(ipv4_hdr->time_to_live); + ++(ipv4_hdr->hdr_checksum); +#endif + } else if (RTE_ETH_IS_IPV6_HDR(m->packet_type)) { + /* Handle IPv6 headers.*/ + struct rte_ipv6_hdr *ipv6_hdr; + + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *, + sizeof(*eth_hdr)); + + nh = ipv6_process_pkt(lconf, eth_hdr, ipv6_hdr, portid); + } else + /* Unhandled protocol */ + nh = rte_ifpx_proxy_get(portid); + + if (nh >= RTE_MAX_ETHPORTS || (active_port_mask & 1 << nh) == 0) + rte_pktmbuf_free(m); + else + send_single_packet(lconf, m, nh); +} + +static inline +void l3fwd_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, + uint16_t portid, struct lcore_conf *lconf) +{ + int32_t j; + + /* Prefetch first packets */ + for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *)); + + /* Prefetch and forward already prefetched packets. */ + for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ + j + PREFETCH_OFFSET], void *)); + l3fwd_lpm_simple_forward(pkts_burst[j], portid, lconf); + } + + /* Forward remaining prefetched packets */ + for (; j < nb_rx; j++) + l3fwd_lpm_simple_forward(pkts_burst[j], portid, lconf); +} + +static +void handle_neigh_add(struct lcore_conf *lconf, + const struct rte_ifpx_neigh_change *ev) +{ + char mac[RTE_ETHER_ADDR_FMT_SIZE]; + char ip[INET_ADDRSTRLEN]; + int32_t i, a; + + i = rte_hash_add_key(lconf->neigh_hash, &ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to add IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(mac, sizeof(mac), &ev->mac); + a = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &a, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour update for port %d: %s -> %s@%d\n", + ev->port_id, ip, mac, i); + } + lconf->neigh_map[i].mac.addr = ev->mac; + lconf->neigh_map[i].mac.valid = 1; +} + +static +void handle_neigh_del(struct lcore_conf *lconf, + const struct rte_ifpx_neigh_change *ev) +{ + char ip[INET_ADDRSTRLEN]; + int32_t i, a; + + i = rte_hash_del_key(lconf->neigh_hash, &ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, + "Failed to remove IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + a = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &a, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour removal for port %d: %s\n", + ev->port_id, ip); + } + lconf->neigh_map[i].val = 0; +} + +static +void handle_neigh6_add(struct lcore_conf *lconf, + const struct rte_ifpx_neigh6_change *ev) +{ + char mac[RTE_ETHER_ADDR_FMT_SIZE]; + char ip[INET6_ADDRSTRLEN]; + int32_t i; + + i = rte_hash_add_key(lconf->neigh6_hash, ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to add IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(mac, sizeof(mac), &ev->mac); + inet_ntop(AF_INET6, ev->ip, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour update for port %d: %s -> %s@%d\n", + ev->port_id, ip, mac, i); + } + lconf->neigh6_map[i].mac.addr = ev->mac; + lconf->neigh6_map[i].mac.valid = 1; +} + +static +void handle_neigh6_del(struct lcore_conf *lconf, + const struct rte_ifpx_neigh6_change *ev) +{ + char ip[INET6_ADDRSTRLEN]; + int32_t i; + + i = rte_hash_del_key(lconf->neigh6_hash, ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to remove IPv6 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour removal for port %d: %s\n", + ev->port_id, ip); + } + lconf->neigh6_map[i].val = 0; +} + +static +void handle_events(struct lcore_conf *lconf) +{ + struct rte_ifpx_event *ev; + + while (rte_ring_dequeue(lconf->ev_queue, (void**)&ev) == 0) { + switch (ev->type) { + case RTE_IFPX_NEIGH_ADD: + handle_neigh_add(lconf, &ev->neigh_change); + break; + case RTE_IFPX_NEIGH_DEL: + handle_neigh_del(lconf, &ev->neigh_change); + break; + case RTE_IFPX_NEIGH6_ADD: + handle_neigh6_add(lconf, &ev->neigh6_change); + break; + case RTE_IFPX_NEIGH6_DEL: + handle_neigh6_del(lconf, &ev->neigh6_change); + break; + default: + RTE_LOG(WARNING, L3FWD, + "Unexpected event: %d\n", ev->type); + } + free(ev); + } +} + +void setup_lpm(void) +{ + struct rte_lpm6_config cfg6; + struct rte_lpm_config cfg4; + + /* create the LPM table */ + cfg4.max_rules = IPV4_L3FWD_LPM_MAX_RULES; + cfg4.number_tbl8s = IPV4_L3FWD_LPM_NUMBER_TBL8S; + cfg4.flags = 0; + ipv4_routes = rte_lpm_create("IPV4_L3FWD_LPM", SOCKET_ID_ANY, &cfg4); + if (ipv4_routes == NULL) + rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table\n"); + + /* create the LPM6 table */ + cfg6.max_rules = IPV6_L3FWD_LPM_MAX_RULES; + cfg6.number_tbl8s = IPV6_L3FWD_LPM_NUMBER_TBL8S; + cfg6.flags = 0; + ipv6_routes = rte_lpm6_create("IPV6_L3FWD_LPM", SOCKET_ID_ANY, &cfg6); + if (ipv6_routes == NULL) + rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table\n"); +} + +static +uint32_t hash_ipv4(const void *key, uint32_t key_len __rte_unused, + uint32_t init_val) +{ + return rte_jhash_1word(*(const uint32_t*)key, init_val); +// return rte_hash_crc_4byte(*(const uint32_t*)key, init_val); +} + +static +uint32_t hash_ipv6(const void *key, uint32_t key_len __rte_unused, + uint32_t init_val) +{ + return rte_jhash_32b(key, 4, init_val); +// const uint64_t *pk = key; +// init_val = rte_hash_crc_8byte(*pk, init_val); +// return rte_hash_crc_8byte(*(pk+1), init_val); +} + +static +int setup_neigh(struct lcore_conf *lconf) +{ + char buf[16]; + struct rte_hash_parameters ipv4_hparams = { + .name = buf, + .entries = L3FWD_NEIGH_ENTRIES, + .key_len = 4, + .hash_func = hash_ipv4, + .hash_func_init_val = 0, + }; + struct rte_hash_parameters ipv6_hparams = { + .name = buf, + .entries = L3FWD_NEIGH_ENTRIES, + .key_len = 16, + .hash_func = hash_ipv6, + .hash_func_init_val = 0, + }; + + snprintf(buf, sizeof(buf), "neigh_hash-%d", rte_lcore_id()); + lconf->neigh_hash = rte_hash_create(&ipv4_hparams); + snprintf(buf, sizeof(buf), "neigh_map-%d", rte_lcore_id()); + lconf->neigh_map = rte_zmalloc(buf, + L3FWD_NEIGH_ENTRIES*sizeof(*lconf->neigh_map), + 8); + if (lconf->neigh_hash == NULL || lconf->neigh_map == NULL) { + RTE_LOG(ERR, L3FWD, + "Unable to create the l3fwd ARP/IPv4 table (lcore %d)\n", + rte_lcore_id()); + return -1; + } + + snprintf(buf, sizeof(buf), "neigh6_hash-%d", rte_lcore_id()); + lconf->neigh6_hash = rte_hash_create(&ipv6_hparams); + snprintf(buf, sizeof(buf), "neigh6_map-%d", rte_lcore_id()); + lconf->neigh6_map = rte_zmalloc(buf, + L3FWD_NEIGH_ENTRIES*sizeof(*lconf->neigh6_map), + 8); + if (lconf->neigh6_hash == NULL || lconf->neigh6_map == NULL) { + RTE_LOG(ERR, L3FWD, + "Unable to create the l3fwd ARP/IPv6 table (lcore %d)\n", + rte_lcore_id()); + return -1; + } + return 0; +} + +int lpm_check_ptype(int portid) +{ + int i, ret; + int ptype_l3_ipv4 = 0, ptype_l3_ipv6 = 0; + uint32_t ptype_mask = RTE_PTYPE_L3_MASK; + + ret = rte_eth_dev_get_supported_ptypes(portid, ptype_mask, NULL, 0); + if (ret <= 0) + return 0; + + uint32_t ptypes[ret]; + + ret = rte_eth_dev_get_supported_ptypes(portid, ptype_mask, ptypes, ret); + for (i = 0; i < ret; ++i) { + if (ptypes[i] & RTE_PTYPE_L3_IPV4) + ptype_l3_ipv4 = 1; + if (ptypes[i] & RTE_PTYPE_L3_IPV6) + ptype_l3_ipv6 = 1; + } + + if (ptype_l3_ipv4 == 0) + RTE_LOG(WARNING, L3FWD, + "port %d cannot parse RTE_PTYPE_L3_IPV4\n", portid); + + if (ptype_l3_ipv6 == 0) + RTE_LOG(WARNING, L3FWD, + "port %d cannot parse RTE_PTYPE_L3_IPV6\n", portid); + + if (ptype_l3_ipv4 && ptype_l3_ipv6) + return 1; + + return 0; + +} + +static inline +void lpm_parse_ptype(struct rte_mbuf *m) +{ + struct rte_ether_hdr *eth_hdr; + uint32_t packet_type = RTE_PTYPE_UNKNOWN; + uint16_t ether_type; + + eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *); + ether_type = eth_hdr->ether_type; + if (ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) + packet_type |= RTE_PTYPE_L3_IPV4_EXT_UNKNOWN; + else if (ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV6)) + packet_type |= RTE_PTYPE_L3_IPV6_EXT_UNKNOWN; + + m->packet_type = packet_type; +} + +uint16_t lpm_cb_parse_ptype(uint16_t port __rte_unused, uint16_t queue __rte_unused, + struct rte_mbuf *pkts[], uint16_t nb_pkts, + uint16_t max_pkts __rte_unused, + void *user_param __rte_unused) +{ + unsigned int i; + + if (unlikely(nb_pkts == 0)) + return nb_pkts; + rte_prefetch0(rte_pktmbuf_mtod(pkts[0], struct ether_hdr *)); + for (i = 0; i < (unsigned int) (nb_pkts - 1); ++i) { + rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], + struct ether_hdr *)); + lpm_parse_ptype(pkts[i]); + } + lpm_parse_ptype(pkts[i]); + + return nb_pkts; +} + +/* main processing loop */ +int lpm_main_loop(void *dummy __rte_unused) +{ + struct rte_mbuf *pkts_burst[MAX_PKT_BURST]; + unsigned lcore_id; + uint64_t prev_tsc, diff_tsc, cur_tsc; + int i, j, nb_rx; + uint16_t portid; + uint8_t queueid; + struct lcore_conf *lconf; + const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) / + US_PER_S * BURST_TX_DRAIN_US; + + prev_tsc = 0; + + lcore_id = rte_lcore_id(); + lconf = &lcore_conf[lcore_id]; + + if (setup_neigh(lconf) < 0) { + RTE_LOG(ERR, L3FWD, "lcore %u failed to setup its ARP tables\n", + lcore_id); + return 0; + } + + if (lconf->n_rx_queue == 0) { + RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id); + return 0; + } + + RTE_LOG(INFO, L3FWD, "entering main loop on lcore %u\n", lcore_id); + + for (i = 0; i < lconf->n_rx_queue; i++) { + + portid = lconf->rx_queue_list[i].port_id; + queueid = lconf->rx_queue_list[i].queue_id; + RTE_LOG(INFO, L3FWD, + " -- lcoreid=%u portid=%u rxqueueid=%hhu\n", + lcore_id, portid, queueid); + } + + while (!force_quit) { + + cur_tsc = rte_rdtsc(); + /* + * TX burst and event queue drain + */ + diff_tsc = cur_tsc - prev_tsc; + if (unlikely(diff_tsc % drain_tsc == 0)) { + + for (i = 0; i < lconf->n_tx_port; ++i) { + portid = lconf->tx_port_id[i]; + if (lconf->tx_mbufs[portid].len == 0) + continue; + send_burst(lconf, + lconf->tx_mbufs[portid].len, + portid); + lconf->tx_mbufs[portid].len = 0; + } + + if (diff_tsc > EV_QUEUE_DRAIN * drain_tsc) { + if (lconf->ev_queue && + !rte_ring_empty(lconf->ev_queue)) + handle_events(lconf); + prev_tsc = cur_tsc; + } + } + + /* + * Read packet from RX queues + */ + for (i = 0; i < lconf->n_rx_queue; ++i) { + portid = lconf->rx_queue_list[i].port_id; + queueid = lconf->rx_queue_list[i].queue_id; + nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, + MAX_PKT_BURST); + if (nb_rx == 0) + continue; + /* If current queue is from proxy interface then there + * is no need to figure out destination port - just + * forward it to the bound port. + */ + if (unlikely(lconf->rx_queue_list[i].dst_port != + RTE_MAX_ETHPORTS)) { + for (j = 0; j < nb_rx; ++j) + send_single_packet(lconf, pkts_burst[j], + lconf->rx_queue_list[i].dst_port); + } else + l3fwd_send_packets(nb_rx, pkts_burst, portid, lconf); + } + } + + return 0; +} diff --git a/examples/l3fwd-ifpx/l3fwd.h b/examples/l3fwd-ifpx/l3fwd.h new file mode 100644 index 000000000..fc60078c5 --- /dev/null +++ b/examples/l3fwd-ifpx/l3fwd.h @@ -0,0 +1,98 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#ifndef __L3_FWD_H__ +#define __L3_FWD_H__ + +#include <stdbool.h> + +#include <rte_ethdev.h> +#include <rte_log.h> +#include <rte_hash.h> + +#define RTE_LOGTYPE_L3FWD RTE_LOGTYPE_USER1 + +#define MAX_PKT_BURST 32 +#define BURST_TX_DRAIN_US 100 /* TX drain every ~100us */ +#define EV_QUEUE_DRAIN 5 /* Check event queue every 5 TX drains */ + +#define MAX_RX_QUEUE_PER_LCORE 16 + +/* + * Try to avoid TX buffering if we have at least MAX_TX_BURST packets to send. + */ +#define MAX_TX_BURST (MAX_PKT_BURST / 2) + +/* Configure how many packets ahead to prefetch, when reading packets */ +#define PREFETCH_OFFSET 3 + +/* Hash parameters. */ +#ifdef RTE_ARCH_64 +/* default to 4 million hash entries (approx) */ +#define L3FWD_HASH_ENTRIES (1024*1024*4) +#else +/* 32-bit has less address-space for hugepage memory, limit to 1M entries */ +#define L3FWD_HASH_ENTRIES (1024*1024*1) +#endif +#define HASH_ENTRY_NUMBER_DEFAULT 4 +/* Default ARP table size */ +#define L3FWD_NEIGH_ENTRIES 1024 + +union lladdr_t { + uint64_t val; + struct { + struct rte_ether_addr addr; + uint16_t valid; + } mac; +}; + +struct mbuf_table { + uint16_t len; + struct rte_mbuf *m_table[MAX_PKT_BURST]; +}; + +struct lcore_rx_queue { + uint16_t port_id; + uint16_t dst_port; + uint8_t queue_id; +} __rte_cache_aligned; + +struct lcore_conf { + uint16_t n_rx_queue; + struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE]; + uint16_t n_tx_port; + uint16_t tx_port_id[RTE_MAX_ETHPORTS]; + uint16_t tx_queue_id[RTE_MAX_ETHPORTS]; + struct mbuf_table tx_mbufs[RTE_MAX_ETHPORTS]; + struct rte_ring *ev_queue; + union lladdr_t *neigh_map; + struct rte_hash *neigh_hash; + union lladdr_t *neigh6_map; + struct rte_hash *neigh6_hash; +} __rte_cache_aligned; + +extern volatile bool force_quit; + +/* mask of enabled/active ports */ +extern uint32_t enabled_port_mask; +extern uint32_t active_port_mask; + +extern struct lcore_conf lcore_conf[RTE_MAX_LCORE]; + +int init_if_proxy(void); +void close_if_proxy(void); + +void wait_for_config_done(void); + +void setup_lpm(void); + +int lpm_check_ptype(int portid); + +uint16_t +lpm_cb_parse_ptype(uint16_t port, uint16_t queue, struct rte_mbuf *pkts[], + uint16_t nb_pkts, uint16_t max_pkts, void *user_param); + +int lpm_main_loop(__attribute__((unused)) void *dummy); + +#endif /* __L3_FWD_H__ */ diff --git a/examples/l3fwd-ifpx/main.c b/examples/l3fwd-ifpx/main.c new file mode 100644 index 000000000..ba49cae66 --- /dev/null +++ b/examples/l3fwd-ifpx/main.c @@ -0,0 +1,729 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#include <stdlib.h> +#include <stdint.h> +#include <inttypes.h> +#include <sys/types.h> +#include <string.h> +#include <sys/queue.h> +#include <stdarg.h> +#include <errno.h> +#include <getopt.h> +#include <signal.h> +#include <stdbool.h> + +#include <rte_byteorder.h> +#include <rte_memory.h> +#include <rte_memcpy.h> +#include <rte_eal.h> +#include <rte_launch.h> +#include <rte_atomic.h> +#include <rte_cycles.h> +#include <rte_prefetch.h> +#include <rte_lcore.h> +#include <rte_per_lcore.h> +#include <rte_branch_prediction.h> +#include <rte_interrupts.h> +#include <rte_random.h> +#include <rte_debug.h> +#include <rte_ether.h> +#include <rte_ethdev.h> +#include <rte_mempool.h> +#include <rte_mbuf.h> +#include <rte_ip.h> +#include <rte_tcp.h> +#include <rte_udp.h> +#include <rte_string_fns.h> +#include <rte_cpuflags.h> +#include <rte_if_proxy.h> + +#include <cmdline_parse.h> +#include <cmdline_parse_etheraddr.h> + +#include "l3fwd.h" + +/* + * Configurable number of RX/TX ring descriptors + */ +#define RTE_TEST_RX_DESC_DEFAULT 1024 +#define RTE_TEST_TX_DESC_DEFAULT 1024 + +#define MAX_TX_QUEUE_PER_PORT RTE_MAX_ETHPORTS +#define MAX_RX_QUEUE_PER_PORT 128 + +#define MAX_LCORE_PARAMS 1024 + +/* Static global variables used within this file. */ +static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT; +static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT; + +/**< Ports set in promiscuous mode off by default. */ +static int promiscuous_on; + +/* Global variables. */ + +static int parse_ptype; /**< Parse packet type using rx callback, and */ + /**< disabled by default */ + +volatile bool force_quit; + +/* mask of enabled/active ports */ +uint32_t enabled_port_mask; +uint32_t active_port_mask; + +struct lcore_conf lcore_conf[RTE_MAX_LCORE]; + +struct lcore_params { + uint16_t port_id; + uint8_t queue_id; + uint8_t lcore_id; +} __rte_cache_aligned; + +static struct lcore_params lcore_params[MAX_LCORE_PARAMS]; +static struct lcore_params lcore_params_default[] = { + {0, 0, 2}, + {0, 1, 2}, + {0, 2, 2}, + {1, 0, 2}, + {1, 1, 2}, + {1, 2, 2}, + {2, 0, 2}, + {3, 0, 3}, + {3, 1, 3}, +}; + +static uint16_t nb_lcore_params; + +static struct rte_eth_conf port_conf = { + .rxmode = { + .mq_mode = ETH_MQ_RX_RSS, + .max_rx_pkt_len = RTE_ETHER_MAX_LEN, + .split_hdr_size = 0, + .offloads = DEV_RX_OFFLOAD_CHECKSUM, + }, + .rx_adv_conf = { + .rss_conf = { + .rss_key = NULL, + .rss_hf = ETH_RSS_IP, + }, + }, + .txmode = { + .mq_mode = ETH_MQ_TX_NONE, + }, +}; + +static struct rte_mempool *pktmbuf_pool; + +static int +check_lcore_params(void) +{ + uint8_t queue, lcore; + uint16_t i, port_id; + int socketid; + + for (i = 0; i < nb_lcore_params; ++i) { + queue = lcore_params[i].queue_id; + if (queue >= MAX_RX_QUEUE_PER_PORT) { + RTE_LOG(ERR, L3FWD, "Invalid queue number: %hhu\n", + queue); + return -1; + } + lcore = lcore_params[i].lcore_id; + if (!rte_lcore_is_enabled(lcore)) { + RTE_LOG(ERR, L3FWD, "lcore %hhu is not enabled " + "in lcore mask\n", lcore); + return -1; + } + port_id = lcore_params[i].port_id; + if ((enabled_port_mask & (1 << port_id)) == 0) { + RTE_LOG(ERR, L3FWD, "port %u is not enabled " + "in port mask\n", port_id); + return -1; + } + if (!rte_eth_dev_is_valid_port(port_id)) { + RTE_LOG(ERR, L3FWD, "port %u is not present " + "on the board\n", port_id); + return -1; + } + if ((socketid = rte_lcore_to_socket_id(lcore)) != 0) { + RTE_LOG(WARNING, L3FWD, "lcore %hhu is on socket %d with " + "numa off\n", lcore, socketid); + } + } + return 0; +} + +static int +add_proxies(void) +{ + uint16_t i, p, port_id, proxy_id; + + for (i = 0, p = nb_lcore_params; i < nb_lcore_params; ++i) { + if (p >= RTE_DIM(lcore_params)) { + RTE_LOG(ERR, L3FWD, "Not enough room in lcore_params " + "to add proxy\n"); + return -1; + } + port_id = lcore_params[i].port_id; + if (rte_ifpx_proxy_get(port_id) != RTE_MAX_ETHPORTS) + continue; + + proxy_id = rte_ifpx_proxy_create(RTE_IFPX_DEFAULT); + if (proxy_id == RTE_MAX_ETHPORTS) { + RTE_LOG(ERR, L3FWD, "Failed to crate proxy\n"); + return -1; + } + rte_ifpx_port_bind(port_id, proxy_id); + /* mark proxy as enabled - the corresponding port is, since we + * are after checking of lcore_params + */ + enabled_port_mask |= 1 << proxy_id; + lcore_params[p].port_id = proxy_id; + lcore_params[p].lcore_id = lcore_params[i].lcore_id; + lcore_params[p].queue_id = lcore_params[i].queue_id; + ++p; + } + + nb_lcore_params = p; + return 0; +} + +static uint8_t +get_port_n_rx_queues(const uint16_t port) +{ + int queue = -1; + uint16_t i; + + for (i = 0; i < nb_lcore_params; ++i) { + if (lcore_params[i].port_id == port) { + if (lcore_params[i].queue_id == queue+1) + queue = lcore_params[i].queue_id; + else + rte_exit(EXIT_FAILURE, "queue ids of the port %d must be" + " in sequence and must start with 0\n", + lcore_params[i].port_id); + } + } + return (uint8_t)(++queue); +} + +static int +init_lcore_rx_queues(void) +{ + uint16_t i, p, nb_rx_queue; + uint8_t lcore; + struct lcore_rx_queue *rq; + + for (i = 0; i < nb_lcore_params; ++i) { + lcore = lcore_params[i].lcore_id; + nb_rx_queue = lcore_conf[lcore].n_rx_queue; + if (nb_rx_queue >= MAX_RX_QUEUE_PER_LCORE) { + RTE_LOG(ERR, L3FWD, "too many queues (%u) for lcore: %u\n", + (unsigned)nb_rx_queue + 1, (unsigned)lcore); + return -1; + } + rq = &lcore_conf[lcore].rx_queue_list[nb_rx_queue]; + rq->port_id = lcore_params[i].port_id; + rq->queue_id = lcore_params[i].queue_id; + if (rte_ifpx_is_proxy(lcore_params[i].port_id)) { + if (rte_ifpx_port_get(lcore_params[i].port_id, &p, 1) > 0) + rq->dst_port = p; + else + RTE_LOG(WARNING, L3FWD, + "Found proxy that has no port bound\n"); + } else + rq->dst_port = RTE_MAX_ETHPORTS; + lcore_conf[lcore].n_rx_queue++; + } + return 0; +} + +/* display usage */ +static void +print_usage(const char *prgname) +{ + fprintf(stderr, "%s [EAL options] --" + " -p PORTMASK" + " [-P]" + " --config (port,queue,lcore)[,(port,queue,lcore)]" + " [--ipv6]" + " [--parse-ptype]" + + " -p PORTMASK: Hexadecimal bitmask of ports to configure\n" + " -P : Enable promiscuous mode\n" + " --config (port,queue,lcore): Rx queue configuration\n" + " --ipv6: Set if running ipv6 packets\n" + " --parse-ptype: Set to use software to analyze packet type\n", + prgname); +} + +static int +parse_portmask(const char *portmask) +{ + char *end = NULL; + unsigned long pm; + + /* parse hexadecimal string */ + pm = strtoul(portmask, &end, 16); + if ((portmask[0] == '\0') || (end == NULL) || (*end != '\0')) + return -1; + + if (pm == 0) + return -1; + + return pm; +} + +static int +parse_config(const char *q_arg) +{ + char s[256]; + const char *p, *p0 = q_arg; + char *end; + enum fieldnames { + FLD_PORT = 0, + FLD_QUEUE, + FLD_LCORE, + _NUM_FLD + }; + unsigned long int_fld[_NUM_FLD]; + char *str_fld[_NUM_FLD]; + int i; + unsigned size; + + nb_lcore_params = 0; + + while ((p = strchr(p0,'(')) != NULL) { + ++p; + if((p0 = strchr(p,')')) == NULL) + return -1; + + size = p0 - p; + if(size >= sizeof(s)) + return -1; + + snprintf(s, sizeof(s), "%.*s", size, p); + if (rte_strsplit(s, sizeof(s), str_fld, _NUM_FLD, ',') != _NUM_FLD) + return -1; + for (i = 0; i < _NUM_FLD; i++){ + errno = 0; + int_fld[i] = strtoul(str_fld[i], &end, 0); + if (errno != 0 || end == str_fld[i] || int_fld[i] > 255) + return -1; + } + if (nb_lcore_params >= MAX_LCORE_PARAMS) { + RTE_LOG(ERR, L3FWD, "exceeded max number of lcore " + "params: %hu\n", nb_lcore_params); + return -1; + } + lcore_params[nb_lcore_params].port_id = + (uint8_t)int_fld[FLD_PORT]; + lcore_params[nb_lcore_params].queue_id = + (uint8_t)int_fld[FLD_QUEUE]; + lcore_params[nb_lcore_params].lcore_id = + (uint8_t)int_fld[FLD_LCORE]; + ++nb_lcore_params; + } + return 0; +} + +#define MAX_JUMBO_PKT_LEN 9600 +#define MEMPOOL_CACHE_SIZE 256 + +static const char short_options[] = + "p:" /* portmask */ + "P" /* promiscuous */ + "L" /* enable long prefix match */ + "E" /* enable exact match */ + ; + +#define CMD_LINE_OPT_CONFIG "config" +#define CMD_LINE_OPT_IPV6 "ipv6" +#define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype" +enum { + /* long options mapped to a short option */ + + /* first long only option value must be >= 256, so that we won't + * conflict with short options */ + CMD_LINE_OPT_MIN_NUM = 256, + CMD_LINE_OPT_CONFIG_NUM, + CMD_LINE_OPT_PARSE_PTYPE_NUM, +}; + +static const struct option lgopts[] = { + {CMD_LINE_OPT_CONFIG, 1, 0, CMD_LINE_OPT_CONFIG_NUM}, + {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, CMD_LINE_OPT_PARSE_PTYPE_NUM}, + {NULL, 0, 0, 0} +}; + +/* + * This expression is used to calculate the number of mbufs needed + * depending on user input, taking into account memory for rx and + * tx hardware rings, cache per lcore and mtable per port per lcore. + * RTE_MAX is used to ensure that NB_MBUF never goes below a minimum + * value of 8192 + */ +#define NB_MBUF(nports) RTE_MAX( \ + (nports*nb_rx_queue*nb_rxd + \ + nports*nb_lcores*MAX_PKT_BURST + \ + nports*n_tx_queue*nb_txd + \ + nb_lcores*MEMPOOL_CACHE_SIZE), \ + (unsigned)8192) + +/* Parse the argument given in the command line of the application */ +static int +parse_args(int argc, char **argv) +{ + int opt, ret; + char **argvopt; + int option_index; + char *prgname = argv[0]; + + argvopt = argv; + + /* Error or normal output strings. */ + while ((opt = getopt_long(argc, argvopt, short_options, + lgopts, &option_index)) != EOF) { + + switch (opt) { + /* portmask */ + case 'p': + enabled_port_mask = parse_portmask(optarg); + if (enabled_port_mask == 0) { + RTE_LOG(ERR, L3FWD, "Invalid portmask\n"); + print_usage(prgname); + return -1; + } + break; + + case 'P': + promiscuous_on = 1; + break; + + /* long options */ + case CMD_LINE_OPT_CONFIG_NUM: + ret = parse_config(optarg); + if (ret) { + RTE_LOG(ERR, L3FWD, "Invalid config\n"); + print_usage(prgname); + return -1; + } + break; + + case CMD_LINE_OPT_PARSE_PTYPE_NUM: + RTE_LOG(INFO, L3FWD, "soft parse-ptype is enabled\n"); + parse_ptype = 1; + break; + + default: + print_usage(prgname); + return -1; + } + } + + if (nb_lcore_params == 0) { + memcpy(lcore_params, lcore_params_default, + sizeof(lcore_params_default)); + nb_lcore_params = RTE_DIM(lcore_params_default); + } + + if (optind >= 0) + argv[optind-1] = prgname; + + ret = optind-1; + optind = 1; /* reset getopt lib */ + return ret; +} + +static void +signal_handler(int signum) +{ + if (signum == SIGINT || signum == SIGTERM) { + RTE_LOG(NOTICE, L3FWD, "\n\n" + "Signal %d received, preparing to exit...\n", signum); + force_quit = true; + } +} + +static int +prepare_ptype_parser(uint16_t portid, uint16_t queueid) +{ + if (parse_ptype) { + RTE_LOG(INFO, L3FWD, "Port %d: softly parse packet type info\n", + portid); + if (rte_eth_add_rx_callback(portid, queueid, + lpm_cb_parse_ptype, + NULL)) + return 1; + + RTE_LOG(ERR, L3FWD, "Failed to add rx callback: port=%d\n", + portid); + return 0; + } + + if (lpm_check_ptype(portid)) + return 1; + + RTE_LOG(ERR, L3FWD, "port %d cannot parse packet type, please add --%s\n", + portid, CMD_LINE_OPT_PARSE_PTYPE); + return 0; +} + +int +main(int argc, char **argv) +{ + struct lcore_conf *lconf; + struct rte_eth_dev_info dev_info; + struct rte_eth_txconf *txconf; + int ret; + unsigned nb_ports; + uint32_t nb_mbufs; + uint16_t queueid, portid; + unsigned lcore_id; + uint32_t nb_tx_queue, nb_lcores; + uint8_t nb_rx_queue, queue; + + /* init EAL */ + ret = rte_eal_init(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Invalid EAL parameters\n"); + argc -= ret; + argv += ret; + + force_quit = false; + signal(SIGINT, signal_handler); + signal(SIGTERM, signal_handler); + + /* parse application arguments (after the EAL ones) */ + ret = parse_args(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Invalid L3FWD parameters\n"); + + if (check_lcore_params() < 0) + rte_exit(EXIT_FAILURE, "check_lcore_params failed\n"); + + if (add_proxies() < 0) + rte_exit(EXIT_FAILURE, "add_proxies failed\n"); + + ret = init_lcore_rx_queues(); + if (ret < 0) + rte_exit(EXIT_FAILURE, "init_lcore_rx_queues failed\n"); + + nb_ports = rte_eth_dev_count_avail(); + + nb_lcores = rte_lcore_count(); + + /* Initial number of mbufs in pool - the amount required for hardware + * rx/tx rings will be added during configuration of ports. + */ + nb_mbufs = nb_ports * nb_lcores * MAX_PKT_BURST + /* mbuf tables */ + nb_lcores * MEMPOOL_CACHE_SIZE; /* mempool per lcore cache */ + + /* Init the lookup structures. */ + setup_lpm(); + + /* initialize all ports (including proxies) */ + RTE_ETH_FOREACH_DEV(portid) { + struct rte_eth_conf local_port_conf = port_conf; + + /* skip ports that are not enabled */ + if ((enabled_port_mask & (1 << portid)) == 0) { + RTE_LOG(INFO, L3FWD, "Skipping disabled port %d\n", + portid); + continue; + } + + /* init port */ + RTE_LOG(INFO, L3FWD, "Initializing port %d ...\n", portid); + + nb_rx_queue = get_port_n_rx_queues(portid); + nb_tx_queue = nb_lcores; + + ret = rte_eth_dev_info_get(portid, &dev_info); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "Error during getting device (port %u) info: %s\n", + portid, strerror(-ret)); + if (nb_rx_queue > dev_info.max_rx_queues || + nb_tx_queue > dev_info.max_tx_queues) + rte_exit(EXIT_FAILURE, + "Port %d cannot configure enough queues\n", + portid); + + RTE_LOG(INFO, L3FWD, "Creating queues: nb_rxq=%d nb_txq=%u...\n", + nb_rx_queue, nb_tx_queue); + + if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE) + local_port_conf.txmode.offloads |= + DEV_TX_OFFLOAD_MBUF_FAST_FREE; + + local_port_conf.rx_adv_conf.rss_conf.rss_hf &= + dev_info.flow_type_rss_offloads; + if (local_port_conf.rx_adv_conf.rss_conf.rss_hf != + port_conf.rx_adv_conf.rss_conf.rss_hf) { + RTE_LOG(INFO, L3FWD, + "Port %u modified RSS hash function based on hardware support," + "requested:%#"PRIx64" configured:%#"PRIx64"\n", + portid, port_conf.rx_adv_conf.rss_conf.rss_hf, + local_port_conf.rx_adv_conf.rss_conf.rss_hf); + } + + ret = rte_eth_dev_configure(portid, nb_rx_queue, + (uint16_t)nb_tx_queue, &local_port_conf); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "Cannot configure device: err=%d, port=%d\n", + ret, portid); + + ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd, + &nb_txd); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "Cannot adjust number of descriptors: err=%d, " + "port=%d\n", ret, portid); + + nb_mbufs += nb_rx_queue * nb_rxd + nb_tx_queue * nb_txd; + /* init one TX queue per couple (lcore,port) */ + queueid = 0; + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + + RTE_LOG(INFO, L3FWD, "\ttxq=%u,%d\n", lcore_id, queueid); + + txconf = &dev_info.default_txconf; + txconf->offloads = local_port_conf.txmode.offloads; + ret = rte_eth_tx_queue_setup(portid, queueid, nb_txd, + SOCKET_ID_ANY, txconf); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_tx_queue_setup: err=%d, " + "port=%d\n", ret, portid); + + lconf = &lcore_conf[lcore_id]; + lconf->tx_queue_id[portid] = queueid; + queueid++; + + lconf->tx_port_id[lconf->n_tx_port] = portid; + lconf->n_tx_port++; + } + RTE_LOG(INFO, L3FWD, "\n"); + } + + /* Init pkt pool. */ + pktmbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", + rte_align32prevpow2(nb_mbufs), MEMPOOL_CACHE_SIZE, + 0, RTE_MBUF_DEFAULT_BUF_SIZE, SOCKET_ID_ANY); + if (pktmbuf_pool == NULL) + rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n"); + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + lconf = &lcore_conf[lcore_id]; + RTE_LOG(INFO, L3FWD, "Initializing rx queues on lcore %u ...\n", + lcore_id ); + /* init RX queues */ + for(queue = 0; queue < lconf->n_rx_queue; ++queue) { + struct rte_eth_rxconf rxq_conf; + + portid = lconf->rx_queue_list[queue].port_id; + queueid = lconf->rx_queue_list[queue].queue_id; + + RTE_LOG(INFO, L3FWD, "\trxq=%d,%d\n", portid, queueid); + + ret = rte_eth_dev_info_get(portid, &dev_info); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "Error during getting device (port %u) info: %s\n", + portid, strerror(-ret)); + + rxq_conf = dev_info.default_rxconf; + rxq_conf.offloads = port_conf.rxmode.offloads; + ret = rte_eth_rx_queue_setup(portid, queueid, + nb_rxd, SOCKET_ID_ANY, + &rxq_conf, + pktmbuf_pool); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_rx_queue_setup: err=%d, port=%d\n", + ret, portid); + } + } + + RTE_LOG(INFO, L3FWD, "\n"); + + /* start ports */ + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) { + continue; + } + /* Start device */ + ret = rte_eth_dev_start(portid); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_dev_start: err=%d, port=%d\n", + ret, portid); + + /* + * If enabled, put device in promiscuous mode. + * This allows IO forwarding mode to forward packets + * to itself through 2 cross-connected ports of the + * target machine. + */ + if (promiscuous_on) { + ret = rte_eth_promiscuous_enable(portid); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "rte_eth_promiscuous_enable: err=%s, port=%u\n", + rte_strerror(-ret), portid); + } + } + /* we've managed to start all enabled ports so active == enabled */ + active_port_mask = enabled_port_mask; + + RTE_LOG(INFO, L3FWD, "\n"); + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + lconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < lconf->n_rx_queue; ++queue) { + portid = lconf->rx_queue_list[queue].port_id; + queueid = lconf->rx_queue_list[queue].queue_id; + if (prepare_ptype_parser(portid, queueid) == 0) + rte_exit(EXIT_FAILURE, "ptype check fails\n"); + } + } + + if (init_if_proxy() < 0) + rte_exit(EXIT_FAILURE, "Failed to configure proxy lib\n"); + wait_for_config_done(); + + ret = 0; + /* launch per-lcore init on every lcore */ + rte_eal_mp_remote_launch(lpm_main_loop, NULL, CALL_MASTER); + RTE_LCORE_FOREACH_SLAVE(lcore_id) { + if (rte_eal_wait_lcore(lcore_id) < 0) { + ret = -1; + break; + } + } + + /* stop ports */ + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) + continue; + RTE_LOG(INFO, L3FWD, "Closing port %d...", portid); + rte_eth_dev_stop(portid); + rte_eth_dev_close(portid); + rte_log(RTE_LOG_INFO, RTE_LOGTYPE_L3FWD, " Done\n"); + } + + close_if_proxy(); + RTE_LOG(INFO, L3FWD, "Bye...\n"); + + return ret; +} diff --git a/examples/l3fwd-ifpx/meson.build b/examples/l3fwd-ifpx/meson.build new file mode 100644 index 000000000..f0c0920b8 --- /dev/null +++ b/examples/l3fwd-ifpx/meson.build @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2020 Marvell International Ltd. + +# meson file, for building this example as part of a main DPDK build. +# +# To build this example as a standalone application with an already-installed +# DPDK instance, use 'make' + +allow_experimental_apis = true +deps += ['hash', 'lpm', 'if_proxy'] +sources = files('l3fwd.c', 'main.c') diff --git a/examples/meson.build b/examples/meson.build index 1f2b6f516..319d765eb 100644 --- a/examples/meson.build +++ b/examples/meson.build @@ -23,7 +23,7 @@ all_examples = [ 'l2fwd', 'l2fwd-cat', 'l2fwd-event', 'l2fwd-crypto', 'l2fwd-jobstats', 'l2fwd-keepalive', 'l3fwd', - 'l3fwd-acl', 'l3fwd-power', + 'l3fwd-acl', 'l3fwd-ifpx', 'l3fwd-power', 'link_status_interrupt', 'multi_process/client_server_mp/mp_client', 'multi_process/client_server_mp/mp_server', -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Introduce IF proxy library 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka ` (3 preceding siblings ...) 2020-03-06 16:41 ` [dpdk-dev] [PATCH 4/4] if_proxy: add example application Andrzej Ostruszka @ 2020-03-06 17:17 ` Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka ` (3 subsequent siblings) 8 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-06 17:17 UTC (permalink / raw) To: dev My apologies - I have forgotten to run checkpatch on the series. I will correct these in version 2 - in the mean time please skip these minor faults and comment on the rest. With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka ` (4 preceding siblings ...) 2020-03-06 17:17 ` [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka @ 2020-03-10 11:10 ` Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library Andrzej Ostruszka ` (5 more replies) 2020-04-16 16:11 ` [dpdk-dev] [PATCH " Stephen Hemminger ` (2 subsequent siblings) 8 siblings, 6 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-10 11:10 UTC (permalink / raw) To: dev What is this useful for ======================= Usually, when an ethernet port is assigned to DPDK it vanishes from the system and user looses ability to control it via normal configuration utilities (e.g. those from iproute2 package). Moreover by default DPDK application is not aware of the network configuration of the system. To address both of these issues application needs to: - add some command line interface (or other mechanism) allowing for control of the port and its configuration - query the status of network configuration and monitor its changes The purpose of this library is to help with both of these tasks (as long as they remain in domain of configuration available to the system). In other words, if DPDK application has some special needs, that cannot be addressed by the normal system configuration utilities, then they need to be solved by the application itself. The connection between DPDK and system is based on the existence of ports that are visible to both DPDK and system (like Tap, KNI and possibly some other drivers). These ports serve as an interface proxies. Let's visualize the action of the library by the following example: Linux | DPDK ============================================================== | | +-------+ +-------+ | | Port1 | | Port2 | "ip link set dev tap1 mtu 1600" | +-------+ +-------+ | | ^ ^ ^ | +------+ | mtu_change | | `->| Tap1 |---' callback | | +------+ | | "ip addr add 198.51.100.14 \ | | | dev tap2" | | | | +------+ | | +->| Tap2 |------------------' | | +------+ addr_add callback | "ip route add 198.0.2.0/24 \ | | | dev tap2" | | route_add callback | | `---------------------' So we have two ports Port1 and Port2 that are not visible to the system. We create two proxy interfaces (here based on Tap driver) and bind the ports to their proxies. When user issues a command changing MTU for Tap1 interface the library notes this and calls "mtu_change" callback for the Port1. Similarly when user adds an IPv4 address to the Tap2 interface "addr_add" callback is called for the Port2 and the same happens for configuration of routing rule pointing to Tap2. Apart from callbacks this library can notify about changes via adding events to notification queues. See below for more inforamtion about that and a complete list of available callbacks. Please note that nothing has been mentioned about forwarding of the packets between system and DPDK. Since the proxies are normal DPDK ports you can receive/send to them via usual RX/TX burst API. However since the library is not aware of the structure of packet processing used by the application it cannot automatically forward the packets - it is responsibility of the application to include proxy ports into its packet processing engine. As mentioned above the intention of the library is to: - provide information about network configuration that would allow application to decide what to do with the packets received on DPDK ports, - allow for control of the ports via standard configuration utilities Although the library only helps you to identify proxy for given port (and vice versa) and calls appropriate callbacks it does open some interesting possibilities. For example you can use the proxy ports to forward packets for protocols that you do not wish to handle in DPDK application to the system protocol stack and just listen to the configuration changes - so that way you can "offload" handling of those protocols to the system. How to use it ============= Usage of this library is rather simple. You have to: 1. Create proxy (if you don't have port suitable for being proxy or you have one but do not wish to use it as a proxy). 2. Bind port to proxy. 3. Register callbacks and/or event queues. 4. Start listening to the network configuration. The only mandatory requirement for DPDK port to be able to act as a proxy is that it is visible in the system - this is checked during port to proxy binding by calling rte_eth_dev_info_get() on proxy port and inspecting 'if_index' field (it has to be non-zero). One can create such port in the application by calling: proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); Upon success this returns id of DPDK proxy port created (RTE_MAX_ETHPORTS on failure). The argument selects type of proxy port to create (currently Tap/KNI only). This function actually is just a wrapper around: uint16_t rte_ifpx_create_by_devarg(const char *devarg); creating valid 'devarg' string for the chosen type of proxy. If you have other driver capable of acting as a proxy you can call rte_ifpx_create_by_devarg() directly passing appropriate argument. Once you have id of both port and proxy you can bind the two via: rte_ifpx_port_bind(port_id, proxy_id); This creates logical binding - as mentioned above there is no automatic packet forwarding. With this binding whenever user changes the state of proxy interface in the system (link up/down, change mac/mtu, add/remove IPv4/IPv6) you get appropriate notification for the bound port. So far we've mentioned several times that the library calls callbacks. They are grouped in 'struct rte_ifpx_callbacks' and user provides them to the library via: rte_ifpx_callbacks_register(&cbs); It is worth mentioning that the context (lcore/thread) in which these callbacks are called is implementation defined. It might differ between different platforms, so the application needs to assume that some kind of inter lcore/thread synchronization/communication is required. Apart from notification via callbacks this library also supports notifying about the changes via adding events to the configured notification queues. The queues are registered via: int rte_ifpx_queue_add(struct rte_ring *r); and the actual logic used is: if there is callback registered then it is called, if it returns non-zero then event is considered completed, otherwise event is added to each configured notification queue. That way application can update data structures that are safe to be modified by single writer from within callback or do the common preprocessing steps (if any needed) in callback and data that is replicated can be updated during handling of queued events. Once we have bindings in place and notification configured, the only essential part that remains is to get the current network configuration and start listening to its changes. This is accomplished via a call to: rte_ifpx_listen(); And basically this is all one needs to understand how to use this library. Other less essential parts include: - ability to query what events are available for given platform - getting mapping between proxy and port - unbinding the ports from proxy - destroying proxy port - closing the listening service - getting basic information about proxy Currently available features and implementation =============================================== The library's API is system independent but it obviously needs some system dependent parts. We provide exemplary Linux implementation (based on netlink sockets). Very similar implementation is possible for FreeBSD (with the usage of PF_ROUTE sockets). Windows implementation would need to differ much (probably IP Helper library would be of some help). Here is the list of currently implemented callbacks: struct rte_ifpx_callbacks { int (*mac_change)(const struct rte_ifpx_mac_change *event); int (*mtu_change)(const struct rte_ifpx_mtu_change *event); int (*link_change)(const struct rte_ifpx_link_change *event); int (*addr_add)(const struct rte_ifpx_addr_change *event); int (*addr_del)(const struct rte_ifpx_addr_change *event); int (*addr6_add)(const struct rte_ifpx_addr6_change *event); int (*addr6_del)(const struct rte_ifpx_addr6_change *event); int (*route_add)(const struct rte_ifpx_route_change *event); int (*route_del)(const struct rte_ifpx_route_change *event); int (*route6_add)(const struct rte_ifpx_route6_change *event); int (*route6_del)(const struct rte_ifpx_route6_change *event); int (*neigh_add)(const struct rte_ifpx_neigh_change *event); int (*neigh_del)(const struct rte_ifpx_neigh_change *event); int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); int (*cfg_done)(void); }; They are all rather self-descriptive with the exception of the last one. When the user calls rte_ifpx_listen() the library first queries the system for its current configuration. That might require several request/reply exchanges between DPDK and system and once it is finished this callback is called to let application know that all info has been gathered. It is worth to mention also that while typical case would be a 1-to-1 mapping between port and proxy, the 1-to-many mapping is also supported. In that case related callbacks will be called for each port bound to given proxy interface - it is application responsibility to define semantic of such mapping (e.g. all changes apply to all ports, or link changes apply to all but other are accepted in "round robin" fashion, or some other logic). As mentioned above Linux implementation is based on netlink socket. This socket is registered as file descriptor in EAL interrupts (similarly to how EAL alarms are implemented). What has changed since the RFC ============================== - Platform dependent parts has been separated into a ifpx_platform structure with callbacks for initialization, getting information about the interface, listening to the changes and closing of the library. That should allow easier reimplementation. - Notification scheme has been changed - instead of having just callbacks now event queueing is also available (or a mix of those two). - Filtering of events only related to the proxy ports - previously all network configuration changes were reported. But DPDK application doesn't need to know whole configuration - only just portion related to the proxy ports. If a packet comes that does not match rules then it can be forwarded via proxy to the system to decide what to do with it. If that is not desired and such packets should be dropped then null port can be created with proxy and e.g. default route installed on it. - Removed previous example which was just printing notification. Instead added a simplified (stripped vectorization and other performance improvements) version of l3fwd that should serve as an example of using this library in real applications. Changes in V2 ============= - Cleaned up checkpatch warnings - Removed dead/unused code and added gateway clearing in l3fwd-ifpx With regards Andrzej Ostruszka Note: Patch 4 in this series has a dependency on: https://patchwork.dpdk.org/patch/66492/ so I add here this newly proposed tag here: Depends-on: series-8862 Andrzej Ostruszka (4): lib: introduce IF Proxy library if_proxy: add library documentation if_proxy: add simple functionality test if_proxy: add example application MAINTAINERS | 6 + app/test/Makefile | 5 + app/test/meson.build | 4 + app/test/test_if_proxy.c | 707 +++++++++++ config/common_base | 5 + config/common_linux | 1 + doc/guides/prog_guide/if_proxy_lib.rst | 142 +++ doc/guides/prog_guide/index.rst | 1 + examples/Makefile | 1 + examples/l3fwd-ifpx/Makefile | 60 + examples/l3fwd-ifpx/l3fwd.c | 1131 +++++++++++++++++ examples/l3fwd-ifpx/l3fwd.h | 98 ++ examples/l3fwd-ifpx/main.c | 740 +++++++++++ examples/l3fwd-ifpx/meson.build | 11 + examples/meson.build | 2 +- lib/Makefile | 2 + .../common/include/rte_eal_interrupts.h | 2 + lib/librte_eal/linux/eal/eal_interrupts.c | 14 +- lib/librte_if_proxy/Makefile | 29 + lib/librte_if_proxy/if_proxy_common.c | 494 +++++++ lib/librte_if_proxy/if_proxy_priv.h | 97 ++ lib/librte_if_proxy/linux/Makefile | 4 + lib/librte_if_proxy/linux/if_proxy.c | 550 ++++++++ lib/librte_if_proxy/meson.build | 19 + lib/librte_if_proxy/rte_if_proxy.h | 561 ++++++++ lib/librte_if_proxy/rte_if_proxy_version.map | 19 + lib/meson.build | 2 +- 27 files changed, 4701 insertions(+), 6 deletions(-) create mode 100644 app/test/test_if_proxy.c create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst create mode 100644 examples/l3fwd-ifpx/Makefile create mode 100644 examples/l3fwd-ifpx/l3fwd.c create mode 100644 examples/l3fwd-ifpx/l3fwd.h create mode 100644 examples/l3fwd-ifpx/main.c create mode 100644 examples/l3fwd-ifpx/meson.build create mode 100644 lib/librte_if_proxy/Makefile create mode 100644 lib/librte_if_proxy/if_proxy_common.c create mode 100644 lib/librte_if_proxy/if_proxy_priv.h create mode 100644 lib/librte_if_proxy/linux/Makefile create mode 100644 lib/librte_if_proxy/linux/if_proxy.c create mode 100644 lib/librte_if_proxy/meson.build create mode 100644 lib/librte_if_proxy/rte_if_proxy.h create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka @ 2020-03-10 11:10 ` Andrzej Ostruszka 2020-07-02 0:34 ` Stephen Hemminger 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 2/4] if_proxy: add library documentation Andrzej Ostruszka ` (4 subsequent siblings) 5 siblings, 1 reply; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-10 11:10 UTC (permalink / raw) To: dev, Thomas Monjalon This library allows to designate ports visible to the system (such as Tun/Tap or KNI) as port representors serving as proxies for other DPDK ports. When such a proxy is configured this library initially queries network configuration from the system and later monitors its changes. The information gathered is passed to the application either via a set of user registered callbacks or as an event added to the configured notification queue (or a combination of these two mechanisms). This way user can use normal network utilities (like those from the iproute2 suite) to configure DPDK ports. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 3 + config/common_base | 5 + config/common_linux | 1 + lib/Makefile | 2 + .../common/include/rte_eal_interrupts.h | 2 + lib/librte_eal/linux/eal/eal_interrupts.c | 14 +- lib/librte_if_proxy/Makefile | 29 + lib/librte_if_proxy/if_proxy_common.c | 494 +++++++++++++++ lib/librte_if_proxy/if_proxy_priv.h | 97 +++ lib/librte_if_proxy/linux/Makefile | 4 + lib/librte_if_proxy/linux/if_proxy.c | 550 +++++++++++++++++ lib/librte_if_proxy/meson.build | 19 + lib/librte_if_proxy/rte_if_proxy.h | 561 ++++++++++++++++++ lib/librte_if_proxy/rte_if_proxy_version.map | 19 + lib/meson.build | 2 +- 15 files changed, 1797 insertions(+), 5 deletions(-) create mode 100644 lib/librte_if_proxy/Makefile create mode 100644 lib/librte_if_proxy/if_proxy_common.c create mode 100644 lib/librte_if_proxy/if_proxy_priv.h create mode 100644 lib/librte_if_proxy/linux/Makefile create mode 100644 lib/librte_if_proxy/linux/if_proxy.c create mode 100644 lib/librte_if_proxy/meson.build create mode 100644 lib/librte_if_proxy/rte_if_proxy.h create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map diff --git a/MAINTAINERS b/MAINTAINERS index f4e0ed8e0..aec7326ca 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1469,6 +1469,9 @@ F: examples/bpf/ F: app/test/test_bpf.c F: doc/guides/prog_guide/bpf_lib.rst +IF Proxy - EXPERIMENTAL +M: Andrzej Ostruszka <aostruszka@marvell.com> +F: lib/librte_if_proxy/ Test Applications ----------------- diff --git a/config/common_base b/config/common_base index 7ca2f28b1..dcc0a0650 100644 --- a/config/common_base +++ b/config/common_base @@ -1075,6 +1075,11 @@ CONFIG_RTE_LIBRTE_BPF_ELF=n # CONFIG_RTE_LIBRTE_IPSEC=y +# +# Compile librte_if_proxy +# +CONFIG_RTE_LIBRTE_IF_PROXY=n + # # Compile the test application # diff --git a/config/common_linux b/config/common_linux index 816810671..1244eb0ae 100644 --- a/config/common_linux +++ b/config/common_linux @@ -16,6 +16,7 @@ CONFIG_RTE_LIBRTE_VHOST_NUMA=y CONFIG_RTE_LIBRTE_VHOST_POSTCOPY=n CONFIG_RTE_LIBRTE_PMD_VHOST=y CONFIG_RTE_LIBRTE_IFC_PMD=y +CONFIG_RTE_LIBRTE_IF_PROXY=y CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y CONFIG_RTE_LIBRTE_PMD_MEMIF=y CONFIG_RTE_LIBRTE_PMD_SOFTNIC=y diff --git a/lib/Makefile b/lib/Makefile index 46b91ae1a..6a20806f1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -118,6 +118,8 @@ DIRS-$(CONFIG_RTE_LIBRTE_TELEMETRY) += librte_telemetry DEPDIRS-librte_telemetry := librte_eal librte_metrics librte_ethdev DIRS-$(CONFIG_RTE_LIBRTE_RCU) += librte_rcu DEPDIRS-librte_rcu := librte_eal +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += librte_if_proxy +DEPDIRS-librte_if_proxy := librte_eal librte_ethdev ifeq ($(CONFIG_RTE_EXEC_ENV_LINUX),y) DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni diff --git a/lib/librte_eal/common/include/rte_eal_interrupts.h b/lib/librte_eal/common/include/rte_eal_interrupts.h index 773a34a42..296a3853d 100644 --- a/lib/librte_eal/common/include/rte_eal_interrupts.h +++ b/lib/librte_eal/common/include/rte_eal_interrupts.h @@ -36,6 +36,8 @@ enum rte_intr_handle_type { RTE_INTR_HANDLE_VDEV, /**< virtual device */ RTE_INTR_HANDLE_DEV_EVENT, /**< device event handle */ RTE_INTR_HANDLE_VFIO_REQ, /**< VFIO request handle */ + RTE_INTR_HANDLE_NETLINK, /**< netlink notification handle */ + RTE_INTR_HANDLE_MAX /**< count of elements */ }; diff --git a/lib/librte_eal/linux/eal/eal_interrupts.c b/lib/librte_eal/linux/eal/eal_interrupts.c index cb8e10709..16236a8c4 100644 --- a/lib/librte_eal/linux/eal/eal_interrupts.c +++ b/lib/librte_eal/linux/eal/eal_interrupts.c @@ -680,6 +680,9 @@ rte_intr_enable(const struct rte_intr_handle *intr_handle) break; /* not used at this moment */ case RTE_INTR_HANDLE_ALARM: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif return -1; #ifdef VFIO_PRESENT case RTE_INTR_HANDLE_VFIO_MSIX: @@ -796,6 +799,9 @@ rte_intr_disable(const struct rte_intr_handle *intr_handle) break; /* not used at this moment */ case RTE_INTR_HANDLE_ALARM: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif return -1; #ifdef VFIO_PRESENT case RTE_INTR_HANDLE_VFIO_MSIX: @@ -889,12 +895,12 @@ eal_intr_process_interrupts(struct epoll_event *events, int nfds) break; #endif #endif - case RTE_INTR_HANDLE_VDEV: case RTE_INTR_HANDLE_EXT: - bytes_read = 0; - call = true; - break; + case RTE_INTR_HANDLE_VDEV: case RTE_INTR_HANDLE_DEV_EVENT: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif bytes_read = 0; call = true; break; diff --git a/lib/librte_if_proxy/Makefile b/lib/librte_if_proxy/Makefile new file mode 100644 index 000000000..43cb702a2 --- /dev/null +++ b/lib/librte_if_proxy/Makefile @@ -0,0 +1,29 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +include $(RTE_SDK)/mk/rte.vars.mk + +# library name +LIB = librte_if_proxy.a + +CFLAGS += -DALLOW_EXPERIMENTAL_API +CFLAGS += -O3 +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) +LDLIBS += -lrte_eal -lrte_ethdev + +EXPORT_MAP := rte_if_proxy_version.map + +LIBABIVER := 1 + +# all source are stored in SRCS-y +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) := if_proxy_common.c + +SYSDIR := $(patsubst "%app",%,$(CONFIG_RTE_EXEC_ENV)) +include $(SRCDIR)/$(SYSDIR)/Makefile + +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += $(addprefix $(SYSDIR)/,$(SRCS)) + +# install this header file +SYMLINK-$(CONFIG_RTE_LIBRTE_IF_PROXY)-include := rte_if_proxy.h + +include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_if_proxy/if_proxy_common.c b/lib/librte_if_proxy/if_proxy_common.c new file mode 100644 index 000000000..546dc7810 --- /dev/null +++ b/lib/librte_if_proxy/if_proxy_common.c @@ -0,0 +1,494 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#include <if_proxy_priv.h> +#include <rte_string_fns.h> + + +/* Definitions of data mentioned in if_proxy_priv.h and local ones. */ +int ifpx_log_type; + +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; + +rte_spinlock_t ifpx_lock = RTE_SPINLOCK_INITIALIZER; + +struct ifpx_proxies_head ifpx_proxies = TAILQ_HEAD_INITIALIZER(ifpx_proxies); + +struct ifpx_queue_node { + TAILQ_ENTRY(ifpx_queue_node) elem; + uint16_t state; + struct rte_ring *r; +}; +static +TAILQ_HEAD(ifpx_queues_head, ifpx_queue_node) ifpx_queues = + TAILQ_HEAD_INITIALIZER(ifpx_queues); + +/* All function pointers have the same size - so use this one to typecast + * different callbacks in rte_ifpx_callbacks and test their presence in a + * generic way. + */ +union cb_ptr_t { + int (*f_ptr)(void *ev); /* type for normal event notification */ + int (*cfg_done)(void); /* lib notification for finished config */ +}; +union { + struct rte_ifpx_callbacks cbs; + union cb_ptr_t funcs[RTE_IFPX_NUM_EVENTS]; +} ifpx_callbacks; + +uint64_t rte_ifpx_events_available(void) +{ + /* All events are supported on Linux. */ + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; +} + +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type) +{ + char devargs[16] = { '\0' }; + int dev_cnt = 0, nlen; + uint16_t port_id; + + switch (type) { + case RTE_IFPX_DEFAULT: + case RTE_IFPX_TAP: + nlen = strlcpy(devargs, "net_tap", sizeof(devargs)); + break; + case RTE_IFPX_KNI: + nlen = strlcpy(devargs, "net_kni", sizeof(devargs)); + break; + default: + IFPX_LOG(ERR, "Unknown proxy type: %d", type); + return RTE_MAX_ETHPORTS; + } + + RTE_ETH_FOREACH_DEV(port_id) { + if (strcmp(rte_eth_devices[port_id].device->driver->name, + devargs) == 0) + ++dev_cnt; + } + snprintf(devargs+nlen, sizeof(devargs)-nlen, "%d", dev_cnt); + + return rte_ifpx_proxy_create_by_devarg(devargs); +} + +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg) +{ + uint16_t port_id = RTE_MAX_ETHPORTS; + struct rte_dev_iterator iter; + + if (rte_dev_probe(devarg) < 0) { + IFPX_LOG(ERR, "Failed to create proxy port %s\n", devarg); + return RTE_MAX_ETHPORTS; + } + + if (rte_eth_iterator_init(&iter, devarg) == 0) { + port_id = rte_eth_iterator_next(&iter); + if (port_id != RTE_MAX_ETHPORTS) + rte_eth_iterator_cleanup(&iter); + } + + return port_id; +} + +int ifpx_proxy_destroy(struct ifpx_proxy_node *px) +{ + unsigned int i; + uint16_t proxy_id = px->proxy_id; + + TAILQ_REMOVE(&ifpx_proxies, px, elem); + free(px); + + /* Clear any bindings for this proxy. */ + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) { + if (ifpx_ports[i] == proxy_id) { + if (i == proxy_id) /* this entry is for proxy itself */ + ifpx_ports[i] = RTE_MAX_ETHPORTS; + else + rte_ifpx_port_unbind(i); + } + } + + return rte_dev_remove(rte_eth_devices[proxy_id].device); +} + +int rte_ifpx_proxy_destroy(uint16_t proxy_id) +{ + struct ifpx_proxy_node *px; + int ec = 0; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id != proxy_id) + continue; + } + if (!px) { + ec = -EINVAL; + goto exit; + } + if (px->state & IN_USE) + px->state |= DEL_PENDING; + else + ec = ifpx_proxy_destroy(px); +exit: + rte_spinlock_unlock(&ifpx_lock); + return ec; +} + +int rte_ifpx_queue_add(struct rte_ring *r) +{ + struct ifpx_queue_node *node; + int ec = 0; + + if (!r) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(node, &ifpx_queues, elem) { + if (node->r == r) { + ec = -EEXIST; + goto exit; + } + } + + node = malloc(sizeof(*node)); + if (!node) { + ec = -ENOMEM; + goto exit; + } + + node->r = r; + TAILQ_INSERT_TAIL(&ifpx_queues, node, elem); +exit: + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +int rte_ifpx_queue_remove(struct rte_ring *r) +{ + struct ifpx_queue_node *node, *next; + int ec = -EINVAL; + + if (!r) + return ec; + + rte_spinlock_lock(&ifpx_lock); + for (node = TAILQ_FIRST(&ifpx_queues); node; node = next) { + next = TAILQ_NEXT(node, elem); + if (node->r != r) + continue; + TAILQ_REMOVE(&ifpx_queues, node, elem); + free(node); + ec = 0; + break; + } + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id) +{ + struct rte_eth_dev_info proxy_eth_info; + struct ifpx_proxy_node *px; + int ec; + + if (port_id >= RTE_MAX_ETHPORTS || proxy_id >= RTE_MAX_ETHPORTS || + /* port is a proxy */ + ifpx_ports[port_id] == port_id) { + IFPX_LOG(ERR, "Invalid port_id: %d", port_id); + return -EINVAL; + } + + /* Do automatic rebinding but issue a warning since this is not + * considered to be a valid behaviour. + */ + if (ifpx_ports[port_id] != RTE_MAX_ETHPORTS) { + IFPX_LOG(WARNING, "Port already bound: %d -> %d", port_id, + ifpx_ports[port_id]); + } + + /* Search for existing proxy - if not found add one to the list. */ + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == proxy_id) + break; + } + if (!px) { + ec = rte_eth_dev_info_get(proxy_id, &proxy_eth_info); + if (ec < 0 || proxy_eth_info.if_index == 0) { + IFPX_LOG(ERR, "Invalid proxy: %d", proxy_id); + rte_spinlock_unlock(&ifpx_lock); + return ec < 0 ? ec : -EINVAL; + } + px = malloc(sizeof(*px)); + if (!px) { + rte_spinlock_unlock(&ifpx_lock); + return -ENOMEM; + } + px->proxy_id = proxy_id; + px->info.if_index = proxy_eth_info.if_index; + rte_eth_dev_get_mtu(proxy_id, &px->info.mtu); + rte_eth_macaddr_get(proxy_id, &px->info.mac); + memset(px->info.if_name, 0, sizeof(px->info.if_name)); + TAILQ_INSERT_TAIL(&ifpx_proxies, px, elem); + ifpx_ports[proxy_id] = proxy_id; + } + rte_spinlock_unlock(&ifpx_lock); + ifpx_ports[port_id] = proxy_id; + + /* Add proxy MAC to the port - since port will often just forward + * packets from the proxy/system they will be sent with proxy MAC as + * src. In order to pass communication in other direction we should be + * accepting packets with proxy MAC as dst. + */ + rte_eth_dev_mac_addr_add(port_id, &px->info.mac, 0); + + if (ifpx_platform.get_info) + ifpx_platform.get_info(px->info.if_index); + + return 0; +} + +int rte_ifpx_port_unbind(uint16_t port_id) +{ + if (port_id >= RTE_MAX_ETHPORTS || + ifpx_ports[port_id] == RTE_MAX_ETHPORTS || + /* port is a proxy */ + ifpx_ports[port_id] == port_id) + return -EINVAL; + + ifpx_ports[port_id] = RTE_MAX_ETHPORTS; + /* Proxy without any port bound is OK - that is the state of the proxy + * that has just been created, and it can still report routing + * information. So we do not even check if this is the case. + */ + + return 0; +} + +int rte_ifpx_callbacks_register(const struct rte_ifpx_callbacks *cbs) +{ + if (!cbs) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + ifpx_callbacks.cbs = *cbs; + rte_spinlock_unlock(&ifpx_lock); + + return 0; +} + +void rte_ifpx_callbacks_unregister(void) +{ + rte_spinlock_lock(&ifpx_lock); + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); + rte_spinlock_unlock(&ifpx_lock); +} + +uint16_t rte_ifpx_proxy_get(uint16_t port_id) +{ + if (port_id >= RTE_MAX_ETHPORTS) + return RTE_MAX_ETHPORTS; + + return ifpx_ports[port_id]; +} + +unsigned int rte_ifpx_port_get(uint16_t proxy_id, + uint16_t *ports, unsigned int num) +{ + unsigned int p, cnt = 0; + + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] == proxy_id && ifpx_ports[p] != p) { + ++cnt; + if (ports && num > 0) { + *ports++ = p; + --num; + } + } + } + return cnt; +} + +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id) +{ + struct ifpx_proxy_node *px; + + if (port_id >= RTE_MAX_ETHPORTS || + ifpx_ports[port_id] == RTE_MAX_ETHPORTS) + return NULL; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == ifpx_ports[port_id]) + break; + } + rte_spinlock_unlock(&ifpx_lock); + RTE_ASSERT(px && "Internal IF Proxy library error"); + + return &px->info; +} + +static +void queue_event(const struct rte_ifpx_event *ev, struct rte_ring *r) +{ + struct rte_ifpx_event *e = malloc(sizeof(*ev)); + + if (!e) { + IFPX_LOG(ERR, "Failed to allocate event!"); + return; + } + RTE_ASSERT(r); + + *e = *ev; + rte_ring_sp_enqueue(r, e); +} + +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px) +{ + struct ifpx_queue_node *q; + int done = 0; + uint16_t p, proxy_id; + + if (px) { + if (px->state & DEL_PENDING) + return; + proxy_id = px->proxy_id; + RTE_ASSERT(proxy_id != RTE_MAX_ETHPORTS); + px->state |= IN_USE; + } else + proxy_id = RTE_MAX_ETHPORTS; + + RTE_ASSERT(ev); + /* This function is expected to be called with a lock held. */ + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); + + if (ifpx_callbacks.funcs[ev->type].f_ptr) { + union cb_ptr_t cb = ifpx_callbacks.funcs[ev->type]; + + /* Drop the lock for the time of callback call. */ + rte_spinlock_unlock(&ifpx_lock); + if (px) { + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] != proxy_id || + ifpx_ports[p] == p) + continue; + ev->data.port_id = p; + done = cb.f_ptr(&ev->data) || done; + } + } else { + RTE_ASSERT(ev->type == RTE_IFPX_CFG_DONE); + done = cb.cfg_done(); + } + rte_spinlock_lock(&ifpx_lock); + } + if (done) + goto exit; + + /* Event not "consumed" yet so try to notify via queues. */ + TAILQ_FOREACH(q, &ifpx_queues, elem) { + if (px) { + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] != proxy_id || + ifpx_ports[p] == p) + continue; + /* Set the port_id - the remaining params should + * be filled before calling this function. + */ + ev->data.port_id = p; + queue_event(ev, q->r); + } + } else + queue_event(ev, q->r); + } +exit: + if (px) + px->state &= ~IN_USE; +} + +void ifpx_cleanup_proxies(void) +{ + struct ifpx_proxy_node *px, *next; + for (px = TAILQ_FIRST(&ifpx_proxies); px; px = next) { + next = TAILQ_NEXT(px, elem); + if (px->state & DEL_PENDING) + ifpx_proxy_destroy(px); + } +} + +int rte_ifpx_listen(void) +{ + int ec; + + if (!ifpx_platform.listen) + return -ENOTSUP; + + ec = ifpx_platform.listen(); + if (ec == 0 && ifpx_platform.get_info) + ifpx_platform.get_info(0); + + return ec; +} + +int rte_ifpx_close(void) +{ + struct ifpx_proxy_node *px; + struct ifpx_queue_node *q; + unsigned int p; + int ec = 0; + + if (ifpx_platform.close) { + ec = ifpx_platform.close(); + if (ec != 0) + IFPX_LOG(ERR, "Platform 'close' calback failed."); + } + + rte_spinlock_lock(&ifpx_lock); + /* Remove queues. */ + while (!TAILQ_EMPTY(&ifpx_queues)) { + q = TAILQ_FIRST(&ifpx_queues); + TAILQ_REMOVE(&ifpx_queues, q, elem); + free(q); + } + + /* Clear callbacks. */ + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); + + /* Unbind ports. */ + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] == RTE_MAX_ETHPORTS) + continue; + if (ifpx_ports[p] == p) + /* port is a proxy - just clear entry */ + ifpx_ports[p] = RTE_MAX_ETHPORTS; + else + rte_ifpx_port_unbind(p); + } + + /* Clear proxies. */ + while (!TAILQ_EMPTY(&ifpx_proxies)) { + px = TAILQ_FIRST(&ifpx_proxies); + TAILQ_REMOVE(&ifpx_proxies, px, elem); + free(px); + } + + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +RTE_INIT(if_proxy_init) +{ + unsigned int i; + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) + ifpx_ports[i] = RTE_MAX_ETHPORTS; + + ifpx_log_type = rte_log_register("lib.if_proxy"); + if (ifpx_log_type >= 0) + rte_log_set_level(ifpx_log_type, RTE_LOG_WARNING); + + if (ifpx_platform.init) + ifpx_platform.init(); +} diff --git a/lib/librte_if_proxy/if_proxy_priv.h b/lib/librte_if_proxy/if_proxy_priv.h new file mode 100644 index 000000000..dd7468891 --- /dev/null +++ b/lib/librte_if_proxy/if_proxy_priv.h @@ -0,0 +1,97 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ +#ifndef _IF_PROXY_PRIV_H_ +#define _IF_PROXY_PRIV_H_ + +#include <rte_if_proxy.h> +#include <rte_spinlock.h> + +extern int ifpx_log_type; +#define IFPX_LOG(level, fmt, args...) \ + rte_log(RTE_LOG_ ## level, ifpx_log_type, "%s(): " fmt "\n", \ + __func__, ##args) + +/* Table keeping mapping between port and their proxies. */ +extern +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; + +/* Callbacks and proxies are kept in linked lists. Since this library is really + * a slow/config path we guard them with a lock - and only one for all of them + * should be enough. We don't expect a need to protect other data structures - + * e.g. data for given port is expected be accessed/modified from single thread. + */ +extern rte_spinlock_t ifpx_lock; + +enum ifpx_node_status { + IN_USE = 1U << 0, + DEL_PENDING = 1U << 1, +}; + +/* List of configured proxies */ +struct ifpx_proxy_node { + TAILQ_ENTRY(ifpx_proxy_node) elem; + uint16_t proxy_id; + uint16_t state; + struct rte_ifpx_info info; +}; +extern +TAILQ_HEAD(ifpx_proxies_head, ifpx_proxy_node) ifpx_proxies; + +/* This function should be called by the implementation whenever it notices + * change in the network configuration. The arguments are: + * - ev : pointer to filled event data structure (all fields are expected to be + * filled, with the exception of 'port_id' for all proxy/port related + * events: this function clones the event notification for each bound port + * and fills 'port_id' appropriately). + * - px : proxy node when given event is proxy/port related, otherwise pass NULL + */ +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px); + +/* This function should be called by the implementation whenever it is done with + * notification about network configuration change. It is only really needed + * for the case of callback based API - from the callback user might to attempt + * to remove callbacks/proxies. Removing of callbacks is handled by the + * ifpx_notify_event() function above, however only implementation really knows + * when notification for given proxy is finished so it is a duty of it to call + * this function to cleanup all proxies that has been marked for deletion. + */ +void ifpx_cleanup_proxies(void); + +/* This is the internal function removing the proxy from the list. It is + * related to the notification function above and intended to be used by the + * platform implementation for the case of callback based API. + * During notification via callback the internal lock is released so that + * operation would not deadlock on an attempt to take a lock. However + * modification (destruction) is not really performed - instead the + * callbacks/proxies are marked as "to be deleted". + * Handling of callbacks that are "to be deleted" is done by the + * ifpx_notify_event() function itself however it cannot delete the proxies (in + * particular the proxy passed as an argument) since they might still be + * referred by the calling function. So it is a responsibility of the platform + * implementation to check after calling notification function if there are any + * proxies to be removed and use ifpx_proxy_destroy() to actually release them. + */ +int ifpx_proxy_destroy(struct ifpx_proxy_node *px); + +/* Every implementation should provide definition of this structure: + * - init : called during library initialization (NULL when not needed) + * - listen : this function should start service listening to the network + * configuration events/changes, + * - close : this function should close the service started by listen() + * - get_info : this function should query system for current configuration of + * interface with index 'if_index'. After successful initialization of + * listening service this function is called with 0 as an argument. In that + * case configuration of all ports should be obtained - and when this + * procedure completes a RTE_IFPX_CFG_DONE event should be signaled via + * ifpx_notify_event(). + */ +extern +struct ifpx_platform_callbacks { + void (*init)(void); + int (*listen)(void); + int (*close)(void); + void (*get_info)(int if_index); +} ifpx_platform; + +#endif /* _IF_PROXY_PRIV_H_ */ diff --git a/lib/librte_if_proxy/linux/Makefile b/lib/librte_if_proxy/linux/Makefile new file mode 100644 index 000000000..275b7e1e3 --- /dev/null +++ b/lib/librte_if_proxy/linux/Makefile @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +SRCS += if_proxy.c diff --git a/lib/librte_if_proxy/linux/if_proxy.c b/lib/librte_if_proxy/linux/if_proxy.c new file mode 100644 index 000000000..0204505e3 --- /dev/null +++ b/lib/librte_if_proxy/linux/if_proxy.c @@ -0,0 +1,550 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ +#include <if_proxy_priv.h> +#include <rte_interrupts.h> +#include <rte_string_fns.h> + +#include <stdbool.h> +#include <unistd.h> +#include <errno.h> +#include <sys/socket.h> +#include <linux/rtnetlink.h> +#include <linux/if.h> + +static +struct rte_intr_handle ifpx_irq = { + .type = RTE_INTR_HANDLE_NETLINK, + .fd = -1, +}; + +static +unsigned int ifpx_pid; + +static +int request_info(int type, int index) +{ + static rte_spinlock_t send_lock = RTE_SPINLOCK_INITIALIZER; + struct info_get { + struct nlmsghdr h; + union { + struct ifinfomsg ifm; + struct ifaddrmsg ifa; + struct rtmsg rtm; + struct ndmsg ndm; + } __rte_aligned(NLMSG_ALIGNTO); + } info_req; + int ret; + + memset(&info_req, 0, sizeof(info_req)); + /* First byte of these messages is family, so just make sure that this + * memset is enough to get all families. + */ + RTE_ASSERT(AF_UNSPEC == 0); + + info_req.h.nlmsg_pid = ifpx_pid; + info_req.h.nlmsg_type = type; + info_req.h.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP; + info_req.h.nlmsg_len = offsetof(struct info_get, ifm); + + switch (type) { + case RTM_GETLINK: + info_req.h.nlmsg_len += sizeof(info_req.ifm); + info_req.ifm.ifi_index = index; + break; + case RTM_GETADDR: + info_req.h.nlmsg_len += sizeof(info_req.ifa); + info_req.ifa.ifa_index = index; + break; + case RTM_GETROUTE: + info_req.h.nlmsg_len += sizeof(info_req.rtm); + break; + case RTM_GETNEIGH: + info_req.h.nlmsg_len += sizeof(info_req.ndm); + break; + default: + IFPX_LOG(WARNING, "Unhandled message type: %d", type); + return -EINVAL; + } + /* Store request type (and if it is global or link specific) in 'seq'. + * Later it is used during handling of reply to continue requesting of + * information dump from system - if needed. + */ + info_req.h.nlmsg_seq = index << 8 | type; + + IFPX_LOG(DEBUG, "\tRequesting msg %d for: %u", type, index); + + rte_spinlock_lock(&send_lock); + ret = send(ifpx_irq.fd, &info_req, info_req.h.nlmsg_len, 0); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to send netlink msg: %d", errno); + rte_errno = errno; + } + rte_spinlock_unlock(&send_lock); + + return ret; +} + +static +void handle_link(const struct nlmsghdr *h) +{ + const struct ifinfomsg *ifi = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + const struct rtattr *attrs[IFLA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + + IFPX_LOG(DEBUG, "\tLink action (%u): %u, 0x%x/0x%x (flags/changed)", + ifi->ifi_index, h->nlmsg_type, ifi->ifi_flags, + ifi->ifi_change); + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (unsigned int)ifi->ifi_index) + break; + } + + /* Drop messages that are not associated with any proxy */ + if (!px) + goto exit; + /* When message is a reply to request for specific interface then keep + * it only when it contains info for this interface. + */ + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && + (h->nlmsg_seq >> 8) != (unsigned int)ifi->ifi_index) + goto exit; + + for (attr = IFLA_RTA(ifi); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > IFLA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + if (ifi->ifi_change & IFF_UP) { + ev.type = RTE_IFPX_LINK_CHANGE; + ev.link_change.is_up = ifi->ifi_flags & IFF_UP; + ifpx_notify_event(&ev, px); + } + if (attrs[IFLA_MTU]) { + uint16_t mtu = *(const int *)RTA_DATA(attrs[IFLA_MTU]); + if (mtu != px->info.mtu) { + px->info.mtu = mtu; + ev.type = RTE_IFPX_MTU_CHANGE; + ev.mtu_change.mtu = mtu; + ifpx_notify_event(&ev, px); + } + } + if (attrs[IFLA_ADDRESS]) { + const struct rte_ether_addr *mac = + RTA_DATA(attrs[IFLA_ADDRESS]); + + RTE_ASSERT(RTA_PAYLOAD(attrs[IFLA_ADDRESS]) == + RTE_ETHER_ADDR_LEN); + if (memcmp(mac, &px->info.mac, RTE_ETHER_ADDR_LEN) != 0) { + rte_ether_addr_copy(mac, &px->info.mac); + ev.type = RTE_IFPX_MAC_CHANGE; + rte_ether_addr_copy(mac, &ev.mac_change.mac); + ifpx_notify_event(&ev, px); + } + } + if (h->nlmsg_pid == ifpx_pid) { + RTE_ASSERT((h->nlmsg_seq & 0xFF) == RTM_GETLINK); + /* If this is reply for specific link request (not initial + * global dump) then follow up with address request, otherwise + * just store the interface name. + */ + if (h->nlmsg_seq >> 8) + request_info(RTM_GETADDR, ifi->ifi_index); + else if (!px->info.if_name[0] && attrs[IFLA_IFNAME]) + strlcpy(px->info.if_name, RTA_DATA(attrs[IFLA_IFNAME]), + sizeof(px->info.if_name)); + } + + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void handle_addr(const struct nlmsghdr *h, bool needs_del) +{ + const struct ifaddrmsg *ifa = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifa)); + const struct rtattr *attrs[IFA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tAddr action (%u): %u, family: %u", + ifa->ifa_index, h->nlmsg_type, ifa->ifa_family); + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == ifa->ifa_index) + break; + } + + /* Drop messages that are not associated with any proxy */ + if (!px) + goto exit; + /* When message is a reply to request for specific interface then keep + * it only when it contains info for this interface. + */ + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && + (h->nlmsg_seq >> 8) != ifa->ifa_index) + goto exit; + + for (attr = IFA_RTA(ifa); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > IFA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + if (attrs[IFA_ADDRESS]) { + ip = RTA_DATA(attrs[IFA_ADDRESS]); + if (ifa->ifa_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_ADDR_DEL + : RTE_IFPX_ADDR_ADD; + ev.addr_change.ip = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_ADDR6_DEL + : RTE_IFPX_ADDR6_ADD; + memcpy(ev.addr6_change.ip, ip, 16); + } + ifpx_notify_event(&ev, px); + ifpx_cleanup_proxies(); + } +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void handle_route(const struct nlmsghdr *h, bool needs_del) +{ + const struct rtmsg *r = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*r)); + const struct rtattr *attrs[RTA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct rte_ifpx_event ev; + struct ifpx_proxy_node *px = NULL; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tRoute action: %u, family: %u", + h->nlmsg_type, r->rtm_family); + + for (attr = RTM_RTA(r); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > RTA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + memset(&ev, 0, sizeof(ev)); + ev.type = RTE_IFPX_NUM_EVENTS; + + rte_spinlock_lock(&ifpx_lock); + if (attrs[RTA_OIF]) { + int if_index = *((int32_t *)RTA_DATA(attrs[RTA_OIF])); + + if (if_index > 0) { + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (uint32_t)if_index) + break; + } + } + } + /* We are only interested in routes related to the proxy interfaces and + * we need to have dst - otherwise skip the message. + */ + if (!px || !attrs[RTA_DST]) + goto exit; + + ip = RTA_DATA(attrs[RTA_DST]); + /* This is common to both IPv4/6. */ + ev.route_change.depth = r->rtm_dst_len; + if (r->rtm_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_ROUTE_DEL + : RTE_IFPX_ROUTE_ADD; + ev.route_change.ip = RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_ROUTE6_DEL + : RTE_IFPX_ROUTE6_ADD; + memcpy(ev.route6_change.ip, ip, 16); + } + if (attrs[RTA_GATEWAY]) { + ip = RTA_DATA(attrs[RTA_GATEWAY]); + if (r->rtm_family == AF_INET) + ev.route_change.gateway = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + else + memcpy(ev.route6_change.gateway, ip, 16); + } + + ifpx_notify_event(&ev, px); + /* Let's check for proxies to remove here too - just in case somebody + * removed the non-proxy related callback. + */ + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +/* Link, addr and route related messages seem to have this macro defined but not + * neighbour one. Define one if it is missing - const qualifiers added just to + * silence compiler - for some reason it is not needed in equivalent macros for + * other messages and here compiler is complaining about (char*) cast on pointer + * to const. + */ +#ifndef NDA_RTA +#define NDA_RTA(r) ((const struct rtattr *)(((const char *)(r)) + \ + NLMSG_ALIGN(sizeof(struct ndmsg)))) +#endif + +static +void handle_neigh(const struct nlmsghdr *h, bool needs_del) +{ + const struct ndmsg *n = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*n)); + const struct rtattr *attrs[NDA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tNeighbour action: %u, family: %u, state: %u, if: %d", + h->nlmsg_type, n->ndm_family, n->ndm_state, n->ndm_ifindex); + + for (attr = NDA_RTA(n); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > NDA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + memset(&ev, 0, sizeof(ev)); + ev.type = RTE_IFPX_NUM_EVENTS; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (unsigned int)n->ndm_ifindex) + break; + } + /* We need only subset of neighbourhood related to proxy interfaces. + * lladdr seems to be needed only for adding new entry - modifications + * (also reported via RTM_NEWLINK) and deletion include only dst. + */ + if (!px || !attrs[NDA_DST] || (!needs_del && !attrs[NDA_LLADDR])) + goto exit; + + ip = RTA_DATA(attrs[NDA_DST]); + if (n->ndm_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_NEIGH_DEL + : RTE_IFPX_NEIGH_ADD; + ev.neigh_change.ip = RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_NEIGH6_DEL + : RTE_IFPX_NEIGH6_ADD; + memcpy(ev.neigh6_change.ip, ip, 16); + } + if (attrs[NDA_LLADDR]) + rte_ether_addr_copy(RTA_DATA(attrs[NDA_LLADDR]), + &ev.neigh_change.mac); + + ifpx_notify_event(&ev, px); + /* Let's check for proxies to remove here too - just in case somebody + * removed the non-proxy related callback. + */ + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void if_proxy_intr_callback(void *arg __rte_unused) +{ + struct nlmsghdr *h; + struct sockaddr_nl addr; + socklen_t addr_len; + char buf[8192]; + ssize_t len; + +restart: + len = recvfrom(ifpx_irq.fd, buf, sizeof(buf), 0, + (struct sockaddr *)&addr, &addr_len); + if (len < 0) { + if (errno == EINTR) { + IFPX_LOG(DEBUG, "recvmsg() interrupted"); + goto restart; + } + IFPX_LOG(ERR, "Failed to read netlink msg: %ld (errno %d)", + len, errno); + return; + } + if (addr_len != sizeof(addr)) { + IFPX_LOG(ERR, "Invalid netlink addr size: %d", addr_len); + return; + } + IFPX_LOG(DEBUG, "Read %lu bytes (buf %lu) from %u/%u", len, + sizeof(buf), addr.nl_pid, addr.nl_groups); + + for (h = (struct nlmsghdr *)buf; NLMSG_OK(h, len); + h = NLMSG_NEXT(h, len)) { + IFPX_LOG(DEBUG, "Recv msg: %u (%u/%u/%u seq/flags/pid)", + h->nlmsg_type, h->nlmsg_seq, h->nlmsg_flags, + h->nlmsg_pid); + + switch (h->nlmsg_type) { + case RTM_NEWLINK: + case RTM_DELLINK: + handle_link(h); + break; + case RTM_NEWADDR: + case RTM_DELADDR: + handle_addr(h, h->nlmsg_type == RTM_DELADDR); + break; + case RTM_NEWROUTE: + case RTM_DELROUTE: + handle_route(h, h->nlmsg_type == RTM_DELROUTE); + break; + case RTM_NEWNEIGH: + case RTM_DELNEIGH: + handle_neigh(h, h->nlmsg_type == RTM_DELNEIGH); + break; + } + + /* If this is a reply for global request then follow up with + * additional requests and notify about finish. + */ + if (h->nlmsg_pid == ifpx_pid && (h->nlmsg_seq >> 8) == 0 && + h->nlmsg_type == NLMSG_DONE) { + if ((h->nlmsg_seq & 0xFF) == RTM_GETLINK) + request_info(RTM_GETADDR, 0); + else if ((h->nlmsg_seq & 0xFF) == RTM_GETADDR) + request_info(RTM_GETROUTE, 0); + else if ((h->nlmsg_seq & 0xFF) == RTM_GETROUTE) + request_info(RTM_GETNEIGH, 0); + else { + struct rte_ifpx_event ev = { + .type = RTE_IFPX_CFG_DONE + }; + + RTE_ASSERT((h->nlmsg_seq & 0xFF) == + RTM_GETNEIGH); + rte_spinlock_lock(&ifpx_lock); + ifpx_notify_event(&ev, NULL); + rte_spinlock_unlock(&ifpx_lock); + } + } + } + IFPX_LOG(DEBUG, "Finished msg loop: %ld bytes left", len); +} + +static +int nlink_listen(void) +{ + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + .nl_pid = 0, + }; + socklen_t addr_len = sizeof(addr); + int ret; + + if (ifpx_irq.fd != -1) { + rte_errno = EBUSY; + return -1; + } + + addr.nl_groups = 1 << (RTNLGRP_LINK-1) + | 1 << (RTNLGRP_NEIGH-1) + | 1 << (RTNLGRP_IPV4_IFADDR-1) + | 1 << (RTNLGRP_IPV6_IFADDR-1) + | 1 << (RTNLGRP_IPV4_ROUTE-1) + | 1 << (RTNLGRP_IPV6_ROUTE-1); + + ifpx_irq.fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, + NETLINK_ROUTE); + if (ifpx_irq.fd == -1) { + IFPX_LOG(ERR, "Failed to create netlink socket: %d", errno); + goto error; + } + /* Starting with kernel 4.19 you can request dump for a specific + * interface and kernel will filter out and send only relevant info. + * Otherwise NLM_F_DUMP will generate info for all interfaces and you + * need to filter them yourself. + */ +#ifdef NETLINK_DUMP_STRICT_CHK + ret = 1; /* use this var also as an input param */ + ret = setsockopt(ifpx_irq.fd, SOL_SOCKET, NETLINK_DUMP_STRICT_CHK, + &ret, sizeof(ret)); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to set socket option: %d", errno); + goto error; + } +#endif + + ret = bind(ifpx_irq.fd, (struct sockaddr *)&addr, addr_len); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to bind socket: %d", errno); + goto error; + } + ret = getsockname(ifpx_irq.fd, (struct sockaddr *)&addr, &addr_len); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to get socket addr: %d", errno); + goto error; + } else { + ifpx_pid = addr.nl_pid; + IFPX_LOG(DEBUG, "Assigned port ID: %u", addr.nl_pid); + } + + ret = rte_intr_callback_register(&ifpx_irq, if_proxy_intr_callback, + NULL); + if (ret == 0) + return 0; + +error: + rte_errno = errno; + if (ifpx_irq.fd != -1) { + close(ifpx_irq.fd); + ifpx_irq.fd = -1; + } + return -1; +} + +static +int nlink_close(void) +{ + int ec; + + if (ifpx_irq.fd < 0) + return -EBADFD; + + do + ec = rte_intr_callback_unregister(&ifpx_irq, + if_proxy_intr_callback, NULL); + while (ec == -EAGAIN); /* unlikely but possible - at least I think so */ + + close(ifpx_irq.fd); + ifpx_irq.fd = -1; + ifpx_pid = 0; + + return 0; +} + +static +void nlink_get_info(int if_index) +{ + if (ifpx_irq.fd != -1) + request_info(RTM_GETLINK, if_index); +} + +struct ifpx_platform_callbacks ifpx_platform = { + .init = NULL, + .listen = nlink_listen, + .close = nlink_close, + .get_info = nlink_get_info, +}; diff --git a/lib/librte_if_proxy/meson.build b/lib/librte_if_proxy/meson.build new file mode 100644 index 000000000..f0c1a6e15 --- /dev/null +++ b/lib/librte_if_proxy/meson.build @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +# Currently only implemented on Linux +if not is_linux + build = false + reason = 'only supported on linux' +endif + +version = 1 +allow_experimental_apis = true + +deps += ['ethdev'] +sources = files('if_proxy_common.c') +headers = files('rte_if_proxy.h') + +if is_linux + sources += files('linux/if_proxy.c') +endif diff --git a/lib/librte_if_proxy/rte_if_proxy.h b/lib/librte_if_proxy/rte_if_proxy.h new file mode 100644 index 000000000..70f701719 --- /dev/null +++ b/lib/librte_if_proxy/rte_if_proxy.h @@ -0,0 +1,561 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#ifndef _RTE_IF_PROXY_H_ +#define _RTE_IF_PROXY_H_ + +/** + * @file + * RTE IF Proxy library + * + * The IF Proxy library allows for monitoring of system network configuration + * and configuration of DPDK ports by using usual system utilities (like the + * ones from iproute2 package). + * + * It is based on the notion of "proxy interface" which actually can be any DPDK + * port which is also visible to the system - that is it has non-zero 'if_index' + * field in 'rte_eth_dev_info' structure. + * + * If application doesn't have any such port (or doesn't want to use it for + * proxy) it can create one by calling: + * + * proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); + * + * This function is just a wrapper that constructs valid 'devargs' string based + * on the proxy type chosen (currently Tap or KNI) and creates the interface by + * calling rte_ifpx_dev_create(). + * + * Once one has DPDK port capable of being proxy one can bind target DPDK port + * to it by calling. + * + * rte_ifpx_port_bind(port_id, proxy_id); + * + * This binding is a logical one - there is no automatic packet forwarding + * between port and it's proxy since the library doesn't know the structure of + * application's packet processing. It remains application responsibility to + * forward the packets from/to proxy port (by calling the usual DPDK RX/TX burst + * API). However when the library notes some change to the proxy interface it + * will simply call appropriate callback with 'port_id' of the DPDK port that is + * bound to this proxy interface. The binding can be 1 to many - that is many + * ports can point to one proxy - in that case registered callbacks will be + * called for every bound port. + * + * The callbacks that are used for notifications are described by the + * 'rte_ifpx_callbacks' structure and they are registered by calling: + * + * rte_ifpx_callbacks_register(&cbs); + * + * Finally the application should call: + * + * rte_ifpx_listen(); + * + * which will query system for present network configuration and start listening + * to its changes. + */ + +#include <rte_eal.h> +#include <rte_ethdev.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Enum naming the type of proxy to create. + * + * @see rte_ifpx_create() + */ +enum rte_ifpx_proxy_type { + RTE_IFPX_DEFAULT, /**< Use default proxy type for given arch. */ + RTE_IFPX_TAP, /**< Use Tap based port for proxy. */ + RTE_IFPX_KNI /**< Use KNI based port for proxy. */ +}; + +/** + * Create DPDK port that can serve as an interface proxy. + * + * This function is just a wrapper around rte_ifpx_create_by_devarg() that + * constructs its 'devarg' argument based on type of proxy requested. + * + * @param type + * A type of proxy to create. + * + * @return + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. + * + * @see enum rte_ifpx_type + * @see rte_ifpx_create_by_devarg() + */ +__rte_experimental +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type); + +/** + * Create DPDK port that can serve as an interface proxy. + * + * @param devarg + * A string passed to rte_dev_probe() to create proxy port. + * + * @return + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. + */ +__rte_experimental +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg); + +/** + * Remove DPDK proxy port. + * + * In addition to removing the proxy port the bindings (if any) are cleared. + * + * @param proxy_id + * Port id of the proxy that should be removed. + * + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_proxy_destroy(uint16_t proxy_id); + +/** + * The rte_ifpx_event_type enum lists all possible event types that can be + * signaled by this library. To learn what events are supported on your + * platform call rte_ifpx_events_available(). + * + * NOTE - do not reorder these enums freely, their values need to correspond to + * the order of the callbacks in struct rte_ifpx_callbacks. + */ +enum rte_ifpx_event_type { + RTE_IFPX_MAC_CHANGE, /**< @see struct rte_ifpx_mac_change */ + RTE_IFPX_MTU_CHANGE, /**< @see struct rte_ifpx_mtu_change */ + RTE_IFPX_LINK_CHANGE, /**< @see struct rte_ifpx_link_change */ + RTE_IFPX_ADDR_ADD, /**< @see struct rte_ifpx_addr_change */ + RTE_IFPX_ADDR_DEL, /**< @see struct rte_ifpx_addr_change */ + RTE_IFPX_ADDR6_ADD, /**< @see struct rte_ifpx_addr6_change */ + RTE_IFPX_ADDR6_DEL, /**< @see struct rte_ifpx_addr6_change */ + RTE_IFPX_ROUTE_ADD, /**< @see struct rte_ifpx_route_change */ + RTE_IFPX_ROUTE_DEL, /**< @see struct rte_ifpx_route_change */ + RTE_IFPX_ROUTE6_ADD, /**< @see struct rte_ifpx_route6_change */ + RTE_IFPX_ROUTE6_DEL, /**< @see struct rte_ifpx_route6_change */ + RTE_IFPX_NEIGH_ADD, /**< @see struct rte_ifpx_neigh_change */ + RTE_IFPX_NEIGH_DEL, /**< @see struct rte_ifpx_neigh_change */ + RTE_IFPX_NEIGH6_ADD, /**< @see struct rte_ifpx_neigh6_change */ + RTE_IFPX_NEIGH6_DEL, /**< @see struct rte_ifpx_neigh6_change */ + RTE_IFPX_CFG_DONE, /**< This event is a lib specific event - it is + * signaled when initial network configuration + * query is finished and has no event data. + */ + RTE_IFPX_NUM_EVENTS, +}; + +/** + * Get the bit mask of implemented events/callbacks for this platform. + * + * @return + * Bit mask of events/callbacks implemented: each event type can be tested by + * checking bit (1 << ev) where 'ev' is one of the rte_ifpx_event_type enum + * values. + * @see enum rte_ifpx_event_type + */ +__rte_experimental +uint64_t rte_ifpx_events_available(void); + +/** + * The rte_ifpx_event defines structure used to pass notification event to + * application. Each event type has its own dedicated inner structure - these + * structures are also used when using callbacks notifications. + */ +struct rte_ifpx_event { + enum rte_ifpx_event_type type; + union { + /** Structure used to pass notification about MAC change of the + * proxy interface. + * @see RTE_IFPX_MAC_CHANGE + */ + struct rte_ifpx_mac_change { + uint16_t port_id; + struct rte_ether_addr mac; + } mac_change; + /** Structure used to pass notification about MTU change. + * @see RTE_IFPX_MTU_CHANGE + */ + struct rte_ifpx_mtu_change { + uint16_t port_id; + uint16_t mtu; + } mtu_change; + /** Structure used to pass notification about link going + * up/down. + * @see RTE_IFPX_LINK_CHANGE + */ + struct rte_ifpx_link_change { + uint16_t port_id; + int is_up; + } link_change; + /** Structure used to pass notification about IPv4 address being + * added/removed. All IPv4 addresses reported by this library + * are in host order. + * @see RTE_IFPX_ADDR_ADD + * @see RTE_IFPX_ADDR_DEL + */ + struct rte_ifpx_addr_change { + uint16_t port_id; + uint32_t ip; + } addr_change; + /** Structure used to pass notification about IPv6 address being + * added/removed. + * @see RTE_IFPX_ADDR6_ADD + * @see RTE_IFPX_ADDR6_DEL + */ + struct rte_ifpx_addr6_change { + uint16_t port_id; + uint8_t ip[16]; + } addr6_change; + /** Structure used to pass notification about IPv4 route being + * added/removed. + * @see RTE_IFPX_ROUTE_ADD + * @see RTE_IFPX_ROUTE_DEL + */ + struct rte_ifpx_route_change { + uint16_t port_id; + uint8_t depth; + uint32_t ip; + uint32_t gateway; + } route_change; + /** Structure used to pass notification about IPv6 route being + * added/removed. + * @see RTE_IFPX_ROUTE6_ADD + * @see RTE_IFPX_ROUTE6_DEL + */ + struct rte_ifpx_route6_change { + uint16_t port_id; + uint8_t depth; + uint8_t ip[16]; + uint8_t gateway[16]; + } route6_change; + /** Structure used to pass notification about IPv4 neighbour + * info changes. + * @see RTE_IFPX_NEIGH_ADD + * @see RTE_IFPX_NEIGH_DEL + */ + struct rte_ifpx_neigh_change { + uint16_t port_id; + struct rte_ether_addr mac; + uint32_t ip; + } neigh_change; + /** Structure used to pass notification about IPv6 neighbour + * info changes. + * @see RTE_IFPX_NEIGH6_ADD + * @see RTE_IFPX_NEIGH6_DEL + */ + struct rte_ifpx_neigh6_change { + uint16_t port_id; + struct rte_ether_addr mac; + uint8_t ip[16]; + } neigh6_change; + /* This structure is used internally - to abstract common parts + * of proxy/port related events and to be able to refer to this + * union without giving it a name. + */ + struct { + uint16_t port_id; + } data; + }; +}; + +/** + * This library can deliver notification about network configuration changes + * either by the use of registered callbacks and/or by queueing change events to + * configured notification queues. The logic used is: + * 1. If there is callback registered for given event type it is called. In + * case of many ports to one proxy binding, this callback is called for every + * port bound. + * 2. If this callback returns non-zero value (for any of ports in case of + * many-1 bindings) the handling of an event is considered as complete. + * 3. Otherwise the event is added to each configured event queue. The event is + * allocated with malloc() so after dequeueing and handling the application + * should deallocate it with free(). + * + * This dual notification mechanism is meant to provide some flexibility to + * application writer. For example, if you store your data in a single writer/ + * many readers coherent data structure you could just update this structure + * from the callback. If you keep separate copy per lcore/port you could make + * some common preparations (if applicable) in the callback, return 0 and use + * notification queues to pick up the change and update data structures. Or you + * could skip the callbacks altogether and just use notification queues - and + * configure them at the level appropriate for your application design (one + * global / one per lcore / one per port ...). + */ + +/** + * Add notification queue to the list of queues. + * + * @param r + * Ring used for queueing of notification events - application can assume that + * there is only one producer. + * @return + * 0 on success, negative otherwise. + */ +int rte_ifpx_queue_add(struct rte_ring *r); + +/** + * Remove notification queue from the list of queues. + * + * @param r + * Notification ring used for queueing of notification events (previously + * added via rte_ifpx_queue_add()). + * @return + * 0 on success, negative otherwise. + */ +int rte_ifpx_queue_remove(struct rte_ring *r); + +/** + * This structure groups the callbacks that might be called as a notification + * events for changing network configuration. Not every platform might + * implement all of them and you can query the availability with + * rte_ifpx_callbacks_available() function. + * @see rte_ifpx_events_available() + * @see rte_ifpx_callbacks_register() + */ +struct rte_ifpx_callbacks { + int (*mac_change)(const struct rte_ifpx_mac_change *event); + /**< Callback for notification about MAC change of the proxy interface. + * This callback (as all other port related callbacks) is called for + * each port (with its port_id as a first argument) bound to the proxy + * interface for which change has been observed. + * @see struct rte_ifpx_mac_change + * @return non-zero if event handling is finished + */ + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); + /**< Callback for notification about MTU change. + * @see struct rte_ifpx_mtu_change + * @return non-zero if event handling is finished + */ + int (*link_change)(const struct rte_ifpx_link_change *event); + /**< Callback for notification about link going up/down. + * @see struct rte_ifpx_link_change + * @return non-zero if event handling is finished + */ + int (*addr_add)(const struct rte_ifpx_addr_change *event); + /**< Callback for notification about IPv4 address being added. + * @see struct rte_ifpx_addr_change + * @return non-zero if event handling is finished + */ + int (*addr_del)(const struct rte_ifpx_addr_change *event); + /**< Callback for notification about IPv4 address removal. + * @see struct rte_ifpx_addr_change + * @return non-zero if event handling is finished + */ + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); + /**< Callback for notification about IPv6 address being added. + * @see struct rte_ifpx_addr6_change + */ + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); + /**< Callback for notification about IPv4 address removal. + * @see struct rte_ifpx_addr6_change + * @return non-zero if event handling is finished + */ + /* Please note that "route" callbacks might be also called when user + * adds address to the interface (that is in addition to address related + * callbacks). + */ + int (*route_add)(const struct rte_ifpx_route_change *event); + /**< Callback for notification about IPv4 route being added. + * @see struct rte_ifpx_route_change + * @return non-zero if event handling is finished + */ + int (*route_del)(const struct rte_ifpx_route_change *event); + /**< Callback for notification about IPv4 route removal. + * @see struct rte_ifpx_route_change + * @return non-zero if event handling is finished + */ + int (*route6_add)(const struct rte_ifpx_route6_change *event); + /**< Callback for notification about IPv6 route being added. + * @see struct rte_ifpx_route6_change + * @return non-zero if event handling is finished + */ + int (*route6_del)(const struct rte_ifpx_route6_change *event); + /**< Callback for notification about IPv6 route removal. + * @see struct rte_ifpx_route6_change + * @return non-zero if event handling is finished + */ + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); + /**< Callback for notification about IPv4 neighbour being added. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); + /**< Callback for notification about IPv4 neighbour removal. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); + /**< Callback for notification about IPv6 neighbour being added. + * @see struct rte_ifpx_neigh_change + */ + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); + /**< Callback for notification about IPv6 neighbour removal. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*cfg_done)(void); + /**< Lib specific callback - called when initial network configuration + * query is finished. + * @return non-zero if event handling is finished + */ +}; + +/** + * Register proxy callbacks. + * + * This function registers callbacks to be called upon appropriate network + * event notification. + * + * @param cbs + * Set of callbacks that will be called. The library does not take any + * ownership of the pointer passed - the callbacks are stored internally. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_callbacks_register(const struct rte_ifpx_callbacks *cbs); + +/** + * Unregister proxy callbacks. + * + * This function unregisters callbacks previously registered with + * rte_ifpx_callbacks_register(). + * + * @param cbs + * Handle/pointer returned on previous callback registration. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +void rte_ifpx_callbacks_unregister(void); + +/** + * Bind the port to its proxy. + * + * After calling this function all network configuration of the proxy (and it's + * changes) will be passed to given port by calling registered callbacks with + * 'port_id' as an argument. + * + * Note: since both arguments are of the same type in order to not mix them and + * ease remembering the order the first one is kept the same for bind/unbind. + * + * @param port_id + * Id of the port to be bound. + * @param proxy_id + * Id of the proxy the port needs to be bound to. + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id); + +/** + * Unbind the port from its proxy. + * + * After calling this function registered callbacks will no longer be called for + * this port (but they might be called for other ports in one to many binding + * scenario). + * + * @param port_id + * Id of the port to unbind. + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_port_unbind(uint16_t port_id); + +/** + * Get the system network configuration and start listening to its changes. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_listen(void); + +/** + * Remove all bindings/callbacks and stop listening to network configuration. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_close(void); + +/** + * Get the id of the proxy the port is bound to. + * + * @param port_id + * Id of the port for which to get proxy. + * @return + * Port id of the proxy on success, RTE_MAX_ETHPORTS on error. + */ +__rte_experimental +uint16_t rte_ifpx_proxy_get(uint16_t port_id); + +/** + * Test for port acting as a proxy. + * + * @param port_id + * Id of the port. + * @return + * 1 if port acts as a proxy, 0 otherwise. + */ +static inline +int rte_ifpx_is_proxy(uint16_t port_id) +{ + return rte_ifpx_proxy_get(port_id) == port_id; +} + +/** + * Get the ids of the ports bound to the proxy. + * + * @param proxy_id + * Id of the proxy for which to get ports. + * @param ports + * Array where to store the port ids. + * @param num + * Size of the 'ports' array. + * @return + * The number of ports bound to given proxy. Note that bound ports are filled + * in 'ports' array up to its size but the return value is always the total + * number of ports bound - so you can make call first with NULL/0 to query for + * the size of the buffer to create or call it with the buffer you have and + * later check if it was large enough. + */ +__rte_experimental +unsigned int rte_ifpx_port_get(uint16_t proxy_id, + uint16_t *ports, unsigned int num); + +/** + * The structure containing some properties of the proxy interface. + */ +struct rte_ifpx_info { + unsigned int if_index; /* entry valid iff if_index != 0 */ + uint16_t mtu; + struct rte_ether_addr mac; + char if_name[RTE_ETH_NAME_MAX_LEN]; +}; + +/** + * Get the properties of the proxy interface. Argument can be either id of the + * proxy or an id of a port that is bound to it. + * + * @param port_id + * Id of the port (or proxy) for which to get proxy properties. + * @return + * Pointer to the proxy information structure. + */ +__rte_experimental +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id); + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_IF_PROXY_H_ */ diff --git a/lib/librte_if_proxy/rte_if_proxy_version.map b/lib/librte_if_proxy/rte_if_proxy_version.map new file mode 100644 index 000000000..e2093137d --- /dev/null +++ b/lib/librte_if_proxy/rte_if_proxy_version.map @@ -0,0 +1,19 @@ +EXPERIMENTAL { + global: + + rte_ifpx_proxy_create; + rte_ifpx_proxy_create_by_devarg; + rte_ifpx_proxy_destroy; + rte_ifpx_events_available; + rte_ifpx_callbacks_register; + rte_ifpx_callbacks_unregister; + rte_ifpx_port_bind; + rte_ifpx_port_unbind; + rte_ifpx_listen; + rte_ifpx_close; + rte_ifpx_proxy_get; + rte_ifpx_port_get; + rte_ifpx_info_get; + + local: *; +}; diff --git a/lib/meson.build b/lib/meson.build index 0af3efab2..c913b33dd 100644 --- a/lib/meson.build +++ b/lib/meson.build @@ -19,7 +19,7 @@ libraries = [ 'acl', 'bbdev', 'bitratestats', 'cfgfile', 'compressdev', 'cryptodev', 'distributor', 'efd', 'eventdev', - 'gro', 'gso', 'ip_frag', 'jobstats', + 'gro', 'gso', 'if_proxy', 'ip_frag', 'jobstats', 'kni', 'latencystats', 'lpm', 'member', 'power', 'pdump', 'rawdev', 'rcu', 'rib', 'reorder', 'sched', 'security', 'stack', 'vhost', -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library Andrzej Ostruszka @ 2020-07-02 0:34 ` Stephen Hemminger 2020-07-07 20:13 ` Andrzej Ostruszka [C] 0 siblings, 1 reply; 64+ messages in thread From: Stephen Hemminger @ 2020-07-02 0:34 UTC (permalink / raw) To: Andrzej Ostruszka; +Cc: dev, Thomas Monjalon I had great hopes for this library, because such code exists in almost every real world application. But the current version falls short. So what this library does is turn a message based protocol (netlink) into a set of callbacks. Not sure if that is all that useful processing messages is often easier. It would be more useful if the library took the netlink and maintained a set of tables (interfaces, neighbours, routes) using DPDK infrastructure and did it as efficiently as possible with RCU. On a real world router with 1M routes, the processing of netlink can be a significant chore. It has to be done efficiently. Probably needs to use rte_ctrl_thread() and not overload the interrupt thread. Also, please don't reinvent netlink parsing. Use a standard library like libmnl. Please use doxygen format for API documentation > +/* This function should be called by the implementation whenever it notices > + * change in the network configuration. The arguments are: > + * - ev : pointer to filled event data structure (all fields are expected to be > + * filled, with the exception of 'port_id' for all proxy/port related > + * events: this function clones the event notification for each bound port > + * and fills 'port_id' appropriately). > + * - px : proxy node when given event is proxy/port related, otherwise pass NULL > + */ > +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px); > + ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-02 0:34 ` Stephen Hemminger @ 2020-07-07 20:13 ` Andrzej Ostruszka [C] 2020-07-08 16:07 ` Morten Brørup 0 siblings, 1 reply; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-07-07 20:13 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, Thomas Monjalon First of all, please excuse me Stephen for late reply. On 02/07/2020 02:34, Stephen Hemminger wrote: > I had great hopes for this library, because such code exists in > almost every real world application. But the current version falls > short. Sorry to disappoint. > So what this library does is turn a message based protocol (netlink) > into a set of callbacks. Not sure if that is all that useful processing > messages is often easier. Callbacks or queued notifications. I'm not sure I understand why you are saying that it would be easier for application to do everything on its own when it can now just pass a callback or dequeue notification with all essential information extracted. Could you elaborate here? > It would be more useful if the library took the netlink and maintained > a set of tables (interfaces, neighbours, routes) using DPDK infrastructure > and did it as efficiently as possible with RCU. I'm pretty sure it would, but that's a bit like complaining about new toy that it doesn't have batteries included. And BTW "no batteries" was a conscientious decision. I wanted something basic that would ease sniffing of the network configuration and IMHO it does that. I don't have access to environment where heavy netlink processing would be required. And apart from that I can imagine that every application has its own particular needs so it will be difficult to please everybody. So instead I thought it would be better to come up with a starting point upon which this thing can grow - with input from the community/users. > On a real world router with 1M routes, the processing of netlink can be > a significant chore. It has to be done efficiently. Probably needs to use > rte_ctrl_thread() and not overload the interrupt thread. That is another voice against using interrupt thread. I acknowledge your (plural) concerns and will think about changing it. Thanks. I'm not sure if I manage to make it before .08 though. > Also, please don't reinvent netlink parsing. Use a standard library > like libmnl. I actually checked this and decided that this will not buy me much and will add an additional dependency. I also checked how other DPDK parts (there are couple) and iproute2 utils are handling netlink and decided to follow their style of manual handling. Could you give a hint here how hard you want to push for libmnl? As I've mentioned, that was on my list of options, so if community does not mind adding libmnl dependency and would prefer using it instead of manual parsing then I might adapt to that. > Please use doxygen format for API documentation > >> +/* This function should be called by the implementation whenever it notices >> + * change in the network configuration. The arguments are: >> + * - ev : pointer to filled event data structure (all fields are expected to be >> + * filled, with the exception of 'port_id' for all proxy/port related >> + * events: this function clones the event notification for each bound port >> + * and fills 'port_id' appropriately). >> + * - px : proxy node when given event is proxy/port related, otherwise pass NULL >> + */ >> +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px); >> + Will do - thank you. Thank you for taking time to look at this. With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-07 20:13 ` Andrzej Ostruszka [C] @ 2020-07-08 16:07 ` Morten Brørup 2020-07-09 8:43 ` Andrzej Ostruszka [C] 0 siblings, 1 reply; 64+ messages in thread From: Morten Brørup @ 2020-07-08 16:07 UTC (permalink / raw) To: Andrzej Ostruszka [C], Stephen Hemminger; +Cc: dev, Thomas Monjalon > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Andrzej Ostruszka > [C] > Sent: Tuesday, July 7, 2020 10:14 PM > > First of all, please excuse me Stephen for late reply. > > On 02/07/2020 02:34, Stephen Hemminger wrote: > > I had great hopes for this library, because such code exists in > > almost every real world application. But the current version falls > > short. > > Sorry to disappoint. > > > So what this library does is turn a message based protocol (netlink) > > into a set of callbacks. Not sure if that is all that useful > processing > > messages is often easier. > > Callbacks or queued notifications. I'm not sure I understand why you > are saying that it would be easier for application to do everything on > its own when it can now just pass a callback or dequeue notification > with all essential information extracted. Could you elaborate here? > > > It would be more useful if the library took the netlink and > maintained > > a set of tables (interfaces, neighbours, routes) using DPDK > infrastructure > > and did it as efficiently as possible with RCU. > > I'm pretty sure it would, but that's a bit like complaining about new > toy that it doesn't have batteries included. And BTW "no batteries" > was > a conscientious decision. I wanted something basic that would ease > sniffing of the network configuration and IMHO it does that. > > I don't have access to environment where heavy netlink processing would > be required. And apart from that I can imagine that every application > has its own particular needs so it will be difficult to please > everybody. So instead I thought it would be better to come up with a > starting point upon which this thing can grow - with input from the > community/users. > Please pay close attention to Stephen's feedback. It may be harsh, but I think you should go back to scratch and take a top-down approach to the library design. The design process should start with an example application, emulating a router running BGP or similar. Such an application would reveal what an IF proxy library should provide, and how it should be provided. And then Thomas' sidetrack could come into play... A generic library for the best common practice for interaction between the data plane and the control plane... The DPDK documentation doesn't say anything about how to handle this interaction. From Stephen's comment below, I guess that the interrupt thread is not the best way to go. > > On a real world router with 1M routes, the processing of netlink can > be > > a significant chore. It has to be done efficiently. Probably needs to > use > > rte_ctrl_thread() and not overload the interrupt thread. > > That is another voice against using interrupt thread. I acknowledge > your (plural) concerns and will think about changing it. Thanks. I'm > not sure if I manage to make it before .08 though. What is the use of getting it into .08 if it needs serious rewriting and a new API anyway? Following all this negative feedback, I would like to provide some positive feedback too: Take note that we have high hopes for the IP proxy library, and we do recognize that it will be very useful for many applications (if done right). I think it is a great idea, and I appreciate your hard work on it! > > > Also, please don't reinvent netlink parsing. Use a standard library > > like libmnl. > > I actually checked this and decided that this will not buy me much and > will add an additional dependency. I also checked how other DPDK parts > (there are couple) and iproute2 utils are handling netlink and decided > to follow their style of manual handling. Could you give a hint here > how hard you want to push for libmnl? As I've mentioned, that was on > my > list of options, so if community does not mind adding libmnl dependency > and would prefer using it instead of manual parsing then I might adapt > to that. > > > Please use doxygen format for API documentation > > > >> +/* This function should be called by the implementation whenever it > notices > >> + * change in the network configuration. The arguments are: > >> + * - ev : pointer to filled event data structure (all fields are > expected to be > >> + * filled, with the exception of 'port_id' for all proxy/port > related > >> + * events: this function clones the event notification for each > bound port > >> + * and fills 'port_id' appropriately). > >> + * - px : proxy node when given event is proxy/port related, > otherwise pass NULL > >> + */ > >> +void ifpx_notify_event(struct rte_ifpx_event *ev, struct > ifpx_proxy_node *px); > >> + > > Will do - thank you. > > Thank you for taking time to look at this. > > With regards > Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-08 16:07 ` Morten Brørup @ 2020-07-09 8:43 ` Andrzej Ostruszka [C] 2020-07-22 0:40 ` Thomas Monjalon 0 siblings, 1 reply; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-07-09 8:43 UTC (permalink / raw) To: Morten Brørup, Stephen Hemminger; +Cc: dev, Thomas Monjalon First of all let me thank you all one again for the time you took looking at this and summarize your feedback as I understand it. 1. Don't use interrupt thread because in heavy load scenarios this might cause problems. 2. Provide higher level functionality - so that application can just use it instead of maintaining the info on their own. 3. Provide way to have control over the changes - if I remember correctly someone wanted apps to be able to reject the changes (meaning rolling back the changes to the kernel) 4. Don't parse netlink on your own (use e.g. libmnl) 5. Don't use callbacks These are the specific things that I managed to understand. Have I missed anything? If so please add this to the list! To that I say: Ad1. Agree, will change that. Ad2. OK, but this can be added later. Ad3. Currently the lib was meant to be one way but as in previous point this can change in the future. Ad4. As mentioned in previous mail I judged it not worthy of adding dependency but I'm OK with using it. This might be more relevant when we address point 3 and can be introduced then, but I can do it now. Ad5. There are now notification queues and I'm fine with adapting to any standard of notification/communication that DPDK will agree on. In addition may I ask your opinion on the changes that are required before the library can be accepted? Thank you in advance With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-09 8:43 ` Andrzej Ostruszka [C] @ 2020-07-22 0:40 ` Thomas Monjalon 2020-07-22 8:45 ` Jerin Jacob 0 siblings, 1 reply; 64+ messages in thread From: Thomas Monjalon @ 2020-07-22 0:40 UTC (permalink / raw) To: Andrzej Ostruszka [C] Cc: Morten Brørup, Stephen Hemminger, dev, techboard 09/07/2020 10:43, Andrzej Ostruszka [C]: > First of all let me thank you all one again for the time you took > looking at this and summarize your feedback as I understand it. > > 1. Don't use interrupt thread because in heavy load scenarios this might > cause problems. Yes For this usage, and for other configuration controls like log/trace, we should have a new communication channel. The telemetry and IPC are other examples of communication channels. Ideally we should keep only one channel. > 2. Provide higher level functionality - so that application can just use > it instead of maintaining the info on their own. As said below, higher layers can come later. > 3. Provide way to have control over the changes - if I remember > correctly someone wanted apps to be able to reject the changes (meaning > rolling back the changes to the kernel) Yes, the DPDK app must remain in control of any change. > 4. Don't parse netlink on your own (use e.g. libmnl) > 5. Don't use callbacks Not sure about which communication API to use. It must be discussed. > These are the specific things that I managed to understand. Have I > missed anything? If so please add this to the list! > > To that I say: > Ad1. Agree, will change that. > Ad2. OK, but this can be added later. > Ad3. Currently the lib was meant to be one way but as in previous point > this can change in the future. > Ad4. As mentioned in previous mail I judged it not worthy of adding > dependency but I'm OK with using it. This might be more relevant when > we address point 3 and can be introduced then, but I can do it now. No problem with adding dependencies, especially with meson. > Ad5. There are now notification queues and I'm fine with adapting to any > standard of notification/communication that DPDK will agree on. > > In addition may I ask your opinion on the changes that are required > before the library can be accepted? Very few contributors take time to look at it. Clearly we want this feature. We really want it, but we are not able to dedicate enough time for its review (blaming myself). That's why I Cc the techboard to try a new process: For such feature requiring a community design, and not having enough feedback to progress in a timely manner, I propose drafting the design in a Technical Board meeting (a regular or specific one). ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-22 0:40 ` Thomas Monjalon @ 2020-07-22 8:45 ` Jerin Jacob 2020-07-22 8:56 ` Thomas Monjalon 0 siblings, 1 reply; 64+ messages in thread From: Jerin Jacob @ 2020-07-22 8:45 UTC (permalink / raw) To: Thomas Monjalon Cc: Andrzej Ostruszka [C], Morten Brørup, Stephen Hemminger, dpdk-dev, techboard On Wed, Jul 22, 2020 at 6:10 AM Thomas Monjalon <thomas@monjalon.net> wrote: > > In addition may I ask your opinion on the changes that are required > > before the library can be accepted? > > Very few contributors take time to look at it. > Clearly we want this feature. We really want it, > but we are not able to dedicate enough time for its review (blaming myself). > That's why I Cc the techboard to try a new process: > > For such feature requiring a community design, > and not having enough feedback to progress in a timely manner, > I propose drafting the design in a Technical Board meeting > (a regular or specific one). Since the patch series already have the documentation for library[1], example application [2] in addition to the implementation. For everyone's benefit, it would good to know what is the expectation of draft design so that one can create such a document as part of the new process and it can apply for new another library. [1] http://patches.dpdk.org/patch/71953/ [2] http://patches.dpdk.org/patch/71956/ > > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-22 8:45 ` Jerin Jacob @ 2020-07-22 8:56 ` Thomas Monjalon 2020-07-22 9:09 ` Jerin Jacob 0 siblings, 1 reply; 64+ messages in thread From: Thomas Monjalon @ 2020-07-22 8:56 UTC (permalink / raw) To: Jerin Jacob Cc: Andrzej Ostruszka [C], Morten Brørup, Stephen Hemminger, dpdk-dev, techboard 22/07/2020 10:45, Jerin Jacob: > On Wed, Jul 22, 2020 at 6:10 AM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > In addition may I ask your opinion on the changes that are required > > > before the library can be accepted? > > > > Very few contributors take time to look at it. > > Clearly we want this feature. We really want it, > > but we are not able to dedicate enough time for its review (blaming myself). > > That's why I Cc the techboard to try a new process: > > > > For such feature requiring a community design, > > and not having enough feedback to progress in a timely manner, > > I propose drafting the design in a Technical Board meeting > > (a regular or specific one). > > Since the patch series already have the documentation for library[1], > example application [2] > in addition to the implementation. > For everyone's benefit, it would good to know what is the expectation > of draft design so that one can create > such a document as part of the new process and it can apply for new > another library. > > [1] http://patches.dpdk.org/patch/71953/ > [2] http://patches.dpdk.org/patch/71956/ I don't understand Jerin. What is your question? Are you proposing to include a design document as part of the patches? In my opinion, the cover letter is a good place to explain a design. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-22 8:56 ` Thomas Monjalon @ 2020-07-22 9:09 ` Jerin Jacob 2020-07-22 9:27 ` Thomas Monjalon 0 siblings, 1 reply; 64+ messages in thread From: Jerin Jacob @ 2020-07-22 9:09 UTC (permalink / raw) To: Thomas Monjalon Cc: Andrzej Ostruszka [C], Morten Brørup, Stephen Hemminger, dpdk-dev, techboard On Wed, Jul 22, 2020 at 2:26 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > 22/07/2020 10:45, Jerin Jacob: > > On Wed, Jul 22, 2020 at 6:10 AM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > > > In addition may I ask your opinion on the changes that are required > > > > before the library can be accepted? > > > > > > Very few contributors take time to look at it. > > > Clearly we want this feature. We really want it, > > > but we are not able to dedicate enough time for its review (blaming myself). > > > That's why I Cc the techboard to try a new process: > > > > > > For such feature requiring a community design, > > > and not having enough feedback to progress in a timely manner, > > > I propose drafting the design in a Technical Board meeting > > > (a regular or specific one). > > > > Since the patch series already have the documentation for library[1], > > example application [2] > > in addition to the implementation. > > For everyone's benefit, it would good to know what is the expectation > > of draft design so that one can create > > such a document as part of the new process and it can apply for new > > another library. > > > > [1] http://patches.dpdk.org/patch/71953/ > > [2] http://patches.dpdk.org/patch/71956/ > > I don't understand Jerin. > What is your question? > Are you proposing to include a design document as part of the patches? No > In my opinion, the cover letter is a good place to explain a design. Agree. Currently, In the patch series already have the design[1] in the cover letter and documentation[2] [1] http://mails.dpdk.org/archives/dev/2020-June/171070.html [2] http://patches.dpdk.org/patch/71953/ Thomas, I did not understand the expectation of "I propose drafting the design in a Technical Board meeting". You mean, Techboard will define/draft the design? or Are you expecting anything from the patch submitter otherthan participating in the discussion. The description is not clear, Hence I am asking. > > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-22 9:09 ` Jerin Jacob @ 2020-07-22 9:27 ` Thomas Monjalon 2020-07-22 9:54 ` Jerin Jacob 0 siblings, 1 reply; 64+ messages in thread From: Thomas Monjalon @ 2020-07-22 9:27 UTC (permalink / raw) To: Jerin Jacob Cc: Andrzej Ostruszka [C], Morten Brørup, Stephen Hemminger, dpdk-dev, techboard 22/07/2020 11:09, Jerin Jacob: > On Wed, Jul 22, 2020 at 2:26 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > 22/07/2020 10:45, Jerin Jacob: > > > On Wed, Jul 22, 2020 at 6:10 AM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > > > > > In addition may I ask your opinion on the changes that are required > > > > > before the library can be accepted? > > > > > > > > Very few contributors take time to look at it. > > > > Clearly we want this feature. We really want it, > > > > but we are not able to dedicate enough time for its review (blaming myself). > > > > That's why I Cc the techboard to try a new process: > > > > > > > > For such feature requiring a community design, > > > > and not having enough feedback to progress in a timely manner, > > > > I propose drafting the design in a Technical Board meeting > > > > (a regular or specific one). > > > > > > Since the patch series already have the documentation for library[1], > > > example application [2] > > > in addition to the implementation. > > > For everyone's benefit, it would good to know what is the expectation > > > of draft design so that one can create > > > such a document as part of the new process and it can apply for new > > > another library. > > > > > > [1] http://patches.dpdk.org/patch/71953/ > > > [2] http://patches.dpdk.org/patch/71956/ > > > > I don't understand Jerin. > > What is your question? > > Are you proposing to include a design document as part of the patches? > > No > > > In my opinion, the cover letter is a good place to explain a design. > > Agree. > > Currently, In the patch series already have the design[1] in the cover > letter and documentation[2] > [1] > http://mails.dpdk.org/archives/dev/2020-June/171070.html > [2] > http://patches.dpdk.org/patch/71953/ > > Thomas, I did not understand the expectation of "I propose drafting > the design in a Technical Board meeting". > You mean, Techboard will define/draft the design? or Are you expecting > anything from the patch submitter otherthan > participating in the discussion. > The description is not clear, Hence I am asking. OK now I understand what is not clear :) Because this design discussion is not progressing enough on the mailing list, I propose discussing in a techboard meeting with Andrzej, and come to a conclusion about what are the expectations. The goal is to approve the technical direction, so Andrzej can rework while being sure his work will be accepted. Does it make sense? ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-22 9:27 ` Thomas Monjalon @ 2020-07-22 9:54 ` Jerin Jacob 2020-07-23 14:09 ` [dpdk-dev] [EXT] " Andrzej Ostruszka [C] 0 siblings, 1 reply; 64+ messages in thread From: Jerin Jacob @ 2020-07-22 9:54 UTC (permalink / raw) To: Thomas Monjalon Cc: Andrzej Ostruszka [C], Morten Brørup, Stephen Hemminger, dpdk-dev, techboard On Wed, Jul 22, 2020 at 2:57 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > 22/07/2020 11:09, Jerin Jacob: > > On Wed, Jul 22, 2020 at 2:26 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > 22/07/2020 10:45, Jerin Jacob: > > > > On Wed, Jul 22, 2020 at 6:10 AM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > > > > > > > In addition may I ask your opinion on the changes that are required > > > > > > before the library can be accepted? > > > > > > > > > > Very few contributors take time to look at it. > > > > > Clearly we want this feature. We really want it, > > > > > but we are not able to dedicate enough time for its review (blaming myself). > > > > > That's why I Cc the techboard to try a new process: > > > > > > > > > > For such feature requiring a community design, > > > > > and not having enough feedback to progress in a timely manner, > > > > > I propose drafting the design in a Technical Board meeting > > > > > (a regular or specific one). > > > > > > > > Since the patch series already have the documentation for library[1], > > > > example application [2] > > > > in addition to the implementation. > > > > For everyone's benefit, it would good to know what is the expectation > > > > of draft design so that one can create > > > > such a document as part of the new process and it can apply for new > > > > another library. > > > > > > > > [1] http://patches.dpdk.org/patch/71953/ > > > > [2] http://patches.dpdk.org/patch/71956/ > > > > > > I don't understand Jerin. > > > What is your question? > > > Are you proposing to include a design document as part of the patches? > > > > No > > > > > In my opinion, the cover letter is a good place to explain a design. > > > > Agree. > > > > Currently, In the patch series already have the design[1] in the cover > > letter and documentation[2] > > [1] > > http://mails.dpdk.org/archives/dev/2020-June/171070.html > > [2] > > http://patches.dpdk.org/patch/71953/ > > > > Thomas, I did not understand the expectation of "I propose drafting > > the design in a Technical Board meeting". > > You mean, Techboard will define/draft the design? or Are you expecting > > anything from the patch submitter otherthan > > participating in the discussion. > > The description is not clear, Hence I am asking. > > OK now I understand what is not clear :) > > Because this design discussion is not progressing enough on the mailing list, > I propose discussing in a techboard meeting with Andrzej, > and come to a conclusion about what are the expectations. > The goal is to approve the technical direction, > so Andrzej can rework while being sure his work will be accepted. > > Does it make sense? Make sense to me. I will leave @Andrzej Ostruszka to share his preference. > > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [EXT] Re: [PATCH v2 1/4] lib: introduce IF Proxy library 2020-07-22 9:54 ` Jerin Jacob @ 2020-07-23 14:09 ` Andrzej Ostruszka [C] 0 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-07-23 14:09 UTC (permalink / raw) To: Jerin Jacob, Thomas Monjalon Cc: Morten Brørup, Stephen Hemminger, dpdk-dev, techboard On 7/22/20 11:54 AM, Jerin Jacob wrote: > External Email > > ---------------------------------------------------------------------- > On Wed, Jul 22, 2020 at 2:57 PM Thomas Monjalon <thomas@monjalon.net> wrote: >> >> 22/07/2020 11:09, Jerin Jacob: >>> On Wed, Jul 22, 2020 at 2:26 PM Thomas Monjalon <thomas@monjalon.net> wrote: [...] >> Because this design discussion is not progressing enough on the mailing list, >> I propose discussing in a techboard meeting with Andrzej, >> and come to a conclusion about what are the expectations. >> The goal is to approve the technical direction, >> so Andrzej can rework while being sure his work will be accepted. >> >> Does it make sense? > > Make sense to me. > > I will leave @Andrzej Ostruszka to share his preference. Yes this is fine with me, I'd be happy to participate in next meeting. With regards Andrzej ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v2 2/4] if_proxy: add library documentation 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library Andrzej Ostruszka @ 2020-03-10 11:10 ` Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 3/4] if_proxy: add simple functionality test Andrzej Ostruszka ` (3 subsequent siblings) 5 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-10 11:10 UTC (permalink / raw) To: dev, Thomas Monjalon, John McNamara, Marko Kovacevic This commit adds documentation of IF Proxy library. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + doc/guides/prog_guide/if_proxy_lib.rst | 142 +++++++++++++++++++++++++ doc/guides/prog_guide/index.rst | 1 + 3 files changed, 144 insertions(+) create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst diff --git a/MAINTAINERS b/MAINTAINERS index aec7326ca..3854d7661 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1472,6 +1472,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: doc/guides/prog_guide/if_proxy_lib.rst Test Applications ----------------- diff --git a/doc/guides/prog_guide/if_proxy_lib.rst b/doc/guides/prog_guide/if_proxy_lib.rst new file mode 100644 index 000000000..f0b9ed70d --- /dev/null +++ b/doc/guides/prog_guide/if_proxy_lib.rst @@ -0,0 +1,142 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(C) 2020 Marvell International Ltd. + +.. _IF_Proxy_Library: + +IF Proxy Library +================ + +When a network interface is assigned to DPDK it usually disappears from +the system and user looses ability to configure it via typical +configuration tools. +There are basically two options to deal with this situation: + +- configure it via command line arguments and/or load configuration + from some file, +- add support for live configuration via some IPC mechanism. + +The first option is static and the second one requires some work to add +communication loop (e.g. separate thread listening/communicating on +a socket). + +This library adds a possibility to configure DPDK ports by using normal +configuration utilities (e.g. from iproute2 suite). +It requires user to configure additional DPDK ports that are visible to +the system (such as Tap or KNI - actually any port that has valid +`if_index` in ``struct rte_eth_dev_info`` will do) and designate them as +a port representor (a proxy) in the system. + +Let's see typical intended usage by an example. +Suppose that you have application that handles traffic on two ports (in +the white list below):: + + ./app -w 00:14.0 -w 00:16.0 --vdev=net_tap0 --vdev=net_tap1 + +So in addition to the "regular" ports you need to configure proxy ports. +These proxy ports can be created via a command line (like above) or from +within the application (e.g. by using `rte_ifpx_proxy_create()` +function). + +When you have proxy ports you need to bind them to the "regular" ports:: + + rte_ifpx_port_bind(port0, proxy0); + rte_ifpx_port_bind(port1, proxy1); + +This binding is a logical one - there is no automatic packet forwarding +configured. +This is because library cannot tell upfront what portion of the traffic +received on ports 0/1 should be redirected to the system via proxies and +also it does not know how the application is structured (what packet +processing engines it uses). +Therefore it is application writer responsibility to include proxy ports +into its packet processing and forward appropriate packets between +proxies and ports. +What the library actually does is that it gets network configuration +from the system and listens to its changes. +This information is then matched against `if_index` of the configured +proxies and passed to the application. + +There are two mechanisms via which library passes notifications to the +application. +First is the set of global callbacks that user has +to register via:: + + rte_ifpx_callbacks_register(&cbs); + +Here `cbs` is a ``struct rte_ifpx_callbacks`` which has following +members:: + + int (*mac_change)(const struct rte_ifpx_mac_change *event); + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); + int (*link_change)(const struct rte_ifpx_link_change *event); + int (*addr_add)(const struct rte_ifpx_addr_change *event); + int (*addr_del)(const struct rte_ifpx_addr_change *event); + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); + int (*route_add)(const struct rte_ifpx_route_change *event); + int (*route_del)(const struct rte_ifpx_route_change *event); + int (*route6_add)(const struct rte_ifpx_route6_change *event); + int (*route6_del)(const struct rte_ifpx_route6_change *event); + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); + int (*cfg_done)(void); + +All of them should be self explanatory apart from the last one which is +library specific callback - called when initial network configuration +query is finished. + +So for example when the user issues command:: + + ip link set dev dtap0 mtu 1600 + +then library will call `mtu_change()` callback with MTU change event +having port_id equal to `port0` (id of the port bound to this proxy) and +`mtu` equal to 1600 (``dtap0`` is the default interface name for +``net_tap0``). +Application can simply use `rte_eth_dev_set_mtu()` in this callback. +The same way `rte_eth_dev_default_mac_addr_set()` can be used in +`mac_change()` and `rte_eth_dev_set_link_up/down()` inside the +`link_change()` callback that does dispatch based on `is_up` member of +its `event` argument. + +Please note however that the context in which these callbacks are called +is most probably different from the one in which packets are handled and +it is application writer responsibility to use proper synchronization +mechanisms - if they are needed. + +Second notification mechanism relies on queueing of event notifications +to the configured notification rings. +Application can add queue via:: + + int rte_ifpx_queue_add(struct rte_ring *r); + +This type of notification is used when there is no callback registered +for given type of event or when it is registered but it returns 0. +This way application has following choices: + +- if the data structure that needs to be updated due to notification + is safe to be modified by a single writer (while being used by other + readers) then it can simply do that inside the callback and return + non-zero value to signal end of the event handling + +- otherwise, when there are some common preparation steps that needs + to be done only once, application can register callback that will + perform these steps and return 0 - library will then add an event to + each registered notification queue + +- if the data structures are replicated and there are no common steps + then application can simply skip registering of the callbacks and + configure notification queues (e.g. 1 per each lcore) + +Once we have bindings in place and notification configured, the only +essential part that remains is to get the current network configuration +and start listening to its changes. +This is accomplished via a call to:: + + int rte_ifpx_listen(void); + +From that moment you should see notifications coming to your +application: first ones resulting from querying of current system +configurations and subsequent on the configuration changes. diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst index fb250abf5..2fd5a8c72 100644 --- a/doc/guides/prog_guide/index.rst +++ b/doc/guides/prog_guide/index.rst @@ -57,6 +57,7 @@ Programmer's Guide metrics_lib bpf_lib ipsec_lib + if_proxy_lib source_org dev_kit_build_system dev_kit_root_make_help -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v2 3/4] if_proxy: add simple functionality test 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 2/4] if_proxy: add library documentation Andrzej Ostruszka @ 2020-03-10 11:10 ` Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 4/4] if_proxy: add example application Andrzej Ostruszka ` (2 subsequent siblings) 5 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-10 11:10 UTC (permalink / raw) To: dev, Thomas Monjalon This commit adds simple test of the library notifications. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + app/test/Makefile | 5 + app/test/meson.build | 4 + app/test/test_if_proxy.c | 707 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 717 insertions(+) create mode 100644 app/test/test_if_proxy.c diff --git a/MAINTAINERS b/MAINTAINERS index 3854d7661..a92cb7356 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1472,6 +1472,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: app/test/test_if_proxy.c F: doc/guides/prog_guide/if_proxy_lib.rst Test Applications diff --git a/app/test/Makefile b/app/test/Makefile index 1f080d162..dc287f94b 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -231,6 +231,11 @@ SRCS-$(CONFIG_RTE_LIBRTE_BPF) += test_bpf.c SRCS-$(CONFIG_RTE_LIBRTE_RCU) += test_rcu_qsbr.c test_rcu_qsbr_perf.c +ifeq ($(CONFIG_RTE_LIBRTE_IF_PROXY),y) +SRCS-y += test_if_proxy.c +LDLIBS += -lrte_if_proxy +endif + SRCS-$(CONFIG_RTE_LIBRTE_IPSEC) += test_ipsec.c SRCS-$(CONFIG_RTE_LIBRTE_IPSEC) += test_ipsec_sad.c ifeq ($(CONFIG_RTE_LIBRTE_IPSEC),y) diff --git a/app/test/meson.build b/app/test/meson.build index 0a2ce710f..870c3a8bb 100644 --- a/app/test/meson.build +++ b/app/test/meson.build @@ -352,6 +352,10 @@ endif if dpdk_conf.has('RTE_LIBRTE_PDUMP') test_deps += 'pdump' endif +if dpdk_conf.has('RTE_LIBRTE_IF_PROXY') + test_deps += 'if_proxy' + test_sources += 'test_if_proxy.c' +endif cflags = machine_args if cc.has_argument('-Wno-format-truncation') diff --git a/app/test/test_if_proxy.c b/app/test/test_if_proxy.c new file mode 100644 index 000000000..72ff782b6 --- /dev/null +++ b/app/test/test_if_proxy.c @@ -0,0 +1,707 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#include "test.h" + +#include <rte_ethdev.h> +#include <rte_if_proxy.h> +#include <rte_cycles.h> + +#include <string.h> +#include <unistd.h> +#include <signal.h> +#include <net/if.h> +#include <arpa/inet.h> +#include <pthread.h> +#include <time.h> + +/* There are two types of event notifications - one using callbacks and one + * using event queues (rings). We'll test them both and this "bool" will govern + * the type of API to use. + */ +static int use_callbacks = 1; +static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; +static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; + +static struct rte_ring *ev_queue; + +enum net_event_mask { + INITIALIZED = 1U << RTE_IFPX_CFG_DONE, + LINK_CHANGED = 1U << RTE_IFPX_LINK_CHANGE, + MAC_CHANGED = 1U << RTE_IFPX_MAC_CHANGE, + MTU_CHANGED = 1U << RTE_IFPX_MTU_CHANGE, + ADDR_ADD = 1U << RTE_IFPX_ADDR_ADD, + ADDR_DEL = 1U << RTE_IFPX_ADDR_DEL, + ROUTE_ADD = 1U << RTE_IFPX_ROUTE_ADD, + ROUTE_DEL = 1U << RTE_IFPX_ROUTE_DEL, + ADDR6_ADD = 1U << RTE_IFPX_ADDR6_ADD, + ADDR6_DEL = 1U << RTE_IFPX_ADDR6_DEL, + ROUTE6_ADD = 1U << RTE_IFPX_ROUTE6_ADD, + ROUTE6_DEL = 1U << RTE_IFPX_ROUTE6_DEL, + NEIGH_ADD = 1U << RTE_IFPX_NEIGH_ADD, + NEIGH_DEL = 1U << RTE_IFPX_NEIGH_DEL, + NEIGH6_ADD = 1U << RTE_IFPX_NEIGH6_ADD, + NEIGH6_DEL = 1U << RTE_IFPX_NEIGH6_DEL, +}; + +static unsigned int state; + +static struct { + struct rte_ether_addr mac_addr; + uint16_t port_id, mtu; + struct in_addr ipv4, route4; + struct in6_addr ipv6, route6; + uint16_t depth4, depth6; + int is_up; +} net_cfg; + +static +int unlock_notify(unsigned int op) +{ + /* the mutex is expected to be locked on entry */ + RTE_VERIFY(pthread_mutex_trylock(&mutex) == EBUSY); + state |= op; + + pthread_mutex_unlock(&mutex); + return pthread_cond_signal(&cond); +} + +static +void handle_event(struct rte_ifpx_event *ev); + +static +int wait_for(unsigned int op_mask, unsigned int sec) +{ + int ec; + + if (use_callbacks) { + struct timespec time; + + ec = pthread_mutex_trylock(&mutex); + /* the mutex is expected to be locked on entry */ + RTE_VERIFY(ec == EBUSY); + + ec = 0; + clock_gettime(CLOCK_REALTIME, &time); + time.tv_sec += sec; + + while ((state & op_mask) != op_mask && ec == 0) + ec = pthread_cond_timedwait(&cond, &mutex, &time); + } else { + uint64_t deadline; + struct rte_ifpx_event *ev; + + ec = 0; + deadline = rte_get_timer_cycles() + sec * rte_get_timer_hz(); + + while ((state & op_mask) != op_mask) { + if (rte_get_timer_cycles() >= deadline) { + ec = ETIMEDOUT; + break; + } + if (rte_ring_dequeue(ev_queue, (void **)&ev) == 0) + handle_event(ev); + } + } + + return ec; +} + +static +int expect(unsigned int op_mask, const char *fmt, ...) +#if __GNUC__ + __attribute__((format(printf, 2, 3))); +#endif + +static +int expect(unsigned int op_mask, const char *fmt, ...) +{ + char cmd[128]; + va_list args; + int ret; + + state &= ~op_mask; + va_start(args, fmt); + vsnprintf(cmd, sizeof(cmd), fmt, args); + va_end(args); + ret = system(cmd); + if (ret == 0) + /* IPv6 address notifications seem to need that long delay. */ + return wait_for(op_mask, 2); + return ret; +} + +static +int mac_change(const struct rte_ifpx_mac_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(MAC_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int mtu_change(const struct rte_ifpx_mtu_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->mtu == net_cfg.mtu) { + unlock_notify(MTU_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int link_change(const struct rte_ifpx_link_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->is_up == net_cfg.is_up) { + /* Special case for testing of callbacks modification from + * inside of callback: we catch putting link down (the last + * operation in test) and remove callbacks registered. + */ + if (!ev->is_up) + rte_ifpx_callbacks_unregister(); + unlock_notify(LINK_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr_add(const struct rte_ifpx_addr_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->ip == net_cfg.ipv4.s_addr) { + unlock_notify(ADDR_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr_del(const struct rte_ifpx_addr_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->ip == net_cfg.ipv4.s_addr) { + unlock_notify(ADDR_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr6_add(const struct rte_ifpx_addr6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(ADDR6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr6_del(const struct rte_ifpx_addr6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(ADDR6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route_add(const struct rte_ifpx_route_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth4 == ev->depth && net_cfg.route4.s_addr == ev->ip) { + unlock_notify(ROUTE_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route_del(const struct rte_ifpx_route_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth4 == ev->depth && net_cfg.route4.s_addr == ev->ip) { + unlock_notify(ROUTE_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route6_add(const struct rte_ifpx_route6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth6 == ev->depth && + /* don't check for trailing zeros */ + memcmp(ev->ip, net_cfg.route6.s6_addr, ev->depth/8) == 0) { + unlock_notify(ROUTE6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route6_del(const struct rte_ifpx_route6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth6 == ev->depth && + /* don't check for trailing zeros */ + memcmp(ev->ip, net_cfg.route6.s6_addr, ev->depth/8) == 0) { + unlock_notify(ROUTE6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh_add(const struct rte_ifpx_neigh_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.ipv4.s_addr == ev->ip && + memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(NEIGH_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh_del(const struct rte_ifpx_neigh_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.ipv4.s_addr == ev->ip) { + unlock_notify(NEIGH_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh6_add(const struct rte_ifpx_neigh6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0 && + memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(NEIGH6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh6_del(const struct rte_ifpx_neigh6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(NEIGH6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int cfg_done(void) +{ + pthread_mutex_lock(&mutex); + unlock_notify(INITIALIZED); + return 1; +} + +static +void handle_event(struct rte_ifpx_event *ev) +{ + if (ev->type != RTE_IFPX_CFG_DONE) + RTE_VERIFY(ev->data.port_id == net_cfg.port_id); + + /* If params do not match what we expect just free the event. */ + switch (ev->type) { + case RTE_IFPX_MAC_CHANGE: + if (memcmp(ev->mac_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_MTU_CHANGE: + if (ev->mtu_change.mtu != net_cfg.mtu) + goto exit; + break; + case RTE_IFPX_LINK_CHANGE: + if (ev->link_change.is_up != net_cfg.is_up) + goto exit; + break; + case RTE_IFPX_ADDR_ADD: + if (ev->addr_change.ip != net_cfg.ipv4.s_addr) + goto exit; + break; + case RTE_IFPX_ADDR_DEL: + if (ev->addr_change.ip != net_cfg.ipv4.s_addr) + goto exit; + break; + case RTE_IFPX_ADDR6_ADD: + if (memcmp(ev->addr6_change.ip, net_cfg.ipv6.s6_addr, + 16) != 0) + goto exit; + break; + case RTE_IFPX_ADDR6_DEL: + if (memcmp(ev->addr6_change.ip, net_cfg.ipv6.s6_addr, + 16) != 0) + goto exit; + break; + case RTE_IFPX_ROUTE_ADD: + if (net_cfg.depth4 != ev->route_change.depth || + net_cfg.route4.s_addr != ev->route_change.ip) + goto exit; + break; + case RTE_IFPX_ROUTE_DEL: + if (net_cfg.depth4 != ev->route_change.depth || + net_cfg.route4.s_addr != ev->route_change.ip) + goto exit; + break; + case RTE_IFPX_ROUTE6_ADD: + if (net_cfg.depth6 != ev->route6_change.depth || + /* don't check for trailing zeros */ + memcmp(ev->route6_change.ip, net_cfg.route6.s6_addr, + ev->route6_change.depth/8) != 0) + goto exit; + break; + case RTE_IFPX_ROUTE6_DEL: + if (net_cfg.depth6 != ev->route6_change.depth || + /* don't check for trailing zeros */ + memcmp(ev->route6_change.ip, net_cfg.route6.s6_addr, + ev->route6_change.depth/8) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH_ADD: + if (net_cfg.ipv4.s_addr != ev->neigh_change.ip || + memcmp(ev->neigh_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH_DEL: + if (net_cfg.ipv4.s_addr != ev->neigh_change.ip) + goto exit; + break; + case RTE_IFPX_NEIGH6_ADD: + if (memcmp(ev->neigh6_change.ip, + net_cfg.ipv6.s6_addr, 16) != 0 || + memcmp(ev->neigh6_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH6_DEL: + if (memcmp(ev->neigh6_change.ip, net_cfg.ipv6.s6_addr, 16) != 0) + goto exit; + break; + case RTE_IFPX_CFG_DONE: + break; + default: + RTE_VERIFY(0 && "Unhandled event type"); + } + + state |= 1U << ev->type; +exit: + free(ev); +} + +static +struct rte_ifpx_callbacks cbs = { + .mac_change = mac_change, + .mtu_change = mtu_change, + .link_change = link_change, + .addr_add = addr_add, + .addr_del = addr_del, + .addr6_add = addr6_add, + .addr6_del = addr6_del, + .route_add = route_add, + .route_del = route_del, + .route6_add = route6_add, + .route6_del = route6_del, + .neigh_add = neigh_add, + .neigh_del = neigh_del, + .neigh6_add = neigh6_add, + .neigh6_del = neigh6_del, + /* lib specific callback */ + .cfg_done = cfg_done, +}; + +static +int test_notifications(const struct rte_ifpx_info *pinfo) +{ + char mac_buf[RTE_ETHER_ADDR_FMT_SIZE]; + int ec; + + /* Test link up notification. */ + net_cfg.is_up = 1; + ec = expect(LINK_CHANGED, "ip link set dev %s up", pinfo->if_name); + if (ec != 0) { + printf("Failed to notify about link going up\n"); + return ec; + } + + /* Test for MAC changes notification. */ + rte_eth_random_addr(net_cfg.mac_addr.addr_bytes); + rte_ether_format_addr(mac_buf, sizeof(mac_buf), &net_cfg.mac_addr); + ec = expect(MAC_CHANGED, "ip link set dev %s address %s", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notification about mac change\n"); + return ec; + } + + /* Test for MTU changes notification. */ + net_cfg.mtu = pinfo->mtu + 100; + ec = expect(MTU_CHANGED, "ip link set dev %s mtu %d", + pinfo->if_name, net_cfg.mtu); + if (ec != 0) { + printf("Missing/wrong notification about mtu change\n"); + return ec; + } + + /* Test for adding of IPv4 address - using address from TEST-2 pool. + * This test is specific to linux netlink behaviour - after adding + * address we get both notification about address being added and new + * route. So I check both. + */ + net_cfg.ipv4.s_addr = RTE_IPV4(198, 51, 100, 14); + net_cfg.route4.s_addr = net_cfg.ipv4.s_addr; + net_cfg.depth4 = 32; + ec = expect(ADDR_ADD | ROUTE_ADD, "ip addr add 198.51.100.14 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 address add\n"); + return ec; + } + + /* Test for IPv4 address removal. See comment above for 'addr add'. */ + ec = expect(ADDR_DEL | ROUTE_DEL, "ip addr del 198.51.100.14/32 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 address del\n"); + return ec; + } + + /* Test for adding IPv4 route. */ + net_cfg.route4.s_addr = RTE_IPV4(198, 51, 100, 0); + net_cfg.depth4 = 24; + ec = expect(ROUTE_ADD, "ip route add 198.51.100.0/24 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 route add\n"); + return ec; + } + + /* Test for IPv4 route removal. */ + ec = expect(ROUTE_DEL, "ip route del 198.51.100.0/24 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 route del\n"); + return ec; + } + + /* Test for neighbour addresses notifications. */ + rte_eth_random_addr(net_cfg.mac_addr.addr_bytes); + rte_ether_format_addr(mac_buf, sizeof(mac_buf), &net_cfg.mac_addr); + + ec = expect(NEIGH_ADD, + "ip neigh add 198.51.100.14 dev %s lladdr %s nud noarp", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 neighbour add\n"); + return ec; + } + + ec = expect(NEIGH_DEL, "ip neigh del 198.51.100.14 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 neighbour del\n"); + return ec; + } + + /* Now the same for IPv6 - with address from "documentation pool". */ + inet_pton(AF_INET6, "2001:db8::dead:beef", net_cfg.ipv6.s6_addr); + /* This is specific to linux netlink behaviour - after adding address + * we get both notification about address being added and new route. + * So I wait for both. + */ + memcpy(net_cfg.route6.s6_addr, net_cfg.ipv6.s6_addr, 16); + net_cfg.depth6 = 128; + ec = expect(ADDR6_ADD | ROUTE6_ADD, + "ip addr add 2001:db8::dead:beef dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 address add\n"); + return ec; + } + + /* See comment above for 'addr6 add'. */ + ec = expect(ADDR6_DEL | ROUTE6_DEL, + "ip addr del 2001:db8::dead:beef/128 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 address del\n"); + return ec; + } + + net_cfg.depth6 = 96; + ec = expect(ROUTE6_ADD, "ip route add 2001:db8::dead:0/96 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 route add\n"); + return ec; + } + + ec = expect(ROUTE6_DEL, "ip route del 2001:db8::dead:0/96 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 route del\n"); + return ec; + } + + ec = expect(NEIGH6_ADD, + "ip neigh add 2001:db8::dead:beef dev %s lladdr %s nud noarp", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 neighbour add\n"); + return ec; + } + + ec = expect(NEIGH6_DEL, "ip neigh del 2001:db8::dead:beef dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 neighbour del\n"); + return ec; + } + + /* Finally put link down and test for notification. */ + net_cfg.is_up = 0; + ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name); + if (ec != 0) { + printf("Failed to notify about link going down\n"); + return ec; + } + + return 0; +} + +static +int test_if_proxy(void) +{ + int ec; + const struct rte_ifpx_info *pinfo; + uint16_t proxy_id; + + state = 0; + memset(&net_cfg, 0, sizeof(net_cfg)); + + if (rte_eth_dev_count_avail() == 0) { + printf("Run this test with at least one port configured\n"); + return 1; + } + /* User the first port available. */ + RTE_ETH_FOREACH_DEV(net_cfg.port_id) + break; + proxy_id = rte_ifpx_proxy_create(RTE_IFPX_DEFAULT); + RTE_VERIFY(proxy_id != RTE_MAX_ETHPORTS); + rte_ifpx_port_bind(net_cfg.port_id, proxy_id); + rte_ifpx_callbacks_register(&cbs); + rte_ifpx_listen(); + + /* Let's start with callback based API. */ + use_callbacks = 1; + pthread_mutex_lock(&mutex); + ec = wait_for(INITIALIZED, 2); + if (ec != 0) { + printf("Failed to obtain network configuration\n"); + goto exit; + } + pinfo = rte_ifpx_info_get(net_cfg.port_id); + RTE_VERIFY(pinfo); + + /* Make sure the link is down. */ + net_cfg.is_up = 0; + ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name); + RTE_VERIFY(ec == ETIMEDOUT || ec == 0); + + ec = test_notifications(pinfo); + if (ec != 0) { + printf("Failed test with callback based API\n"); + goto exit; + } + /* Switch to event queue based API and repeat tests. */ + use_callbacks = 0; + ev_queue = rte_ring_create("IFPX-events", 16, SOCKET_ID_ANY, + RING_F_SP_ENQ | RING_F_SC_DEQ); + ec = rte_ifpx_queue_add(ev_queue); + if (ec != 0) { + printf("Failed to add a notification queue\n"); + goto exit; + } + ec = test_notifications(pinfo); + if (ec != 0) { + printf("Failed test with event queue based API\n"); + goto exit; + } + +exit: + pthread_mutex_unlock(&mutex); + /* Proxy ports are not owned by the lib. Internal references to them + * are cleared on close, but the ports are not destroyed so we need to + * do that explicitly. + */ + rte_ifpx_proxy_destroy(proxy_id); + rte_ifpx_close(); + /* Queue is removed from the lib by rte_ifpx_close() - here we just + * free it. + */ + rte_ring_free(ev_queue); + ev_queue = NULL; + + return ec; +} + +REGISTER_TEST_COMMAND(if_proxy_autotest, test_if_proxy) -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v2 4/4] if_proxy: add example application 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka ` (2 preceding siblings ...) 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 3/4] if_proxy: add simple functionality test Andrzej Ostruszka @ 2020-03-10 11:10 ` Andrzej Ostruszka 2020-03-25 8:08 ` [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library David Marchand 2020-04-03 21:42 ` Thomas Monjalon 5 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-10 11:10 UTC (permalink / raw) To: dev, Thomas Monjalon Add an example application showing possible library usage. This is a simplified version of l3fwd where: - many performance improvements has been removed in order to simplify logic and put focus on the proxy library usage, - the configuration of forwarding has to be done by the user (using typical system tools on proxy ports) - these changes are passed to the application via library notifications. It is meant to show how you can update some data from callbacks (routing - see note below) and how those that are replicated (e.g. kept per lcore) can be updated via event queueing (here neighbouring info). Note: This example assumes that LPM tables can be updated by a single writer while being used by others. To the best of author's knowledge this is the case (by preliminary code inspection) but DPDK does not make such a promise. Obviously, upon the change, there will be a transient period (when some IPs will be directed still to the old destination) but that is expected. Note also that in some cases you might need to tweak your system configuration to see effects. For example you send Gratuitous ARP to DPDK port and expect neighbour tables to be updated in application which does not happen. The packet will be sent to the kernel but it might drop it, please check /proc/sys/net/ipv4/conf/dtap0/arp_accept and related configuration options ('dtap0' here is just a name of your proxy port). Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> Depends-on: series-8862 --- MAINTAINERS | 1 + examples/Makefile | 1 + examples/l3fwd-ifpx/Makefile | 60 ++ examples/l3fwd-ifpx/l3fwd.c | 1131 +++++++++++++++++++++++++++++++ examples/l3fwd-ifpx/l3fwd.h | 98 +++ examples/l3fwd-ifpx/main.c | 740 ++++++++++++++++++++ examples/l3fwd-ifpx/meson.build | 11 + examples/meson.build | 2 +- 8 files changed, 2043 insertions(+), 1 deletion(-) create mode 100644 examples/l3fwd-ifpx/Makefile create mode 100644 examples/l3fwd-ifpx/l3fwd.c create mode 100644 examples/l3fwd-ifpx/l3fwd.h create mode 100644 examples/l3fwd-ifpx/main.c create mode 100644 examples/l3fwd-ifpx/meson.build diff --git a/MAINTAINERS b/MAINTAINERS index a92cb7356..79355f9eb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1472,6 +1472,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: examples/l3fwd-ifpx/ F: app/test/test_if_proxy.c F: doc/guides/prog_guide/if_proxy_lib.rst diff --git a/examples/Makefile b/examples/Makefile index feff79784..a8cb02a6c 100644 --- a/examples/Makefile +++ b/examples/Makefile @@ -81,6 +81,7 @@ else $(info vm_power_manager requires libvirt >= 0.9.3) endif endif +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += l3fwd-ifpx DIRS-y += eventdev_pipeline diff --git a/examples/l3fwd-ifpx/Makefile b/examples/l3fwd-ifpx/Makefile new file mode 100644 index 000000000..68eefeb75 --- /dev/null +++ b/examples/l3fwd-ifpx/Makefile @@ -0,0 +1,60 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2020 Marvell International Ltd. + +# binary name +APP = l3fwd + +# all source are stored in SRCS-y +SRCS-y := main.c l3fwd.c + +# Build using pkg-config variables if possible +ifeq ($(shell pkg-config --exists libdpdk && echo 0),0) + +all: shared +.PHONY: shared static +shared: build/$(APP)-shared + ln -sf $(APP)-shared build/$(APP) +static: build/$(APP)-static + ln -sf $(APP)-static build/$(APP) + +PKGCONF ?= pkg-config + +PC_FILE := $(shell $(PKGCONF) --path libdpdk 2>/dev/null) +CFLAGS += -DALLOW_EXPERIMENTAL_API -O3 $(shell $(PKGCONF) --cflags libdpdk) +LDFLAGS_SHARED = $(shell $(PKGCONF) --libs libdpdk) +LDFLAGS_STATIC = -Wl,-Bstatic $(shell $(PKGCONF) --static --libs libdpdk) + +build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED) + +build/$(APP)-static: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_STATIC) + +build: + @mkdir -p $@ + +.PHONY: clean +clean: + rm -f build/$(APP) build/$(APP)-static build/$(APP)-shared + test -d build && rmdir -p build || true + +else # Build using legacy build system + +ifeq ($(RTE_SDK),) +$(error "Please define RTE_SDK environment variable") +endif + +# Default target, detect a build directory, by looking for a path with a .config +RTE_TARGET ?= $(notdir $(abspath $(dir $(firstword $(wildcard $(RTE_SDK)/*/.config))))) + +include $(RTE_SDK)/mk/rte.vars.mk + +CFLAGS += -DALLOW_EXPERIMENTAL_API + +CFLAGS += -I$(SRCDIR) +CFLAGS += -O3 $(USER_FLAGS) +CFLAGS += $(WERROR_FLAGS) +LDLIBS += -lrte_if_proxy -lrte_ethdev -lrte_eal + +include $(RTE_SDK)/mk/rte.extapp.mk +endif diff --git a/examples/l3fwd-ifpx/l3fwd.c b/examples/l3fwd-ifpx/l3fwd.c new file mode 100644 index 000000000..4b457dfad --- /dev/null +++ b/examples/l3fwd-ifpx/l3fwd.c @@ -0,0 +1,1131 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <inttypes.h> +#include <sys/types.h> +#include <string.h> +#include <sys/queue.h> +#include <stdarg.h> +#include <errno.h> +#include <getopt.h> +#include <sys/socket.h> +#include <arpa/inet.h> + +#include <rte_debug.h> +#include <rte_ether.h> +#include <rte_cycles.h> +#include <rte_malloc.h> +#include <rte_mbuf.h> +#include <rte_ip.h> + +#ifndef USE_HASH_CRC +#include <rte_jhash.h> +#else +#include <rte_hash_crc.h> +#endif + +#include <rte_tcp.h> +#include <rte_udp.h> +#include <rte_lpm.h> +#include <rte_lpm6.h> +#include <rte_if_proxy.h> + +#include "l3fwd.h" + +#define DO_RFC_1812_CHECKS + +#define IPV4_L3FWD_LPM_MAX_RULES 1024 +#define IPV4_L3FWD_LPM_NUMBER_TBL8S (1 << 8) +#define IPV6_L3FWD_LPM_MAX_RULES 1024 +#define IPV6_L3FWD_LPM_NUMBER_TBL8S (1 << 16) + +static volatile bool ifpx_ready; + +/* ethernet addresses of ports */ +static +union lladdr_t port_mac[RTE_MAX_ETHPORTS]; + +static struct rte_lpm *ipv4_routes; +static struct rte_lpm6 *ipv6_routes; + +static +struct ipv4_gateway { + uint16_t port; + union lladdr_t lladdr; + uint32_t ip; +} ipv4_gateways[128]; + +static +struct ipv6_gateway { + uint16_t port; + union lladdr_t lladdr; + uint8_t ip[16]; +} ipv6_gateways[128]; + +/* The lowest 2 bits of next hop (which is 24/21 bit for IPv4/6) are reserved to + * encode: + * 00 -> host route: higher bits of next hop are port id and dst MAC should be + * based on dst IP + * 01 -> gateway route: higher bits of next hop are index into gateway array and + * use port and MAC cached there (if no MAC cached yet then search for it + * based on gateway IP) + * 10 -> proxy entry: packet directed to us, just take higher bits as port id of + * proxy and send packet there (without any modification) + * The port id (16 bits) will always fit however this will not work if you + * need more than 2^20 gateways. + */ +enum route_type { + HOST_ROUTE = 0x00, + GW_ROUTE = 0x01, + PROXY_ADDR = 0x02, +}; + +RTE_STD_C11 +_Static_assert(RTE_DIM(ipv4_gateways) <= (1 << 22) && + RTE_DIM(ipv6_gateways) <= (1 << 19), + "Gateway array index has to fit within next_hop with 2 bits reserved"); + +static +uint32_t find_add_gateway(uint16_t port, uint32_t ip) +{ + uint32_t i, idx = -1U; + + for (i = 0; i < RTE_DIM(ipv4_gateways); ++i) { + /* Remember first free slot in case GW is not present. */ + if (idx == -1U && ipv4_gateways[i].ip == 0) + idx = i; + else if (ipv4_gateways[i].ip == ip) + /* For now assume that given GW will be always at the + * same port, so no checking for that + */ + return i; + } + if (idx != -1U) { + ipv4_gateways[idx].port = port; + ipv4_gateways[idx].ip = ip; + /* Since ARP tables are kept per lcore MAC will be updated + * during first lookup. + */ + } + return idx; +} + +static +void clear_gateway(uint32_t ip) +{ + uint32_t i; + + for (i = 0; i < RTE_DIM(ipv4_gateways); ++i) { + if (ipv4_gateways[i].ip == ip) { + ipv4_gateways[i].ip = 0; + ipv4_gateways[i].lladdr.val = 0; + ipv4_gateways[i].port = RTE_MAX_ETHPORTS; + break; + } + } +} + +static +uint32_t find_add_gateway6(uint16_t port, const uint8_t *ip) +{ + uint32_t i, idx = -1U; + + for (i = 0; i < RTE_DIM(ipv6_gateways); ++i) { + /* Remember first free slot in case GW is not present. */ + if (idx == -1U && ipv6_gateways[i].ip[0] == 0) + idx = i; + else if (ipv6_gateways[i].ip[0]) + /* For now assume that given GW will be always at the + * same port, so no checking for that + */ + return i; + } + if (idx != -1U) { + ipv6_gateways[idx].port = port; + memcpy(ipv6_gateways[idx].ip, ip, 16); + /* Since ARP tables are kept per lcore MAC will be updated + * during first lookup. + */ + } + return idx; +} + +static +void clear_gateway6(const uint8_t *ip) +{ + uint32_t i; + + for (i = 0; i < RTE_DIM(ipv6_gateways); ++i) { + if (memcmp(ipv6_gateways[i].ip, ip, 16) == 0) { + memset(&ipv6_gateways[i].ip, 0, 16); + ipv6_gateways[i].lladdr.val = 0; + ipv6_gateways[i].port = RTE_MAX_ETHPORTS; + break; + } + } +} + +/* Assumptions: + * - Link related changes (MAC/MTU/...) need to be executed once, and it's OK + * to run them from the callback - if this is not the case (e.g. -EBUSY for + * MTU change, then event notification need to be used and more sophisticated + * coordination with lcore loops and stopping/starting of the ports: for + * example lcores not receiving on this port just mark it as inactive and stop + * transmitting to it and the one with RX stops the port sets the MAC starts + * it and notifies other lcores that it is back). + * - LPM is safe to be modified by one writer, and read by many without any + * locks (it looks to me like this is the case), however upon routing change + * there might be a transient period during which packets are not directed + * according to new rule. + * - Hash is unsafe to be used that way (and I don't want to turn on relevant + * flags just to excersize queued notifications) so every lcore keeps its + * copy of relevant data. + * Therefore there are callbacks defined for the routing info/address changes + * and remaining ones are handled via events on per lcore basis. + */ +static +int mac_change(const struct rte_ifpx_mac_change *ev) +{ + int i; + struct rte_ether_addr mac_addr; + char buf[RTE_ETHER_ADDR_FMT_SIZE]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(buf, sizeof(buf), &ev->mac); + RTE_LOG(DEBUG, L3FWD, "MAC change for port %d: %s\n", + ev->port_id, buf); + } + /* NOTE - use copy because RTE functions don't take const args */ + rte_ether_addr_copy(&ev->mac, &mac_addr); + i = rte_eth_dev_default_mac_addr_set(ev->port_id, &mac_addr); + if (i == -EOPNOTSUPP) + i = rte_eth_dev_mac_addr_add(ev->port_id, &mac_addr, 0); + if (i < 0) + RTE_LOG(WARNING, L3FWD, "Failed to set MAC address\n"); + else { + port_mac[ev->port_id].mac.addr = ev->mac; + port_mac[ev->port_id].mac.valid = 1; + } + return 1; +} + +static +int link_change(const struct rte_ifpx_link_change *ev) +{ + uint16_t proxy_id = rte_ifpx_proxy_get(ev->port_id); + uint32_t mask; + + /* Mark the proxy too since we get only port notifications. */ + mask = 1U << ev->port_id | 1U << proxy_id; + + RTE_LOG(DEBUG, L3FWD, "Link change for port %d: %d\n", + ev->port_id, ev->is_up); + if (ev->is_up) { + rte_eth_dev_set_link_up(ev->port_id); + active_port_mask |= mask; + } else { + rte_eth_dev_set_link_down(ev->port_id); + active_port_mask &= ~mask; + } + active_port_mask &= enabled_port_mask; + return 1; +} + +static +int addr_add(const struct rte_ifpx_addr_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 address for port %d: %s\n", + ev->port_id, buf); + } + rte_lpm_add(ipv4_routes, ev->ip, 32, + ev->port_id << 2 | PROXY_ADDR); + return 1; +} + +static +int route_add(const struct rte_ifpx_route_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t nh, ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 route for port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + + /* On Linux upon changing of the IP we get notification for both addr + * and route, so just check if we already have addr entry and if so + * then ignore this notification. + */ + if (ev->depth == 32 && + rte_lpm_lookup(ipv4_routes, ev->ip, &nh) == 0 && nh & PROXY_ADDR) + return 1; + + if (ev->gateway) { + nh = find_add_gateway(ev->port_id, ev->gateway); + if (nh != -1U) + rte_lpm_add(ipv4_routes, ev->ip, ev->depth, + nh << 2 | GW_ROUTE); + else + RTE_LOG(WARNING, L3FWD, "No free slot in GW array\n"); + } else + rte_lpm_add(ipv4_routes, ev->ip, ev->depth, + ev->port_id << 2 | HOST_ROUTE); + return 1; +} + +static +int addr_del(const struct rte_ifpx_addr_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 address removed from port %d: %s\n", + ev->port_id, buf); + } + rte_lpm_delete(ipv4_routes, ev->ip, 32); + return 1; +} + +static +int route_del(const struct rte_ifpx_route_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 route removed from port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + if (ev->gateway) + clear_gateway(ev->gateway); + rte_lpm_delete(ipv4_routes, ev->ip, ev->depth); + return 1; +} + +static +int addr6_add(const struct rte_ifpx_addr6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 address for port %d: %s\n", + ev->port_id, buf); + } + rte_lpm6_add(ipv6_routes, ev->ip, 128, + ev->port_id << 2 | PROXY_ADDR); + return 1; +} + +static +int route6_add(const struct rte_ifpx_route6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + /* See comment in route_add(). */ + uint32_t nh; + if (ev->depth == 128 && + rte_lpm6_lookup(ipv6_routes, ev->ip, &nh) == 0 && nh & PROXY_ADDR) + return 1; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 route for port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + /* no valid IPv6 address starts with 0x00 */ + if (ev->gateway[0]) { + nh = find_add_gateway6(ev->port_id, ev->ip); + if (nh != -1U) + rte_lpm6_add(ipv6_routes, ev->ip, ev->depth, + nh << 2 | GW_ROUTE); + else + RTE_LOG(WARNING, L3FWD, "No free slot in GW6 array\n"); + } else + rte_lpm6_add(ipv6_routes, ev->ip, ev->depth, + ev->port_id << 2 | HOST_ROUTE); + return 1; +} + +static +int addr6_del(const struct rte_ifpx_addr6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 address removed from port %d: %s\n", + ev->port_id, buf); + } + rte_lpm6_delete(ipv6_routes, ev->ip, 128); + return 1; +} + +static +int route6_del(const struct rte_ifpx_route6_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 route removed from port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + if (ev->gateway[0]) + clear_gateway6(ev->gateway); + rte_lpm6_delete(ipv6_routes, ev->ip, ev->depth); + return 1; +} + +static +int cfg_done(void) +{ + uint16_t port_id, px; + const struct rte_ifpx_info *pinfo; + + RTE_LOG(DEBUG, L3FWD, "Proxy config finished\n"); + + /* Copy MAC addresses of the proxies - to be used as src MAC during + * forwarding. + */ + RTE_ETH_FOREACH_DEV(port_id) { + px = rte_ifpx_proxy_get(port_id); + if (px != RTE_MAX_ETHPORTS && px != port_id) { + pinfo = rte_ifpx_info_get(px); + rte_ether_addr_copy(&pinfo->mac, + &port_mac[port_id].mac.addr); + port_mac[port_id].mac.valid = 1; + } + } + + ifpx_ready = 1; + return 1; +} + +static +struct rte_ifpx_callbacks ifpx_callbacks = { + .mac_change = mac_change, +#if 0 + .mtu_change = mtu_change, +#endif + .link_change = link_change, + .addr_add = addr_add, + .addr_del = addr_del, + .addr6_add = addr6_add, + .addr6_del = addr6_del, + .route_add = route_add, + .route_del = route_del, + .route6_add = route6_add, + .route6_del = route6_del, + .cfg_done = cfg_done, +}; + +int init_if_proxy(void) +{ + char buf[16]; + unsigned int i; + + rte_ifpx_callbacks_register(&ifpx_callbacks); + + RTE_LCORE_FOREACH(i) { + if (lcore_conf[i].n_rx_queue == 0) + continue; + snprintf(buf, sizeof(buf), "IFPX-events_%d", i); + lcore_conf[i].ev_queue = rte_ring_create(buf, 16, SOCKET_ID_ANY, + RING_F_SP_ENQ | RING_F_SC_DEQ); + if (!lcore_conf[i].ev_queue) { + RTE_LOG(ERR, L3FWD, + "Failed to create event queue for lcore %d\n", + i); + return -1; + } + rte_ifpx_queue_add(lcore_conf[i].ev_queue); + } + + return rte_ifpx_listen(); +} + +void close_if_proxy(void) +{ + unsigned int i; + + RTE_LCORE_FOREACH(i) { + if (lcore_conf[i].n_rx_queue == 0) + continue; + rte_ring_free(lcore_conf[i].ev_queue); + } + rte_ifpx_close(); +} + +void wait_for_config_done(void) +{ + while (!ifpx_ready) + rte_delay_ms(100); +} + +#ifdef DO_RFC_1812_CHECKS +static inline +int is_valid_ipv4_pkt(struct rte_ipv4_hdr *pkt, uint32_t link_len) +{ + /* From http://www.rfc-editor.org/rfc/rfc1812.txt section 5.2.2 */ + /* + * 1. The packet length reported by the Link Layer must be large + * enough to hold the minimum length legal IP datagram (20 bytes). + */ + if (link_len < sizeof(struct rte_ipv4_hdr)) + return -1; + + /* 2. The IP checksum must be correct. */ + /* this is checked in H/W */ + + /* + * 3. The IP version number must be 4. If the version number is not 4 + * then the packet may be another version of IP, such as IPng or + * ST-II. + */ + if (((pkt->version_ihl) >> 4) != 4) + return -3; + /* + * 4. The IP header length field must be large enough to hold the + * minimum length legal IP datagram (20 bytes = 5 words). + */ + if ((pkt->version_ihl & 0xf) < 5) + return -4; + + /* + * 5. The IP total length field must be large enough to hold the IP + * datagram header, whose length is specified in the IP header length + * field. + */ + if (rte_cpu_to_be_16(pkt->total_length) < sizeof(struct rte_ipv4_hdr)) + return -5; + + return 0; +} +#endif + +/* Send burst of packets on an output interface */ +static inline +int send_burst(struct lcore_conf *lconf, uint16_t n, uint16_t port) +{ + struct rte_mbuf **m_table; + int ret; + uint16_t queueid; + + queueid = lconf->tx_queue_id[port]; + m_table = (struct rte_mbuf **)lconf->tx_mbufs[port].m_table; + + ret = rte_eth_tx_burst(port, queueid, m_table, n); + if (unlikely(ret < n)) { + do { + rte_pktmbuf_free(m_table[ret]); + } while (++ret < n); + } + + return 0; +} + +/* Enqueue a single packet, and send burst if queue is filled */ +static inline +int send_single_packet(struct lcore_conf *lconf, + struct rte_mbuf *m, uint16_t port) +{ + uint16_t len; + + len = lconf->tx_mbufs[port].len; + lconf->tx_mbufs[port].m_table[len] = m; + len++; + + /* enough pkts to be sent */ + if (unlikely(len == MAX_PKT_BURST)) { + send_burst(lconf, MAX_PKT_BURST, port); + len = 0; + } + + lconf->tx_mbufs[port].len = len; + return 0; +} + +static inline +int ipv4_get_destination(const struct rte_ipv4_hdr *ipv4_hdr, + struct rte_lpm *lpm, uint32_t *next_hop) +{ + return rte_lpm_lookup(lpm, + rte_be_to_cpu_32(ipv4_hdr->dst_addr), + next_hop); +} + +static inline +int ipv6_get_destination(const struct rte_ipv6_hdr *ipv6_hdr, + struct rte_lpm6 *lpm, uint32_t *next_hop) +{ + return rte_lpm6_lookup(lpm, ipv6_hdr->dst_addr, next_hop); +} + +static +uint16_t ipv4_process_pkt(struct lcore_conf *lconf, + struct rte_ether_hdr *eth_hdr, + struct rte_ipv4_hdr *ipv4_hdr, uint16_t portid) +{ + union lladdr_t lladdr = { 0 }; + int i; + uint32_t ip, nh; + + /* Here we know that packet is not from proxy - this case is handled + * in the main loop - so if we fail to find destination we will direct + * it to the proxy. + */ + if (ipv4_get_destination(ipv4_hdr, ipv4_routes, &nh) < 0) + return rte_ifpx_proxy_get(portid); + + if (nh & PROXY_ADDR) + return nh >> 2; + + /* Packet not to us so update src/dst MAC. */ + if (nh & GW_ROUTE) { + i = nh >> 2; + if (ipv4_gateways[i].lladdr.mac.valid) + lladdr = ipv4_gateways[i].lladdr; + else { + i = rte_hash_lookup(lconf->neigh_hash, + &ipv4_gateways[i].ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh_map[i]; + ipv4_gateways[i].lladdr = lladdr; + } + nh = ipv4_gateways[i].port; + } else { + nh >>= 2; + ip = rte_be_to_cpu_32(ipv4_hdr->dst_addr); + i = rte_hash_lookup(lconf->neigh_hash, &ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh_map[i]; + } + + RTE_ASSERT(lladdr.mac.valid); + RTE_ASSERT(port_mac[nh].mac.valid); + /* dst addr */ + *(uint64_t *)ð_hdr->d_addr = lladdr.val; + /* src addr */ + rte_ether_addr_copy(&port_mac[nh].mac.addr, ð_hdr->s_addr); + + return nh; +} + +static +uint16_t ipv6_process_pkt(struct lcore_conf *lconf, + struct rte_ether_hdr *eth_hdr, + struct rte_ipv6_hdr *ipv6_hdr, uint16_t portid) +{ + union lladdr_t lladdr = { 0 }; + int i; + uint32_t nh; + + /* Here we know that packet is not from proxy - this case is handled + * in the main loop - so if we fail to find destination we will direct + * it to the proxy. + */ + if (ipv6_get_destination(ipv6_hdr, ipv6_routes, &nh) < 0) + return rte_ifpx_proxy_get(portid); + + if (nh & PROXY_ADDR) + return nh >> 2; + + /* Packet not to us so update src/dst MAC. */ + if (nh & GW_ROUTE) { + i = nh >> 2; + if (ipv6_gateways[i].lladdr.mac.valid) + lladdr = ipv6_gateways[i].lladdr; + else { + i = rte_hash_lookup(lconf->neigh6_hash, + ipv6_gateways[i].ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh6_map[i]; + ipv6_gateways[i].lladdr = lladdr; + } + nh = ipv6_gateways[i].port; + } else { + nh >>= 2; + i = rte_hash_lookup(lconf->neigh6_hash, ipv6_hdr->dst_addr); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh6_map[i]; + } + + RTE_ASSERT(lladdr.mac.valid); + /* dst addr */ + *(uint64_t *)ð_hdr->d_addr = lladdr.val; + /* src addr */ + rte_ether_addr_copy(&port_mac[nh].mac.addr, ð_hdr->s_addr); + + return nh; +} + +static __rte_always_inline +void l3fwd_lpm_simple_forward(struct rte_mbuf *m, uint16_t portid, + struct lcore_conf *lconf) +{ + struct rte_ether_hdr *eth_hdr; + uint32_t nh; + + eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *); + + if (RTE_ETH_IS_IPV4_HDR(m->packet_type)) { + /* Handle IPv4 headers.*/ + struct rte_ipv4_hdr *ipv4_hdr; + + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *, + sizeof(*eth_hdr)); + +#ifdef DO_RFC_1812_CHECKS + /* Check to make sure the packet is valid (RFC1812) */ + if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) { + rte_pktmbuf_free(m); + return; + } +#endif + nh = ipv4_process_pkt(lconf, eth_hdr, ipv4_hdr, portid); + +#ifdef DO_RFC_1812_CHECKS + /* Update time to live and header checksum */ + --(ipv4_hdr->time_to_live); + ++(ipv4_hdr->hdr_checksum); +#endif + } else if (RTE_ETH_IS_IPV6_HDR(m->packet_type)) { + /* Handle IPv6 headers.*/ + struct rte_ipv6_hdr *ipv6_hdr; + + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *, + sizeof(*eth_hdr)); + + nh = ipv6_process_pkt(lconf, eth_hdr, ipv6_hdr, portid); + } else + /* Unhandled protocol */ + nh = rte_ifpx_proxy_get(portid); + + if (nh >= RTE_MAX_ETHPORTS || (active_port_mask & 1 << nh) == 0) + rte_pktmbuf_free(m); + else + send_single_packet(lconf, m, nh); +} + +static inline +void l3fwd_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, + uint16_t portid, struct lcore_conf *lconf) +{ + int32_t j; + + /* Prefetch first packets */ + for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *)); + + /* Prefetch and forward already prefetched packets. */ + for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ + j + PREFETCH_OFFSET], void *)); + l3fwd_lpm_simple_forward(pkts_burst[j], portid, lconf); + } + + /* Forward remaining prefetched packets */ + for (; j < nb_rx; j++) + l3fwd_lpm_simple_forward(pkts_burst[j], portid, lconf); +} + +static +void handle_neigh_add(struct lcore_conf *lconf, + const struct rte_ifpx_neigh_change *ev) +{ + char mac[RTE_ETHER_ADDR_FMT_SIZE]; + char ip[INET_ADDRSTRLEN]; + int32_t i, a; + + i = rte_hash_add_key(lconf->neigh_hash, &ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to add IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(mac, sizeof(mac), &ev->mac); + a = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &a, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour update for port %d: %s -> %s@%d\n", + ev->port_id, ip, mac, i); + } + lconf->neigh_map[i].mac.addr = ev->mac; + lconf->neigh_map[i].mac.valid = 1; +} + +static +void handle_neigh_del(struct lcore_conf *lconf, + const struct rte_ifpx_neigh_change *ev) +{ + char ip[INET_ADDRSTRLEN]; + int32_t i, a; + + i = rte_hash_del_key(lconf->neigh_hash, &ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, + "Failed to remove IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + a = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &a, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour removal for port %d: %s\n", + ev->port_id, ip); + } + lconf->neigh_map[i].val = 0; +} + +static +void handle_neigh6_add(struct lcore_conf *lconf, + const struct rte_ifpx_neigh6_change *ev) +{ + char mac[RTE_ETHER_ADDR_FMT_SIZE]; + char ip[INET6_ADDRSTRLEN]; + int32_t i; + + i = rte_hash_add_key(lconf->neigh6_hash, ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to add IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(mac, sizeof(mac), &ev->mac); + inet_ntop(AF_INET6, ev->ip, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour update for port %d: %s -> %s@%d\n", + ev->port_id, ip, mac, i); + } + lconf->neigh6_map[i].mac.addr = ev->mac; + lconf->neigh6_map[i].mac.valid = 1; +} + +static +void handle_neigh6_del(struct lcore_conf *lconf, + const struct rte_ifpx_neigh6_change *ev) +{ + char ip[INET6_ADDRSTRLEN]; + int32_t i; + + i = rte_hash_del_key(lconf->neigh6_hash, ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to remove IPv6 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour removal for port %d: %s\n", + ev->port_id, ip); + } + lconf->neigh6_map[i].val = 0; +} + +static +void handle_events(struct lcore_conf *lconf) +{ + struct rte_ifpx_event *ev; + + while (rte_ring_dequeue(lconf->ev_queue, (void **)&ev) == 0) { + switch (ev->type) { + case RTE_IFPX_NEIGH_ADD: + handle_neigh_add(lconf, &ev->neigh_change); + break; + case RTE_IFPX_NEIGH_DEL: + handle_neigh_del(lconf, &ev->neigh_change); + break; + case RTE_IFPX_NEIGH6_ADD: + handle_neigh6_add(lconf, &ev->neigh6_change); + break; + case RTE_IFPX_NEIGH6_DEL: + handle_neigh6_del(lconf, &ev->neigh6_change); + break; + default: + RTE_LOG(WARNING, L3FWD, + "Unexpected event: %d\n", ev->type); + } + free(ev); + } +} + +void setup_lpm(void) +{ + struct rte_lpm6_config cfg6; + struct rte_lpm_config cfg4; + + /* create the LPM table */ + cfg4.max_rules = IPV4_L3FWD_LPM_MAX_RULES; + cfg4.number_tbl8s = IPV4_L3FWD_LPM_NUMBER_TBL8S; + cfg4.flags = 0; + ipv4_routes = rte_lpm_create("IPV4_L3FWD_LPM", SOCKET_ID_ANY, &cfg4); + if (ipv4_routes == NULL) + rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table\n"); + + /* create the LPM6 table */ + cfg6.max_rules = IPV6_L3FWD_LPM_MAX_RULES; + cfg6.number_tbl8s = IPV6_L3FWD_LPM_NUMBER_TBL8S; + cfg6.flags = 0; + ipv6_routes = rte_lpm6_create("IPV6_L3FWD_LPM", SOCKET_ID_ANY, &cfg6); + if (ipv6_routes == NULL) + rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table\n"); +} + +static +uint32_t hash_ipv4(const void *key, uint32_t key_len __rte_unused, + uint32_t init_val) +{ +#ifndef USE_HASH_CRC + return rte_jhash_1word(*(const uint32_t *)key, init_val); +#else + return rte_hash_crc_4byte(*(const uint32_t *)key, init_val); +#endif +} + +static +uint32_t hash_ipv6(const void *key, uint32_t key_len __rte_unused, + uint32_t init_val) +{ +#ifndef USE_HASH_CRC + return rte_jhash_32b(key, 4, init_val); +#else + const uint64_t *pk = key; + init_val = rte_hash_crc_8byte(*pk, init_val); + return rte_hash_crc_8byte(*(pk+1), init_val); +#endif +} + +static +int setup_neigh(struct lcore_conf *lconf) +{ + char buf[16]; + struct rte_hash_parameters ipv4_hparams = { + .name = buf, + .entries = L3FWD_NEIGH_ENTRIES, + .key_len = 4, + .hash_func = hash_ipv4, + .hash_func_init_val = 0, + }; + struct rte_hash_parameters ipv6_hparams = { + .name = buf, + .entries = L3FWD_NEIGH_ENTRIES, + .key_len = 16, + .hash_func = hash_ipv6, + .hash_func_init_val = 0, + }; + + snprintf(buf, sizeof(buf), "neigh_hash-%d", rte_lcore_id()); + lconf->neigh_hash = rte_hash_create(&ipv4_hparams); + snprintf(buf, sizeof(buf), "neigh_map-%d", rte_lcore_id()); + lconf->neigh_map = rte_zmalloc(buf, + L3FWD_NEIGH_ENTRIES*sizeof(*lconf->neigh_map), + 8); + if (lconf->neigh_hash == NULL || lconf->neigh_map == NULL) { + RTE_LOG(ERR, L3FWD, + "Unable to create the l3fwd ARP/IPv4 table (lcore %d)\n", + rte_lcore_id()); + return -1; + } + + snprintf(buf, sizeof(buf), "neigh6_hash-%d", rte_lcore_id()); + lconf->neigh6_hash = rte_hash_create(&ipv6_hparams); + snprintf(buf, sizeof(buf), "neigh6_map-%d", rte_lcore_id()); + lconf->neigh6_map = rte_zmalloc(buf, + L3FWD_NEIGH_ENTRIES*sizeof(*lconf->neigh6_map), + 8); + if (lconf->neigh6_hash == NULL || lconf->neigh6_map == NULL) { + RTE_LOG(ERR, L3FWD, + "Unable to create the l3fwd ARP/IPv6 table (lcore %d)\n", + rte_lcore_id()); + return -1; + } + return 0; +} + +int lpm_check_ptype(int portid) +{ + int i, ret; + int ptype_l3_ipv4 = 0, ptype_l3_ipv6 = 0; + uint32_t ptype_mask = RTE_PTYPE_L3_MASK; + + ret = rte_eth_dev_get_supported_ptypes(portid, ptype_mask, NULL, 0); + if (ret <= 0) + return 0; + + uint32_t ptypes[ret]; + + ret = rte_eth_dev_get_supported_ptypes(portid, ptype_mask, ptypes, ret); + for (i = 0; i < ret; ++i) { + if (ptypes[i] & RTE_PTYPE_L3_IPV4) + ptype_l3_ipv4 = 1; + if (ptypes[i] & RTE_PTYPE_L3_IPV6) + ptype_l3_ipv6 = 1; + } + + if (ptype_l3_ipv4 == 0) + RTE_LOG(WARNING, L3FWD, + "port %d cannot parse RTE_PTYPE_L3_IPV4\n", portid); + + if (ptype_l3_ipv6 == 0) + RTE_LOG(WARNING, L3FWD, + "port %d cannot parse RTE_PTYPE_L3_IPV6\n", portid); + + if (ptype_l3_ipv4 && ptype_l3_ipv6) + return 1; + + return 0; + +} + +static inline +void lpm_parse_ptype(struct rte_mbuf *m) +{ + struct rte_ether_hdr *eth_hdr; + uint32_t packet_type = RTE_PTYPE_UNKNOWN; + uint16_t ether_type; + + eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *); + ether_type = eth_hdr->ether_type; + if (ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) + packet_type |= RTE_PTYPE_L3_IPV4_EXT_UNKNOWN; + else if (ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV6)) + packet_type |= RTE_PTYPE_L3_IPV6_EXT_UNKNOWN; + + m->packet_type = packet_type; +} + +uint16_t lpm_cb_parse_ptype(uint16_t port __rte_unused, + uint16_t queue __rte_unused, + struct rte_mbuf *pkts[], uint16_t nb_pkts, + uint16_t max_pkts __rte_unused, + void *user_param __rte_unused) +{ + unsigned int i; + + if (unlikely(nb_pkts == 0)) + return nb_pkts; + rte_prefetch0(rte_pktmbuf_mtod(pkts[0], struct ether_hdr *)); + for (i = 0; i < (unsigned int) (nb_pkts - 1); ++i) { + rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], + struct ether_hdr *)); + lpm_parse_ptype(pkts[i]); + } + lpm_parse_ptype(pkts[i]); + + return nb_pkts; +} + +/* main processing loop */ +int lpm_main_loop(void *dummy __rte_unused) +{ + struct rte_mbuf *pkts_burst[MAX_PKT_BURST]; + unsigned int lcore_id; + uint64_t prev_tsc, diff_tsc, cur_tsc; + int i, j, nb_rx; + uint16_t portid; + uint8_t queueid; + struct lcore_conf *lconf; + struct lcore_rx_queue *rxq; + const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) / + US_PER_S * BURST_TX_DRAIN_US; + + prev_tsc = 0; + + lcore_id = rte_lcore_id(); + lconf = &lcore_conf[lcore_id]; + + if (setup_neigh(lconf) < 0) { + RTE_LOG(ERR, L3FWD, "lcore %u failed to setup its ARP tables\n", + lcore_id); + return 0; + } + + if (lconf->n_rx_queue == 0) { + RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id); + return 0; + } + + RTE_LOG(INFO, L3FWD, "entering main loop on lcore %u\n", lcore_id); + + for (i = 0; i < lconf->n_rx_queue; i++) { + + portid = lconf->rx_queue_list[i].port_id; + queueid = lconf->rx_queue_list[i].queue_id; + RTE_LOG(INFO, L3FWD, + " -- lcoreid=%u portid=%u rxqueueid=%hhu\n", + lcore_id, portid, queueid); + } + + while (!force_quit) { + + cur_tsc = rte_rdtsc(); + /* + * TX burst and event queue drain + */ + diff_tsc = cur_tsc - prev_tsc; + if (unlikely(diff_tsc % drain_tsc == 0)) { + + for (i = 0; i < lconf->n_tx_port; ++i) { + portid = lconf->tx_port_id[i]; + if (lconf->tx_mbufs[portid].len == 0) + continue; + send_burst(lconf, + lconf->tx_mbufs[portid].len, + portid); + lconf->tx_mbufs[portid].len = 0; + } + + if (diff_tsc > EV_QUEUE_DRAIN * drain_tsc) { + if (lconf->ev_queue && + !rte_ring_empty(lconf->ev_queue)) + handle_events(lconf); + prev_tsc = cur_tsc; + } + } + + /* + * Read packet from RX queues + */ + for (i = 0; i < lconf->n_rx_queue; ++i) { + rxq = &lconf->rx_queue_list[i]; + portid = rxq->port_id; + queueid = rxq->queue_id; + nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, + MAX_PKT_BURST); + if (nb_rx == 0) + continue; + /* If current queue is from proxy interface then there + * is no need to figure out destination port - just + * forward it to the bound port. + */ + if (unlikely(rxq->dst_port != RTE_MAX_ETHPORTS)) { + for (j = 0; j < nb_rx; ++j) + send_single_packet(lconf, pkts_burst[j], + rxq->dst_port); + } else + l3fwd_send_packets(nb_rx, pkts_burst, portid, + lconf); + } + } + + return 0; +} diff --git a/examples/l3fwd-ifpx/l3fwd.h b/examples/l3fwd-ifpx/l3fwd.h new file mode 100644 index 000000000..fc60078c5 --- /dev/null +++ b/examples/l3fwd-ifpx/l3fwd.h @@ -0,0 +1,98 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#ifndef __L3_FWD_H__ +#define __L3_FWD_H__ + +#include <stdbool.h> + +#include <rte_ethdev.h> +#include <rte_log.h> +#include <rte_hash.h> + +#define RTE_LOGTYPE_L3FWD RTE_LOGTYPE_USER1 + +#define MAX_PKT_BURST 32 +#define BURST_TX_DRAIN_US 100 /* TX drain every ~100us */ +#define EV_QUEUE_DRAIN 5 /* Check event queue every 5 TX drains */ + +#define MAX_RX_QUEUE_PER_LCORE 16 + +/* + * Try to avoid TX buffering if we have at least MAX_TX_BURST packets to send. + */ +#define MAX_TX_BURST (MAX_PKT_BURST / 2) + +/* Configure how many packets ahead to prefetch, when reading packets */ +#define PREFETCH_OFFSET 3 + +/* Hash parameters. */ +#ifdef RTE_ARCH_64 +/* default to 4 million hash entries (approx) */ +#define L3FWD_HASH_ENTRIES (1024*1024*4) +#else +/* 32-bit has less address-space for hugepage memory, limit to 1M entries */ +#define L3FWD_HASH_ENTRIES (1024*1024*1) +#endif +#define HASH_ENTRY_NUMBER_DEFAULT 4 +/* Default ARP table size */ +#define L3FWD_NEIGH_ENTRIES 1024 + +union lladdr_t { + uint64_t val; + struct { + struct rte_ether_addr addr; + uint16_t valid; + } mac; +}; + +struct mbuf_table { + uint16_t len; + struct rte_mbuf *m_table[MAX_PKT_BURST]; +}; + +struct lcore_rx_queue { + uint16_t port_id; + uint16_t dst_port; + uint8_t queue_id; +} __rte_cache_aligned; + +struct lcore_conf { + uint16_t n_rx_queue; + struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE]; + uint16_t n_tx_port; + uint16_t tx_port_id[RTE_MAX_ETHPORTS]; + uint16_t tx_queue_id[RTE_MAX_ETHPORTS]; + struct mbuf_table tx_mbufs[RTE_MAX_ETHPORTS]; + struct rte_ring *ev_queue; + union lladdr_t *neigh_map; + struct rte_hash *neigh_hash; + union lladdr_t *neigh6_map; + struct rte_hash *neigh6_hash; +} __rte_cache_aligned; + +extern volatile bool force_quit; + +/* mask of enabled/active ports */ +extern uint32_t enabled_port_mask; +extern uint32_t active_port_mask; + +extern struct lcore_conf lcore_conf[RTE_MAX_LCORE]; + +int init_if_proxy(void); +void close_if_proxy(void); + +void wait_for_config_done(void); + +void setup_lpm(void); + +int lpm_check_ptype(int portid); + +uint16_t +lpm_cb_parse_ptype(uint16_t port, uint16_t queue, struct rte_mbuf *pkts[], + uint16_t nb_pkts, uint16_t max_pkts, void *user_param); + +int lpm_main_loop(__attribute__((unused)) void *dummy); + +#endif /* __L3_FWD_H__ */ diff --git a/examples/l3fwd-ifpx/main.c b/examples/l3fwd-ifpx/main.c new file mode 100644 index 000000000..7f1da5ec2 --- /dev/null +++ b/examples/l3fwd-ifpx/main.c @@ -0,0 +1,740 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#include <stdlib.h> +#include <stdint.h> +#include <inttypes.h> +#include <sys/types.h> +#include <string.h> +#include <sys/queue.h> +#include <stdarg.h> +#include <errno.h> +#include <getopt.h> +#include <signal.h> +#include <stdbool.h> + +#include <rte_byteorder.h> +#include <rte_memory.h> +#include <rte_memcpy.h> +#include <rte_eal.h> +#include <rte_launch.h> +#include <rte_atomic.h> +#include <rte_cycles.h> +#include <rte_prefetch.h> +#include <rte_lcore.h> +#include <rte_per_lcore.h> +#include <rte_branch_prediction.h> +#include <rte_interrupts.h> +#include <rte_random.h> +#include <rte_debug.h> +#include <rte_ether.h> +#include <rte_ethdev.h> +#include <rte_mempool.h> +#include <rte_mbuf.h> +#include <rte_ip.h> +#include <rte_tcp.h> +#include <rte_udp.h> +#include <rte_string_fns.h> +#include <rte_cpuflags.h> +#include <rte_if_proxy.h> + +#include <cmdline_parse.h> +#include <cmdline_parse_etheraddr.h> + +#include "l3fwd.h" + +/* + * Configurable number of RX/TX ring descriptors + */ +#define RTE_TEST_RX_DESC_DEFAULT 1024 +#define RTE_TEST_TX_DESC_DEFAULT 1024 + +#define MAX_TX_QUEUE_PER_PORT RTE_MAX_ETHPORTS +#define MAX_RX_QUEUE_PER_PORT 128 + +#define MAX_LCORE_PARAMS 1024 + +/* Static global variables used within this file. */ +static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT; +static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT; + +/**< Ports set in promiscuous mode off by default. */ +static int promiscuous_on; + +/* Global variables. */ + +static int parse_ptype; /**< Parse packet type using rx callback, and */ + /**< disabled by default */ + +volatile bool force_quit; + +/* mask of enabled/active ports */ +uint32_t enabled_port_mask; +uint32_t active_port_mask; + +struct lcore_conf lcore_conf[RTE_MAX_LCORE]; + +struct lcore_params { + uint16_t port_id; + uint8_t queue_id; + uint8_t lcore_id; +} __rte_cache_aligned; + +static struct lcore_params lcore_params[MAX_LCORE_PARAMS]; +static struct lcore_params lcore_params_default[] = { + {0, 0, 2}, + {0, 1, 2}, + {0, 2, 2}, + {1, 0, 2}, + {1, 1, 2}, + {1, 2, 2}, + {2, 0, 2}, + {3, 0, 3}, + {3, 1, 3}, +}; + +static uint16_t nb_lcore_params; + +static struct rte_eth_conf port_conf = { + .rxmode = { + .mq_mode = ETH_MQ_RX_RSS, + .max_rx_pkt_len = RTE_ETHER_MAX_LEN, + .split_hdr_size = 0, + .offloads = DEV_RX_OFFLOAD_CHECKSUM, + }, + .rx_adv_conf = { + .rss_conf = { + .rss_key = NULL, + .rss_hf = ETH_RSS_IP, + }, + }, + .txmode = { + .mq_mode = ETH_MQ_TX_NONE, + }, +}; + +static struct rte_mempool *pktmbuf_pool; + +static int +check_lcore_params(void) +{ + uint8_t queue, lcore; + uint16_t i, port_id; + int socketid; + + for (i = 0; i < nb_lcore_params; ++i) { + queue = lcore_params[i].queue_id; + if (queue >= MAX_RX_QUEUE_PER_PORT) { + RTE_LOG(ERR, L3FWD, "Invalid queue number: %hhu\n", + queue); + return -1; + } + lcore = lcore_params[i].lcore_id; + if (!rte_lcore_is_enabled(lcore)) { + RTE_LOG(ERR, L3FWD, "lcore %hhu is not enabled " + "in lcore mask\n", lcore); + return -1; + } + port_id = lcore_params[i].port_id; + if ((enabled_port_mask & (1 << port_id)) == 0) { + RTE_LOG(ERR, L3FWD, "port %u is not enabled " + "in port mask\n", port_id); + return -1; + } + if (!rte_eth_dev_is_valid_port(port_id)) { + RTE_LOG(ERR, L3FWD, "port %u is not present " + "on the board\n", port_id); + return -1; + } + socketid = rte_lcore_to_socket_id(lcore); + if (socketid != 0) { + RTE_LOG(WARNING, L3FWD, + "lcore %hhu is on socket %d with numa off\n", + lcore, socketid); + } + } + return 0; +} + +static int +add_proxies(void) +{ + uint16_t i, p, port_id, proxy_id; + + for (i = 0, p = nb_lcore_params; i < nb_lcore_params; ++i) { + if (p >= RTE_DIM(lcore_params)) { + RTE_LOG(ERR, L3FWD, "Not enough room in lcore_params " + "to add proxy\n"); + return -1; + } + port_id = lcore_params[i].port_id; + if (rte_ifpx_proxy_get(port_id) != RTE_MAX_ETHPORTS) + continue; + + proxy_id = rte_ifpx_proxy_create(RTE_IFPX_DEFAULT); + if (proxy_id == RTE_MAX_ETHPORTS) { + RTE_LOG(ERR, L3FWD, "Failed to crate proxy\n"); + return -1; + } + rte_ifpx_port_bind(port_id, proxy_id); + /* mark proxy as enabled - the corresponding port is, since we + * are after checking of lcore_params + */ + enabled_port_mask |= 1 << proxy_id; + lcore_params[p].port_id = proxy_id; + lcore_params[p].lcore_id = lcore_params[i].lcore_id; + lcore_params[p].queue_id = lcore_params[i].queue_id; + ++p; + } + + nb_lcore_params = p; + return 0; +} + +static uint8_t +get_port_n_rx_queues(const uint16_t port) +{ + int queue = -1; + uint16_t i; + + for (i = 0; i < nb_lcore_params; ++i) { + if (lcore_params[i].port_id == port) { + if (lcore_params[i].queue_id == queue+1) + queue = lcore_params[i].queue_id; + else + rte_exit(EXIT_FAILURE, "queue ids of the port %d must be" + " in sequence and must start with 0\n", + lcore_params[i].port_id); + } + } + return (uint8_t)(++queue); +} + +static int +init_lcore_rx_queues(void) +{ + uint16_t i, p, nb_rx_queue; + uint8_t lcore; + struct lcore_rx_queue *rq; + + for (i = 0; i < nb_lcore_params; ++i) { + lcore = lcore_params[i].lcore_id; + nb_rx_queue = lcore_conf[lcore].n_rx_queue; + if (nb_rx_queue >= MAX_RX_QUEUE_PER_LCORE) { + RTE_LOG(ERR, L3FWD, + "too many queues (%u) for lcore: %u\n", + (unsigned int)nb_rx_queue + 1, + (unsigned int)lcore); + return -1; + } + rq = &lcore_conf[lcore].rx_queue_list[nb_rx_queue]; + rq->port_id = lcore_params[i].port_id; + rq->queue_id = lcore_params[i].queue_id; + if (rte_ifpx_is_proxy(rq->port_id)) { + if (rte_ifpx_port_get(rq->port_id, &p, 1) > 0) + rq->dst_port = p; + else + RTE_LOG(WARNING, L3FWD, + "Found proxy that has no port bound\n"); + } else + rq->dst_port = RTE_MAX_ETHPORTS; + lcore_conf[lcore].n_rx_queue++; + } + return 0; +} + +/* display usage */ +static void +print_usage(const char *prgname) +{ + fprintf(stderr, "%s [EAL options] --" + " -p PORTMASK" + " [-P]" + " --config (port,queue,lcore)[,(port,queue,lcore)]" + " [--ipv6]" + " [--parse-ptype]" + + " -p PORTMASK: Hexadecimal bitmask of ports to configure\n" + " -P : Enable promiscuous mode\n" + " --config (port,queue,lcore): Rx queue configuration\n" + " --ipv6: Set if running ipv6 packets\n" + " --parse-ptype: Set to use software to analyze packet type\n", + prgname); +} + +static int +parse_portmask(const char *portmask) +{ + char *end = NULL; + unsigned long pm; + + /* parse hexadecimal string */ + pm = strtoul(portmask, &end, 16); + if ((portmask[0] == '\0') || (end == NULL) || (*end != '\0')) + return -1; + + if (pm == 0) + return -1; + + return pm; +} + +static int +parse_config(const char *q_arg) +{ + char s[256]; + const char *p, *p0 = q_arg; + char *end; + enum fieldnames { + FLD_PORT = 0, + FLD_QUEUE, + FLD_LCORE, + _NUM_FLD + }; + unsigned long int_fld[_NUM_FLD]; + char *str_fld[_NUM_FLD]; + int i; + unsigned int size; + + nb_lcore_params = 0; + + while ((p = strchr(p0, '(')) != NULL) { + ++p; + p0 = strchr(p, ')'); + if (p0 == NULL) + return -1; + + size = p0 - p; + if (size >= sizeof(s)) + return -1; + + snprintf(s, sizeof(s), "%.*s", size, p); + if (rte_strsplit(s, sizeof(s), str_fld, _NUM_FLD, ',') != + _NUM_FLD) + return -1; + for (i = 0; i < _NUM_FLD; i++) { + errno = 0; + int_fld[i] = strtoul(str_fld[i], &end, 0); + if (errno != 0 || end == str_fld[i] || int_fld[i] > 255) + return -1; + } + if (nb_lcore_params >= MAX_LCORE_PARAMS) { + RTE_LOG(ERR, L3FWD, "exceeded max number of lcore " + "params: %hu\n", nb_lcore_params); + return -1; + } + lcore_params[nb_lcore_params].port_id = + (uint8_t)int_fld[FLD_PORT]; + lcore_params[nb_lcore_params].queue_id = + (uint8_t)int_fld[FLD_QUEUE]; + lcore_params[nb_lcore_params].lcore_id = + (uint8_t)int_fld[FLD_LCORE]; + ++nb_lcore_params; + } + return 0; +} + +#define MAX_JUMBO_PKT_LEN 9600 +#define MEMPOOL_CACHE_SIZE 256 + +static const char short_options[] = + "p:" /* portmask */ + "P" /* promiscuous */ + "L" /* enable long prefix match */ + "E" /* enable exact match */ + ; + +#define CMD_LINE_OPT_CONFIG "config" +#define CMD_LINE_OPT_IPV6 "ipv6" +#define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype" +enum { + /* long options mapped to a short option */ + + /* first long only option value must be >= 256, so that we won't + * conflict with short options + */ + CMD_LINE_OPT_MIN_NUM = 256, + CMD_LINE_OPT_CONFIG_NUM, + CMD_LINE_OPT_PARSE_PTYPE_NUM, +}; + +static const struct option lgopts[] = { + {CMD_LINE_OPT_CONFIG, 1, 0, CMD_LINE_OPT_CONFIG_NUM}, + {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, CMD_LINE_OPT_PARSE_PTYPE_NUM}, + {NULL, 0, 0, 0} +}; + +/* + * This expression is used to calculate the number of mbufs needed + * depending on user input, taking into account memory for rx and + * tx hardware rings, cache per lcore and mtable per port per lcore. + * RTE_MAX is used to ensure that NB_MBUF never goes below a minimum + * value of 8192 + */ +#define NB_MBUF(nports) RTE_MAX( \ + (nports*nb_rx_queue*nb_rxd + \ + nports*nb_lcores*MAX_PKT_BURST + \ + nports*n_tx_queue*nb_txd + \ + nb_lcores*MEMPOOL_CACHE_SIZE), \ + 8192U) + +/* Parse the argument given in the command line of the application */ +static int +parse_args(int argc, char **argv) +{ + int opt, ret; + char **argvopt; + int option_index; + char *prgname = argv[0]; + + argvopt = argv; + + /* Error or normal output strings. */ + while ((opt = getopt_long(argc, argvopt, short_options, + lgopts, &option_index)) != EOF) { + + switch (opt) { + /* portmask */ + case 'p': + enabled_port_mask = parse_portmask(optarg); + if (enabled_port_mask == 0) { + RTE_LOG(ERR, L3FWD, "Invalid portmask\n"); + print_usage(prgname); + return -1; + } + break; + + case 'P': + promiscuous_on = 1; + break; + + /* long options */ + case CMD_LINE_OPT_CONFIG_NUM: + ret = parse_config(optarg); + if (ret) { + RTE_LOG(ERR, L3FWD, "Invalid config\n"); + print_usage(prgname); + return -1; + } + break; + + case CMD_LINE_OPT_PARSE_PTYPE_NUM: + RTE_LOG(INFO, L3FWD, "soft parse-ptype is enabled\n"); + parse_ptype = 1; + break; + + default: + print_usage(prgname); + return -1; + } + } + + if (nb_lcore_params == 0) { + memcpy(lcore_params, lcore_params_default, + sizeof(lcore_params_default)); + nb_lcore_params = RTE_DIM(lcore_params_default); + } + + if (optind >= 0) + argv[optind-1] = prgname; + + ret = optind-1; + optind = 1; /* reset getopt lib */ + return ret; +} + +static void +signal_handler(int signum) +{ + if (signum == SIGINT || signum == SIGTERM) { + RTE_LOG(NOTICE, L3FWD, + "\n\nSignal %d received, preparing to exit...\n", + signum); + force_quit = true; + } +} + +static int +prepare_ptype_parser(uint16_t portid, uint16_t queueid) +{ + if (parse_ptype) { + RTE_LOG(INFO, L3FWD, "Port %d: softly parse packet type info\n", + portid); + if (rte_eth_add_rx_callback(portid, queueid, + lpm_cb_parse_ptype, + NULL)) + return 1; + + RTE_LOG(ERR, L3FWD, "Failed to add rx callback: port=%d\n", + portid); + return 0; + } + + if (lpm_check_ptype(portid)) + return 1; + + RTE_LOG(ERR, L3FWD, + "port %d cannot parse packet type, please add --%s\n", + portid, CMD_LINE_OPT_PARSE_PTYPE); + return 0; +} + +int +main(int argc, char **argv) +{ + struct lcore_conf *lconf; + struct rte_eth_dev_info dev_info; + struct rte_eth_txconf *txconf; + int ret; + unsigned int nb_ports; + uint32_t nb_mbufs; + uint16_t queueid, portid; + unsigned int lcore_id; + uint32_t nb_tx_queue, nb_lcores; + uint8_t nb_rx_queue, queue; + + /* init EAL */ + ret = rte_eal_init(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Invalid EAL parameters\n"); + argc -= ret; + argv += ret; + + force_quit = false; + signal(SIGINT, signal_handler); + signal(SIGTERM, signal_handler); + + /* parse application arguments (after the EAL ones) */ + ret = parse_args(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Invalid L3FWD parameters\n"); + + if (check_lcore_params() < 0) + rte_exit(EXIT_FAILURE, "check_lcore_params failed\n"); + + if (add_proxies() < 0) + rte_exit(EXIT_FAILURE, "add_proxies failed\n"); + + ret = init_lcore_rx_queues(); + if (ret < 0) + rte_exit(EXIT_FAILURE, "init_lcore_rx_queues failed\n"); + + nb_ports = rte_eth_dev_count_avail(); + + nb_lcores = rte_lcore_count(); + + /* Initial number of mbufs in pool - the amount required for hardware + * rx/tx rings will be added during configuration of ports. + */ + nb_mbufs = nb_ports * nb_lcores * MAX_PKT_BURST + /* mbuf tables */ + nb_lcores * MEMPOOL_CACHE_SIZE; /* per lcore cache */ + + /* Init the lookup structures. */ + setup_lpm(); + + /* initialize all ports (including proxies) */ + RTE_ETH_FOREACH_DEV(portid) { + struct rte_eth_conf local_port_conf = port_conf; + + /* skip ports that are not enabled */ + if ((enabled_port_mask & (1 << portid)) == 0) { + RTE_LOG(INFO, L3FWD, "Skipping disabled port %d\n", + portid); + continue; + } + + /* init port */ + RTE_LOG(INFO, L3FWD, "Initializing port %d ...\n", portid); + + nb_rx_queue = get_port_n_rx_queues(portid); + nb_tx_queue = nb_lcores; + + ret = rte_eth_dev_info_get(portid, &dev_info); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "Error during getting device (port %u) info: %s\n", + portid, strerror(-ret)); + if (nb_rx_queue > dev_info.max_rx_queues || + nb_tx_queue > dev_info.max_tx_queues) + rte_exit(EXIT_FAILURE, + "Port %d cannot configure enough queues\n", + portid); + + RTE_LOG(INFO, L3FWD, "Creating queues: nb_rxq=%d nb_txq=%u...\n", + nb_rx_queue, nb_tx_queue); + + if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE) + local_port_conf.txmode.offloads |= + DEV_TX_OFFLOAD_MBUF_FAST_FREE; + + local_port_conf.rx_adv_conf.rss_conf.rss_hf &= + dev_info.flow_type_rss_offloads; + if (local_port_conf.rx_adv_conf.rss_conf.rss_hf != + port_conf.rx_adv_conf.rss_conf.rss_hf) { + RTE_LOG(INFO, L3FWD, + "Port %u modified RSS hash function based on hardware support," + "requested:%#"PRIx64" configured:%#"PRIx64"\n", + portid, port_conf.rx_adv_conf.rss_conf.rss_hf, + local_port_conf.rx_adv_conf.rss_conf.rss_hf); + } + + ret = rte_eth_dev_configure(portid, nb_rx_queue, + (uint16_t)nb_tx_queue, + &local_port_conf); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "Cannot configure device: err=%d, port=%d\n", + ret, portid); + + ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd, + &nb_txd); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "Cannot adjust number of descriptors: err=%d, " + "port=%d\n", ret, portid); + + nb_mbufs += nb_rx_queue * nb_rxd + nb_tx_queue * nb_txd; + /* init one TX queue per couple (lcore,port) */ + queueid = 0; + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + + RTE_LOG(INFO, L3FWD, "\ttxq=%u,%d\n", lcore_id, + queueid); + + txconf = &dev_info.default_txconf; + txconf->offloads = local_port_conf.txmode.offloads; + ret = rte_eth_tx_queue_setup(portid, queueid, nb_txd, + SOCKET_ID_ANY, txconf); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_tx_queue_setup: err=%d, " + "port=%d\n", ret, portid); + + lconf = &lcore_conf[lcore_id]; + lconf->tx_queue_id[portid] = queueid; + queueid++; + + lconf->tx_port_id[lconf->n_tx_port] = portid; + lconf->n_tx_port++; + } + RTE_LOG(INFO, L3FWD, "\n"); + } + + /* Init pkt pool. */ + pktmbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", + rte_align32prevpow2(nb_mbufs), MEMPOOL_CACHE_SIZE, + 0, RTE_MBUF_DEFAULT_BUF_SIZE, SOCKET_ID_ANY); + if (pktmbuf_pool == NULL) + rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n"); + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + lconf = &lcore_conf[lcore_id]; + RTE_LOG(INFO, L3FWD, "Initializing rx queues on lcore %u ...\n", + lcore_id); + /* init RX queues */ + for (queue = 0; queue < lconf->n_rx_queue; ++queue) { + struct rte_eth_rxconf rxq_conf; + + portid = lconf->rx_queue_list[queue].port_id; + queueid = lconf->rx_queue_list[queue].queue_id; + + RTE_LOG(INFO, L3FWD, "\trxq=%d,%d\n", portid, queueid); + + ret = rte_eth_dev_info_get(portid, &dev_info); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "Error during getting device (port %u) info: %s\n", + portid, strerror(-ret)); + + rxq_conf = dev_info.default_rxconf; + rxq_conf.offloads = port_conf.rxmode.offloads; + ret = rte_eth_rx_queue_setup(portid, queueid, + nb_rxd, SOCKET_ID_ANY, + &rxq_conf, + pktmbuf_pool); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_rx_queue_setup: err=%d, port=%d\n", + ret, portid); + } + } + + RTE_LOG(INFO, L3FWD, "\n"); + + /* start ports */ + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) + continue; + + /* Start device */ + ret = rte_eth_dev_start(portid); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_dev_start: err=%d, port=%d\n", + ret, portid); + + /* + * If enabled, put device in promiscuous mode. + * This allows IO forwarding mode to forward packets + * to itself through 2 cross-connected ports of the + * target machine. + */ + if (promiscuous_on) { + ret = rte_eth_promiscuous_enable(portid); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "rte_eth_promiscuous_enable: err=%s, port=%u\n", + rte_strerror(-ret), portid); + } + } + /* we've managed to start all enabled ports so active == enabled */ + active_port_mask = enabled_port_mask; + + RTE_LOG(INFO, L3FWD, "\n"); + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + lconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < lconf->n_rx_queue; ++queue) { + portid = lconf->rx_queue_list[queue].port_id; + queueid = lconf->rx_queue_list[queue].queue_id; + if (prepare_ptype_parser(portid, queueid) == 0) + rte_exit(EXIT_FAILURE, "ptype check fails\n"); + } + } + + if (init_if_proxy() < 0) + rte_exit(EXIT_FAILURE, "Failed to configure proxy lib\n"); + wait_for_config_done(); + + ret = 0; + /* launch per-lcore init on every lcore */ + rte_eal_mp_remote_launch(lpm_main_loop, NULL, CALL_MASTER); + RTE_LCORE_FOREACH_SLAVE(lcore_id) { + if (rte_eal_wait_lcore(lcore_id) < 0) { + ret = -1; + break; + } + } + + /* stop ports */ + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) + continue; + RTE_LOG(INFO, L3FWD, "Closing port %d...", portid); + rte_eth_dev_stop(portid); + rte_eth_dev_close(portid); + rte_log(RTE_LOG_INFO, RTE_LOGTYPE_L3FWD, " Done\n"); + } + + close_if_proxy(); + RTE_LOG(INFO, L3FWD, "Bye...\n"); + + return ret; +} diff --git a/examples/l3fwd-ifpx/meson.build b/examples/l3fwd-ifpx/meson.build new file mode 100644 index 000000000..f0c0920b8 --- /dev/null +++ b/examples/l3fwd-ifpx/meson.build @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2020 Marvell International Ltd. + +# meson file, for building this example as part of a main DPDK build. +# +# To build this example as a standalone application with an already-installed +# DPDK instance, use 'make' + +allow_experimental_apis = true +deps += ['hash', 'lpm', 'if_proxy'] +sources = files('l3fwd.c', 'main.c') diff --git a/examples/meson.build b/examples/meson.build index 1f2b6f516..319d765eb 100644 --- a/examples/meson.build +++ b/examples/meson.build @@ -23,7 +23,7 @@ all_examples = [ 'l2fwd', 'l2fwd-cat', 'l2fwd-event', 'l2fwd-crypto', 'l2fwd-jobstats', 'l2fwd-keepalive', 'l3fwd', - 'l3fwd-acl', 'l3fwd-power', + 'l3fwd-acl', 'l3fwd-ifpx', 'l3fwd-power', 'link_status_interrupt', 'multi_process/client_server_mp/mp_client', 'multi_process/client_server_mp/mp_server', -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka ` (3 preceding siblings ...) 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 4/4] if_proxy: add example application Andrzej Ostruszka @ 2020-03-25 8:08 ` David Marchand 2020-03-25 11:11 ` Morten Brørup 2020-03-26 12:41 ` Andrzej Ostruszka 2020-04-03 21:42 ` Thomas Monjalon 5 siblings, 2 replies; 64+ messages in thread From: David Marchand @ 2020-03-25 8:08 UTC (permalink / raw) To: Andrzej Ostruszka; +Cc: dev Hello Andrzej, On Tue, Mar 10, 2020 at 12:11 PM Andrzej Ostruszka <aostruszka@marvell.com> wrote: > > What is this useful for > ======================= > > Usually, when an ethernet port is assigned to DPDK it vanishes from the > system and user looses ability to control it via normal configuration > utilities (e.g. those from iproute2 package). Moreover by default DPDK > application is not aware of the network configuration of the system. > > To address both of these issues application needs to: > - add some command line interface (or other mechanism) allowing for > control of the port and its configuration > - query the status of network configuration and monitor its changes > > The purpose of this library is to help with both of these tasks (as long > as they remain in domain of configuration available to the system). In > other words, if DPDK application has some special needs, that cannot be > addressed by the normal system configuration utilities, then they need > to be solved by the application itself. > > The connection between DPDK and system is based on the existence of > ports that are visible to both DPDK and system (like Tap, KNI and > possibly some other drivers). These ports serve as an interface > proxies. > > Let's visualize the action of the library by the following example: > > Linux | DPDK > ============================================================== > | > | +-------+ +-------+ > | | Port1 | | Port2 | > "ip link set dev tap1 mtu 1600" | +-------+ +-------+ > | | ^ ^ ^ > | +------+ | mtu_change | | > `->| Tap1 |---' callback | | > +------+ | | > "ip addr add 198.51.100.14 \ | | | > dev tap2" | | | > | +------+ | | > +->| Tap2 |------------------' | > | +------+ addr_add callback | > "ip route add 198.0.2.0/24 \ | | | > dev tap2" | | route_add callback | > | `---------------------' > > So we have two ports Port1 and Port2 that are not visible to the system. > We create two proxy interfaces (here based on Tap driver) and bind the > ports to their proxies. When user issues a command changing MTU for > Tap1 interface the library notes this and calls "mtu_change" callback > for the Port1. Similarly when user adds an IPv4 address to the Tap2 > interface "addr_add" callback is called for the Port2 and the same > happens for configuration of routing rule pointing to Tap2. Apart from > callbacks this library can notify about changes via adding events to > notification queues. See below for more inforamtion about that and > a complete list of available callbacks. > > Please note that nothing has been mentioned about forwarding of the > packets between system and DPDK. Since the proxies are normal DPDK > ports you can receive/send to them via usual RX/TX burst API. However > since the library is not aware of the structure of packet processing > used by the application it cannot automatically forward the packets - it > is responsibility of the application to include proxy ports into its > packet processing engine. > > As mentioned above the intention of the library is to: > - provide information about network configuration that would allow > application to decide what to do with the packets received on DPDK > ports, > - allow for control of the ports via standard configuration utilities > > Although the library only helps you to identify proxy for given port > (and vice versa) and calls appropriate callbacks it does open some > interesting possibilities. For example you can use the proxy ports to > forward packets for protocols that you do not wish to handle in DPDK > application to the system protocol stack and just listen to the > configuration changes - so that way you can "offload" handling of those > protocols to the system. > > How to use it > ============= > > Usage of this library is rather simple. You have to: > 1. Create proxy (if you don't have port suitable for being proxy or you > have one but do not wish to use it as a proxy). > 2. Bind port to proxy. > 3. Register callbacks and/or event queues. > 4. Start listening to the network configuration. > > The only mandatory requirement for DPDK port to be able to act as > a proxy is that it is visible in the system - this is checked during > port to proxy binding by calling rte_eth_dev_info_get() on proxy port > and inspecting 'if_index' field (it has to be non-zero). > One can create such port in the application by calling: > > proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); > > Upon success this returns id of DPDK proxy port created > (RTE_MAX_ETHPORTS on failure). The argument selects type of proxy port > to create (currently Tap/KNI only). This function actually is just > a wrapper around: > > uint16_t rte_ifpx_create_by_devarg(const char *devarg); > > creating valid 'devarg' string for the chosen type of proxy. If you have > other driver capable of acting as a proxy you can call > rte_ifpx_create_by_devarg() directly passing appropriate argument. > > Once you have id of both port and proxy you can bind the two via: > > rte_ifpx_port_bind(port_id, proxy_id); > > This creates logical binding - as mentioned above there is no automatic > packet forwarding. With this binding whenever user changes the state of > proxy interface in the system (link up/down, change mac/mtu, add/remove > IPv4/IPv6) you get appropriate notification for the bound port. > > So far we've mentioned several times that the library calls callbacks. > They are grouped in 'struct rte_ifpx_callbacks' and user provides them > to the library via: > > rte_ifpx_callbacks_register(&cbs); > > It is worth mentioning that the context (lcore/thread) in which these > callbacks are called is implementation defined. It might differ between > different platforms, so the application needs to assume that some kind > of inter lcore/thread synchronization/communication is required. > > Apart from notification via callbacks this library also supports > notifying about the changes via adding events to the configured > notification queues. The queues are registered via: > > int rte_ifpx_queue_add(struct rte_ring *r); > > and the actual logic used is: if there is callback registered then it is > called, if it returns non-zero then event is considered completed, > otherwise event is added to each configured notification queue. > That way application can update data structures that are safe to be > modified by single writer from within callback or do the common > preprocessing steps (if any needed) in callback and data that is > replicated can be updated during handling of queued events. > > Once we have bindings in place and notification configured, the only > essential part that remains is to get the current network configuration > and start listening to its changes. This is accomplished via a call to: > > rte_ifpx_listen(); > > And basically this is all one needs to understand how to use this > library. Other less essential parts include: > - ability to query what events are available for given platform > - getting mapping between proxy and port > - unbinding the ports from proxy > - destroying proxy port > - closing the listening service > - getting basic information about proxy > > > Currently available features and implementation > =============================================== > > The library's API is system independent but it obviously needs some > system dependent parts. We provide exemplary Linux implementation (based > on netlink sockets). Very similar implementation is possible for > FreeBSD (with the usage of PF_ROUTE sockets). Windows implementation > would need to differ much (probably IP Helper library would be of some help). > > Here is the list of currently implemented callbacks: > > struct rte_ifpx_callbacks { > int (*mac_change)(const struct rte_ifpx_mac_change *event); > int (*mtu_change)(const struct rte_ifpx_mtu_change *event); > int (*link_change)(const struct rte_ifpx_link_change *event); > int (*addr_add)(const struct rte_ifpx_addr_change *event); > int (*addr_del)(const struct rte_ifpx_addr_change *event); > int (*addr6_add)(const struct rte_ifpx_addr6_change *event); > int (*addr6_del)(const struct rte_ifpx_addr6_change *event); > int (*route_add)(const struct rte_ifpx_route_change *event); > int (*route_del)(const struct rte_ifpx_route_change *event); > int (*route6_add)(const struct rte_ifpx_route6_change *event); > int (*route6_del)(const struct rte_ifpx_route6_change *event); > int (*neigh_add)(const struct rte_ifpx_neigh_change *event); > int (*neigh_del)(const struct rte_ifpx_neigh_change *event); > int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); > int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); > int (*cfg_done)(void); > }; > > They are all rather self-descriptive with the exception of the last one. > When the user calls rte_ifpx_listen() the library first queries the > system for its current configuration. That might require several > request/reply exchanges between DPDK and system and once it is finished > this callback is called to let application know that all info has been > gathered. > > It is worth to mention also that while typical case would be a 1-to-1 > mapping between port and proxy, the 1-to-many mapping is also supported. > In that case related callbacks will be called for each port bound to > given proxy interface - it is application responsibility to define > semantic of such mapping (e.g. all changes apply to all ports, or link > changes apply to all but other are accepted in "round robin" fashion, or > some other logic). > > As mentioned above Linux implementation is based on netlink socket. > This socket is registered as file descriptor in EAL interrupts > (similarly to how EAL alarms are implemented). > > What has changed since the RFC > ============================== > > - Platform dependent parts has been separated into a ifpx_platform > structure with callbacks for initialization, getting information about > the interface, listening to the changes and closing of the library. > That should allow easier reimplementation. > > - Notification scheme has been changed - instead of having just > callbacks now event queueing is also available (or a mix of those > two). > > - Filtering of events only related to the proxy ports - previously all > network configuration changes were reported. But DPDK application > doesn't need to know whole configuration - only just portion related > to the proxy ports. If a packet comes that does not match rules then > it can be forwarded via proxy to the system to decide what to do with > it. If that is not desired and such packets should be dropped then > null port can be created with proxy and e.g. default route installed > on it. > > - Removed previous example which was just printing notification. > Instead added a simplified (stripped vectorization and other > performance improvements) version of l3fwd that should serve as an > example of using this library in real applications. > > Changes in V2 > ============= > - Cleaned up checkpatch warnings > - Removed dead/unused code and added gateway clearing in l3fwd-ifpx I can see we end up exposing structures for registering callbacks. Did you consider some ways to avoid exposure of those? (thinking of ABI maintenance for when this library will elect to non-experimental). I can see some canary at the end of an enum, can we do without it? Is there a pb with merging ifpx support into the existing l3fwd application rather than introduce a new example? -- David Marchand ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-03-25 8:08 ` [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library David Marchand @ 2020-03-25 11:11 ` Morten Brørup 2020-03-26 17:42 ` Andrzej Ostruszka 2020-03-26 12:41 ` Andrzej Ostruszka 1 sibling, 1 reply; 64+ messages in thread From: Morten Brørup @ 2020-03-25 11:11 UTC (permalink / raw) To: Andrzej Ostruszka, David Marchand; +Cc: dev Andrzej, > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of David Marchand > Sent: Wednesday, March 25, 2020 9:08 AM > > Hello Andrzej, > > On Tue, Mar 10, 2020 at 12:11 PM Andrzej Ostruszka > <aostruszka@marvell.com> wrote: > > <snip> > > > > What has changed since the RFC > > ============================== > > > > - Platform dependent parts has been separated into a ifpx_platform > > structure with callbacks for initialization, getting information > about > > the interface, listening to the changes and closing of the library. > > That should allow easier reimplementation. > > > > - Notification scheme has been changed - instead of having just > > callbacks now event queueing is also available (or a mix of those > > two). Thank you for adding event queueing! David mentions ABI forward compatibility below. Consider using a dynamically sized generic TLV (type, length, value) message format instead of a big union structure for the events. This would make it easier to extend the list of event types without breaking the ABI. And I am still strongly opposed to the callback method: The callbacks are handled as DPDK interrupts, which are running in a non-DPDK thread, i.e. a running callback may be preempted by some other Linux process. This makes it difficult to implement callbacks correctly. The risk of someone calling a non-thread safe function from a callback is high, e.g. DPDK hash table manipulation (except lookup) is not thread safe. Your documentation is far too vague about this: Please note however that the context in which these callbacks are called is most probably different from the one in which packets are handled and it is application writer responsibility to use proper synchronization mechanisms - if they are needed. You need a big fat WARNING about how difficult the DPDK interrupt thread is to work with. As I described above, it is not "most probably" it is "certainly" a very different kind of context. Did you check that the functions you use in your example callbacks are all thread safe and non-blocking, so they can safely be called from a non-DPDK thread that may be preempted by a another Linux process? <snip> > > I can see we end up exposing structures for registering callbacks. > Did you consider some ways to avoid exposure of those? (thinking of > ABI maintenance for when this library will elect to non-experimental). > I can see some canary at the end of an enum, can we do without it? > > Is there a pb with merging ifpx support into the existing l3fwd > application rather than introduce a new example? > > > -- > David Marchand > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-03-25 11:11 ` Morten Brørup @ 2020-03-26 17:42 ` Andrzej Ostruszka 2020-04-02 13:48 ` Andrzej Ostruszka [C] 0 siblings, 1 reply; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-26 17:42 UTC (permalink / raw) To: Morten Brørup, Andrzej Ostruszka, David Marchand; +Cc: dev On 3/25/20 12:11 PM, Morten Brørup wrote: [...] >>> - Notification scheme has been changed - instead of having just >>> callbacks now event queueing is also available (or a mix of those >>> two). > > Thank you for adding event queueing! That was actually a good input from you - thank you. > David mentions ABI forward compatibility below. > Consider using a dynamically sized generic TLV (type, length, value) > message format instead of a big union structure for the events. This > would make it easier to extend the list of event types without breaking > the ABI. My understanding is that David was talking about registering of callbacks and you want to extend this to event definition. So let's focus on one example: ... RTE_IFPX_NEIGH_ADD, RTE_IFPX_NEIGH_DEL, ... struct rte_ifpx_neigh_change { uint16_t port_id; struct rte_ether_addr mac; uint32_t ip; }; Right now the event is defined as: struct rte_ifpx_event { enum rte_ifpx_event_type type; union { ... struct rte_ifpx_neigh_change neigh_change; ... }; }; So what the user does is a switch on event->type: switch (ev->type) { case RTE_IFPX_NEIGH_ADD: handle_neigh_add(lconf, &ev->neigh_change); break; case RTE_IFPX_NEIGH_DEL: handle_neigh_del(lconf, &ev->neigh_change); break; How does adding more event types to this union would break ABI? User gets event from the queue (allocated by the lib) checks the type and casts the pointer past the 'type' to proper event definition. And when done with the event simply free()s it (BTW right now it is malloc() not rte_malloc() - should I change that?). If app links against newer version of lib then it might get type which it does not understand/handle so it should skip (possibly with a warning). I'm not sure how changing rte_ifpx_event to: struct rte_ifpx_event { enut rte_ifpx_event_type type; int length; uint8_t data[]; }; would help here. The user would need to cast data based on event type whereas now it takes address of a proper union member - and the union is there only to avoid casting. In both cases what is important is that RTE_IFPX_NEIGH_ADD/DEL and "struct rte_ifpx_neigh_change" don't change between versions (new values can be added - or new versions of the previously existing events when trying to make a change). And for the callbacks it is more or less the same - library will prepare data and call callback with a pointer to this data. Handling of new event types should be automatic when I implement what David wanted - simply lib callback for the new event will be NULL nothing will be called and application will work without problems. > And I am still strongly opposed to the callback method: Noted - however for now I would like to keep them. I don't have much experience with this library so if they prove to be inadequate then we will remove them. Right now they seem to add some flexibility that I like: - if something should be changed globally and once (and it is safe to do so!) then it can be done from the callback - if something can be prepared once and consumed later by lcores then it can be done in callback and the callback returns 0 so that event is still queued and lcores (under assumption that queues are per lcore) pick up what has been prepared. > The callbacks are handled as DPDK interrupts, which are running in a non-DPDK > thread, i.e. a running callback may be preempted by some other Linux process. > This makes it difficult to implement callbacks correctly. > The risk of someone calling a non-thread safe function from a callback is high, > e.g. DPDK hash table manipulation (except lookup) is not thread safe. > > Your documentation is far too vague about this: > Please note however that the context in which these callbacks are > called is most probably different from the one in which packets are > handled and it is application writer responsibility to use proper > synchronization mechanisms - if they are needed. > > You need a big fat WARNING about how difficult the DPDK interrupt thread is to > work with. As I described above, it is not "most probably" it is "certainly" a > very different kind of context. OK. Will update in next version. > Did you check that the functions you use in your example callbacks are all > thread safe and non-blocking, so they can safely be called from a non-DPDK thread > that may be preempted by a another Linux process? I believe so. However there is a big question whether my assumption about LPM is correct. I've looked at the code and it looks like it so but I'm not in power to authoritatively declare it. So again, to me LPM looks like safe to be changed by a single writer while being used by multiple readers (with an obvious transient period when rule is being expanded and some IPs might go with an old and some with a new destination). With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-03-26 17:42 ` Andrzej Ostruszka @ 2020-04-02 13:48 ` Andrzej Ostruszka [C] 2020-04-03 17:19 ` Thomas Monjalon 0 siblings, 1 reply; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-04-02 13:48 UTC (permalink / raw) To: Morten Brørup, David Marchand; +Cc: dev On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > On 3/25/20 12:11 PM, Morten Brørup wrote: [...] >> And I am still strongly opposed to the callback method: > > Noted - however for now I would like to keep them. I don't have much > experience with this library so if they prove to be inadequate then we > will remove them. Right now they seem to add some flexibility that I like: > - if something should be changed globally and once (and it is safe to do > so!) then it can be done from the callback > - if something can be prepared once and consumed later by lcores then it > can be done in callback and the callback returns 0 so that event is > still queued and lcores (under assumption that queues are per lcore) > pick up what has been prepared. Morten I've been thinking about this a bit and would like to know your (and others) opinion about following proposed enhancement. Right now, how queues are used is left to the application decision (per lcore, per port, ...) - and I intend to keep it that way - but they are "match all". What I mean by that is that (unlike callbacks where you have separate per event type) queue has no chance to be selective. So if someone would like to go with queues only they he would have to coordinate between queues (or their "owners") which one does the handling of an event that supposedly should be handled only once. Let's take this forwarding example - the queues are per lcore and each lcore keeps its own copy of ARP table (hash) so when the change is noticed the event is queued to all registered queue, each lcore updates its own copy and everything is OK. However the routing is global (and right now is updated from callback) and if no callback is used for that then the event would be queued to all lcores and application would need to select the one which does the update. Would that be easier/better to register queue together with a bitmask of event types that given queue is accepting? Than during setup phase application would select just one queue to handle "global" events and the logic of event handling for lcores should be simplier. Let me know what you think. With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-02 13:48 ` Andrzej Ostruszka [C] @ 2020-04-03 17:19 ` Thomas Monjalon 2020-04-03 19:09 ` Jerin Jacob 2020-04-04 18:30 ` Andrzej Ostruszka [C] 0 siblings, 2 replies; 64+ messages in thread From: Thomas Monjalon @ 2020-04-03 17:19 UTC (permalink / raw) To: Andrzej Ostruszka [C] Cc: Morten Brørup, David Marchand, dev, bruce.richardson, anatoly.burakov 02/04/2020 15:48, Andrzej Ostruszka [C]: > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > > On 3/25/20 12:11 PM, Morten Brørup wrote: > [...] > >> And I am still strongly opposed to the callback method: > > > > Noted - however for now I would like to keep them. I don't have much > > experience with this library so if they prove to be inadequate then we > > will remove them. Right now they seem to add some flexibility that I like: > > - if something should be changed globally and once (and it is safe to do > > so!) then it can be done from the callback > > - if something can be prepared once and consumed later by lcores then it > > can be done in callback and the callback returns 0 so that event is > > still queued and lcores (under assumption that queues are per lcore) > > pick up what has been prepared. > > Morten > > I've been thinking about this a bit and would like to know your (and > others) opinion about following proposed enhancement. > > Right now, how queues are used is left to the application decision (per > lcore, per port, ...) - and I intend to keep it that way - but they are > "match all". What I mean by that is that (unlike callbacks where you > have separate per event type) queue has no chance to be selective. > > So if someone would like to go with queues only they he would have to > coordinate between queues (or their "owners") which one does the > handling of an event that supposedly should be handled only once. > > Let's take this forwarding example - the queues are per lcore and each > lcore keeps its own copy of ARP table (hash) so when the change is > noticed the event is queued to all registered queue, each lcore updates > its own copy and everything is OK. However the routing is global (and > right now is updated from callback) and if no callback is used for that > then the event would be queued to all lcores and application would need > to select the one which does the update. > > Would that be easier/better to register queue together with a bitmask of > event types that given queue is accepting? Than during setup phase > application would select just one queue to handle "global" events and > the logic of event handling for lcores should be simplier. > > Let me know what you think. I think we want to avoid complicate design. So let's choose between callback and message queue. I vote for message queue because it can handle any situation, and it allows to control the context of the event processing. The other reason is that I believe we need message queueing for other purposes in DPDK (ex: multi-process, telemetry). You start thinking about complex message management. And I start thinking about other usages of message queueing. So I think it is the right time to introduce a generic messaging in DPDK. Note: the IPC rte_mp should be built on top of such generic messaging. If you agree, we can start a new email thread to better discuss the generic messaging sub-system. I describe here the 3 properties I have in mind: 1/ Message policy One very important rule in DPDK is to let the control to the application. So the messaging policy must be managed by the application via DPDK API. 2/ Message queue It seems we should rely on ZeroMQ. Here is why: http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ 3/ Message format I am not sure whether we can manage with "simple strings", TLV, or should we use something more complex like protobuf? ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-03 17:19 ` Thomas Monjalon @ 2020-04-03 19:09 ` Jerin Jacob 2020-04-03 21:18 ` Morten Brørup 2020-04-04 18:30 ` Andrzej Ostruszka [C] 1 sibling, 1 reply; 64+ messages in thread From: Jerin Jacob @ 2020-04-03 19:09 UTC (permalink / raw) To: Thomas Monjalon Cc: Andrzej Ostruszka [C], Morten Brørup, David Marchand, dpdk-dev, Richardson, Bruce, Anatoly Burakov On Fri, Apr 3, 2020 at 10:49 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > 02/04/2020 15:48, Andrzej Ostruszka [C]: > > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > > > On 3/25/20 12:11 PM, Morten Brørup wrote: > > [...] > > >> And I am still strongly opposed to the callback method: > > > > > > Noted - however for now I would like to keep them. I don't have much > > > experience with this library so if they prove to be inadequate then we > > > will remove them. Right now they seem to add some flexibility that I like: > > > - if something should be changed globally and once (and it is safe to do > > > so!) then it can be done from the callback > > > - if something can be prepared once and consumed later by lcores then it > > > can be done in callback and the callback returns 0 so that event is > > > still queued and lcores (under assumption that queues are per lcore) > > > pick up what has been prepared. > > > > Morten > > > > I've been thinking about this a bit and would like to know your (and > > others) opinion about following proposed enhancement. > > > > Right now, how queues are used is left to the application decision (per > > lcore, per port, ...) - and I intend to keep it that way - but they are > > "match all". What I mean by that is that (unlike callbacks where you > > have separate per event type) queue has no chance to be selective. > > > > So if someone would like to go with queues only they he would have to > > coordinate between queues (or their "owners") which one does the > > handling of an event that supposedly should be handled only once. > > > > Let's take this forwarding example - the queues are per lcore and each > > lcore keeps its own copy of ARP table (hash) so when the change is > > noticed the event is queued to all registered queue, each lcore updates > > its own copy and everything is OK. However the routing is global (and > > right now is updated from callback) and if no callback is used for that > > then the event would be queued to all lcores and application would need > > to select the one which does the update. > > > > Would that be easier/better to register queue together with a bitmask of > > event types that given queue is accepting? Than during setup phase > > application would select just one queue to handle "global" events and > > the logic of event handling for lcores should be simplier. > > > > Let me know what you think. > > I think we want to avoid complicate design. > So let's choose between callback and message queue. > I vote for message queue because it can handle any situation, > and it allows to control the context of the event processing. IMO, it should be left to application decision, Application can use either callback or message queue based on their design and I don't think, DPDK needs to enforce certain model. On the upside, Giving two options, the application can choose the right model. The simple use case like updating the global routing table, The callback scheme would be more than enough. The downside of pushing the architecture to message queue would be that application either need to create additional control thread to poll or call select() get the event or in worst case check the message queue emptiness in fastpath. So why to enforce? Thoughts? > The other reason is that I believe we need message queueing for > other purposes in DPDK (ex: multi-process, telemetry). As far as I know, telemetry is using Linux socket fro IPC, I am not sure why do we need to standardize message queue infra? Becasue, each use case is different. > > You start thinking about complex message management. > And I start thinking about other usages of message queueing. > So I think it is the right time to introduce a generic messaging in DPDK. > Note: the IPC rte_mp should be built on top of such generic messaging. > > If you agree, we can start a new email thread to better discuss > the generic messaging sub-system. > I describe here the 3 properties I have in mind: > > 1/ Message policy > One very important rule in DPDK is to let the control to the application. > So the messaging policy must be managed by the application via DPDK API. Do you mean send() and recv() should be wrapped around DPDK call? > > 2/ Message queue > It seems we should rely on ZeroMQ. Here is why: > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ IMO, ZeroMQ used for IPC over network etc. In this case, the purpose is to pass the Netlink message IN THE SAME SYSTEM to application. Do you need external library dependency? On the same system or multiprocess application, our rte_ring would be more than enough. Right? If not, please enumerate the use case. > > 3/ Message format > I am not sure whether we can manage with "simple strings", TLV, > or should we use something more complex like protobuf? In this use case, we are relying the Netlink message to application at least in Linux case. I think the message should be similar to Netlink message and give provision for other OS'es such as scheme. Why reinvent the wheel? > > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-03 19:09 ` Jerin Jacob @ 2020-04-03 21:18 ` Morten Brørup 2020-04-03 21:57 ` Thomas Monjalon 0 siblings, 1 reply; 64+ messages in thread From: Morten Brørup @ 2020-04-03 21:18 UTC (permalink / raw) To: Jerin Jacob, Thomas Monjalon, Andrzej Ostruszka [C] Cc: David Marchand, dpdk-dev, Richardson, Bruce, Anatoly Burakov > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob > Sent: Friday, April 3, 2020 9:09 PM > > On Fri, Apr 3, 2020 at 10:49 PM Thomas Monjalon <thomas@monjalon.net> > wrote: > > > > 02/04/2020 15:48, Andrzej Ostruszka [C]: > > > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > > > > On 3/25/20 12:11 PM, Morten Brørup wrote: > > > [...] > > > >> And I am still strongly opposed to the callback method: > > > > > > > > Noted - however for now I would like to keep them. I don't have > much > > > > experience with this library so if they prove to be inadequate > then we > > > > will remove them. Right now they seem to add some flexibility > that I like: > > > > - if something should be changed globally and once (and it is > safe to do > > > > so!) then it can be done from the callback > > > > - if something can be prepared once and consumed later by lcores > then it > > > > can be done in callback and the callback returns 0 so that > event is > > > > still queued and lcores (under assumption that queues are per > lcore) > > > > pick up what has been prepared. > > > > > > Morten > > > > > > I've been thinking about this a bit and would like to know your > (and > > > others) opinion about following proposed enhancement. > > > > > > Right now, how queues are used is left to the application decision > (per > > > lcore, per port, ...) - and I intend to keep it that way - but they > are > > > "match all". What I mean by that is that (unlike callbacks where > you > > > have separate per event type) queue has no chance to be selective. > > > > > > So if someone would like to go with queues only they he would have > to > > > coordinate between queues (or their "owners") which one does the > > > handling of an event that supposedly should be handled only once. > > > > > > Let's take this forwarding example - the queues are per lcore and > each > > > lcore keeps its own copy of ARP table (hash) so when the change is > > > noticed the event is queued to all registered queue, each lcore > updates > > > its own copy and everything is OK. However the routing is global > (and > > > right now is updated from callback) and if no callback is used for > that > > > then the event would be queued to all lcores and application would > need > > > to select the one which does the update. > > > > > > Would that be easier/better to register queue together with a > bitmask of > > > event types that given queue is accepting? Than during setup phase > > > application would select just one queue to handle "global" events > and > > > the logic of event handling for lcores should be simplier. > > > > > > Let me know what you think. > > > > I think we want to avoid complicate design. > > So let's choose between callback and message queue. > > I vote for message queue because it can handle any situation, > > and it allows to control the context of the event processing. > > IMO, it should be left to application decision, Application can use > either callback or > message queue based on their design and I don't think, DPDK needs to > enforce certain model. > On the upside, Giving two options, the application can choose the right > model. > The simple use case like updating the global routing table, The > callback scheme would be more than enough. > The downside of pushing the architecture to message queue would > be that application either need to create additional control thread to > poll or call select() > get the event or in worst case check the message queue emptiness in > fastpath. > So why to enforce? > > Thoughts? A message queue would not require an additional control thread. It would use the existing control thread that the application already has. I think you are missing an important point: The application needs to handle all control plane interactions, not just control plane interactions related to the interface proxy library. So the application already has (or needs to add) mechanisms in place for this. E.g. if a control plane event (from the interface proxy library or some other trigger) needs to be distributed across a single or multiple data plane lcores, the application already has (or needs to add) a mechanism for doing it. Adding a specific mechanism only in this library does not help all the other control plane interactions the application needs to handle. Actually it does the opposite: it requires that the application handles events from the interface proxy library in a specific way that is different from the way the application already handles other control plane events. So I'm also voting for simplicity: A single event queue, leaving it up to the application how to handle these events. > > The other reason is that I believe we need message queueing for > > other purposes in DPDK (ex: multi-process, telemetry). > > As far as I know, telemetry is using Linux socket fro IPC, I am not > sure > why do we need to standardize message queue infra? Becasue, each use > case is different. I think Thomas is suggesting that we consider the generic case of interaction with the control plane, as I described above. Not just interaction with the interface proxy events. > > > > You start thinking about complex message management. > > And I start thinking about other usages of message queueing. > > So I think it is the right time to introduce a generic messaging in > DPDK. > > Note: the IPC rte_mp should be built on top of such generic > messaging. > > > > If you agree, we can start a new email thread to better discuss > > the generic messaging sub-system. I agree that it should be separated from the interface proxy library. And yes, DPDK is missing a generic framework - or at least a "best practices" description - for interaction between the control plane and the data plane. So far, every DPDK application developer has to come up with his own. > > I describe here the 3 properties I have in mind: > > > > 1/ Message policy > > One very important rule in DPDK is to let the control to the > application. > > So the messaging policy must be managed by the application via DPDK > API. > > Do you mean send() and recv() should be wrapped around DPDK call? > > > > > 2/ Message queue > > It seems we should rely on ZeroMQ. Here is why: > > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ > > IMO, ZeroMQ used for IPC over network etc. In this case, the purpose is > to pass the Netlink message IN THE SAME SYSTEM to application. > Do you need external library dependency? On the same system or > multiprocess application, our rte_ring would be more than enough. > Right? > If not, please enumerate the use case. > > > > > 3/ Message format > > I am not sure whether we can manage with "simple strings", TLV, > > or should we use something more complex like protobuf? Lean and mean is the way to go. A binary format, please. No more JSON or similar bloated encoding! > > In this use case, we are relying the Netlink message to application at > least > in Linux case. I think the message should be similar to Netlink message > and give > provision for other OS'es such as scheme. > > Why reinvent the wheel? ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-03 21:18 ` Morten Brørup @ 2020-04-03 21:57 ` Thomas Monjalon 2020-04-04 10:18 ` Jerin Jacob 0 siblings, 1 reply; 64+ messages in thread From: Thomas Monjalon @ 2020-04-03 21:57 UTC (permalink / raw) To: Jerin Jacob, Andrzej Ostruszka [C], Morten Brørup Cc: David Marchand, dpdk-dev, Richardson, Bruce, Anatoly Burakov 03/04/2020 23:18, Morten Brørup: > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob > > Thomas Monjalon <thomas@monjalon.net> wrote: > > > 02/04/2020 15:48, Andrzej Ostruszka [C]: > > > > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > > > > > On 3/25/20 12:11 PM, Morten Brørup wrote: > > > > [...] > > > > >> And I am still strongly opposed to the callback method: > > > > > > > > > > Noted - however for now I would like to keep them. I don't have > > much > > > > > experience with this library so if they prove to be inadequate > > then we > > > > > will remove them. Right now they seem to add some flexibility > > that I like: > > > > > - if something should be changed globally and once (and it is > > safe to do > > > > > so!) then it can be done from the callback > > > > > - if something can be prepared once and consumed later by lcores > > then it > > > > > can be done in callback and the callback returns 0 so that > > event is > > > > > still queued and lcores (under assumption that queues are per > > lcore) > > > > > pick up what has been prepared. > > > > > > > > Morten > > > > > > > > I've been thinking about this a bit and would like to know your > > (and > > > > others) opinion about following proposed enhancement. > > > > > > > > Right now, how queues are used is left to the application decision > > (per > > > > lcore, per port, ...) - and I intend to keep it that way - but they > > are > > > > "match all". What I mean by that is that (unlike callbacks where > > you > > > > have separate per event type) queue has no chance to be selective. > > > > > > > > So if someone would like to go with queues only they he would have > > to > > > > coordinate between queues (or their "owners") which one does the > > > > handling of an event that supposedly should be handled only once. > > > > > > > > Let's take this forwarding example - the queues are per lcore and > > each > > > > lcore keeps its own copy of ARP table (hash) so when the change is > > > > noticed the event is queued to all registered queue, each lcore > > updates > > > > its own copy and everything is OK. However the routing is global > > (and > > > > right now is updated from callback) and if no callback is used for > > that > > > > then the event would be queued to all lcores and application would > > need > > > > to select the one which does the update. > > > > > > > > Would that be easier/better to register queue together with a > > bitmask of > > > > event types that given queue is accepting? Than during setup phase > > > > application would select just one queue to handle "global" events > > and > > > > the logic of event handling for lcores should be simplier. > > > > > > > > Let me know what you think. > > > > > > I think we want to avoid complicate design. > > > So let's choose between callback and message queue. > > > I vote for message queue because it can handle any situation, > > > and it allows to control the context of the event processing. > > > > IMO, it should be left to application decision, Application can use > > either callback or > > message queue based on their design and I don't think, DPDK needs to > > enforce certain model. > > On the upside, Giving two options, the application can choose the right > > model. > > The simple use case like updating the global routing table, The > > callback scheme would be more than enough. > > The downside of pushing the architecture to message queue would > > be that application either need to create additional control thread to > > poll or call select() > > get the event or in worst case check the message queue emptiness in > > fastpath. > > So why to enforce? > > > > Thoughts? > > A message queue would not require an additional control thread. It would use the existing control thread that the application already has. > > I think you are missing an important point: > > The application needs to handle all control plane interactions, > not just control plane interactions related to the interface proxy library. Yes this is the point. > So the application already has (or needs to add) mechanisms in place for this. E.g. if a control plane event (from the interface proxy library or some other trigger) needs to be distributed across a single or multiple data plane lcores, the application already has (or needs to add) a mechanism for doing it. Adding a specific mechanism only in this library does not help all the other control plane interactions the application needs to handle. Actually it does the opposite: it requires that the application handles events from the interface proxy library in a specific way that is different from the way the application already handles other control plane events. > > So I'm also voting for simplicity: A single event queue, leaving it up to the application how to handle these events. > > > > The other reason is that I believe we need message queueing for > > > other purposes in DPDK (ex: multi-process, telemetry). > > > > As far as I know, telemetry is using Linux socket fro IPC, I am not > > sure > > why do we need to standardize message queue infra? Becasue, each use > > case is different. > > I think Thomas is suggesting that we consider the generic case of > interaction with the control plane, as I described above. > Not just interaction with the interface proxy events. > > > > > > > You start thinking about complex message management. > > > And I start thinking about other usages of message queueing. > > > So I think it is the right time to introduce a generic messaging in > > DPDK. > > > Note: the IPC rte_mp should be built on top of such generic > > messaging. > > > > > > If you agree, we can start a new email thread to better discuss > > > the generic messaging sub-system. > > I agree that it should be separated from the interface proxy library. > > And yes, DPDK is missing a generic framework - or at least a "best practices" description - for interaction between the control plane and the data plane. So far, every DPDK application developer has to come up with his own. > > > > I describe here the 3 properties I have in mind: > > > > > > 1/ Message policy > > > One very important rule in DPDK is to let the control to the > > application. > > > So the messaging policy must be managed by the application via DPDK > > API. > > > > Do you mean send() and recv() should be wrapped around DPDK call? I am thinking about something a bit more complex with handlers registration and default handlers in each DPDK library. > > > 2/ Message queue > > > It seems we should rely on ZeroMQ. Here is why: > > > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ > > > > IMO, ZeroMQ used for IPC over network etc. In this case, the purpose is > > to pass the Netlink message IN THE SAME SYSTEM to application. > > Do you need external library dependency? On the same system or > > multiprocess application, our rte_ring would be more than enough. > > Right? > > If not, please enumerate the use case. Network communication will allow standardizing a DPDK remote control. With ZeroMQ, it comes for free. > > > 3/ Message format > > > I am not sure whether we can manage with "simple strings", TLV, > > > or should we use something more complex like protobuf? > > Lean and mean is the way to go. A binary format, please. > No more JSON or similar bloated encoding! JSON, as other text encoding as one advantage: it is readable when debugging. But I tend to agree that TLV is probably a good fit. > > In this use case, we are relying the Netlink message to application at > > least > > in Linux case. I think the message should be similar to Netlink message > > and give > > provision for other OS'es such as scheme. > > > > Why reinvent the wheel? I agree, we should not re-encode Netlink. With a TLV format, we can just encapsulate Netlink for the generic channel, and give it a message type to dispatch the message to the right hansler. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-03 21:57 ` Thomas Monjalon @ 2020-04-04 10:18 ` Jerin Jacob 2020-04-10 10:41 ` Morten Brørup 0 siblings, 1 reply; 64+ messages in thread From: Jerin Jacob @ 2020-04-04 10:18 UTC (permalink / raw) To: Thomas Monjalon Cc: Andrzej Ostruszka [C], Morten Brørup, David Marchand, dpdk-dev, Richardson, Bruce, Anatoly Burakov On Sat, Apr 4, 2020 at 3:27 AM Thomas Monjalon <thomas@monjalon.net> wrote: > > 03/04/2020 23:18, Morten Brørup: > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob > > > Thomas Monjalon <thomas@monjalon.net> wrote: > > > > 02/04/2020 15:48, Andrzej Ostruszka [C]: > > > > > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > > > > > > On 3/25/20 12:11 PM, Morten Brørup wrote: > > > > > [...] > > > > > >> And I am still strongly opposed to the callback method: > > > > > > > > > > > > Noted - however for now I would like to keep them. I don't have > > > much > > > > > > experience with this library so if they prove to be inadequate > > > then we > > > > > > will remove them. Right now they seem to add some flexibility > > > that I like: > > > > > > - if something should be changed globally and once (and it is > > > safe to do > > > > > > so!) then it can be done from the callback > > > > > > - if something can be prepared once and consumed later by lcores > > > then it > > > > > > can be done in callback and the callback returns 0 so that > > > event is > > > > > > still queued and lcores (under assumption that queues are per > > > lcore) > > > > > > pick up what has been prepared. > > > > > > > > > > Morten > > > > > > > > > > I've been thinking about this a bit and would like to know your > > > (and > > > > > others) opinion about following proposed enhancement. > > > > > > > > > > Right now, how queues are used is left to the application decision > > > (per > > > > > lcore, per port, ...) - and I intend to keep it that way - but they > > > are > > > > > "match all". What I mean by that is that (unlike callbacks where > > > you > > > > > have separate per event type) queue has no chance to be selective. > > > > > > > > > > So if someone would like to go with queues only they he would have > > > to > > > > > coordinate between queues (or their "owners") which one does the > > > > > handling of an event that supposedly should be handled only once. > > > > > > > > > > Let's take this forwarding example - the queues are per lcore and > > > each > > > > > lcore keeps its own copy of ARP table (hash) so when the change is > > > > > noticed the event is queued to all registered queue, each lcore > > > updates > > > > > its own copy and everything is OK. However the routing is global > > > (and > > > > > right now is updated from callback) and if no callback is used for > > > that > > > > > then the event would be queued to all lcores and application would > > > need > > > > > to select the one which does the update. > > > > > > > > > > Would that be easier/better to register queue together with a > > > bitmask of > > > > > event types that given queue is accepting? Than during setup phase > > > > > application would select just one queue to handle "global" events > > > and > > > > > the logic of event handling for lcores should be simplier. > > > > > > > > > > Let me know what you think. > > > > > > > > I think we want to avoid complicate design. > > > > So let's choose between callback and message queue. > > > > I vote for message queue because it can handle any situation, > > > > and it allows to control the context of the event processing. > > > > > > IMO, it should be left to application decision, Application can use > > > either callback or > > > message queue based on their design and I don't think, DPDK needs to > > > enforce certain model. > > > On the upside, Giving two options, the application can choose the right > > > model. > > > The simple use case like updating the global routing table, The > > > callback scheme would be more than enough. > > > The downside of pushing the architecture to message queue would > > > be that application either need to create additional control thread to > > > poll or call select() > > > get the event or in worst case check the message queue emptiness in > > > fastpath. > > > So why to enforce? > > > > > > Thoughts? > > > > A message queue would not require an additional control thread. It would use the existing control thread that the application already has. Assuming every application has a control thread. > > > > I think you are missing an important point: > > > > The application needs to handle all control plane interactions, > > not just control plane interactions related to the interface proxy library. > > Yes this is the point. OK. I think the following message needs to have a unified message access scheme. 1) RTE_ETH_EVENT_ events registered using rte_eth_dev_callback_register() 2) rte_mp messages 3) telemetry control message for remote control Future: 4) IF proxy library messages 5) adding the trace control message for remote control. Since it is the control plane, slow path traffic without any performance requirement, Generalize the message comes for zero cost. +1 for standardizing the message if every subsystem planning to do the same. > > So the application already has (or needs to add) mechanisms in place for this. E.g. if a control plane event (from the interface proxy library or some other trigger) needs to be distributed across a single or multiple data plane lcores, the application already has (or needs to add) a mechanism for doing it. Adding a specific mechanism only in this library does not help all the other control plane interactions the application needs to handle. Actually it does the opposite: it requires that the application handles events from the interface proxy library in a specific way that is different from the way the application already handles other control plane events. > > > > So I'm also voting for simplicity: A single event queue, leaving it up to the application how to handle these events. +1 > > > > > > The other reason is that I believe we need message queueing for > > > > other purposes in DPDK (ex: multi-process, telemetry). > > > > > > As far as I know, telemetry is using Linux socket fro IPC, I am not > > > sure > > > why do we need to standardize message queue infra? Becasue, each use > > > case is different. > > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-04 10:18 ` Jerin Jacob @ 2020-04-10 10:41 ` Morten Brørup 0 siblings, 0 replies; 64+ messages in thread From: Morten Brørup @ 2020-04-10 10:41 UTC (permalink / raw) To: Jerin Jacob, Thomas Monjalon Cc: Andrzej Ostruszka [C], David Marchand, dpdk-dev, Richardson, Bruce, Anatoly Burakov > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob > Sent: Saturday, April 4, 2020 12:19 PM > [...] > Since it is the control plane, slow path traffic without any > performance requirement, This is incorrect. Production equipment certainly has control plane performance requirements! If there were no control plane requirements, adding a single route could take 1 second. Then loading the internet route table of 850k IPv4 routes would take 10 full days. Good luck selling a router with this control plane performance. > Generalize the message comes for zero cost. > +1 for standardizing the message if every subsystem planning to do the > same. +1 for generalizing and standardizing, if it can be done right. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-03 17:19 ` Thomas Monjalon 2020-04-03 19:09 ` Jerin Jacob @ 2020-04-04 18:30 ` Andrzej Ostruszka [C] 2020-04-04 19:58 ` Thomas Monjalon 1 sibling, 1 reply; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-04-04 18:30 UTC (permalink / raw) To: Thomas Monjalon Cc: Morten Brørup, David Marchand, dev, bruce.richardson, anatoly.burakov Thomas, I have replied to the other mail, here I just want to confirm, that I'm fine with the proposed "general messaging" which other libraries (IF Proxy including) could utilize. See also below. On 4/3/20 7:19 PM, Thomas Monjalon wrote: > 02/04/2020 15:48, Andrzej Ostruszka [C]: >> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: [...] >> Would that be easier/better to register queue together with a bitmask of >> event types that given queue is accepting? Than during setup phase >> application would select just one queue to handle "global" events and >> the logic of event handling for lcores should be simplier. >> >> Let me know what you think. > > I think we want to avoid complicate design. > So let's choose between callback and message queue. > I vote for message queue because it can handle any situation, > and it allows to control the context of the event processing. > The other reason is that I believe we need message queueing for > other purposes in DPDK (ex: multi-process, telemetry). > > You start thinking about complex message management. > And I start thinking about other usages of message queueing. > So I think it is the right time to introduce a generic messaging in DPDK. > Note: the IPC rte_mp should be built on top of such generic messaging. Do you have also inter-lcore communication in mind here? Or just "external" world to "some DPDK controller/dispatcher" and how that is passed to other cores is an application writer problem. > If you agree, we can start a new email thread to better discuss > the generic messaging sub-system. Yes, lets talk about that. > I describe here the 3 properties I have in mind: > > 1/ Message policy > One very important rule in DPDK is to let the control to the application. > So the messaging policy must be managed by the application via DPDK API. > > 2/ Message queue > It seems we should rely on ZeroMQ. Here is why: > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ > > 3/ Message format > I am not sure whether we can manage with "simple strings", TLV, > or should we use something more complex like protobuf? ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-04 18:30 ` Andrzej Ostruszka [C] @ 2020-04-04 19:58 ` Thomas Monjalon 2020-04-10 10:03 ` Morten Brørup 0 siblings, 1 reply; 64+ messages in thread From: Thomas Monjalon @ 2020-04-04 19:58 UTC (permalink / raw) To: Andrzej Ostruszka [C] Cc: Morten Brørup, David Marchand, dev, bruce.richardson, anatoly.burakov 04/04/2020 20:30, Andrzej Ostruszka [C]: > Thomas, > > I have replied to the other mail, here I just want to confirm, that I'm > fine with the proposed "general messaging" which other libraries (IF > Proxy including) could utilize. > > See also below. > > On 4/3/20 7:19 PM, Thomas Monjalon wrote: > > 02/04/2020 15:48, Andrzej Ostruszka [C]: > >> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > [...] > >> Would that be easier/better to register queue together with a bitmask of > >> event types that given queue is accepting? Than during setup phase > >> application would select just one queue to handle "global" events and > >> the logic of event handling for lcores should be simplier. > >> > >> Let me know what you think. > > > > I think we want to avoid complicate design. > > So let's choose between callback and message queue. > > I vote for message queue because it can handle any situation, > > and it allows to control the context of the event processing. > > The other reason is that I believe we need message queueing for > > other purposes in DPDK (ex: multi-process, telemetry). > > > > You start thinking about complex message management. > > And I start thinking about other usages of message queueing. > > So I think it is the right time to introduce a generic messaging in DPDK. > > Note: the IPC rte_mp should be built on top of such generic messaging. > > Do you have also inter-lcore communication in mind here? Or just > "external" world to "some DPDK controller/dispatcher" and how that is > passed to other cores is an application writer problem. I was thinking at communication with: - DPDK event from random context - secondary process - external application - remote application In all cases, I thought the message receiver would be the master core. But you are probably right that targeting a specific core may be interesting. > > If you agree, we can start a new email thread to better discuss > > the generic messaging sub-system. > > Yes, lets talk about that. Let's start with the requirements (as above). I'll write such email soon. > > I describe here the 3 properties I have in mind: > > > > 1/ Message policy > > One very important rule in DPDK is to let the control to the application. > > So the messaging policy must be managed by the application via DPDK API. > > > > 2/ Message queue > > It seems we should rely on ZeroMQ. Here is why: > > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ > > > > 3/ Message format > > I am not sure whether we can manage with "simple strings", TLV, > > or should we use something more complex like protobuf? ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-04 19:58 ` Thomas Monjalon @ 2020-04-10 10:03 ` Morten Brørup 2020-04-10 12:28 ` Jerin Jacob 0 siblings, 1 reply; 64+ messages in thread From: Morten Brørup @ 2020-04-10 10:03 UTC (permalink / raw) To: Thomas Monjalon, Andrzej Ostruszka [C] Cc: David Marchand, dev, bruce.richardson, anatoly.burakov > From: Thomas Monjalon [mailto:thomas@monjalon.net] > Sent: Saturday, April 4, 2020 9:58 PM > > 04/04/2020 20:30, Andrzej Ostruszka [C]: > > Thomas, > > > > I have replied to the other mail, here I just want to confirm, that > I'm > > fine with the proposed "general messaging" which other libraries (IF > > Proxy including) could utilize. > > > > See also below. > > > > On 4/3/20 7:19 PM, Thomas Monjalon wrote: > > > 02/04/2020 15:48, Andrzej Ostruszka [C]: > > >> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > > [...] > > >> Would that be easier/better to register queue together with a > bitmask of > > >> event types that given queue is accepting? Than during setup > phase > > >> application would select just one queue to handle "global" events > and > > >> the logic of event handling for lcores should be simplier. > > >> > > >> Let me know what you think. > > > > > > I think we want to avoid complicate design. > > > So let's choose between callback and message queue. > > > I vote for message queue because it can handle any situation, > > > and it allows to control the context of the event processing. > > > The other reason is that I believe we need message queueing for > > > other purposes in DPDK (ex: multi-process, telemetry). > > > > > > You start thinking about complex message management. > > > And I start thinking about other usages of message queueing. > > > So I think it is the right time to introduce a generic messaging in > DPDK. > > > Note: the IPC rte_mp should be built on top of such generic > messaging. > > > > Do you have also inter-lcore communication in mind here? Or just > > "external" world to "some DPDK controller/dispatcher" and how that is > > passed to other cores is an application writer problem. > > I was thinking at communication with: > - DPDK event from random context > - secondary process > - external application > - remote application > > In all cases, I thought the message receiver would be the master core. That would also be my assumption. > But you are probably right that targeting a specific core may be > interesting. > > > > If you agree, we can start a new email thread to better discuss > > > the generic messaging sub-system. > > > > Yes, lets talk about that. > > Let's start with the requirements (as above). > I'll write such email soon. > > > > > I describe here the 3 properties I have in mind: > > > > > > 1/ Message policy > > > One very important rule in DPDK is to let the control to the > application. > > > So the messaging policy must be managed by the application via DPDK > API. > > > > > > 2/ Message queue > > > It seems we should rely on ZeroMQ. Here is why: > > > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ > > > > > > 3/ Message format > > > I am not sure whether we can manage with "simple strings", TLV, > > > or should we use something more complex like protobuf? > > > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-10 10:03 ` Morten Brørup @ 2020-04-10 12:28 ` Jerin Jacob 0 siblings, 0 replies; 64+ messages in thread From: Jerin Jacob @ 2020-04-10 12:28 UTC (permalink / raw) To: Morten Brørup Cc: Thomas Monjalon, Andrzej Ostruszka [C], David Marchand, dpdk-dev, Richardson, Bruce, Anatoly Burakov On Fri, Apr 10, 2020 at 3:33 PM Morten Brørup <mb@smartsharesystems.com> wrote: > > > From: Thomas Monjalon [mailto:thomas@monjalon.net] > > Sent: Saturday, April 4, 2020 9:58 PM > > > > 04/04/2020 20:30, Andrzej Ostruszka [C]: > > > Thomas, > > > > > > I have replied to the other mail, here I just want to confirm, that > > I'm > > > fine with the proposed "general messaging" which other libraries (IF > > > Proxy including) could utilize. > > > > > > See also below. > > > > > > On 4/3/20 7:19 PM, Thomas Monjalon wrote: > > > > 02/04/2020 15:48, Andrzej Ostruszka [C]: > > > >> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote: > > > [...] > > > >> Would that be easier/better to register queue together with a > > bitmask of > > > >> event types that given queue is accepting? Than during setup > > phase > > > >> application would select just one queue to handle "global" events > > and > > > >> the logic of event handling for lcores should be simplier. > > > >> > > > >> Let me know what you think. > > > > > > > > I think we want to avoid complicate design. > > > > So let's choose between callback and message queue. > > > > I vote for message queue because it can handle any situation, > > > > and it allows to control the context of the event processing. > > > > The other reason is that I believe we need message queueing for > > > > other purposes in DPDK (ex: multi-process, telemetry). > > > > > > > > You start thinking about complex message management. > > > > And I start thinking about other usages of message queueing. > > > > So I think it is the right time to introduce a generic messaging in > > DPDK. > > > > Note: the IPC rte_mp should be built on top of such generic > > messaging. > > > > > > Do you have also inter-lcore communication in mind here? Or just > > > "external" world to "some DPDK controller/dispatcher" and how that is > > > passed to other cores is an application writer problem. > > > > I was thinking at communication with: > > - DPDK event from random context > > - secondary process > > - external application > > - remote application > > > > In all cases, I thought the message receiver would be the master core. > > That would also be my assumption. IMO, DPDK should not dictate that. The library can give API to for housekeeping. It is up to the application to call on DPDK isolated cores or from the control thread. I think, we need to dictate only it should be used by a single consumer. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-03-25 8:08 ` [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library David Marchand 2020-03-25 11:11 ` Morten Brørup @ 2020-03-26 12:41 ` Andrzej Ostruszka 2020-03-30 19:23 ` Andrzej Ostruszka 1 sibling, 1 reply; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-26 12:41 UTC (permalink / raw) To: David Marchand, Andrzej Ostruszka; +Cc: dev Thank you David for taking time to look at this. On 3/25/20 9:08 AM, David Marchand wrote: > Hello Andrzej, > > On Tue, Mar 10, 2020 at 12:11 PM Andrzej Ostruszka [...] > I can see we end up exposing structures for registering callbacks. Right. I was thinking more in terms of user convenience so it seemed like a good choice to gather them in one struct and call 'register' once. The fact that the same structure is used to keep them is an implementation choice and this can be decoupled. > Did you consider some ways to avoid exposure of those? (thinking of > ABI maintenance for when this library will elect to non-experimental). I will. So far I used the union for the input since I like when things are well typed :) and there is no need for casting. However I will spend some time on this and will get back to you soon (if you have already something in your head please share). Right now I'm thinking about taking array of callbacks with each entry being ("event type", callback) pair, however need to figure out how to have minimum amount of type casting. > I can see some canary at the end of an enum, can we do without it? I followed discussion on the list about that and have thought about it but deemed that to be not a problem. This enum value is never returned from the library and the event type enum is never taken as an input (only used for event notification). So this is really implementation thing and you are right it would be better to hide it. This might be resolved by itself when I come up with something for the above ABI stability issue. > Is there a pb with merging ifpx support into the existing l3fwd > application rather than introduce a new example? I don't see a problem with merging per se. That might be my misunderstanding of what the examples are. I thought that each library can have its own example to show how it is supposed to be used. So decided to have simplified version of l3fwd - and initially I thought about updating l3fwd but it has some non-trivial optimizations and two modes of operations (hash/lpm) so I wanted something simple to just show how to use the library. Don't know what is the reason for this bi-modality of l3fwd: - if this is just a need to show LPM/Hash in use then I can replace that with single mode of l3fwd-ifpx where LPM is used for routing and Hash is used to keep neighbouring info - if this is to show that both LPM and Hash can be used for routing then it would complicate things as these two have different update properties. I assume (but don't have a solid proof for that) that LPM can be updated by a single writer while being used by multiple readers and use this assumption to show how such structures can be updated (Morten please cover your eyes ;-)) from a callback while other can be updated via event queuing. So if the community decides that it would be OK to morph l3fwd to: - strip the bi-modality - use LPM and Hash for different things (not both for routing) then I'm OK with that and will happily do that. Otherwise adding IFPX to l3fwd will end up with two modes with different routing implementation and different update strategies - a bit like two different apps bundled into one and chosen by the command arg. There is also a question of not having FreeBSD and Windows support yet - so things might get complicated. With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-03-26 12:41 ` Andrzej Ostruszka @ 2020-03-30 19:23 ` Andrzej Ostruszka 0 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-03-30 19:23 UTC (permalink / raw) To: David Marchand, Andrzej Ostruszka; +Cc: dev On 3/26/20 1:41 PM, Andrzej Ostruszka wrote: > Thank you David for taking time to look at this. > > On 3/25/20 9:08 AM, David Marchand wrote: >> Hello Andrzej, >> >> On Tue, Mar 10, 2020 at 12:11 PM Andrzej Ostruszka > [...] >> I can see we end up exposing structures for registering callbacks. > > Right. I was thinking more in terms of user convenience so it seemed > like a good choice to gather them in one struct and call 'register' > once. The fact that the same structure is used to keep them is an > implementation choice and this can be decoupled. > >> Did you consider some ways to avoid exposure of those? (thinking of >> ABI maintenance for when this library will elect to non-experimental). > > I will. So far I used the union for the input since I like when things > are well typed :) and there is no need for casting. However I will > spend some time on this and will get back to you soon (if you have > already something in your head please share). Right now I'm thinking > about taking array of callbacks with each entry being ("event type", > callback) pair, however need to figure out how to have minimum amount of > type casting. David, I thought about this a bit and here is my proposal. Define "typeful" callback pointer (public): union rte_ifpx_cb_ptr { int (*mac_change)(const struct mac_change *ev); int (*mtu_change)(const struct mtu_change *ev); ... int (*cfg_done)(void); }; In implementation make sure its size is as expected: _Static_assert(sizeof(union rte_ifpx_cb_ptr) == sizeof (int(*)(void*)), "Size of callback pointer has to be" "equal to size of function pointer"); Accept as input tagged callbacks (also public type): struct rte_ifpx_callback { enum rte_ifpx_event_type type; union rte_ifpx_cb_ptr callback; }; The user would be defining array of callbacks: struct rte_ifpx_callback callbacks[] = { {RTE_IFPX_MAC_CHANGE, {.mac_change = mac_change}}, {RTE_IFPX_MTU_CHANGE, {.mtu_change = mtu_change}}, ... {RTE_IFPX_CFG_DONE, {.cfg_done = finished}}, }; and passing it to registration together with its length like: int rte_ifpx_callbacks_register(int len, const struct rte_ifpx_callback *cbs) { for (int i = 0; i < len; ++i) { switch (cbs[i].type) { case RTE_IFPX_MAC_CHANGE: priv_cbs.mac_change = cbs[i].callback.mac_change; break; ... } This way we should be protected from ABI breakage when adding new event types and how the callbacks are stored would not be visible to the user. Let me know what do you think about it. With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka ` (4 preceding siblings ...) 2020-03-25 8:08 ` [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library David Marchand @ 2020-04-03 21:42 ` Thomas Monjalon 2020-04-04 18:07 ` Andrzej Ostruszka [C] 5 siblings, 1 reply; 64+ messages in thread From: Thomas Monjalon @ 2020-04-03 21:42 UTC (permalink / raw) To: Andrzej Ostruszka; +Cc: dev Hi Andrzej, Thanks for the very good explanations in the cover letter. I have several comments and questions about the design. I think IF proxy is a good idea which should be part of a bigger plan. 10/03/2020 12:10, Andrzej Ostruszka: > What is this useful for > ======================= > > Usually, when an ethernet port is assigned to DPDK it vanishes from the > system and user looses ability to control it via normal configuration > utilities (e.g. those from iproute2 package). Moreover by default DPDK > application is not aware of the network configuration of the system. > > To address both of these issues application needs to: > - add some command line interface (or other mechanism) allowing for > control of the port and its configuration > - query the status of network configuration and monitor its changes > > The purpose of this library is to help with both of these tasks (as long > as they remain in domain of configuration available to the system). In > other words, if DPDK application has some special needs, that cannot be > addressed by the normal system configuration utilities, then they need > to be solved by the application itself. In any case, the application must be in the loop. The application should always remain in control. When querying some information, nothing need to be controlled I guess. But when adjusting some configuration, the application must be able to be notified and decide which change is allowed. Of course, the application might allow being bypassed. Currently this rule is not respected in the rte_mp IPC system. I think rte_mp and IF proxy should follow the same path, keeping the primary application process in control. I would like not only secondary process and IF proxy be able to use this control path. It should be generic enough to allow any application (local or remote) be part of the control path, communicating with the DPDK application primary process. As a summary, I propose to target the following goal: implement a user configuration path as a DPDK standard that the application can enable. Do we agree that the exception packet path is out of scope? [...] > We create two proxy interfaces (here based on Tap driver) and bind the > ports to their proxies. When user issues a command changing MTU for > Tap1 interface the library notes this and calls "mtu_change" callback > for the Port1. Similarly when user adds an IPv4 address to the Tap2 > interface "addr_add" callback is called for the Port2 and the same > happens for configuration of routing rule pointing to Tap2. Will it work as well with TC flow configuration converted to rte_flow? > Apart from > callbacks this library can notify about changes via adding events to > notification queues. See below for more inforamtion about that and > a complete list of available callbacks. There is choice between callback in a random context, or a read from a message queue in a controlled context. Second option looks better. > Please note that nothing has been mentioned about forwarding of the > packets between system and DPDK. Since the proxies are normal DPDK > ports you can receive/send to them via usual RX/TX burst API. However > since the library is not aware of the structure of packet processing > used by the application it cannot automatically forward the packets - it > is responsibility of the application to include proxy ports into its > packet processing engine. So IF proxy does nothing special with packets, right? > As mentioned above the intention of the library is to: > - provide information about network configuration that would allow > application to decide what to do with the packets received on DPDK > ports, > - allow for control of the ports via standard configuration utilities > > Although the library only helps you to identify proxy for given port > (and vice versa) and calls appropriate callbacks it does open some > interesting possibilities. For example you can use the proxy ports to > forward packets for protocols that you do not wish to handle in DPDK > application to the system protocol stack and just listen to the > configuration changes - so that way you can "offload" handling of those > protocols to the system. Note that when using a bifurcated driver (af_xdp or mlx), the exception path in the kernel is not going through DPDK. Moreover, no proxy is needed for device configuration in such case. [...] > The only mandatory requirement for DPDK port to be able to act as > a proxy is that it is visible in the system - this is checked during > port to proxy binding by calling rte_eth_dev_info_get() on proxy port > and inspecting 'if_index' field (it has to be non-zero). Simple, good :) > This creates logical binding - as mentioned above there is no automatic > packet forwarding. With this binding whenever user changes the state of > proxy interface in the system (link up/down, change mac/mtu, add/remove > IPv4/IPv6) you get appropriate notification for the bound port. When configuring a port via DPDK API, is it mirrored automatically to the kernel device? > So far we've mentioned several times that the library calls callbacks. > They are grouped in 'struct rte_ifpx_callbacks' and user provides them > to the library via: > > rte_ifpx_callbacks_register(&cbs); > > It is worth mentioning that the context (lcore/thread) in which these > callbacks are called is implementation defined. It might differ between > different platforms, so the application needs to assume that some kind > of inter lcore/thread synchronization/communication is required. > > Apart from notification via callbacks this library also supports > notifying about the changes via adding events to the configured > notification queues. The queues are registered via: > > int rte_ifpx_queue_add(struct rte_ring *r); > > and the actual logic used is: if there is callback registered then it is > called, if it returns non-zero then event is considered completed, > otherwise event is added to each configured notification queue. > That way application can update data structures that are safe to be > modified by single writer from within callback or do the common > preprocessing steps (if any needed) in callback and data that is > replicated can be updated during handling of queued events. As explained above, the application must control every changes. One issue is thread safety. The simplest model is to manage control path from a single thread in the primary process. If we create an API to allow the application managing the control path from external requests, I think it should be a building block independent of IF proxy. Then IF proxy can plug into this subsystem. It would allow other control path mechanisms to co-exist. [...] > It is worth to mention also that while typical case would be a 1-to-1 > mapping between port and proxy, the 1-to-many mapping is also supported. > In that case related callbacks will be called for each port bound to > given proxy interface - it is application responsibility to define > semantic of such mapping (e.g. all changes apply to all ports, or link > changes apply to all but other are accepted in "round robin" fashion, or > some other logic). I don't get the interest of one-to-many mapping. [...] Thanks for the work. It seems there are some overlaps with telemetry and rte_mp channels. The same channel could be used also for dynamic tracing command or for remote control. Would you be OK to extend it to a global control subsystem, having IF proxy plugged in? ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-03 21:42 ` Thomas Monjalon @ 2020-04-04 18:07 ` Andrzej Ostruszka [C] 2020-04-04 19:51 ` Thomas Monjalon 0 siblings, 1 reply; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-04-04 18:07 UTC (permalink / raw) To: Thomas Monjalon; +Cc: dev First of all Thomas, thank you for taking time to look at this. Please scroll below for my comments. It looks like we are going to detour a bit for a general config mechanism on top of which we can "rebase" IF Proxy and other libs/apps. Note - since I often like to think in concrete terms I might be switching between general comments and specific examples. On 4/3/20 11:42 PM, Thomas Monjalon wrote: > Hi Andrzej, [...] > 10/03/2020 12:10, Andrzej Ostruszka: [...] >> The purpose of this library is to help with both of these tasks (as long >> as they remain in domain of configuration available to the system). In >> other words, if DPDK application has some special needs, that cannot be >> addressed by the normal system configuration utilities, then they need >> to be solved by the application itself. > > In any case, the application must be in the loop. > The application should always remain in control. OK - so let me try to understand what you mean here on the example of this IF Proxy. I wanted (and that is an explicit goal) to use iproute2 tools to configure DPDK ports. In this context allowing application to have control might mean two things: - application can accept/ignore/deny change requested by the user from the shell in a dynamic way (based on some state/event/...) - application writer have a choice to bind or not, but once the proxy is bound you simply accept the requests - just like you accept user requests in e.g. testpmd shell. So ... > When querying some information, nothing need to be controlled I guess. > But when adjusting some configuration, the application must be able > to be notified and decide which change is allowed. > Of course, the application might allow being bypassed. ... it looks like you are talking about the first option ("bypass" is I guess the second option). In the concrete example of IF Proxy that might be a bit problematic. User requests changes on proxy interface kernel accepts them and the DPDK application is just notified about that - has no chance to deny request (say "busy" or "not permitted" to the user). Of course app can ignore it (do nothing in the callback or drop the event) and have a mismatch between port and its proxy. I'm not so sure if this is what you had in mind. > Currently this rule is not respected in the rte_mp IPC system. > I think rte_mp and IF proxy should follow the same path, > keeping the primary application process in control. > > I would like not only secondary process and IF proxy be able to use > this control path. It should be generic enough to allow any application > (local or remote) be part of the control path, communicating with > the DPDK application primary process. That goal sounds indeed like ZMQ. Is the consensus about that already reached? On a general level that sounds good to me - the devil might be in details, like e.g. trying to be simple and generic enough. I'd like also to solicit here input from other members of the community. > As a summary, I propose to target the following goal: > implement a user configuration path as a DPDK standard > that the application can enable. > > Do we agree that the exception packet path is out of scope? Could you rephrase this question? I'm not sure I understand it. If you wanted to say that we should: - implement general config/notification mechanism - rebase IF Proxy upon it (so instead of deciding whether this should be callback/queue it simply uses this new scheme to deliver the change to application) then I'm fine with this. If you meant something else then please explain. > [...] >> We create two proxy interfaces (here based on Tap driver) and bind the >> ports to their proxies. When user issues a command changing MTU for >> Tap1 interface the library notes this and calls "mtu_change" callback >> for the Port1. Similarly when user adds an IPv4 address to the Tap2 >> interface "addr_add" callback is called for the Port2 and the same >> happens for configuration of routing rule pointing to Tap2. > > Will it work as well with TC flow configuration converted to rte_flow? Not at the moment. But should be doable - as long as there is good mapping between them (I haven't checked). >> Apart from >> callbacks this library can notify about changes via adding events to >> notification queues. See below for more inforamtion about that and >> a complete list of available callbacks. > > There is choice between callback in a random context, > or a read from a message queue in a controlled context. > Second option looks better. Note that callback can be a simple enqueue to some ring. From the IF Proxy implementation point of view - this is not much of a difference. I notice the change and in that place in code I can either call callback or queue an event. Since I expect queues to be a popular choice its support is added but without this user could register callback that would be enqueuing (one more indirection in slow path). Having said that - the only reason callbacks are kept is that (as mentioned in cover): - I can easily implement global action (true - in random context), since queues are "match all" each event will be added to all queues and cores would have to decide which one of them performs the global action - I can do single/global preparation before queueing event But I guess this is rather a mute point since we are going in the direction of general config which IF Proxy would be using. >> Please note that nothing has been mentioned about forwarding of the >> packets between system and DPDK. Since the proxies are normal DPDK >> ports you can receive/send to them via usual RX/TX burst API. However >> since the library is not aware of the structure of packet processing >> used by the application it cannot automatically forward the packets - it >> is responsibility of the application to include proxy ports into its >> packet processing engine. > > So IF proxy does nothing special with packets, right? Correct. >> Although the library only helps you to identify proxy for given port >> (and vice versa) and calls appropriate callbacks it does open some >> interesting possibilities. For example you can use the proxy ports to >> forward packets for protocols that you do not wish to handle in DPDK >> application to the system protocol stack and just listen to the >> configuration changes - so that way you can "offload" handling of those >> protocols to the system. > > Note that when using a bifurcated driver (af_xdp or mlx), > the exception path in the kernel is not going through DPDK. > Moreover, no proxy is needed for device configuration in such case. True for the link level info. But if application would like to have also address/routing/neighbouring info then I guess proxy would be needed. As for the bifurcated drivers - in one version of the library I had an option to bind port to itself. The binding is there only to tell library which if_index is interesting and how to report event (if_index -> port_id) [...] >> This creates logical binding - as mentioned above there is no automatic >> packet forwarding. With this binding whenever user changes the state of >> proxy interface in the system (link up/down, change mac/mtu, add/remove >> IPv4/IPv6) you get appropriate notification for the bound port. > > When configuring a port via DPDK API, is it mirrored automatically > to the kernel device? No it isn't. It's one way at the moment. If we wanted bidirectional then I would have to plug in somewhere in eth_dev to monitor changes to ports and request similar changes to the proxy. [...] >> and the actual logic used is: if there is callback registered then it is >> called, if it returns non-zero then event is considered completed, >> otherwise event is added to each configured notification queue. >> That way application can update data structures that are safe to be >> modified by single writer from within callback or do the common >> preprocessing steps (if any needed) in callback and data that is >> replicated can be updated during handling of queued events. > > As explained above, the application must control every changes. > > One issue is thread safety. > The simplest model is to manage control path from a single thread > in the primary process. > > If we create an API to allow the application managing the control path > from external requests, I think it should be a building block > independent of IF proxy. Then IF proxy can plug into this subsystem. > It would allow other control path mechanisms to co-exist. I'm fine with this. > [...] >> It is worth to mention also that while typical case would be a 1-to-1 >> mapping between port and proxy, the 1-to-many mapping is also supported. >> In that case related callbacks will be called for each port bound to >> given proxy interface - it is application responsibility to define >> semantic of such mapping (e.g. all changes apply to all ports, or link >> changes apply to all but other are accepted in "round robin" fashion, or >> some other logic). > > I don't get the interest of one-to-many mapping. That was a request during early version of library - with bridging in mind. However this is an "experimental" part of an "experimental" library - I would not focus on this, as it might get removed if we don't find a real use case for that. > [...] > > Thanks for the work. > It seems there are some overlaps with telemetry and rte_mp channels. > The same channel could be used also for dynamic tracing command > or for remote control. > Would you be OK to extend it to a global control subsystem, > having IF proxy plugged in? Yes. I don't have a problem with that - however at the moment the requirements/design are a bit vague, so let's discuss this more. With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library 2020-04-04 18:07 ` Andrzej Ostruszka [C] @ 2020-04-04 19:51 ` Thomas Monjalon 0 siblings, 0 replies; 64+ messages in thread From: Thomas Monjalon @ 2020-04-04 19:51 UTC (permalink / raw) To: Andrzej Ostruszka [C]; +Cc: dev 04/04/2020 20:07, Andrzej Ostruszka [C]: > On 4/3/20 11:42 PM, Thomas Monjalon wrote: > > 10/03/2020 12:10, Andrzej Ostruszka: > [...] > >> The purpose of this library is to help with both of these tasks (as long > >> as they remain in domain of configuration available to the system). In > >> other words, if DPDK application has some special needs, that cannot be > >> addressed by the normal system configuration utilities, then they need > >> to be solved by the application itself. > > > > In any case, the application must be in the loop. > > The application should always remain in control. > > OK - so let me try to understand what you mean here on the example of > this IF Proxy. I wanted (and that is an explicit goal) to use iproute2 > tools to configure DPDK ports. In this context allowing application to > have control might mean two things: > > - application can accept/ignore/deny change requested by the user from > the shell in a dynamic way (based on some state/event/...) Yes > - application writer have a choice to bind or not, but once the proxy is > bound you simply accept the requests - just like you accept user > requests in e.g. testpmd shell. No, the application may check each user request before accepting. And on the path, the application may need to adapt based on user request. > So ... > > > When querying some information, nothing need to be controlled I guess. > > But when adjusting some configuration, the application must be able > > to be notified and decide which change is allowed. > > Of course, the application might allow being bypassed. > > ... it looks like you are talking about the first option ("bypass" is I > guess the second option). In the concrete example of IF Proxy that > might be a bit problematic. User requests changes on proxy interface > kernel accepts them and the DPDK application is just notified about that > - has no chance to deny request (say "busy" or "not permitted" to the user). The application must return a decision. > Of course app can ignore it (do nothing in the callback or drop the > event) and have a mismatch between port and its proxy. I'm not so sure > if this is what you had in mind. If the change is denied, the proxy may rollback the change in kernel. > > Currently this rule is not respected in the rte_mp IPC system. > > I think rte_mp and IF proxy should follow the same path, > > keeping the primary application process in control. > > > > I would like not only secondary process and IF proxy be able to use > > this control path. It should be generic enough to allow any application > > (local or remote) be part of the control path, communicating with > > the DPDK application primary process. > > That goal sounds indeed like ZMQ. Is the consensus about that already > reached? On a general level that sounds good to me - the devil might be > in details, like e.g. trying to be simple and generic enough. I'd like > also to solicit here input from other members of the community. No there is no consensus. Integrating ZMQ is a very fresh idea. As said, we should open this major topic in a separate email thread. > > As a summary, I propose to target the following goal: > > implement a user configuration path as a DPDK standard > > that the application can enable. > > > > Do we agree that the exception packet path is out of scope? > > Could you rephrase this question? I'm not sure I understand it. If you > wanted to say that we should: > > - implement general config/notification mechanism > - rebase IF Proxy upon it (so instead of deciding whether this should be > callback/queue it simply uses this new scheme to deliver the change to > application) Yes this is what I mean in general. > then I'm fine with this. If you meant something else then please explain. I ask for confirmation that IF proxy is not managing an exception datapath. You said the netdev used as proxy can also be used to send/receive packets to/from kernel stack. But it is out of scope of IF proxy features, right? > > [...] > >> We create two proxy interfaces (here based on Tap driver) and bind the > >> ports to their proxies. When user issues a command changing MTU for > >> Tap1 interface the library notes this and calls "mtu_change" callback > >> for the Port1. Similarly when user adds an IPv4 address to the Tap2 > >> interface "addr_add" callback is called for the Port2 and the same > >> happens for configuration of routing rule pointing to Tap2. > > > > Will it work as well with TC flow configuration converted to rte_flow? > > Not at the moment. But should be doable - as long as there is good > mapping between them (I haven't checked). That's an interesting challenge. > >> Apart from > >> callbacks this library can notify about changes via adding events to > >> notification queues. See below for more inforamtion about that and > >> a complete list of available callbacks. > > > > There is choice between callback in a random context, > > or a read from a message queue in a controlled context. > > Second option looks better. > > Note that callback can be a simple enqueue to some ring. From the IF > Proxy implementation point of view - this is not much of a difference. > I notice the change and in that place in code I can either call callback > or queue an event. Since I expect queues to be a popular choice its > support is added but without this user could register callback that > would be enqueuing (one more indirection in slow path). > > Having said that - the only reason callbacks are kept is that (as > mentioned in cover): > > - I can easily implement global action (true - in random context), since > queues are "match all" each event will be added to all queues and cores > would have to decide which one of them performs the global action > - I can do single/global preparation before queueing event > > But I guess this is rather a mute point since we are going in the > direction of general config which IF Proxy would be using. > > >> Please note that nothing has been mentioned about forwarding of the > >> packets between system and DPDK. Since the proxies are normal DPDK > >> ports you can receive/send to them via usual RX/TX burst API. However > >> since the library is not aware of the structure of packet processing > >> used by the application it cannot automatically forward the packets - it > >> is responsibility of the application to include proxy ports into its > >> packet processing engine. > > > > So IF proxy does nothing special with packets, right? > > Correct. > > >> Although the library only helps you to identify proxy for given port > >> (and vice versa) and calls appropriate callbacks it does open some > >> interesting possibilities. For example you can use the proxy ports to > >> forward packets for protocols that you do not wish to handle in DPDK > >> application to the system protocol stack and just listen to the > >> configuration changes - so that way you can "offload" handling of those > >> protocols to the system. > > > > Note that when using a bifurcated driver (af_xdp or mlx), > > the exception path in the kernel is not going through DPDK. > > Moreover, no proxy is needed for device configuration in such case. > > True for the link level info. But if application would like to have > also address/routing/neighbouring info then I guess proxy would be > needed. As for the bifurcated drivers - in one version of the library I > had an option to bind port to itself. The binding is there only to tell > library which if_index is interesting and how to report event (if_index > -> port_id) Yes you're right, and that's interesting. I wonder whether we should have a flag in the proxy port to mark bifurcated model. > [...] > >> This creates logical binding - as mentioned above there is no automatic > >> packet forwarding. With this binding whenever user changes the state of > >> proxy interface in the system (link up/down, change mac/mtu, add/remove > >> IPv4/IPv6) you get appropriate notification for the bound port. > > > > When configuring a port via DPDK API, is it mirrored automatically > > to the kernel device? > > No it isn't. It's one way at the moment. If we wanted bidirectional > then I would have to plug in somewhere in eth_dev to monitor changes to > ports and request similar changes to the proxy. OK I think it is a gap we need to fill. Bidirectional way looks mandatory to me. Given that the application needs to be aware of the proxy binding, can we have an explicit mechanism to notify the proxy of any change? Or do we want to use something as failsafe PMD to pair the port and its proxy as sub-devices of a main one, dispatching all changes? > [...] > >> and the actual logic used is: if there is callback registered then it is > >> called, if it returns non-zero then event is considered completed, > >> otherwise event is added to each configured notification queue. > >> That way application can update data structures that are safe to be > >> modified by single writer from within callback or do the common > >> preprocessing steps (if any needed) in callback and data that is > >> replicated can be updated during handling of queued events. > > > > As explained above, the application must control every changes. > > > > One issue is thread safety. > > The simplest model is to manage control path from a single thread > > in the primary process. > > > > If we create an API to allow the application managing the control path > > from external requests, I think it should be a building block > > independent of IF proxy. Then IF proxy can plug into this subsystem. > > It would allow other control path mechanisms to co-exist. > > I'm fine with this. > > > [...] > >> It is worth to mention also that while typical case would be a 1-to-1 > >> mapping between port and proxy, the 1-to-many mapping is also supported. > >> In that case related callbacks will be called for each port bound to > >> given proxy interface - it is application responsibility to define > >> semantic of such mapping (e.g. all changes apply to all ports, or link > >> changes apply to all but other are accepted in "round robin" fashion, or > >> some other logic). > > > > I don't get the interest of one-to-many mapping. > > That was a request during early version of library - with bridging in > mind. However this is an "experimental" part of an "experimental" > library - I would not focus on this, as it might get removed if we don't > find a real use case for that. If we don't have a real use-case, I suggest to not implement it, but keep some room for it in the design. > > [...] > > > > Thanks for the work. > > It seems there are some overlaps with telemetry and rte_mp channels. > > The same channel could be used also for dynamic tracing command > > or for remote control. > > Would you be OK to extend it to a global control subsystem, > > having IF proxy plugged in? > > Yes. I don't have a problem with that - however at the moment the > requirements/design are a bit vague, so let's discuss this more. Thanks a lot ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Introduce IF proxy library 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka ` (5 preceding siblings ...) 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka @ 2020-04-16 16:11 ` Stephen Hemminger 2020-04-16 16:49 ` Jerin Jacob 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 " Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 0/4] Introduce IF proxy library Andrzej Ostruszka 8 siblings, 1 reply; 64+ messages in thread From: Stephen Hemminger @ 2020-04-16 16:11 UTC (permalink / raw) To: Andrzej Ostruszka; +Cc: dev On Fri, 6 Mar 2020 17:41:00 +0100 Andrzej Ostruszka <aostruszka@marvell.com> wrote: > What is this useful for > ======================= > > Usually, when an ethernet port is assigned to DPDK it vanishes from the > system and user looses ability to control it via normal configuration > utilities (e.g. those from iproute2 package). Moreover by default DPDK > application is not aware of the network configuration of the system. > > To address both of these issues application needs to: > - add some command line interface (or other mechanism) allowing for > control of the port and its configuration > - query the status of network configuration and monitor its changes > > The purpose of this library is to help with both of these tasks (as long > as they remain in domain of configuration available to the system). In > other words, if DPDK application has some special needs, that cannot be > addressed by the normal system configuration utilities, then they need > to be solved by the application itself. > > The connection between DPDK and system is based on the existence of > ports that are visible to both DPDK and system (like Tap, KNI and > possibly some other drivers). These ports serve as an interface > proxies. > > Let's visualize the action of the library by the following example: > > Linux | DPDK > ============================================================== > | > | +-------+ +-------+ > | | Port1 | | Port2 | > "ip link set dev tap1 mtu 1600" | +-------+ +-------+ > | | ^ ^ ^ > | +------+ | mtu_change | | > `->| Tap1 |---' callback | | > +------+ | | > "ip addr add 198.51.100.14 \ | | | > dev tap2" | | | > | +------+ | | > +->| Tap2 |------------------' | > | +------+ addr_add callback | > "ip route add 198.0.2.0/24 \ | | | > dev tap2" | | route_add callback | > | `---------------------' Has anyone investigated solving this in the kernel rather than creating the added overhead of more Linux devices? What I am thinking of is a netlink to userspace interface. The kernel already has File-System-in-Userspace (FUSE) to allow for filesystems. What about having a NUSE (Netlink in userspace)? Then DPDK could have a daemon that is a provider to NUSE. This solution would also benefit other non-DPDK projects like VPP and allow DPDK to integrate with devlink etc. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Introduce IF proxy library 2020-04-16 16:11 ` [dpdk-dev] [PATCH " Stephen Hemminger @ 2020-04-16 16:49 ` Jerin Jacob 2020-04-16 17:04 ` Stephen Hemminger 2020-04-16 17:12 ` Andrzej Ostruszka [C] 0 siblings, 2 replies; 64+ messages in thread From: Jerin Jacob @ 2020-04-16 16:49 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrzej Ostruszka, dpdk-dev On Thu, Apr 16, 2020 at 9:41 PM Stephen Hemminger <stephen@networkplumber.org> wrote: > > On Fri, 6 Mar 2020 17:41:00 +0100 > Andrzej Ostruszka <aostruszka@marvell.com> wrote: > > > What is this useful for > > ======================= > > > > Usually, when an ethernet port is assigned to DPDK it vanishes from the > > system and user looses ability to control it via normal configuration > > utilities (e.g. those from iproute2 package). Moreover by default DPDK > > application is not aware of the network configuration of the system. > > > > To address both of these issues application needs to: > > - add some command line interface (or other mechanism) allowing for > > control of the port and its configuration > > - query the status of network configuration and monitor its changes > > > > The purpose of this library is to help with both of these tasks (as long > > as they remain in domain of configuration available to the system). In > > other words, if DPDK application has some special needs, that cannot be > > addressed by the normal system configuration utilities, then they need > > to be solved by the application itself. > > > > The connection between DPDK and system is based on the existence of > > ports that are visible to both DPDK and system (like Tap, KNI and > > possibly some other drivers). These ports serve as an interface > > proxies. > > > > Let's visualize the action of the library by the following example: > > > > Linux | DPDK > > ============================================================== > > | > > | +-------+ +-------+ > > | | Port1 | | Port2 | > > "ip link set dev tap1 mtu 1600" | +-------+ +-------+ > > | | ^ ^ ^ > > | +------+ | mtu_change | | > > `->| Tap1 |---' callback | | > > +------+ | | > > "ip addr add 198.51.100.14 \ | | | > > dev tap2" | | | > > | +------+ | | > > +->| Tap2 |------------------' | > > | +------+ addr_add callback | > > "ip route add 198.0.2.0/24 \ | | | > > dev tap2" | | route_add callback | > > | `---------------------' > > Has anyone investigated solving this in the kernel rather than > creating the added overhead of more Linux devices? > > What I am thinking of is a netlink to userspace interface. > The kernel already has File-System-in-Userspace (FUSE) to allow > for filesystems. What about having a NUSE (Netlink in userspace)? IMO, there is no issue with the Linux Netlink _userspace_ interface. The goal of IF proxy to abstract the OS differences so that it can work with Linux, FreeBSD, and Windows(if needed). > > Then DPDK could have a daemon that is a provider to NUSE. > This solution would also benefit other non-DPDK projects like VPP > and allow DPDK to integrate with devlink etc. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Introduce IF proxy library 2020-04-16 16:49 ` Jerin Jacob @ 2020-04-16 17:04 ` Stephen Hemminger 2020-04-16 17:26 ` Andrzej Ostruszka [C] 2020-04-16 17:27 ` Jerin Jacob 2020-04-16 17:12 ` Andrzej Ostruszka [C] 1 sibling, 2 replies; 64+ messages in thread From: Stephen Hemminger @ 2020-04-16 17:04 UTC (permalink / raw) To: Jerin Jacob; +Cc: Andrzej Ostruszka, dpdk-dev On Thu, 16 Apr 2020 22:19:05 +0530 Jerin Jacob <jerinjacobk@gmail.com> wrote: > On Thu, Apr 16, 2020 at 9:41 PM Stephen Hemminger > <stephen@networkplumber.org> wrote: > > > > On Fri, 6 Mar 2020 17:41:00 +0100 > > Andrzej Ostruszka <aostruszka@marvell.com> wrote: > > > > > What is this useful for > > > ======================= > > > > > > Usually, when an ethernet port is assigned to DPDK it vanishes from the > > > system and user looses ability to control it via normal configuration > > > utilities (e.g. those from iproute2 package). Moreover by default DPDK > > > application is not aware of the network configuration of the system. > > > > > > To address both of these issues application needs to: > > > - add some command line interface (or other mechanism) allowing for > > > control of the port and its configuration > > > - query the status of network configuration and monitor its changes > > > > > > The purpose of this library is to help with both of these tasks (as long > > > as they remain in domain of configuration available to the system). In > > > other words, if DPDK application has some special needs, that cannot be > > > addressed by the normal system configuration utilities, then they need > > > to be solved by the application itself. > > > > > > The connection between DPDK and system is based on the existence of > > > ports that are visible to both DPDK and system (like Tap, KNI and > > > possibly some other drivers). These ports serve as an interface > > > proxies. > > > > > > Let's visualize the action of the library by the following example: > > > > > > Linux | DPDK > > > ============================================================== > > > | > > > | +-------+ +-------+ > > > | | Port1 | | Port2 | > > > "ip link set dev tap1 mtu 1600" | +-------+ +-------+ > > > | | ^ ^ ^ > > > | +------+ | mtu_change | | > > > `->| Tap1 |---' callback | | > > > +------+ | | > > > "ip addr add 198.51.100.14 \ | | | > > > dev tap2" | | | > > > | +------+ | | > > > +->| Tap2 |------------------' | > > > | +------+ addr_add callback | > > > "ip route add 198.0.2.0/24 \ | | | > > > dev tap2" | | route_add callback | > > > | `---------------------' > > > > Has anyone investigated solving this in the kernel rather than > > creating the added overhead of more Linux devices? > > > > What I am thinking of is a netlink to userspace interface. > > The kernel already has File-System-in-Userspace (FUSE) to allow > > for filesystems. What about having a NUSE (Netlink in userspace)? > > IMO, there is no issue with the Linux Netlink _userspace_ interface. > The goal of IF proxy to abstract the OS differences so that it can > work with Linux, FreeBSD, and Windows(if needed). > > > > > > Then DPDK could have a daemon that is a provider to NUSE. > > This solution would also benefit other non-DPDK projects like VPP > > and allow DPDK to integrate with devlink etc. With the wider use of tap devices like this, it may be a problem for other usages of TAP. If nothing else, having to figure out which tap is which would be error prone. Also, TAP on Windows is only available as an out-of-tree driver from OpenVPN. And the TAP on Windows is quite, limited, deprecated, poorly supported and buggy. There is no standard TAP like interface in Windows. TAP on BSD is different than Linux and has different control functions. Don't remember what the interface notification mechanism is on BSD, it is not netlink. So is IF proxy even going to work on these other OS? ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Introduce IF proxy library 2020-04-16 17:04 ` Stephen Hemminger @ 2020-04-16 17:26 ` Andrzej Ostruszka [C] 2020-04-16 17:27 ` Jerin Jacob 1 sibling, 0 replies; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-04-16 17:26 UTC (permalink / raw) To: Stephen Hemminger, Jerin Jacob; +Cc: dpdk-dev On 4/16/20 7:04 PM, Stephen Hemminger wrote: > On Thu, 16 Apr 2020 22:19:05 +0530 > Jerin Jacob <jerinjacobk@gmail.com> wrote: > >> On Thu, Apr 16, 2020 at 9:41 PM Stephen Hemminger >> <stephen@networkplumber.org> wrote: [...] >>> Has anyone investigated solving this in the kernel rather than >>> creating the added overhead of more Linux devices? >>> >>> What I am thinking of is a netlink to userspace interface. >>> The kernel already has File-System-in-Userspace (FUSE) to allow >>> for filesystems. What about having a NUSE (Netlink in userspace)? >> >> IMO, there is no issue with the Linux Netlink _userspace_ interface. >> The goal of IF proxy to abstract the OS differences so that it can >> work with Linux, FreeBSD, and Windows(if needed). >> >> >>> >>> Then DPDK could have a daemon that is a provider to NUSE. >>> This solution would also benefit other non-DPDK projects like VPP >>> and allow DPDK to integrate with devlink etc. > > With the wider use of tap devices like this, it may be a problem > for other usages of TAP. If nothing else, having to figure out which > tap is which would be error prone. Stephen, the library does not require TAP - only some DPDK port that is visible to the system (has non-zero if_index). As to the confusion - if we use TAP then it has optional 'iface=...' argument, so we can name those proxy interfaces as 'iface=proxy0' or something like that. This is under control of application (just call ...create_by_devarg() with proper argument). > Also, TAP on Windows is only available as an out-of-tree driver > from OpenVPN. And the TAP on Windows is quite, limited, deprecated, > poorly supported and buggy. There is no standard TAP like interface > in Windows. > > TAP on BSD is different than Linux and has different control functions. > Don't remember what the interface notification mechanism is on BSD, > it is not netlink. > > So is IF proxy even going to work on these other OS? No. At the moment only Linux is supported. I don't know much about Windows, it would need some TAP-like driver and implementation would probably make use of "IP Helper" library (some extra thread doing polling?). As for FreeBSD I'm convinced that very similar implementation is possible by using PF_ROUTE sockets. What the library does to help with other platforms is that it defines following structure: /* Every implementation should provide definition of this structure: * - init : called during library initialization (NULL when not needed) * - events : this should return bitmask of supported events (can be * NULL if all defined events are supported by the implementation) * - listen : this function should start service listening to the * network configuration events/changes, * - close : this function should close the service started by listen() * - get_info : this function should query system for current * configuration of interface with index 'if_index'. After * successful initialization of listening service this function is * called with 0 as an argument. In that case configuration of all * ports should be obtained - and when this procedure completes a * RTE_IFPX_CFG_DONE event should be signaled via * ifpx_notify_event(). */ extern struct ifpx_platform_callbacks { void (*init)(void); uint64_t (*events)(void); int (*listen)(void); int (*close)(void); void (*get_info)(int if_index); } ifpx_platform; With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Introduce IF proxy library 2020-04-16 17:04 ` Stephen Hemminger 2020-04-16 17:26 ` Andrzej Ostruszka [C] @ 2020-04-16 17:27 ` Jerin Jacob 1 sibling, 0 replies; 64+ messages in thread From: Jerin Jacob @ 2020-04-16 17:27 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrzej Ostruszka, dpdk-dev On Thu, Apr 16, 2020 at 10:34 PM Stephen Hemminger <stephen@networkplumber.org> wrote: > > On Thu, 16 Apr 2020 22:19:05 +0530 > Jerin Jacob <jerinjacobk@gmail.com> wrote: > > > On Thu, Apr 16, 2020 at 9:41 PM Stephen Hemminger > > <stephen@networkplumber.org> wrote: > > > > > > On Fri, 6 Mar 2020 17:41:00 +0100 > > > Andrzej Ostruszka <aostruszka@marvell.com> wrote: > > > > > > > What is this useful for > > > > ======================= > > > > > > > > Usually, when an ethernet port is assigned to DPDK it vanishes from the > > > > system and user looses ability to control it via normal configuration > > > > utilities (e.g. those from iproute2 package). Moreover by default DPDK > > > > application is not aware of the network configuration of the system. > > > > > > > > To address both of these issues application needs to: > > > > - add some command line interface (or other mechanism) allowing for > > > > control of the port and its configuration > > > > - query the status of network configuration and monitor its changes > > > > > > > > The purpose of this library is to help with both of these tasks (as long > > > > as they remain in domain of configuration available to the system). In > > > > other words, if DPDK application has some special needs, that cannot be > > > > addressed by the normal system configuration utilities, then they need > > > > to be solved by the application itself. > > > > > > > > The connection between DPDK and system is based on the existence of > > > > ports that are visible to both DPDK and system (like Tap, KNI and > > > > possibly some other drivers). These ports serve as an interface > > > > proxies. > > > > > > > > Let's visualize the action of the library by the following example: > > > > > > > > Linux | DPDK > > > > ============================================================== > > > > | > > > > | +-------+ +-------+ > > > > | | Port1 | | Port2 | > > > > "ip link set dev tap1 mtu 1600" | +-------+ +-------+ > > > > | | ^ ^ ^ > > > > | +------+ | mtu_change | | > > > > `->| Tap1 |---' callback | | > > > > +------+ | | > > > > "ip addr add 198.51.100.14 \ | | | > > > > dev tap2" | | | > > > > | +------+ | | > > > > +->| Tap2 |------------------' | > > > > | +------+ addr_add callback | > > > > "ip route add 198.0.2.0/24 \ | | | > > > > dev tap2" | | route_add callback | > > > > | `---------------------' > > > > > > Has anyone investigated solving this in the kernel rather than > > > creating the added overhead of more Linux devices? > > > > > > What I am thinking of is a netlink to userspace interface. > > > The kernel already has File-System-in-Userspace (FUSE) to allow > > > for filesystems. What about having a NUSE (Netlink in userspace)? > > > > IMO, there is no issue with the Linux Netlink _userspace_ interface. > > The goal of IF proxy to abstract the OS differences so that it can > > work with Linux, FreeBSD, and Windows(if needed). > > > > > > > > > > Then DPDK could have a daemon that is a provider to NUSE. > > > This solution would also benefit other non-DPDK projects like VPP > > > and allow DPDK to integrate with devlink etc. > > With the wider use of tap devices like this, it may be a problem > for other usages of TAP. If nothing else, having to figure out which > tap is which would be error prone. > > Also, TAP on Windows is only available as an out-of-tree driver > from OpenVPN. And the TAP on Windows is quite, limited, deprecated, > poorly supported and buggy. There is no standard TAP like interface > in Windows. > > TAP on BSD is different than Linux and has different control functions. > Don't remember what the interface notification mechanism is on BSD, > it is not netlink. > > So is IF proxy even going to work on these other OS? I dont know about Windows. BSD has a control interface. The library gives abstraction and public API definitions and driver interface. It is up to the implementer to implement driver API for a specific EAL environment. That would help us to not, directly calling Linux specific interface in the DPDK application. > > > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Introduce IF proxy library 2020-04-16 16:49 ` Jerin Jacob 2020-04-16 17:04 ` Stephen Hemminger @ 2020-04-16 17:12 ` Andrzej Ostruszka [C] 2020-04-16 17:19 ` Stephen Hemminger 1 sibling, 1 reply; 64+ messages in thread From: Andrzej Ostruszka [C] @ 2020-04-16 17:12 UTC (permalink / raw) To: Jerin Jacob, Stephen Hemminger; +Cc: dpdk-dev On 4/16/20 6:49 PM, Jerin Jacob wrote: > On Thu, Apr 16, 2020 at 9:41 PM Stephen Hemminger > <stephen@networkplumber.org> wrote: [...] >> Has anyone investigated solving this in the kernel rather than >> creating the added overhead of more Linux devices? >> >> What I am thinking of is a netlink to userspace interface. >> The kernel already has File-System-in-Userspace (FUSE) to allow >> for filesystems. What about having a NUSE (Netlink in userspace)? > > IMO, there is no issue with the Linux Netlink _userspace_ interface. > The goal of IF proxy to abstract the OS differences so that it can > work with Linux, FreeBSD, and Windows(if needed). My understanding of Stephen's question is a bit different - Stephen please correct me if I'm wrong. By the comparison with FUSE he was thinking about providing a "kernel proxy" to userspace-based port/interface, which could be used not only by DPDK but by other too. The answer from me is: no I have not. For two reasons: - that would be Linux only - if we would create such proxy, we would probably end up with tap like driver in the end With regards Andrzej Ostruszka ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Introduce IF proxy library 2020-04-16 17:12 ` Andrzej Ostruszka [C] @ 2020-04-16 17:19 ` Stephen Hemminger 0 siblings, 0 replies; 64+ messages in thread From: Stephen Hemminger @ 2020-04-16 17:19 UTC (permalink / raw) To: Andrzej Ostruszka [C]; +Cc: Jerin Jacob, dpdk-dev On Thu, 16 Apr 2020 17:12:07 +0000 "Andrzej Ostruszka [C]" <aostruszka@marvell.com> wrote: > On 4/16/20 6:49 PM, Jerin Jacob wrote: > > On Thu, Apr 16, 2020 at 9:41 PM Stephen Hemminger > > <stephen@networkplumber.org> wrote: > [...] > >> Has anyone investigated solving this in the kernel rather than > >> creating the added overhead of more Linux devices? > >> > >> What I am thinking of is a netlink to userspace interface. > >> The kernel already has File-System-in-Userspace (FUSE) to allow > >> for filesystems. What about having a NUSE (Netlink in userspace)? > > > > IMO, there is no issue with the Linux Netlink _userspace_ interface. > > The goal of IF proxy to abstract the OS differences so that it can > > work with Linux, FreeBSD, and Windows(if needed). > > My understanding of Stephen's question is a bit different - Stephen > please correct me if I'm wrong. By the comparison with FUSE he was > thinking about providing a "kernel proxy" to userspace-based > port/interface, which could be used not only by DPDK but by other too. > > The answer from me is: no I have not. For two reasons: > - that would be Linux only > - if we would create such proxy, we would probably end up with tap like > driver in the end > > With regards > Andrzej Ostruszka The point is think of the problem beyond just DPDK. ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v3 0/4] Introduce IF proxy library 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka ` (6 preceding siblings ...) 2020-04-16 16:11 ` [dpdk-dev] [PATCH " Stephen Hemminger @ 2020-05-04 8:53 ` Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 1/4] lib: introduce IF Proxy library Andrzej Ostruszka ` (3 more replies) 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 0/4] Introduce IF proxy library Andrzej Ostruszka 8 siblings, 4 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-05-04 8:53 UTC (permalink / raw) To: dev All Please find in this patch set updated version of IF Proxy library. This version addresses comments received so far with some additional minor improvements/changes. This version does not change the notification scheme yet since discussion about general DPDK messaging/notification scheme has has not started so please expect yet another version when this will crystallize. I also have received some performance improvements for the example application from Harman (thank you) but decided to hold them. This is because in future the example application might either be merged into the regular l3fwd or be kept separate - depending on the outcome of the discussion. Changes in V3 ============= - Changed callback registration scheme to make the ABI more robust - Added new platform callback to provide mask with events available - All library data access is guarded with a lock - When port is unbound and proxy has no more ports then it is automatically released Changes in V2 ============= - Cleaned up checkpatch warnings - Removed dead/unused code and added gateway clearing in l3fwd-ifpx What is this useful for ======================= Usually, when an ethernet port is assigned to DPDK it vanishes from the system and user looses ability to control it via normal configuration utilities (e.g. those from iproute2 package). Moreover by default DPDK application is not aware of the network configuration of the system. To address both of these issues application needs to: - add some command line interface (or other mechanism) allowing for control of the port and its configuration - query the status of network configuration and monitor its changes The purpose of this library is to help with both of these tasks (as long as they remain in domain of configuration available to the system). In other words, if DPDK application has some special needs, that cannot be addressed by the normal system configuration utilities, then they need to be solved by the application itself. The connection between DPDK and system is based on the existence of ports that are visible to both DPDK and system (like Tap, KNI and possibly some other drivers). These ports serve as an interface proxies. Let's visualize the action of the library by the following example: Linux | DPDK ============================================================== | | +-------+ +-------+ | | Port1 | | Port2 | "ip link set dev tap1 mtu 1600" | +-------+ +-------+ | | ^ ^ ^ | +------+ | mtu_change | | `->| Tap1 |---' callback | | +------+ | | "ip addr add 198.51.100.14 \ | | | dev tap2" | | | | +------+ | | +->| Tap2 |------------------' | | +------+ addr_add callback | "ip route add 198.0.2.0/24 \ | | | dev tap2" | | route_add callback | | `---------------------' So we have two ports Port1 and Port2 that are not visible to the system. We create two proxy interfaces (here based on Tap driver) and bind the ports to their proxies. When user issues a command changing MTU for Tap1 interface the library notes this and calls "mtu_change" callback for the Port1. Similarly when user adds an IPv4 address to the Tap2 interface "addr_add" callback is called for the Port2 and the same happens for configuration of routing rule pointing to Tap2. Apart from callbacks this library can notify about changes via adding events to notification queues. See below for more inforamtion about that and a complete list of available callbacks. Please note that nothing has been mentioned about forwarding of the packets between system and DPDK. Since the proxies are normal DPDK ports you can receive/send to them via usual RX/TX burst API. However since the library is not aware of the structure of packet processing used by the application it cannot automatically forward the packets - it is responsibility of the application to include proxy ports into its packet processing engine. As mentioned above the intention of the library is to: - provide information about network configuration that would allow application to decide what to do with the packets received on DPDK ports, - allow for control of the ports via standard configuration utilities Although the library only helps you to identify proxy for given port (and vice versa) and calls appropriate callbacks it does open some interesting possibilities. For example you can use the proxy ports to forward packets for protocols that you do not wish to handle in DPDK application to the system protocol stack and just listen to the configuration changes - so that way you can "offload" handling of those protocols to the system. How to use it ============= Usage of this library is rather simple. You have to: 1. Create proxy (if you don't have port suitable for being proxy or you have one but do not wish to use it as a proxy). 2. Bind port to proxy. 3. Register callbacks and/or event queues. 4. Start listening to the network configuration. The only mandatory requirement for DPDK port to be able to act as a proxy is that it is visible in the system - this is checked during port to proxy binding by calling rte_eth_dev_info_get() on proxy port and inspecting 'if_index' field (it has to be non-zero). One can create such port in the application by calling: proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); Upon success this returns id of DPDK proxy port created (RTE_MAX_ETHPORTS on failure). The argument selects type of proxy port to create (currently Tap/KNI only). This function actually is just a wrapper around: uint16_t rte_ifpx_create_by_devarg(const char *devarg); creating valid 'devarg' string for the chosen type of proxy. If you have other driver capable of acting as a proxy you can call rte_ifpx_create_by_devarg() directly passing appropriate argument. Once you have id of both port and proxy you can bind the two via: rte_ifpx_port_bind(port_id, proxy_id); This creates logical binding - as mentioned above there is no automatic packet forwarding. With this binding whenever user changes the state of proxy interface in the system (link up/down, change mac/mtu, add/remove IPv4/IPv6) you get appropriate notification for the bound port. So far we've mentioned several times that the library calls callbacks. They are grouped in 'struct rte_ifpx_callbacks' and user provides them to the library via: rte_ifpx_callbacks_register(len, cbs); It is worth mentioning that the context (lcore/thread) in which these callbacks are called is implementation defined. It might differ between different platforms, so the application needs to assume that some kind of inter lcore/thread synchronization/communication is required. Apart from notification via callbacks this library also supports notifying about the changes via adding events to the configured notification queues. The queues are registered via: int rte_ifpx_queue_add(struct rte_ring *r); and the actual logic used is: if there is callback registered then it is called, if it returns non-zero then event is considered completed, otherwise event is added to each configured notification queue. That way application can update data structures that are safe to be modified by single writer from within callback or do the common preprocessing steps (if any needed) in callback and data that is replicated can be updated during handling of queued events. Once we have bindings in place and notification configured, the only essential part that remains is to get the current network configuration and start listening to its changes. This is accomplished via a call to: rte_ifpx_listen(); And basically this is all one needs to understand how to use this library. Other less essential parts include: - ability to query what events are available for given platform - getting mapping between proxy and port - unbinding the ports from proxy - destroying proxy port - closing the listening service - getting basic information about proxy Currently available features and implementation =============================================== The library's API is system independent but it obviously needs some system dependent parts. We provide exemplary Linux implementation (based on netlink sockets). Very similar implementation is possible for FreeBSD (with the usage of PF_ROUTE sockets). Windows implementation would need to differ much (probably IP Helper library would be of some help). Here is the list of currently implemented callbacks: int (*mac_change)(const struct rte_ifpx_mac_change *event); int (*mtu_change)(const struct rte_ifpx_mtu_change *event); int (*link_change)(const struct rte_ifpx_link_change *event); int (*addr_add)(const struct rte_ifpx_addr_change *event); int (*addr_del)(const struct rte_ifpx_addr_change *event); int (*addr6_add)(const struct rte_ifpx_addr6_change *event); int (*addr6_del)(const struct rte_ifpx_addr6_change *event); int (*route_add)(const struct rte_ifpx_route_change *event); int (*route_del)(const struct rte_ifpx_route_change *event); int (*route6_add)(const struct rte_ifpx_route6_change *event); int (*route6_del)(const struct rte_ifpx_route6_change *event); int (*neigh_add)(const struct rte_ifpx_neigh_change *event); int (*neigh_del)(const struct rte_ifpx_neigh_change *event); int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); int (*cfg_done)(void); They are all rather self-descriptive with the exception of the last one. When the user calls rte_ifpx_listen() the library first queries the system for its current configuration. That might require several request/reply exchanges between DPDK and system and once it is finished this callback is called to let application know that all info has been gathered. It is worth to mention also that while typical case would be a 1-to-1 mapping between port and proxy, the 1-to-many mapping is also supported. In that case related callbacks will be called for each port bound to given proxy interface - it is application responsibility to define semantic of such mapping (e.g. all changes apply to all ports, or link changes apply to all but other are accepted in "round robin" fashion, or some other logic). As mentioned above Linux implementation is based on netlink socket. This socket is registered as file descriptor in EAL interrupts (similarly to how EAL alarms are implemented). With regards Andrzej Ostruszka Andrzej Ostruszka (4): lib: introduce IF Proxy library if_proxy: add library documentation if_proxy: add simple functionality test if_proxy: add example application MAINTAINERS | 6 + app/test/Makefile | 5 + app/test/meson.build | 4 + app/test/test_if_proxy.c | 707 +++++++++++ config/common_base | 5 + config/common_linux | 1 + doc/guides/prog_guide/if_proxy_lib.rst | 142 +++ doc/guides/prog_guide/index.rst | 1 + examples/Makefile | 1 + examples/l3fwd-ifpx/Makefile | 60 + examples/l3fwd-ifpx/l3fwd.c | 1128 ++++++++++++++++++ examples/l3fwd-ifpx/l3fwd.h | 98 ++ examples/l3fwd-ifpx/main.c | 740 ++++++++++++ examples/l3fwd-ifpx/meson.build | 11 + examples/meson.build | 2 +- lib/Makefile | 2 + lib/librte_eal/include/rte_eal_interrupts.h | 2 + lib/librte_eal/linux/eal_interrupts.c | 14 +- lib/librte_if_proxy/Makefile | 29 + lib/librte_if_proxy/if_proxy_common.c | 564 +++++++++ lib/librte_if_proxy/if_proxy_priv.h | 97 ++ lib/librte_if_proxy/linux/Makefile | 4 + lib/librte_if_proxy/linux/if_proxy.c | 563 +++++++++ lib/librte_if_proxy/meson.build | 19 + lib/librte_if_proxy/rte_if_proxy.h | 585 +++++++++ lib/librte_if_proxy/rte_if_proxy_version.map | 20 + lib/meson.build | 2 +- 27 files changed, 4806 insertions(+), 6 deletions(-) create mode 100644 app/test/test_if_proxy.c create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst create mode 100644 examples/l3fwd-ifpx/Makefile create mode 100644 examples/l3fwd-ifpx/l3fwd.c create mode 100644 examples/l3fwd-ifpx/l3fwd.h create mode 100644 examples/l3fwd-ifpx/main.c create mode 100644 examples/l3fwd-ifpx/meson.build create mode 100644 lib/librte_if_proxy/Makefile create mode 100644 lib/librte_if_proxy/if_proxy_common.c create mode 100644 lib/librte_if_proxy/if_proxy_priv.h create mode 100644 lib/librte_if_proxy/linux/Makefile create mode 100644 lib/librte_if_proxy/linux/if_proxy.c create mode 100644 lib/librte_if_proxy/meson.build create mode 100644 lib/librte_if_proxy/rte_if_proxy.h create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v3 1/4] lib: introduce IF Proxy library 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 " Andrzej Ostruszka @ 2020-05-04 8:53 ` Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 2/4] if_proxy: add library documentation Andrzej Ostruszka ` (2 subsequent siblings) 3 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-05-04 8:53 UTC (permalink / raw) To: dev, Thomas Monjalon This library allows to designate ports visible to the system (such as Tun/Tap or KNI) as port representors serving as proxies for other DPDK ports. When such a proxy is configured this library initially queries network configuration from the system and later monitors its changes. The information gathered is passed to the application either via a set of user registered callbacks or as an event added to the configured notification queue (or a combination of these two mechanisms). This way user can use normal network utilities (like those from the iproute2 suite) to configure DPDK ports. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 3 + config/common_base | 5 + config/common_linux | 1 + lib/Makefile | 2 + lib/librte_eal/include/rte_eal_interrupts.h | 2 + lib/librte_eal/linux/eal_interrupts.c | 14 +- lib/librte_if_proxy/Makefile | 29 + lib/librte_if_proxy/if_proxy_common.c | 564 ++++++++++++++++++ lib/librte_if_proxy/if_proxy_priv.h | 97 +++ lib/librte_if_proxy/linux/Makefile | 4 + lib/librte_if_proxy/linux/if_proxy.c | 563 ++++++++++++++++++ lib/librte_if_proxy/meson.build | 19 + lib/librte_if_proxy/rte_if_proxy.h | 585 +++++++++++++++++++ lib/librte_if_proxy/rte_if_proxy_version.map | 20 + lib/meson.build | 2 +- 15 files changed, 1905 insertions(+), 5 deletions(-) create mode 100644 lib/librte_if_proxy/Makefile create mode 100644 lib/librte_if_proxy/if_proxy_common.c create mode 100644 lib/librte_if_proxy/if_proxy_priv.h create mode 100644 lib/librte_if_proxy/linux/Makefile create mode 100644 lib/librte_if_proxy/linux/if_proxy.c create mode 100644 lib/librte_if_proxy/meson.build create mode 100644 lib/librte_if_proxy/rte_if_proxy.h create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map diff --git a/MAINTAINERS b/MAINTAINERS index e05c80504..1013745ce 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1472,6 +1472,9 @@ F: examples/bpf/ F: app/test/test_bpf.c F: doc/guides/prog_guide/bpf_lib.rst +IF Proxy - EXPERIMENTAL +M: Andrzej Ostruszka <aostruszka@marvell.com> +F: lib/librte_if_proxy/ Test Applications ----------------- diff --git a/config/common_base b/config/common_base index 14000ba07..95ca8dbf6 100644 --- a/config/common_base +++ b/config/common_base @@ -1087,6 +1087,11 @@ CONFIG_RTE_LIBRTE_BPF_ELF=n # CONFIG_RTE_LIBRTE_IPSEC=y +# +# Compile librte_if_proxy +# +CONFIG_RTE_LIBRTE_IF_PROXY=n + # # Compile the test application # diff --git a/config/common_linux b/config/common_linux index 816810671..1244eb0ae 100644 --- a/config/common_linux +++ b/config/common_linux @@ -16,6 +16,7 @@ CONFIG_RTE_LIBRTE_VHOST_NUMA=y CONFIG_RTE_LIBRTE_VHOST_POSTCOPY=n CONFIG_RTE_LIBRTE_PMD_VHOST=y CONFIG_RTE_LIBRTE_IFC_PMD=y +CONFIG_RTE_LIBRTE_IF_PROXY=y CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y CONFIG_RTE_LIBRTE_PMD_MEMIF=y CONFIG_RTE_LIBRTE_PMD_SOFTNIC=y diff --git a/lib/Makefile b/lib/Makefile index d0ec3919b..1e7d78183 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -118,6 +118,8 @@ DIRS-$(CONFIG_RTE_LIBRTE_TELEMETRY) += librte_telemetry DEPDIRS-librte_telemetry := librte_eal librte_metrics librte_ethdev DIRS-$(CONFIG_RTE_LIBRTE_RCU) += librte_rcu DEPDIRS-librte_rcu := librte_eal librte_ring +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += librte_if_proxy +DEPDIRS-librte_if_proxy := librte_eal librte_ethdev ifeq ($(CONFIG_RTE_EXEC_ENV_LINUX),y) DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni diff --git a/lib/librte_eal/include/rte_eal_interrupts.h b/lib/librte_eal/include/rte_eal_interrupts.h index 773a34a42..296a3853d 100644 --- a/lib/librte_eal/include/rte_eal_interrupts.h +++ b/lib/librte_eal/include/rte_eal_interrupts.h @@ -36,6 +36,8 @@ enum rte_intr_handle_type { RTE_INTR_HANDLE_VDEV, /**< virtual device */ RTE_INTR_HANDLE_DEV_EVENT, /**< device event handle */ RTE_INTR_HANDLE_VFIO_REQ, /**< VFIO request handle */ + RTE_INTR_HANDLE_NETLINK, /**< netlink notification handle */ + RTE_INTR_HANDLE_MAX /**< count of elements */ }; diff --git a/lib/librte_eal/linux/eal_interrupts.c b/lib/librte_eal/linux/eal_interrupts.c index 16e7a7d51..91ddafc59 100644 --- a/lib/librte_eal/linux/eal_interrupts.c +++ b/lib/librte_eal/linux/eal_interrupts.c @@ -691,6 +691,9 @@ rte_intr_enable(const struct rte_intr_handle *intr_handle) break; /* not used at this moment */ case RTE_INTR_HANDLE_ALARM: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif rc = -1; break; #ifdef VFIO_PRESENT @@ -818,6 +821,9 @@ rte_intr_disable(const struct rte_intr_handle *intr_handle) break; /* not used at this moment */ case RTE_INTR_HANDLE_ALARM: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif rc = -1; break; #ifdef VFIO_PRESENT @@ -915,12 +921,12 @@ eal_intr_process_interrupts(struct epoll_event *events, int nfds) break; #endif #endif - case RTE_INTR_HANDLE_VDEV: case RTE_INTR_HANDLE_EXT: - bytes_read = 0; - call = true; - break; + case RTE_INTR_HANDLE_VDEV: case RTE_INTR_HANDLE_DEV_EVENT: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif bytes_read = 0; call = true; break; diff --git a/lib/librte_if_proxy/Makefile b/lib/librte_if_proxy/Makefile new file mode 100644 index 000000000..43cb702a2 --- /dev/null +++ b/lib/librte_if_proxy/Makefile @@ -0,0 +1,29 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +include $(RTE_SDK)/mk/rte.vars.mk + +# library name +LIB = librte_if_proxy.a + +CFLAGS += -DALLOW_EXPERIMENTAL_API +CFLAGS += -O3 +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) +LDLIBS += -lrte_eal -lrte_ethdev + +EXPORT_MAP := rte_if_proxy_version.map + +LIBABIVER := 1 + +# all source are stored in SRCS-y +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) := if_proxy_common.c + +SYSDIR := $(patsubst "%app",%,$(CONFIG_RTE_EXEC_ENV)) +include $(SRCDIR)/$(SYSDIR)/Makefile + +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += $(addprefix $(SYSDIR)/,$(SRCS)) + +# install this header file +SYMLINK-$(CONFIG_RTE_LIBRTE_IF_PROXY)-include := rte_if_proxy.h + +include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_if_proxy/if_proxy_common.c b/lib/librte_if_proxy/if_proxy_common.c new file mode 100644 index 000000000..6f72511f4 --- /dev/null +++ b/lib/librte_if_proxy/if_proxy_common.c @@ -0,0 +1,564 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#include "if_proxy_priv.h" +#include <rte_string_fns.h> + +/* Definitions of data mentioned in if_proxy_priv.h and local ones. */ +int ifpx_log_type; + +/* Table keeping mapping between port and their proxies. */ +static +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; + +rte_spinlock_t ifpx_lock = RTE_SPINLOCK_INITIALIZER; + +struct ifpx_proxies_head ifpx_proxies = TAILQ_HEAD_INITIALIZER(ifpx_proxies); + +struct ifpx_queue_node { + TAILQ_ENTRY(ifpx_queue_node) elem; + uint16_t state; + struct rte_ring *r; +}; +static +TAILQ_HEAD(ifpx_queues_head, ifpx_queue_node) ifpx_queues = + TAILQ_HEAD_INITIALIZER(ifpx_queues); + +/* All callbacks have similar signature (taking pointer to some event) so we'll + * use this f_ptr to typecast and invoke them in a generic way. There is one + * exception though - notification about completed initial configuration - and + * it is handled separately. + */ +union ifpx_cb_ptr { + int (*f_ptr)(void *ev); /* type for normal event notification */ + union rte_ifpx_cb_ptr cb; +} ifpx_callbacks[RTE_IFPX_NUM_EVENTS]; + +uint64_t rte_ifpx_events_available(void) +{ + if (ifpx_platform.events) + return ifpx_platform.events(); + + /* If callback is not provided then all events are supported. */ + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; +} + +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type) +{ + char devargs[16] = { '\0' }; + int dev_cnt = 0, nlen; + uint16_t port_id; + + switch (type) { + case RTE_IFPX_DEFAULT: + case RTE_IFPX_TAP: + nlen = strlcpy(devargs, "net_tap", sizeof(devargs)); + break; + case RTE_IFPX_KNI: + nlen = strlcpy(devargs, "net_kni", sizeof(devargs)); + break; + default: + IFPX_LOG(ERR, "Unknown proxy type: %d", type); + return RTE_MAX_ETHPORTS; + } + + RTE_ETH_FOREACH_DEV(port_id) { + if (strcmp(rte_eth_devices[port_id].device->driver->name, + devargs) == 0) + ++dev_cnt; + } + snprintf(devargs+nlen, sizeof(devargs)-nlen, "%d", dev_cnt); + + return rte_ifpx_proxy_create_by_devarg(devargs); +} + +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg) +{ + uint16_t port_id = RTE_MAX_ETHPORTS; + struct rte_dev_iterator iter; + + if (rte_dev_probe(devarg) < 0) { + IFPX_LOG(ERR, "Failed to create proxy port %s\n", devarg); + return RTE_MAX_ETHPORTS; + } + + if (rte_eth_iterator_init(&iter, devarg) == 0) { + port_id = rte_eth_iterator_next(&iter); + if (port_id != RTE_MAX_ETHPORTS) + rte_eth_iterator_cleanup(&iter); + } + + return port_id; +} + +int ifpx_proxy_destroy(struct ifpx_proxy_node *px) +{ + unsigned int i; + uint16_t proxy_id = px->proxy_id; + + /* This function is expected to be called with a lock held. */ + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); + + if (px->state & IN_USE) { + px->state |= DEL_PENDING; + return 0; + } + + TAILQ_REMOVE(&ifpx_proxies, px, elem); + free(px); + + /* Clear any bindings for this proxy. */ + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) { + if (ifpx_ports[i] == proxy_id) + ifpx_ports[i] = RTE_MAX_ETHPORTS; + } + + return rte_dev_remove(rte_eth_devices[proxy_id].device); +} + +int rte_ifpx_proxy_destroy(uint16_t proxy_id) +{ + struct ifpx_proxy_node *px; + int ec; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == proxy_id) + break; + } + if (!px) { + ec = -EINVAL; + goto exit; + } + + ec = ifpx_proxy_destroy(px); +exit: + rte_spinlock_unlock(&ifpx_lock); + return ec; +} + +int rte_ifpx_queue_add(struct rte_ring *r) +{ + struct ifpx_queue_node *node; + int ec = 0; + + if (!r) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(node, &ifpx_queues, elem) { + if (node->r == r) { + ec = -EEXIST; + goto exit; + } + } + + node = malloc(sizeof(*node)); + if (!node) { + ec = -ENOMEM; + goto exit; + } + + node->r = r; + TAILQ_INSERT_TAIL(&ifpx_queues, node, elem); +exit: + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +int rte_ifpx_queue_remove(struct rte_ring *r) +{ + struct ifpx_queue_node *node, *next; + int ec = -EINVAL; + + if (!r) + return ec; + + rte_spinlock_lock(&ifpx_lock); + for (node = TAILQ_FIRST(&ifpx_queues); node; node = next) { + next = TAILQ_NEXT(node, elem); + if (node->r != r) + continue; + TAILQ_REMOVE(&ifpx_queues, node, elem); + free(node); + ec = 0; + break; + } + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id) +{ + struct rte_eth_dev_info proxy_eth_info; + struct ifpx_proxy_node *px; + int ec; + + rte_spinlock_lock(&ifpx_lock); + + if (port_id >= RTE_MAX_ETHPORTS || proxy_id >= RTE_MAX_ETHPORTS || + /* port is a proxy */ + ifpx_ports[port_id] == port_id) { + IFPX_LOG(ERR, "Invalid port_id: %d", port_id); + ec = -EINVAL; + goto error; + } + + /* Do automatic rebinding but issue a warning since this is not + * considered to be a valid behaviour. + */ + if (ifpx_ports[port_id] != RTE_MAX_ETHPORTS) { + IFPX_LOG(WARNING, "Port already bound: %d -> %d", port_id, + ifpx_ports[port_id]); + } + + /* Search for existing proxy - if not found add one to the list. */ + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == proxy_id) + break; + } + if (!px) { + ec = rte_eth_dev_info_get(proxy_id, &proxy_eth_info); + if (ec < 0 || proxy_eth_info.if_index == 0) { + IFPX_LOG(ERR, "Invalid proxy: %d", proxy_id); + if (ec >= 0) + ec = -EINVAL; + goto error; + } + px = malloc(sizeof(*px)); + if (!px) { + ec = -ENOMEM; + goto error; + } + px->proxy_id = proxy_id; + px->info.if_index = proxy_eth_info.if_index; + rte_eth_dev_get_mtu(proxy_id, &px->info.mtu); + rte_eth_macaddr_get(proxy_id, &px->info.mac); + memset(px->info.if_name, 0, sizeof(px->info.if_name)); + TAILQ_INSERT_TAIL(&ifpx_proxies, px, elem); + } + ifpx_ports[port_id] = proxy_id; + rte_spinlock_unlock(&ifpx_lock); + + /* Add proxy MAC to the port - since port will often just forward + * packets from the proxy/system they will be sent with proxy MAC as + * src. In order to pass communication in other direction we should be + * accepting packets with proxy MAC as dst. + */ + rte_eth_dev_mac_addr_add(port_id, &px->info.mac, 0); + + if (ifpx_platform.get_info) + ifpx_platform.get_info(px->info.if_index); + + return 0; + +error: + rte_spinlock_unlock(&ifpx_lock); + return ec; +} + +int rte_ifpx_port_unbind(uint16_t port_id) +{ + unsigned int i, cnt; + uint16_t proxy_id; + struct ifpx_proxy_node *px; + int ec = 0; + + rte_spinlock_lock(&ifpx_lock); + if (port_id >= RTE_MAX_ETHPORTS || + ifpx_ports[port_id] == RTE_MAX_ETHPORTS || + /* port is a proxy */ + ifpx_ports[port_id] == port_id) { + ec = -EINVAL; + goto exit; + } + + proxy_id = ifpx_ports[port_id]; + ifpx_ports[port_id] = RTE_MAX_ETHPORTS; + + for (i = 0, cnt = 0; i < RTE_DIM(ifpx_ports); ++i) { + if (ifpx_ports[i] == proxy_id) + ++cnt; + } + + /* If there is no port bound to this proxy then remove it. */ + if (cnt == 0) { + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == proxy_id) + break; + } + RTE_ASSERT(px); + ec = ifpx_proxy_destroy(px); + } +exit: + rte_spinlock_unlock(&ifpx_lock); + return ec; +} + +int rte_ifpx_callbacks_register(unsigned int len, + const struct rte_ifpx_callback cbs[]) +{ + unsigned int i; + + if (!cbs || len == 0) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + + for (i = 0; i < len; ++i) { + if (cbs[i].type < 0 || cbs[i].type > RTE_IFPX_LAST_EVENT) { + IFPX_LOG(WARNING, "Invalid event type: %d", + cbs[i].type); + continue; + } + ifpx_callbacks[i].cb = cbs[i].callback; + } + + rte_spinlock_unlock(&ifpx_lock); + + return 0; +} + +void rte_ifpx_callbacks_unregister_all(void) +{ + rte_spinlock_lock(&ifpx_lock); + memset(&ifpx_callbacks, 0, sizeof(ifpx_callbacks)); + rte_spinlock_unlock(&ifpx_lock); +} + +int rte_ifpx_callbacks_unregister(enum rte_ifpx_event_type ev) +{ + if (ev < 0 || ev > RTE_IFPX_CFG_DONE) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + ifpx_callbacks[ev].f_ptr = NULL; + rte_spinlock_unlock(&ifpx_lock); + + return 0; +} + +uint16_t rte_ifpx_proxy_get(uint16_t port_id) +{ + uint16_t p = RTE_MAX_ETHPORTS; + + if (port_id < RTE_MAX_ETHPORTS) { + rte_spinlock_lock(&ifpx_lock); + p = ifpx_ports[port_id]; + rte_spinlock_unlock(&ifpx_lock); + } + + return p; +} + +unsigned int rte_ifpx_port_get(uint16_t proxy_id, + uint16_t *ports, unsigned int num) +{ + unsigned int p, cnt = 0; + + rte_spinlock_lock(&ifpx_lock); + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] == proxy_id && ifpx_ports[p] != p) { + ++cnt; + if (ports && num > 0) { + *ports++ = p; + --num; + } + } + } + rte_spinlock_unlock(&ifpx_lock); + + return cnt; +} + +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id) +{ + struct ifpx_proxy_node *px; + + rte_spinlock_lock(&ifpx_lock); + + if (port_id >= RTE_MAX_ETHPORTS || + ifpx_ports[port_id] == RTE_MAX_ETHPORTS) { + rte_spinlock_unlock(&ifpx_lock); + return NULL; + } + + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == ifpx_ports[port_id]) + break; + } + rte_spinlock_unlock(&ifpx_lock); + RTE_ASSERT(px && "Internal IF Proxy library error"); + + return &px->info; +} + +static +void queue_event(const struct rte_ifpx_event *ev, struct rte_ring *r) +{ + struct rte_ifpx_event *e = malloc(sizeof(*ev)); + + if (!e) { + IFPX_LOG(ERR, "Failed to allocate event!"); + return; + } + RTE_ASSERT(r); + + *e = *ev; + rte_ring_sp_enqueue(r, e); +} + +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px) +{ + struct ifpx_queue_node *q; + int done = 0; + uint16_t p, proxy_id; + + if (px) { + if (px->state & DEL_PENDING) + return; + proxy_id = px->proxy_id; + RTE_ASSERT(proxy_id != RTE_MAX_ETHPORTS); + px->state |= IN_USE; + } else + proxy_id = RTE_MAX_ETHPORTS; + + RTE_ASSERT(ev && ev->type >= 0 && ev->type <= RTE_IFPX_LAST_EVENT); + /* This function is expected to be called with a lock held. */ + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); + + if (ifpx_callbacks[ev->type].f_ptr) { + union ifpx_cb_ptr fun = ifpx_callbacks[ev->type]; + + /* Below we drop the lock for the time of callback call to allow + * for calling of IF Proxy API. + */ + if (px) { + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] != proxy_id || + ifpx_ports[p] == p) + continue; + ev->data.port_id = p; + rte_spinlock_unlock(&ifpx_lock); + done = fun.f_ptr(&ev->data) || done; + rte_spinlock_lock(&ifpx_lock); + } + } else { + RTE_ASSERT(ev->type == RTE_IFPX_CFG_DONE); + rte_spinlock_unlock(&ifpx_lock); + done = fun.cb.cfg_done(); + rte_spinlock_lock(&ifpx_lock); + } + } + if (done) + goto exit; + + /* Event not "consumed" yet so try to notify via queues. */ + TAILQ_FOREACH(q, &ifpx_queues, elem) { + if (px) { + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] != proxy_id || + ifpx_ports[p] == p) + continue; + /* Set the port_id - the remaining params should + * be filled before calling this function. + */ + ev->data.port_id = p; + queue_event(ev, q->r); + } + } else + queue_event(ev, q->r); + } +exit: + if (px) + px->state &= ~IN_USE; +} + +void ifpx_cleanup_proxies(void) +{ + struct ifpx_proxy_node *px, *next; + for (px = TAILQ_FIRST(&ifpx_proxies); px; px = next) { + next = TAILQ_NEXT(px, elem); + if (px->state & DEL_PENDING) + ifpx_proxy_destroy(px); + } +} + +int rte_ifpx_listen(void) +{ + int ec; + + if (!ifpx_platform.listen) + return -ENOTSUP; + + ec = ifpx_platform.listen(); + if (ec == 0 && ifpx_platform.get_info) + ifpx_platform.get_info(0); + + return ec; +} + +int rte_ifpx_close(void) +{ + struct ifpx_proxy_node *px; + struct ifpx_queue_node *q; + unsigned int p; + int ec = 0; + + rte_spinlock_lock(&ifpx_lock); + + if (ifpx_platform.close) { + ec = ifpx_platform.close(); + if (ec != 0) + IFPX_LOG(ERR, "Platform 'close' calback failed."); + } + + /* Remove queues. */ + while (!TAILQ_EMPTY(&ifpx_queues)) { + q = TAILQ_FIRST(&ifpx_queues); + TAILQ_REMOVE(&ifpx_queues, q, elem); + free(q); + } + + /* Clear callbacks. */ + memset(&ifpx_callbacks, 0, sizeof(ifpx_callbacks)); + + /* Unbind ports. */ + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] == RTE_MAX_ETHPORTS) + continue; + /* We don't need to call rte_ifpx_port_unbind() here since we + * clear proxies below anyway, just clearing the mapping is + * enough (and besides it would deadlock :)). + */ + ifpx_ports[p] = RTE_MAX_ETHPORTS; + } + + /* Clear proxies. */ + while (!TAILQ_EMPTY(&ifpx_proxies)) { + px = TAILQ_FIRST(&ifpx_proxies); + TAILQ_REMOVE(&ifpx_proxies, px, elem); + free(px); + } + + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +RTE_INIT(if_proxy_init) +{ + unsigned int i; + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) + ifpx_ports[i] = RTE_MAX_ETHPORTS; + + ifpx_log_type = rte_log_register("lib.if_proxy"); + if (ifpx_log_type >= 0) + rte_log_set_level(ifpx_log_type, RTE_LOG_WARNING); + + if (ifpx_platform.init) + ifpx_platform.init(); +} diff --git a/lib/librte_if_proxy/if_proxy_priv.h b/lib/librte_if_proxy/if_proxy_priv.h new file mode 100644 index 000000000..7691494be --- /dev/null +++ b/lib/librte_if_proxy/if_proxy_priv.h @@ -0,0 +1,97 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ +#ifndef _IF_PROXY_PRIV_H_ +#define _IF_PROXY_PRIV_H_ + +#include "rte_if_proxy.h" +#include <rte_spinlock.h> + +#define RTE_IFPX_LAST_EVENT RTE_IFPX_CFG_DONE +#define RTE_IFPX_NUM_EVENTS (RTE_IFPX_LAST_EVENT+1) +#define RTE_IFPX_EVENT_INVALID RTE_IFPX_NUM_EVENTS + +extern int ifpx_log_type; +#define IFPX_LOG(level, fmt, args...) \ + rte_log(RTE_LOG_ ## level, ifpx_log_type, "%s(): " fmt "\n", \ + __func__, ##args) + +/* Since this library is really a slow/config path we guard all internal data + * with a lock - and only one for all of them should be enough. + */ +extern rte_spinlock_t ifpx_lock; + +enum ifpx_node_status { + IN_USE = 1U << 0, + DEL_PENDING = 1U << 1, +}; + +/* List of configured proxies */ +struct ifpx_proxy_node { + TAILQ_ENTRY(ifpx_proxy_node) elem; + uint16_t proxy_id; + uint16_t state; + struct rte_ifpx_info info; +}; +extern +TAILQ_HEAD(ifpx_proxies_head, ifpx_proxy_node) ifpx_proxies; + +/* This function should be called by the implementation whenever it notices + * change in the network configuration. The arguments are: + * - ev : pointer to filled event data structure (all fields are expected to be + * filled, with the exception of 'port_id' for all proxy/port related + * events: this function clones the event notification for each bound port + * and fills 'port_id' appropriately). + * - px : proxy node when given event is proxy/port related, otherwise pass NULL + */ +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px); + +/* This function should be called by the implementation whenever it is done with + * notification about network configuration change. It is only really needed + * for the case of callback based API since from the callback user might attempt + * to remove proxies. Only implementation really knows when notification for + * given proxy is finished so it is a duty of it to call this function to + * cleanup all proxies that has been marked for deletion. + */ +void ifpx_cleanup_proxies(void); + +/* This is the internal function removing the proxy from the list. It is + * related to the notification function above and intended to be used by the + * platform implementation for the case of callback based API. + * During notification via callback the internal lock is released so that + * operation would not deadlock on an attempt to take a lock. However + * modification (destruction) is not really performed - instead the + * callbacks/proxies are marked as "to be deleted". + * Handling of callbacks that are "to be deleted" is done by the + * ifpx_notify_event() function itself however it cannot delete the proxies (in + * particular the proxy passed as an argument) since they might still be + * referred by the calling function. So it is a responsibility of the platform + * implementation to check after calling notification function if there are any + * proxies to be removed and use ifpx_proxy_destroy() to actually release them. + */ +int ifpx_proxy_destroy(struct ifpx_proxy_node *px); + +/* Every implementation should provide definition of this structure: + * - init : called during library initialization (NULL when not needed) + * - events : this should return bitmask of supported events (can be NULL if all + * defined events are supported by the implementation) + * - listen : this function should start service listening to the network + * configuration events/changes, + * - close : this function should close the service started by listen() + * - get_info : this function should query system for current configuration of + * interface with index 'if_index'. After successful initialization of + * listening service this function is called with 0 as an argument. In that + * case configuration of all ports should be obtained - and when this + * procedure completes a RTE_IFPX_CFG_DONE event should be signaled via + * ifpx_notify_event(). + */ +extern +struct ifpx_platform_callbacks { + void (*init)(void); + uint64_t (*events)(void); + int (*listen)(void); + int (*close)(void); + void (*get_info)(int if_index); +} ifpx_platform; + +#endif /* _IF_PROXY_PRIV_H_ */ diff --git a/lib/librte_if_proxy/linux/Makefile b/lib/librte_if_proxy/linux/Makefile new file mode 100644 index 000000000..275b7e1e3 --- /dev/null +++ b/lib/librte_if_proxy/linux/Makefile @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +SRCS += if_proxy.c diff --git a/lib/librte_if_proxy/linux/if_proxy.c b/lib/librte_if_proxy/linux/if_proxy.c new file mode 100644 index 000000000..618631b01 --- /dev/null +++ b/lib/librte_if_proxy/linux/if_proxy.c @@ -0,0 +1,563 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ +#include <if_proxy_priv.h> +#include <rte_interrupts.h> +#include <rte_string_fns.h> + +#include <stdbool.h> +#include <unistd.h> +#include <errno.h> +#include <sys/socket.h> +#include <linux/rtnetlink.h> +#include <linux/if.h> + +static +struct rte_intr_handle ifpx_irq = { + .type = RTE_INTR_HANDLE_NETLINK, + .fd = -1, +}; + +static +unsigned int ifpx_pid; + +static +int request_info(int type, int index) +{ + static rte_spinlock_t send_lock = RTE_SPINLOCK_INITIALIZER; + struct info_get { + struct nlmsghdr h; + union { + struct ifinfomsg ifm; + struct ifaddrmsg ifa; + struct rtmsg rtm; + struct ndmsg ndm; + } __rte_aligned(NLMSG_ALIGNTO); + } info_req; + int ret; + + memset(&info_req, 0, sizeof(info_req)); + /* First byte of these messages is family, so just make sure that this + * memset is enough to get all families. + */ + RTE_ASSERT(AF_UNSPEC == 0); + + info_req.h.nlmsg_pid = ifpx_pid; + info_req.h.nlmsg_type = type; + info_req.h.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP; + info_req.h.nlmsg_len = offsetof(struct info_get, ifm); + + switch (type) { + case RTM_GETLINK: + info_req.h.nlmsg_len += sizeof(info_req.ifm); + info_req.ifm.ifi_index = index; + break; + case RTM_GETADDR: + info_req.h.nlmsg_len += sizeof(info_req.ifa); + info_req.ifa.ifa_index = index; + break; + case RTM_GETROUTE: + info_req.h.nlmsg_len += sizeof(info_req.rtm); + break; + case RTM_GETNEIGH: + info_req.h.nlmsg_len += sizeof(info_req.ndm); + break; + default: + IFPX_LOG(WARNING, "Unhandled message type: %d", type); + return -EINVAL; + } + /* Store request type (and if it is global or link specific) in 'seq'. + * Later it is used during handling of reply to continue requesting of + * information dump from system - if needed. + */ + info_req.h.nlmsg_seq = index << 8 | type; + + IFPX_LOG(DEBUG, "\tRequesting msg %d for: %u", type, index); + + rte_spinlock_lock(&send_lock); +retry: + ret = send(ifpx_irq.fd, &info_req, info_req.h.nlmsg_len, 0); + if (ret < 0) { + if (errno == EINTR) { + IFPX_LOG(DEBUG, "send() interrupted"); + goto retry; + } + IFPX_LOG(ERR, "Failed to send netlink msg: %d", errno); + rte_errno = errno; + } + rte_spinlock_unlock(&send_lock); + + return ret; +} + +static +void handle_link(const struct nlmsghdr *h) +{ + const struct ifinfomsg *ifi = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + const struct rtattr *attrs[IFLA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + + IFPX_LOG(DEBUG, "\tLink action (%u): %u, 0x%x/0x%x (flags/changed)", + ifi->ifi_index, h->nlmsg_type, ifi->ifi_flags, + ifi->ifi_change); + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (unsigned int)ifi->ifi_index) + break; + } + + /* Drop messages that are not associated with any proxy */ + if (!px) + goto exit; + /* When message is a reply to request for specific interface then keep + * it only when it contains info for this interface. + */ + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && + (h->nlmsg_seq >> 8) != (unsigned int)ifi->ifi_index) + goto exit; + + for (attr = IFLA_RTA(ifi); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > IFLA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + if (ifi->ifi_change & IFF_UP) { + ev.type = RTE_IFPX_LINK_CHANGE; + ev.link_change.is_up = ifi->ifi_flags & IFF_UP; + ifpx_notify_event(&ev, px); + } + if (attrs[IFLA_MTU]) { + uint16_t mtu = *(const int *)RTA_DATA(attrs[IFLA_MTU]); + if (mtu != px->info.mtu) { + px->info.mtu = mtu; + ev.type = RTE_IFPX_MTU_CHANGE; + ev.mtu_change.mtu = mtu; + ifpx_notify_event(&ev, px); + } + } + if (attrs[IFLA_ADDRESS]) { + const struct rte_ether_addr *mac = + RTA_DATA(attrs[IFLA_ADDRESS]); + + RTE_ASSERT(RTA_PAYLOAD(attrs[IFLA_ADDRESS]) == + RTE_ETHER_ADDR_LEN); + if (memcmp(mac, &px->info.mac, RTE_ETHER_ADDR_LEN) != 0) { + rte_ether_addr_copy(mac, &px->info.mac); + ev.type = RTE_IFPX_MAC_CHANGE; + rte_ether_addr_copy(mac, &ev.mac_change.mac); + ifpx_notify_event(&ev, px); + } + } + if (h->nlmsg_pid == ifpx_pid) { + RTE_ASSERT((h->nlmsg_seq & 0xFF) == RTM_GETLINK); + /* If this is reply for specific link request (not initial + * global dump) then follow up with address request, otherwise + * just store the interface name. + */ + if (h->nlmsg_seq >> 8) + request_info(RTM_GETADDR, ifi->ifi_index); + else if (!px->info.if_name[0] && attrs[IFLA_IFNAME]) + strlcpy(px->info.if_name, RTA_DATA(attrs[IFLA_IFNAME]), + sizeof(px->info.if_name)); + } + + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void handle_addr(const struct nlmsghdr *h, bool needs_del) +{ + const struct ifaddrmsg *ifa = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifa)); + const struct rtattr *attrs[IFA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tAddr action (%u): %u, family: %u", + ifa->ifa_index, h->nlmsg_type, ifa->ifa_family); + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == ifa->ifa_index) + break; + } + + /* Drop messages that are not associated with any proxy */ + if (!px) + goto exit; + /* When message is a reply to request for specific interface then keep + * it only when it contains info for this interface. + */ + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && + (h->nlmsg_seq >> 8) != ifa->ifa_index) + goto exit; + + for (attr = IFA_RTA(ifa); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > IFA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + if (attrs[IFA_ADDRESS]) { + ip = RTA_DATA(attrs[IFA_ADDRESS]); + if (ifa->ifa_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_ADDR_DEL + : RTE_IFPX_ADDR_ADD; + ev.addr_change.ip = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_ADDR6_DEL + : RTE_IFPX_ADDR6_ADD; + memcpy(ev.addr6_change.ip, ip, 16); + } + ifpx_notify_event(&ev, px); + ifpx_cleanup_proxies(); + } +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void handle_route(const struct nlmsghdr *h, bool needs_del) +{ + const struct rtmsg *r = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*r)); + const struct rtattr *attrs[RTA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct rte_ifpx_event ev; + struct ifpx_proxy_node *px = NULL; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tRoute action: %u, family: %u", + h->nlmsg_type, r->rtm_family); + + for (attr = RTM_RTA(r); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > RTA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + memset(&ev, 0, sizeof(ev)); + ev.type = RTE_IFPX_EVENT_INVALID; + + rte_spinlock_lock(&ifpx_lock); + if (attrs[RTA_OIF]) { + int if_index = *((int32_t *)RTA_DATA(attrs[RTA_OIF])); + + if (if_index > 0) { + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (uint32_t)if_index) + break; + } + } + } + /* We are only interested in routes related to the proxy interfaces and + * we need to have dst - otherwise skip the message. + */ + if (!px || !attrs[RTA_DST]) + goto exit; + + ip = RTA_DATA(attrs[RTA_DST]); + /* This is common to both IPv4/6. */ + ev.route_change.depth = r->rtm_dst_len; + if (r->rtm_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_ROUTE_DEL + : RTE_IFPX_ROUTE_ADD; + ev.route_change.ip = RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_ROUTE6_DEL + : RTE_IFPX_ROUTE6_ADD; + memcpy(ev.route6_change.ip, ip, 16); + } + if (attrs[RTA_GATEWAY]) { + ip = RTA_DATA(attrs[RTA_GATEWAY]); + if (r->rtm_family == AF_INET) + ev.route_change.gateway = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + else + memcpy(ev.route6_change.gateway, ip, 16); + } + + ifpx_notify_event(&ev, px); + /* Let's check for proxies to remove here too - just in case somebody + * removed the non-proxy related callback. + */ + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +/* Link, addr and route related messages seem to have this macro defined but not + * neighbour one. Define one if it is missing - const qualifiers added just to + * silence compiler - for some reason it is not needed in equivalent macros for + * other messages and here compiler is complaining about (char*) cast on pointer + * to const. + */ +#ifndef NDA_RTA +#define NDA_RTA(r) ((const struct rtattr *)(((const char *)(r)) + \ + NLMSG_ALIGN(sizeof(struct ndmsg)))) +#endif + +static +void handle_neigh(const struct nlmsghdr *h, bool needs_del) +{ + const struct ndmsg *n = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*n)); + const struct rtattr *attrs[NDA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tNeighbour action: %u, family: %u, state: %u, if: %d", + h->nlmsg_type, n->ndm_family, n->ndm_state, n->ndm_ifindex); + + for (attr = NDA_RTA(n); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > NDA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + memset(&ev, 0, sizeof(ev)); + ev.type = RTE_IFPX_EVENT_INVALID; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (unsigned int)n->ndm_ifindex) + break; + } + /* We need only subset of neighbourhood related to proxy interfaces. + * lladdr seems to be needed only for adding new entry - modifications + * (also reported via RTM_NEWLINK) and deletion include only dst. + */ + if (!px || !attrs[NDA_DST] || (!needs_del && !attrs[NDA_LLADDR])) + goto exit; + + ip = RTA_DATA(attrs[NDA_DST]); + if (n->ndm_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_NEIGH_DEL + : RTE_IFPX_NEIGH_ADD; + ev.neigh_change.ip = RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_NEIGH6_DEL + : RTE_IFPX_NEIGH6_ADD; + memcpy(ev.neigh6_change.ip, ip, 16); + } + if (attrs[NDA_LLADDR]) + rte_ether_addr_copy(RTA_DATA(attrs[NDA_LLADDR]), + &ev.neigh_change.mac); + + ifpx_notify_event(&ev, px); + /* Let's check for proxies to remove here too - just in case somebody + * removed the non-proxy related callback. + */ + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void if_proxy_intr_callback(void *arg __rte_unused) +{ + struct nlmsghdr *h; + struct sockaddr_nl addr; + socklen_t addr_len; + char buf[8192]; + ssize_t len; + +restart: + len = recvfrom(ifpx_irq.fd, buf, sizeof(buf), 0, + (struct sockaddr *)&addr, &addr_len); + if (len < 0) { + if (errno == EINTR) { + IFPX_LOG(DEBUG, "recvfrom() interrupted"); + goto restart; + } + IFPX_LOG(ERR, "Failed to read netlink msg: %ld (errno %d)", + len, errno); + return; + } + if (addr_len != sizeof(addr)) { + IFPX_LOG(ERR, "Invalid netlink addr size: %d", addr_len); + return; + } + IFPX_LOG(DEBUG, "Read %lu bytes (buf %lu) from %u/%u", len, + sizeof(buf), addr.nl_pid, addr.nl_groups); + + for (h = (struct nlmsghdr *)buf; NLMSG_OK(h, len); + h = NLMSG_NEXT(h, len)) { + IFPX_LOG(DEBUG, "Recv msg: %u (%u/%u/%u seq/flags/pid)", + h->nlmsg_type, h->nlmsg_seq, h->nlmsg_flags, + h->nlmsg_pid); + + switch (h->nlmsg_type) { + case RTM_NEWLINK: + case RTM_DELLINK: + handle_link(h); + break; + case RTM_NEWADDR: + case RTM_DELADDR: + handle_addr(h, h->nlmsg_type == RTM_DELADDR); + break; + case RTM_NEWROUTE: + case RTM_DELROUTE: + handle_route(h, h->nlmsg_type == RTM_DELROUTE); + break; + case RTM_NEWNEIGH: + case RTM_DELNEIGH: + handle_neigh(h, h->nlmsg_type == RTM_DELNEIGH); + break; + } + + /* If this is a reply for global request then follow up with + * additional requests and notify about finish. + */ + if (h->nlmsg_pid == ifpx_pid && (h->nlmsg_seq >> 8) == 0 && + h->nlmsg_type == NLMSG_DONE) { + if ((h->nlmsg_seq & 0xFF) == RTM_GETLINK) + request_info(RTM_GETADDR, 0); + else if ((h->nlmsg_seq & 0xFF) == RTM_GETADDR) + request_info(RTM_GETROUTE, 0); + else if ((h->nlmsg_seq & 0xFF) == RTM_GETROUTE) + request_info(RTM_GETNEIGH, 0); + else { + struct rte_ifpx_event ev = { + .type = RTE_IFPX_CFG_DONE + }; + + RTE_ASSERT((h->nlmsg_seq & 0xFF) == + RTM_GETNEIGH); + rte_spinlock_lock(&ifpx_lock); + ifpx_notify_event(&ev, NULL); + rte_spinlock_unlock(&ifpx_lock); + } + } + } + IFPX_LOG(DEBUG, "Finished msg loop: %ld bytes left", len); +} + +static +int nlink_listen(void) +{ + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + .nl_pid = 0, + }; + socklen_t addr_len = sizeof(addr); + int ret; + + if (ifpx_irq.fd != -1) { + rte_errno = EBUSY; + return -1; + } + + addr.nl_groups = 1 << (RTNLGRP_LINK-1) + | 1 << (RTNLGRP_NEIGH-1) + | 1 << (RTNLGRP_IPV4_IFADDR-1) + | 1 << (RTNLGRP_IPV6_IFADDR-1) + | 1 << (RTNLGRP_IPV4_ROUTE-1) + | 1 << (RTNLGRP_IPV6_ROUTE-1); + + ifpx_irq.fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, + NETLINK_ROUTE); + if (ifpx_irq.fd == -1) { + IFPX_LOG(ERR, "Failed to create netlink socket: %d", errno); + goto error; + } + /* Starting with kernel 4.19 you can request dump for a specific + * interface and kernel will filter out and send only relevant info. + * Otherwise NLM_F_DUMP will generate info for all interfaces and you + * need to filter them yourself. + */ +#ifdef NETLINK_DUMP_STRICT_CHK + ret = 1; /* use this var also as an input param */ + ret = setsockopt(ifpx_irq.fd, SOL_SOCKET, NETLINK_DUMP_STRICT_CHK, + &ret, sizeof(ret)); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to set socket option: %d", errno); + goto error; + } +#endif + + ret = bind(ifpx_irq.fd, (struct sockaddr *)&addr, addr_len); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to bind socket: %d", errno); + goto error; + } + ret = getsockname(ifpx_irq.fd, (struct sockaddr *)&addr, &addr_len); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to get socket addr: %d", errno); + goto error; + } else { + ifpx_pid = addr.nl_pid; + IFPX_LOG(DEBUG, "Assigned port ID: %u", addr.nl_pid); + } + + ret = rte_intr_callback_register(&ifpx_irq, if_proxy_intr_callback, + NULL); + if (ret == 0) + return 0; + +error: + rte_errno = errno; + if (ifpx_irq.fd != -1) { + close(ifpx_irq.fd); + ifpx_irq.fd = -1; + } + return -1; +} + +static +int nlink_close(void) +{ + int ec; + + if (ifpx_irq.fd < 0) + return -EBADFD; + + /* Drop the lock for the time of unregistering - otherwise we might dead + * lock e.g. we take a lock here and try to unregister and wait for the + * interrupt lock but it is taken already because notification comes + * and executes proxy callback which will try to take a lock. + */ + rte_spinlock_unlock(&ifpx_lock); + do + ec = rte_intr_callback_unregister(&ifpx_irq, + if_proxy_intr_callback, NULL); + while (ec == -EAGAIN); /* unlikely but possible - at least I think so */ + rte_spinlock_lock(&ifpx_lock); + + close(ifpx_irq.fd); + ifpx_irq.fd = -1; + ifpx_pid = 0; + + return 0; +} + +static +void nlink_get_info(int if_index) +{ + if (ifpx_irq.fd != -1) + request_info(RTM_GETLINK, if_index); +} + +struct ifpx_platform_callbacks ifpx_platform = { + .init = NULL, + .events = NULL, + .listen = nlink_listen, + .close = nlink_close, + .get_info = nlink_get_info, +}; diff --git a/lib/librte_if_proxy/meson.build b/lib/librte_if_proxy/meson.build new file mode 100644 index 000000000..f0c1a6e15 --- /dev/null +++ b/lib/librte_if_proxy/meson.build @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +# Currently only implemented on Linux +if not is_linux + build = false + reason = 'only supported on linux' +endif + +version = 1 +allow_experimental_apis = true + +deps += ['ethdev'] +sources = files('if_proxy_common.c') +headers = files('rte_if_proxy.h') + +if is_linux + sources += files('linux/if_proxy.c') +endif diff --git a/lib/librte_if_proxy/rte_if_proxy.h b/lib/librte_if_proxy/rte_if_proxy.h new file mode 100644 index 000000000..2378b4424 --- /dev/null +++ b/lib/librte_if_proxy/rte_if_proxy.h @@ -0,0 +1,585 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#ifndef _RTE_IF_PROXY_H_ +#define _RTE_IF_PROXY_H_ + +/** + * @file + * RTE IF Proxy library + * + * The IF Proxy library allows for monitoring of system network configuration + * and configuration of DPDK ports by using usual system utilities (like the + * ones from iproute2 package). + * + * It is based on the notion of "proxy interface" which actually can be any DPDK + * port which is also visible to the system - that is it has non-zero 'if_index' + * field in 'rte_eth_dev_info' structure. + * + * If application doesn't have any such port (or doesn't want to use it for + * proxy) it can create one by calling: + * + * proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); + * + * This function is just a wrapper that constructs valid 'devargs' string based + * on the proxy type chosen (currently Tap or KNI) and creates the interface by + * calling rte_ifpx_dev_create(). + * + * Once one has DPDK port capable of being proxy one can bind target DPDK port + * to it by calling. + * + * rte_ifpx_port_bind(port_id, proxy_id); + * + * This binding is a logical one - there is no automatic packet forwarding + * between port and it's proxy since the library doesn't know the structure of + * application's packet processing. It remains application responsibility to + * forward the packets from/to proxy port (by calling the usual DPDK RX/TX burst + * API). However when the library notes some change to the proxy interface it + * will simply call appropriate callback with 'port_id' of the DPDK port that is + * bound to this proxy interface. The binding can be 1 to many - that is many + * ports can point to one proxy - in that case registered callbacks will be + * called for every bound port. + * + * The callbacks that are used for notifications are described by the + * 'rte_ifpx_callback' structure and they are registered by calling: + * + * rte_ifpx_callbacks_register(len, cbs); + * + * where cbs is an array of callback pointers. + * @see rte_ifpx_callbacks_register() + * + * Finally the application should call: + * + * rte_ifpx_listen(); + * + * which will query system for present network configuration and start listening + * to its changes. + */ + +#include <rte_eal.h> +#include <rte_ethdev.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Enum naming the type of proxy to create. + * + * @see rte_ifpx_create() + */ +enum rte_ifpx_proxy_type { + RTE_IFPX_DEFAULT, /**< Use default proxy type for given arch. */ + RTE_IFPX_TAP, /**< Use Tap based port for proxy. */ + RTE_IFPX_KNI /**< Use KNI based port for proxy. */ +}; + +/** + * Create DPDK port that can serve as an interface proxy. + * + * This function is just a wrapper around rte_ifpx_create_by_devarg() that + * constructs its 'devarg' argument based on type of proxy requested. + * + * @param type + * A type of proxy to create. + * + * @return + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. + * + * @see enum rte_ifpx_type + * @see rte_ifpx_create_by_devarg() + */ +__rte_experimental +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type); + +/** + * Create DPDK port that can serve as an interface proxy. + * + * @param devarg + * A string passed to rte_dev_probe() to create proxy port. + * + * @return + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. + */ +__rte_experimental +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg); + +/** + * Remove DPDK proxy port. + * + * In addition to removing the proxy port the bindings (if any) are cleared. + * + * @param proxy_id + * Port id of the proxy that should be removed. + * + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_proxy_destroy(uint16_t proxy_id); + +/** + * The rte_ifpx_event_type enum lists all possible event types that can be + * signaled by this library. To learn what events are supported on your + * platform call rte_ifpx_events_available(). + * + * NOTE - in order to keep ABI stable do not reorder these enums freely. + */ +enum rte_ifpx_event_type { + RTE_IFPX_MAC_CHANGE, /**< @see struct rte_ifpx_mac_change */ + RTE_IFPX_MTU_CHANGE, /**< @see struct rte_ifpx_mtu_change */ + RTE_IFPX_LINK_CHANGE, /**< @see struct rte_ifpx_link_change */ + RTE_IFPX_ADDR_ADD, /**< @see struct rte_ifpx_addr_change */ + RTE_IFPX_ADDR_DEL, /**< @see struct rte_ifpx_addr_change */ + RTE_IFPX_ADDR6_ADD, /**< @see struct rte_ifpx_addr6_change */ + RTE_IFPX_ADDR6_DEL, /**< @see struct rte_ifpx_addr6_change */ + RTE_IFPX_ROUTE_ADD, /**< @see struct rte_ifpx_route_change */ + RTE_IFPX_ROUTE_DEL, /**< @see struct rte_ifpx_route_change */ + RTE_IFPX_ROUTE6_ADD, /**< @see struct rte_ifpx_route6_change */ + RTE_IFPX_ROUTE6_DEL, /**< @see struct rte_ifpx_route6_change */ + RTE_IFPX_NEIGH_ADD, /**< @see struct rte_ifpx_neigh_change */ + RTE_IFPX_NEIGH_DEL, /**< @see struct rte_ifpx_neigh_change */ + RTE_IFPX_NEIGH6_ADD, /**< @see struct rte_ifpx_neigh6_change */ + RTE_IFPX_NEIGH6_DEL, /**< @see struct rte_ifpx_neigh6_change */ + RTE_IFPX_CFG_DONE, /**< This event is a lib specific event - it is + * signaled when initial network configuration + * query is finished and has no event data. + */ +}; + +/** + * Get the bit mask of implemented events/callbacks for this platform. + * + * @return + * Bit mask of events/callbacks implemented: each event type can be tested by + * checking bit (1 << ev) where 'ev' is one of the rte_ifpx_event_type enum + * values. + * @see enum rte_ifpx_event_type + */ +__rte_experimental +uint64_t rte_ifpx_events_available(void); + +/** + * The rte_ifpx_event defines structure used to pass notification event to + * application. Each event type has its own dedicated inner structure - these + * structures are also used when using callbacks notifications. + */ +struct rte_ifpx_event { + enum rte_ifpx_event_type type; + union { + /** Structure used to pass notification about MAC change of the + * proxy interface. + * @see RTE_IFPX_MAC_CHANGE + */ + struct rte_ifpx_mac_change { + uint16_t port_id; + struct rte_ether_addr mac; + } mac_change; + /** Structure used to pass notification about MTU change. + * @see RTE_IFPX_MTU_CHANGE + */ + struct rte_ifpx_mtu_change { + uint16_t port_id; + uint16_t mtu; + } mtu_change; + /** Structure used to pass notification about link going + * up/down. + * @see RTE_IFPX_LINK_CHANGE + */ + struct rte_ifpx_link_change { + uint16_t port_id; + int is_up; + } link_change; + /** Structure used to pass notification about IPv4 address being + * added/removed. All IPv4 addresses reported by this library + * are in host order. + * @see RTE_IFPX_ADDR_ADD + * @see RTE_IFPX_ADDR_DEL + */ + struct rte_ifpx_addr_change { + uint16_t port_id; + uint32_t ip; + } addr_change; + /** Structure used to pass notification about IPv6 address being + * added/removed. + * @see RTE_IFPX_ADDR6_ADD + * @see RTE_IFPX_ADDR6_DEL + */ + struct rte_ifpx_addr6_change { + uint16_t port_id; + uint8_t ip[16]; + } addr6_change; + /** Structure used to pass notification about IPv4 route being + * added/removed. + * @see RTE_IFPX_ROUTE_ADD + * @see RTE_IFPX_ROUTE_DEL + */ + struct rte_ifpx_route_change { + uint16_t port_id; + uint8_t depth; + uint32_t ip; + uint32_t gateway; + } route_change; + /** Structure used to pass notification about IPv6 route being + * added/removed. + * @see RTE_IFPX_ROUTE6_ADD + * @see RTE_IFPX_ROUTE6_DEL + */ + struct rte_ifpx_route6_change { + uint16_t port_id; + uint8_t depth; + uint8_t ip[16]; + uint8_t gateway[16]; + } route6_change; + /** Structure used to pass notification about IPv4 neighbour + * info changes. + * @see RTE_IFPX_NEIGH_ADD + * @see RTE_IFPX_NEIGH_DEL + */ + struct rte_ifpx_neigh_change { + uint16_t port_id; + struct rte_ether_addr mac; + uint32_t ip; + } neigh_change; + /** Structure used to pass notification about IPv6 neighbour + * info changes. + * @see RTE_IFPX_NEIGH6_ADD + * @see RTE_IFPX_NEIGH6_DEL + */ + struct rte_ifpx_neigh6_change { + uint16_t port_id; + struct rte_ether_addr mac; + uint8_t ip[16]; + } neigh6_change; + /* This structure is used internally - to abstract common parts + * of proxy/port related events and to be able to refer to this + * union without giving it a name. + */ + struct { + uint16_t port_id; + } data; + }; +}; + +/** + * This library can deliver notification about network configuration changes + * either by the use of registered callbacks and/or by queueing change events to + * configured notification queues. The logic used is: + * 1. If there is callback registered for given event type it is called. In + * case of many ports to one proxy binding, this callback is called for every + * port bound. + * 2. If this callback returns non-zero value (for any of ports in case of + * many-1 bindings) the handling of an event is considered as complete. + * 3. Otherwise the event is added to each configured event queue. The event is + * allocated with malloc() so after dequeueing and handling the application + * should deallocate it with free(). + * + * This dual notification mechanism is meant to provide some flexibility to + * application writer. For example, if you store your data in a single writer/ + * many readers coherent data structure you could just update this structure + * from the callback. If you keep separate copy per lcore/port you could make + * some common preparations (if applicable) in the callback, return 0 and use + * notification queues to pick up the change and update data structures. Or you + * could skip the callbacks altogether and just use notification queues - and + * configure them at the level appropriate for your application design (one + * global / one per lcore / one per port ...). + */ + +/** + * Add notification queue to the list of queues. + * + * @param r + * Ring used for queueing of notification events - application can assume that + * there is only one producer. + * @return + * 0 on success, negative otherwise. + */ +int rte_ifpx_queue_add(struct rte_ring *r); + +/** + * Remove notification queue from the list of queues. + * + * @param r + * Notification ring used for queueing of notification events (previously + * added via rte_ifpx_queue_add()). + * @return + * 0 on success, negative otherwise. + */ +int rte_ifpx_queue_remove(struct rte_ring *r); + +/** + * This union groups the callback types that might be called as a notification + * events for changing network configuration. Not every platform might + * implement all of them and you can query the availability with + * rte_ifpx_events_available() function. + * @see rte_ifpx_events_available() + * @see rte_ifpx_callbacks_register() + */ +union rte_ifpx_cb_ptr { + int (*mac_change)(const struct rte_ifpx_mac_change *event); + /**< Callback for notification about MAC change of the proxy interface. + * This callback (as all other port related callbacks) is called for + * each port (with its port_id as a first argument) bound to the proxy + * interface for which change has been observed. + * @see struct rte_ifpx_mac_change + * @return non-zero if event handling is finished + */ + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); + /**< Callback for notification about MTU change. + * @see struct rte_ifpx_mtu_change + * @return non-zero if event handling is finished + */ + int (*link_change)(const struct rte_ifpx_link_change *event); + /**< Callback for notification about link going up/down. + * @see struct rte_ifpx_link_change + * @return non-zero if event handling is finished + */ + int (*addr_add)(const struct rte_ifpx_addr_change *event); + /**< Callback for notification about IPv4 address being added. + * @see struct rte_ifpx_addr_change + * @return non-zero if event handling is finished + */ + int (*addr_del)(const struct rte_ifpx_addr_change *event); + /**< Callback for notification about IPv4 address removal. + * @see struct rte_ifpx_addr_change + * @return non-zero if event handling is finished + */ + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); + /**< Callback for notification about IPv6 address being added. + * @see struct rte_ifpx_addr6_change + */ + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); + /**< Callback for notification about IPv4 address removal. + * @see struct rte_ifpx_addr6_change + * @return non-zero if event handling is finished + */ + /* Please note that "route" callbacks might be also called when user + * adds address to the interface (that is in addition to address related + * callbacks). + */ + int (*route_add)(const struct rte_ifpx_route_change *event); + /**< Callback for notification about IPv4 route being added. + * @see struct rte_ifpx_route_change + * @return non-zero if event handling is finished + */ + int (*route_del)(const struct rte_ifpx_route_change *event); + /**< Callback for notification about IPv4 route removal. + * @see struct rte_ifpx_route_change + * @return non-zero if event handling is finished + */ + int (*route6_add)(const struct rte_ifpx_route6_change *event); + /**< Callback for notification about IPv6 route being added. + * @see struct rte_ifpx_route6_change + * @return non-zero if event handling is finished + */ + int (*route6_del)(const struct rte_ifpx_route6_change *event); + /**< Callback for notification about IPv6 route removal. + * @see struct rte_ifpx_route6_change + * @return non-zero if event handling is finished + */ + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); + /**< Callback for notification about IPv4 neighbour being added. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); + /**< Callback for notification about IPv4 neighbour removal. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); + /**< Callback for notification about IPv6 neighbour being added. + * @see struct rte_ifpx_neigh_change + */ + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); + /**< Callback for notification about IPv6 neighbour removal. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*cfg_done)(void); + /**< Lib specific callback - called when initial network configuration + * query is finished. + * @return non-zero if event handling is finished + */ +}; + +/** + * This structure is a "tagged union" used to pass the callback for + * registration. + * + * @see union rte_ifpx_cb_ptr + * @see rte_ifpx_events_available() + * @see rte_ifpx_callbacks_register() + */ +struct rte_ifpx_callback { + enum rte_ifpx_event_type type; + union rte_ifpx_cb_ptr callback; +}; + +/** + * Register proxy callbacks. + * + * This function registers callbacks to be called upon appropriate network + * event notification. + * + * @param cbs + * Set of callbacks that will be called. The library does not take any + * ownership of the pointer passed - the callbacks are stored internally. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_callbacks_register(unsigned int len, + const struct rte_ifpx_callback cbs[]); + +/** + * Unregister proxy callbacks. + * + * This function unregisters all callbacks previously registered with + * rte_ifpx_callbacks_register(). + */ +__rte_experimental +void rte_ifpx_callbacks_unregister_all(void); + +/** + * Unregister proxy callback. + * + * This function unregisters one callback previously registered with + * rte_ifpx_callbacks_register(). + * + * @param ev + * Type of event for which callback should be removed. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_callbacks_unregister(enum rte_ifpx_event_type ev); + +/** + * Bind the port to its proxy. + * + * After calling this function all network configuration of the proxy (and it's + * changes) will be passed to given port by calling registered callbacks with + * 'port_id' as an argument. + * + * Note: since both arguments are of the same type in order to not mix them and + * ease remembering the order the first one is kept the same for bind/unbind. + * + * @param port_id + * Id of the port to be bound. + * @param proxy_id + * Id of the proxy the port needs to be bound to. + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id); + +/** + * Unbind the port from its proxy. + * + * After calling this function registered callbacks will no longer be called for + * this port (but they might be called for other ports in one to many binding + * scenario). + * + * @param port_id + * Id of the port to unbind. + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_port_unbind(uint16_t port_id); + +/** + * Get the system network configuration and start listening to its changes. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_listen(void); + +/** + * Remove all bindings/callbacks and stop listening to network configuration. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_close(void); + +/** + * Get the id of the proxy the port is bound to. + * + * @param port_id + * Id of the port for which to get proxy. + * @return + * Port id of the proxy on success, RTE_MAX_ETHPORTS on error. + */ +__rte_experimental +uint16_t rte_ifpx_proxy_get(uint16_t port_id); + +/** + * Test for port acting as a proxy. + * + * @param port_id + * Id of the port. + * @return + * 1 if port acts as a proxy, 0 otherwise. + */ +static inline +int rte_ifpx_is_proxy(uint16_t port_id) +{ + return rte_ifpx_proxy_get(port_id) == port_id; +} + +/** + * Get the ids of the ports bound to the proxy. + * + * @param proxy_id + * Id of the proxy for which to get ports. + * @param ports + * Array where to store the port ids. + * @param num + * Size of the 'ports' array. + * @return + * The number of ports bound to given proxy. Note that bound ports are filled + * in 'ports' array up to its size but the return value is always the total + * number of ports bound - so you can make call first with NULL/0 to query for + * the size of the buffer to create or call it with the buffer you have and + * later check if it was large enough. + */ +__rte_experimental +unsigned int rte_ifpx_port_get(uint16_t proxy_id, + uint16_t *ports, unsigned int num); + +/** + * The structure containing some properties of the proxy interface. + */ +struct rte_ifpx_info { + unsigned int if_index; /* entry valid iff if_index != 0 */ + uint16_t mtu; + struct rte_ether_addr mac; + char if_name[RTE_ETH_NAME_MAX_LEN]; +}; + +/** + * Get the properties of the proxy interface. Argument can be either id of the + * proxy or an id of a port that is bound to it. + * + * @param port_id + * Id of the port (or proxy) for which to get proxy properties. + * @return + * Pointer to the proxy information structure. + */ +__rte_experimental +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id); + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_IF_PROXY_H_ */ diff --git a/lib/librte_if_proxy/rte_if_proxy_version.map b/lib/librte_if_proxy/rte_if_proxy_version.map new file mode 100644 index 000000000..6da35d096 --- /dev/null +++ b/lib/librte_if_proxy/rte_if_proxy_version.map @@ -0,0 +1,20 @@ +EXPERIMENTAL { + global: + + rte_ifpx_callbacks_register; + rte_ifpx_callbacks_unregister; + rte_ifpx_callbacks_unregister_all; + rte_ifpx_close; + rte_ifpx_events_available; + rte_ifpx_info_get; + rte_ifpx_listen; + rte_ifpx_port_bind; + rte_ifpx_port_get; + rte_ifpx_port_unbind; + rte_ifpx_proxy_create; + rte_ifpx_proxy_create_by_devarg; + rte_ifpx_proxy_destroy; + rte_ifpx_proxy_get; + + local: *; +}; diff --git a/lib/meson.build b/lib/meson.build index 07a65a625..caa54f7b5 100644 --- a/lib/meson.build +++ b/lib/meson.build @@ -21,7 +21,7 @@ libraries = [ 'acl', 'bbdev', 'bitratestats', 'cfgfile', 'compressdev', 'cryptodev', 'distributor', 'efd', 'eventdev', - 'gro', 'gso', 'ip_frag', 'jobstats', + 'gro', 'gso', 'if_proxy', 'ip_frag', 'jobstats', 'kni', 'latencystats', 'lpm', 'member', 'power', 'pdump', 'rawdev', 'rib', 'reorder', 'sched', 'security', 'stack', 'vhost', -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v3 2/4] if_proxy: add library documentation 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 " Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 1/4] lib: introduce IF Proxy library Andrzej Ostruszka @ 2020-05-04 8:53 ` Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 3/4] if_proxy: add simple functionality test Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 4/4] if_proxy: add example application Andrzej Ostruszka 3 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-05-04 8:53 UTC (permalink / raw) To: dev, Thomas Monjalon, John McNamara, Marko Kovacevic This commit adds documentation of IF Proxy library. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + doc/guides/prog_guide/if_proxy_lib.rst | 142 +++++++++++++++++++++++++ doc/guides/prog_guide/index.rst | 1 + 3 files changed, 144 insertions(+) create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst diff --git a/MAINTAINERS b/MAINTAINERS index 1013745ce..1216366ab 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1475,6 +1475,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: doc/guides/prog_guide/if_proxy_lib.rst Test Applications ----------------- diff --git a/doc/guides/prog_guide/if_proxy_lib.rst b/doc/guides/prog_guide/if_proxy_lib.rst new file mode 100644 index 000000000..4ec7e65a5 --- /dev/null +++ b/doc/guides/prog_guide/if_proxy_lib.rst @@ -0,0 +1,142 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(C) 2020 Marvell International Ltd. + +.. _IF_Proxy_Library: + +IF Proxy Library +================ + +When a network interface is assigned to DPDK it usually disappears from +the system and user looses ability to configure it via typical +configuration tools. +There are basically two options to deal with this situation: + +- configure it via command line arguments and/or load configuration + from some file, +- add support for live configuration via some IPC mechanism. + +The first option is static and the second one requires some work to add +communication loop (e.g. separate thread listening/communicating on +a socket). + +This library adds a possibility to configure DPDK ports by using normal +configuration utilities (e.g. from iproute2 suite). +It requires user to configure additional DPDK ports that are visible to +the system (such as Tap or KNI - actually any port that has valid +`if_index` in ``struct rte_eth_dev_info`` will do) and designate them as +a port representor (a proxy) in the system. + +Let's see typical intended usage by an example. +Suppose that you have application that handles traffic on two ports (in +the white list below):: + + ./app -w 00:14.0 -w 00:16.0 --vdev=net_tap0 --vdev=net_tap1 + +So in addition to the "regular" ports you need to configure proxy ports. +These proxy ports can be created via a command line (like above) or from +within the application (e.g. by using `rte_ifpx_proxy_create()` +function). + +When you have proxy ports you need to bind them to the "regular" ports:: + + rte_ifpx_port_bind(port0, proxy0); + rte_ifpx_port_bind(port1, proxy1); + +This binding is a logical one - there is no automatic packet forwarding +configured. +This is because library cannot tell upfront what portion of the traffic +received on ports 0/1 should be redirected to the system via proxies and +also it does not know how the application is structured (what packet +processing engines it uses). +Therefore it is application writer responsibility to include proxy ports +into its packet processing and forward appropriate packets between +proxies and ports. +What the library actually does is that it gets network configuration +from the system and listens to its changes. +This information is then matched against `if_index` of the configured +proxies and passed to the application. + +There are two mechanisms via which library passes notifications to the +application. +First is the set of global callbacks that user has +to register via:: + + rte_ifpx_callbacks_register(len, cbs); + +Here `cbs` is an array of ``struct rte_ifpx_callback`` which is a tagged +union with following members:: + + int (*mac_change)(const struct rte_ifpx_mac_change *event); + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); + int (*link_change)(const struct rte_ifpx_link_change *event); + int (*addr_add)(const struct rte_ifpx_addr_change *event); + int (*addr_del)(const struct rte_ifpx_addr_change *event); + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); + int (*route_add)(const struct rte_ifpx_route_change *event); + int (*route_del)(const struct rte_ifpx_route_change *event); + int (*route6_add)(const struct rte_ifpx_route6_change *event); + int (*route6_del)(const struct rte_ifpx_route6_change *event); + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); + int (*cfg_done)(void); + +All of them should be self explanatory apart from the last one which is +library specific callback - called when initial network configuration +query is finished. + +So for example when the user issues command:: + + ip link set dev dtap0 mtu 1600 + +then library will call `mtu_change()` callback with MTU change event +having port_id equal to `port0` (id of the port bound to this proxy) and +`mtu` equal to 1600 (``dtap0`` is the default interface name for +``net_tap0``). +Application can simply use `rte_eth_dev_set_mtu()` in this callback. +The same way `rte_eth_dev_default_mac_addr_set()` can be used in +`mac_change()` and `rte_eth_dev_set_link_up/down()` inside the +`link_change()` callback that does dispatch based on `is_up` member of +its `event` argument. + +Please note however that the context in which these callbacks are called +is most probably different from the one in which packets are handled and +it is application writer responsibility to use proper synchronization +mechanisms - if they are needed. + +Second notification mechanism relies on queueing of event notifications +to the configured notification rings. +Application can add queue via:: + + int rte_ifpx_queue_add(struct rte_ring *r); + +This type of notification is used when there is no callback registered +for given type of event or when it is registered but it returns 0. +This way application has following choices: + +- if the data structure that needs to be updated due to notification + is safe to be modified by a single writer (while being used by other + readers) then it can simply do that inside the callback and return + non-zero value to signal end of the event handling + +- otherwise, when there are some common preparation steps that needs + to be done only once, application can register callback that will + perform these steps and return 0 - library will then add an event to + each registered notification queue + +- if the data structures are replicated and there are no common steps + then application can simply skip registering of the callbacks and + configure notification queues (e.g. 1 per each lcore) + +Once we have bindings in place and notification configured, the only +essential part that remains is to get the current network configuration +and start listening to its changes. +This is accomplished via a call to:: + + int rte_ifpx_listen(void); + +From that moment you should see notifications coming to your +application: first ones resulting from querying of current system +configurations and subsequent on the configuration changes. diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst index 1d0cd49cd..349829bcd 100644 --- a/doc/guides/prog_guide/index.rst +++ b/doc/guides/prog_guide/index.rst @@ -58,6 +58,7 @@ Programmer's Guide metrics_lib bpf_lib ipsec_lib + if_proxy_lib source_org dev_kit_build_system dev_kit_root_make_help -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v3 3/4] if_proxy: add simple functionality test 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 " Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 2/4] if_proxy: add library documentation Andrzej Ostruszka @ 2020-05-04 8:53 ` Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 4/4] if_proxy: add example application Andrzej Ostruszka 3 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-05-04 8:53 UTC (permalink / raw) To: dev, Thomas Monjalon This commit adds simple test of the library notifications. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + app/test/Makefile | 5 + app/test/meson.build | 4 + app/test/test_if_proxy.c | 707 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 717 insertions(+) create mode 100644 app/test/test_if_proxy.c diff --git a/MAINTAINERS b/MAINTAINERS index 1216366ab..d42cfb566 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1475,6 +1475,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: app/test/test_if_proxy.c F: doc/guides/prog_guide/if_proxy_lib.rst Test Applications diff --git a/app/test/Makefile b/app/test/Makefile index 4582eca6c..a13595042 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -240,6 +240,11 @@ SRCS-$(CONFIG_RTE_LIBRTE_RCU) += test_rcu_qsbr.c test_rcu_qsbr_perf.c SRCS-$(CONFIG_RTE_LIBRTE_SECURITY) += test_security.c +ifeq ($(CONFIG_RTE_LIBRTE_IF_PROXY),y) +SRCS-y += test_if_proxy.c +LDLIBS += -lrte_if_proxy +endif + SRCS-$(CONFIG_RTE_LIBRTE_IPSEC) += test_ipsec.c SRCS-$(CONFIG_RTE_LIBRTE_IPSEC) += test_ipsec_sad.c ifeq ($(CONFIG_RTE_LIBRTE_IPSEC),y) diff --git a/app/test/meson.build b/app/test/meson.build index fc60acbe7..678f7ef62 100644 --- a/app/test/meson.build +++ b/app/test/meson.build @@ -369,6 +369,10 @@ endif if dpdk_conf.has('RTE_LIBRTE_PDUMP') test_deps += 'pdump' endif +if dpdk_conf.has('RTE_LIBRTE_IF_PROXY') + test_deps += 'if_proxy' + test_sources += 'test_if_proxy.c' +endif if cc.has_argument('-Wno-format-truncation') cflags += '-Wno-format-truncation' diff --git a/app/test/test_if_proxy.c b/app/test/test_if_proxy.c new file mode 100644 index 000000000..4eca049c9 --- /dev/null +++ b/app/test/test_if_proxy.c @@ -0,0 +1,707 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#include "test.h" + +#include <rte_ethdev.h> +#include <rte_if_proxy.h> +#include <rte_cycles.h> + +#include <string.h> +#include <unistd.h> +#include <signal.h> +#include <net/if.h> +#include <arpa/inet.h> +#include <pthread.h> +#include <time.h> + +/* There are two types of event notifications - one using callbacks and one + * using event queues (rings). We'll test them both and this "bool" will govern + * the type of API to use. + */ +static int use_callbacks = 1; +static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; +static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; + +static struct rte_ring *ev_queue; + +enum net_event_mask { + INITIALIZED = 1U << RTE_IFPX_CFG_DONE, + LINK_CHANGED = 1U << RTE_IFPX_LINK_CHANGE, + MAC_CHANGED = 1U << RTE_IFPX_MAC_CHANGE, + MTU_CHANGED = 1U << RTE_IFPX_MTU_CHANGE, + ADDR_ADD = 1U << RTE_IFPX_ADDR_ADD, + ADDR_DEL = 1U << RTE_IFPX_ADDR_DEL, + ROUTE_ADD = 1U << RTE_IFPX_ROUTE_ADD, + ROUTE_DEL = 1U << RTE_IFPX_ROUTE_DEL, + ADDR6_ADD = 1U << RTE_IFPX_ADDR6_ADD, + ADDR6_DEL = 1U << RTE_IFPX_ADDR6_DEL, + ROUTE6_ADD = 1U << RTE_IFPX_ROUTE6_ADD, + ROUTE6_DEL = 1U << RTE_IFPX_ROUTE6_DEL, + NEIGH_ADD = 1U << RTE_IFPX_NEIGH_ADD, + NEIGH_DEL = 1U << RTE_IFPX_NEIGH_DEL, + NEIGH6_ADD = 1U << RTE_IFPX_NEIGH6_ADD, + NEIGH6_DEL = 1U << RTE_IFPX_NEIGH6_DEL, +}; + +static unsigned int state; + +static struct { + struct rte_ether_addr mac_addr; + uint16_t port_id, mtu; + struct in_addr ipv4, route4; + struct in6_addr ipv6, route6; + uint16_t depth4, depth6; + int is_up; +} net_cfg; + +static +int unlock_notify(unsigned int op) +{ + /* the mutex is expected to be locked on entry */ + RTE_VERIFY(pthread_mutex_trylock(&mutex) == EBUSY); + state |= op; + + pthread_mutex_unlock(&mutex); + return pthread_cond_signal(&cond); +} + +static +void handle_event(struct rte_ifpx_event *ev); + +static +int wait_for(unsigned int op_mask, unsigned int sec) +{ + int ec; + + if (use_callbacks) { + struct timespec time; + + ec = pthread_mutex_trylock(&mutex); + /* the mutex is expected to be locked on entry */ + RTE_VERIFY(ec == EBUSY); + + ec = 0; + clock_gettime(CLOCK_REALTIME, &time); + time.tv_sec += sec; + + while ((state & op_mask) != op_mask && ec == 0) + ec = pthread_cond_timedwait(&cond, &mutex, &time); + } else { + uint64_t deadline; + struct rte_ifpx_event *ev; + + ec = 0; + deadline = rte_get_timer_cycles() + sec * rte_get_timer_hz(); + + while ((state & op_mask) != op_mask) { + if (rte_get_timer_cycles() >= deadline) { + ec = ETIMEDOUT; + break; + } + if (rte_ring_dequeue(ev_queue, (void **)&ev) == 0) + handle_event(ev); + } + } + + return ec; +} + +static +int expect(unsigned int op_mask, const char *fmt, ...) +#if __GNUC__ + __attribute__((format(printf, 2, 3))); +#endif + +static +int expect(unsigned int op_mask, const char *fmt, ...) +{ + char cmd[128]; + va_list args; + int ret; + + state &= ~op_mask; + va_start(args, fmt); + vsnprintf(cmd, sizeof(cmd), fmt, args); + va_end(args); + ret = system(cmd); + if (ret == 0) + /* IPv6 address notifications seem to need that long delay. */ + return wait_for(op_mask, 2); + return ret; +} + +static +int mac_change(const struct rte_ifpx_mac_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(MAC_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int mtu_change(const struct rte_ifpx_mtu_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->mtu == net_cfg.mtu) { + unlock_notify(MTU_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int link_change(const struct rte_ifpx_link_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->is_up == net_cfg.is_up) { + /* Special case for testing of callbacks modification from + * inside of callback: we catch putting link down (the last + * operation in test) and remove callbacks registered. + */ + if (!ev->is_up) + rte_ifpx_callbacks_unregister_all(); + unlock_notify(LINK_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr_add(const struct rte_ifpx_addr_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->ip == net_cfg.ipv4.s_addr) { + unlock_notify(ADDR_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr_del(const struct rte_ifpx_addr_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->ip == net_cfg.ipv4.s_addr) { + unlock_notify(ADDR_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr6_add(const struct rte_ifpx_addr6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(ADDR6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr6_del(const struct rte_ifpx_addr6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(ADDR6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route_add(const struct rte_ifpx_route_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth4 == ev->depth && net_cfg.route4.s_addr == ev->ip) { + unlock_notify(ROUTE_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route_del(const struct rte_ifpx_route_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth4 == ev->depth && net_cfg.route4.s_addr == ev->ip) { + unlock_notify(ROUTE_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route6_add(const struct rte_ifpx_route6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth6 == ev->depth && + /* don't check for trailing zeros */ + memcmp(ev->ip, net_cfg.route6.s6_addr, ev->depth/8) == 0) { + unlock_notify(ROUTE6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route6_del(const struct rte_ifpx_route6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth6 == ev->depth && + /* don't check for trailing zeros */ + memcmp(ev->ip, net_cfg.route6.s6_addr, ev->depth/8) == 0) { + unlock_notify(ROUTE6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh_add(const struct rte_ifpx_neigh_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.ipv4.s_addr == ev->ip && + memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(NEIGH_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh_del(const struct rte_ifpx_neigh_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.ipv4.s_addr == ev->ip) { + unlock_notify(NEIGH_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh6_add(const struct rte_ifpx_neigh6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0 && + memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(NEIGH6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh6_del(const struct rte_ifpx_neigh6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(NEIGH6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int cfg_done(void) +{ + pthread_mutex_lock(&mutex); + unlock_notify(INITIALIZED); + return 1; +} + +static +void handle_event(struct rte_ifpx_event *ev) +{ + if (ev->type != RTE_IFPX_CFG_DONE) + RTE_VERIFY(ev->data.port_id == net_cfg.port_id); + + /* If params do not match what we expect just free the event. */ + switch (ev->type) { + case RTE_IFPX_MAC_CHANGE: + if (memcmp(ev->mac_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_MTU_CHANGE: + if (ev->mtu_change.mtu != net_cfg.mtu) + goto exit; + break; + case RTE_IFPX_LINK_CHANGE: + if (ev->link_change.is_up != net_cfg.is_up) + goto exit; + break; + case RTE_IFPX_ADDR_ADD: + if (ev->addr_change.ip != net_cfg.ipv4.s_addr) + goto exit; + break; + case RTE_IFPX_ADDR_DEL: + if (ev->addr_change.ip != net_cfg.ipv4.s_addr) + goto exit; + break; + case RTE_IFPX_ADDR6_ADD: + if (memcmp(ev->addr6_change.ip, net_cfg.ipv6.s6_addr, + 16) != 0) + goto exit; + break; + case RTE_IFPX_ADDR6_DEL: + if (memcmp(ev->addr6_change.ip, net_cfg.ipv6.s6_addr, + 16) != 0) + goto exit; + break; + case RTE_IFPX_ROUTE_ADD: + if (net_cfg.depth4 != ev->route_change.depth || + net_cfg.route4.s_addr != ev->route_change.ip) + goto exit; + break; + case RTE_IFPX_ROUTE_DEL: + if (net_cfg.depth4 != ev->route_change.depth || + net_cfg.route4.s_addr != ev->route_change.ip) + goto exit; + break; + case RTE_IFPX_ROUTE6_ADD: + if (net_cfg.depth6 != ev->route6_change.depth || + /* don't check for trailing zeros */ + memcmp(ev->route6_change.ip, net_cfg.route6.s6_addr, + ev->route6_change.depth/8) != 0) + goto exit; + break; + case RTE_IFPX_ROUTE6_DEL: + if (net_cfg.depth6 != ev->route6_change.depth || + /* don't check for trailing zeros */ + memcmp(ev->route6_change.ip, net_cfg.route6.s6_addr, + ev->route6_change.depth/8) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH_ADD: + if (net_cfg.ipv4.s_addr != ev->neigh_change.ip || + memcmp(ev->neigh_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH_DEL: + if (net_cfg.ipv4.s_addr != ev->neigh_change.ip) + goto exit; + break; + case RTE_IFPX_NEIGH6_ADD: + if (memcmp(ev->neigh6_change.ip, + net_cfg.ipv6.s6_addr, 16) != 0 || + memcmp(ev->neigh6_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH6_DEL: + if (memcmp(ev->neigh6_change.ip, net_cfg.ipv6.s6_addr, 16) != 0) + goto exit; + break; + case RTE_IFPX_CFG_DONE: + break; + default: + RTE_VERIFY(0 && "Unhandled event type"); + } + + state |= 1U << ev->type; +exit: + free(ev); +} + +static +struct rte_ifpx_callback cbs[] = { + { RTE_IFPX_MAC_CHANGE, {.mac_change = mac_change} }, + { RTE_IFPX_MTU_CHANGE, {.mtu_change = mtu_change} }, + { RTE_IFPX_LINK_CHANGE, {.link_change = link_change} }, + { RTE_IFPX_ADDR_ADD, {.addr_add = addr_add} }, + { RTE_IFPX_ADDR_DEL, {.addr_del = addr_del} }, + { RTE_IFPX_ADDR6_ADD, {.addr6_add = addr6_add} }, + { RTE_IFPX_ADDR6_DEL, {.addr6_del = addr6_del} }, + { RTE_IFPX_ROUTE_ADD, {.route_add = route_add} }, + { RTE_IFPX_ROUTE_DEL, {.route_del = route_del} }, + { RTE_IFPX_ROUTE6_ADD, {.route6_add = route6_add} }, + { RTE_IFPX_ROUTE6_DEL, {.route6_del = route6_del} }, + { RTE_IFPX_NEIGH_ADD, {.neigh_add = neigh_add} }, + { RTE_IFPX_NEIGH_DEL, {.neigh_del = neigh_del} }, + { RTE_IFPX_NEIGH6_ADD, {.neigh6_add = neigh6_add} }, + { RTE_IFPX_NEIGH6_DEL, {.neigh6_del = neigh6_del} }, + /* lib specific callback */ + { RTE_IFPX_CFG_DONE, {.cfg_done = cfg_done} }, +}; + +static +int test_notifications(const struct rte_ifpx_info *pinfo) +{ + char mac_buf[RTE_ETHER_ADDR_FMT_SIZE]; + int ec; + + /* Test link up notification. */ + net_cfg.is_up = 1; + ec = expect(LINK_CHANGED, "ip link set dev %s up", pinfo->if_name); + if (ec != 0) { + printf("Failed to notify about link going up\n"); + return ec; + } + + /* Test for MAC changes notification. */ + rte_eth_random_addr(net_cfg.mac_addr.addr_bytes); + rte_ether_format_addr(mac_buf, sizeof(mac_buf), &net_cfg.mac_addr); + ec = expect(MAC_CHANGED, "ip link set dev %s address %s", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notification about mac change\n"); + return ec; + } + + /* Test for MTU changes notification. */ + net_cfg.mtu = pinfo->mtu + 100; + ec = expect(MTU_CHANGED, "ip link set dev %s mtu %d", + pinfo->if_name, net_cfg.mtu); + if (ec != 0) { + printf("Missing/wrong notification about mtu change\n"); + return ec; + } + + /* Test for adding of IPv4 address - using address from TEST-2 pool. + * This test is specific to linux netlink behaviour - after adding + * address we get both notification about address being added and new + * route. So I check both. + */ + net_cfg.ipv4.s_addr = RTE_IPV4(198, 51, 100, 14); + net_cfg.route4.s_addr = net_cfg.ipv4.s_addr; + net_cfg.depth4 = 32; + ec = expect(ADDR_ADD | ROUTE_ADD, "ip addr add 198.51.100.14 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 address add\n"); + return ec; + } + + /* Test for IPv4 address removal. See comment above for 'addr add'. */ + ec = expect(ADDR_DEL | ROUTE_DEL, "ip addr del 198.51.100.14/32 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 address del\n"); + return ec; + } + + /* Test for adding IPv4 route. */ + net_cfg.route4.s_addr = RTE_IPV4(198, 51, 100, 0); + net_cfg.depth4 = 24; + ec = expect(ROUTE_ADD, "ip route add 198.51.100.0/24 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 route add\n"); + return ec; + } + + /* Test for IPv4 route removal. */ + ec = expect(ROUTE_DEL, "ip route del 198.51.100.0/24 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 route del\n"); + return ec; + } + + /* Test for neighbour addresses notifications. */ + rte_eth_random_addr(net_cfg.mac_addr.addr_bytes); + rte_ether_format_addr(mac_buf, sizeof(mac_buf), &net_cfg.mac_addr); + + ec = expect(NEIGH_ADD, + "ip neigh add 198.51.100.14 dev %s lladdr %s nud noarp", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 neighbour add\n"); + return ec; + } + + ec = expect(NEIGH_DEL, "ip neigh del 198.51.100.14 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 neighbour del\n"); + return ec; + } + + /* Now the same for IPv6 - with address from "documentation pool". */ + inet_pton(AF_INET6, "2001:db8::dead:beef", net_cfg.ipv6.s6_addr); + /* This is specific to linux netlink behaviour - after adding address + * we get both notification about address being added and new route. + * So I wait for both. + */ + memcpy(net_cfg.route6.s6_addr, net_cfg.ipv6.s6_addr, 16); + net_cfg.depth6 = 128; + ec = expect(ADDR6_ADD | ROUTE6_ADD, + "ip addr add 2001:db8::dead:beef dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 address add\n"); + return ec; + } + + /* See comment above for 'addr6 add'. */ + ec = expect(ADDR6_DEL | ROUTE6_DEL, + "ip addr del 2001:db8::dead:beef/128 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 address del\n"); + return ec; + } + + net_cfg.depth6 = 96; + ec = expect(ROUTE6_ADD, "ip route add 2001:db8::dead:0/96 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 route add\n"); + return ec; + } + + ec = expect(ROUTE6_DEL, "ip route del 2001:db8::dead:0/96 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 route del\n"); + return ec; + } + + ec = expect(NEIGH6_ADD, + "ip neigh add 2001:db8::dead:beef dev %s lladdr %s nud noarp", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 neighbour add\n"); + return ec; + } + + ec = expect(NEIGH6_DEL, "ip neigh del 2001:db8::dead:beef dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 neighbour del\n"); + return ec; + } + + /* Finally put link down and test for notification. */ + net_cfg.is_up = 0; + ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name); + if (ec != 0) { + printf("Failed to notify about link going down\n"); + return ec; + } + + return 0; +} + +static +int test_if_proxy(void) +{ + int ec; + const struct rte_ifpx_info *pinfo; + uint16_t proxy_id; + + state = 0; + memset(&net_cfg, 0, sizeof(net_cfg)); + + if (rte_eth_dev_count_avail() == 0) { + printf("Run this test with at least one port configured\n"); + return 1; + } + /* User the first port available. */ + RTE_ETH_FOREACH_DEV(net_cfg.port_id) + break; + proxy_id = rte_ifpx_proxy_create(RTE_IFPX_DEFAULT); + RTE_VERIFY(proxy_id != RTE_MAX_ETHPORTS); + rte_ifpx_port_bind(net_cfg.port_id, proxy_id); + rte_ifpx_callbacks_register(RTE_DIM(cbs), cbs); + rte_ifpx_listen(); + + /* Let's start with callback based API. */ + use_callbacks = 1; + pthread_mutex_lock(&mutex); + ec = wait_for(INITIALIZED, 2); + if (ec != 0) { + printf("Failed to obtain network configuration\n"); + goto exit; + } + pinfo = rte_ifpx_info_get(net_cfg.port_id); + RTE_VERIFY(pinfo); + + /* Make sure the link is down. */ + net_cfg.is_up = 0; + ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name); + RTE_VERIFY(ec == ETIMEDOUT || ec == 0); + + ec = test_notifications(pinfo); + if (ec != 0) { + printf("Failed test with callback based API\n"); + goto exit; + } + /* Switch to event queue based API and repeat tests. */ + use_callbacks = 0; + ev_queue = rte_ring_create("IFPX-events", 16, SOCKET_ID_ANY, + RING_F_SP_ENQ | RING_F_SC_DEQ); + ec = rte_ifpx_queue_add(ev_queue); + if (ec != 0) { + printf("Failed to add a notification queue\n"); + goto exit; + } + ec = test_notifications(pinfo); + if (ec != 0) { + printf("Failed test with event queue based API\n"); + goto exit; + } + +exit: + pthread_mutex_unlock(&mutex); + /* Proxy ports are not owned by the lib. Internal references to them + * are cleared on close, but the ports are not destroyed so we need to + * do that explicitly. + */ + rte_ifpx_proxy_destroy(proxy_id); + rte_ifpx_close(); + /* Queue is removed from the lib by rte_ifpx_close() - here we just + * free it. + */ + rte_ring_free(ev_queue); + ev_queue = NULL; + + return ec; +} + +REGISTER_TEST_COMMAND(if_proxy_autotest, test_if_proxy) -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v3 4/4] if_proxy: add example application 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 " Andrzej Ostruszka ` (2 preceding siblings ...) 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 3/4] if_proxy: add simple functionality test Andrzej Ostruszka @ 2020-05-04 8:53 ` Andrzej Ostruszka 3 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-05-04 8:53 UTC (permalink / raw) To: dev, Thomas Monjalon Add an example application showing possible library usage. This is a simplified version of l3fwd where: - many performance improvements has been removed in order to simplify logic and put focus on the proxy library usage, - the configuration of forwarding has to be done by the user (using typical system tools on proxy ports) - these changes are passed to the application via library notifications. It is meant to show how you can update some data from callbacks (routing - see note below) and how those that are replicated (e.g. kept per lcore) can be updated via event queueing (here neighbouring info). Note: This example assumes that LPM tables can be updated by a single writer while being used by others. To the best of author's knowledge this is the case (by preliminary code inspection) but DPDK does not make such a promise. Obviously, upon the change, there will be a transient period (when some IPs will be directed still to the old destination) but that is expected. Note also that in some cases you might need to tweak your system configuration to see effects. For example you send Gratuitous ARP to DPDK port and expect neighbour tables to be updated in application which does not happen. The packet will be sent to the kernel but it might drop it, please check /proc/sys/net/ipv4/conf/dtap0/arp_accept and related configuration options ('dtap0' here is just a name of your proxy port). Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> Depends-on: series-8862 --- MAINTAINERS | 1 + examples/Makefile | 1 + examples/l3fwd-ifpx/Makefile | 60 ++ examples/l3fwd-ifpx/l3fwd.c | 1128 +++++++++++++++++++++++++++++++ examples/l3fwd-ifpx/l3fwd.h | 98 +++ examples/l3fwd-ifpx/main.c | 740 ++++++++++++++++++++ examples/l3fwd-ifpx/meson.build | 11 + examples/meson.build | 2 +- 8 files changed, 2040 insertions(+), 1 deletion(-) create mode 100644 examples/l3fwd-ifpx/Makefile create mode 100644 examples/l3fwd-ifpx/l3fwd.c create mode 100644 examples/l3fwd-ifpx/l3fwd.h create mode 100644 examples/l3fwd-ifpx/main.c create mode 100644 examples/l3fwd-ifpx/meson.build diff --git a/MAINTAINERS b/MAINTAINERS index d42cfb566..96f1b4075 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1475,6 +1475,7 @@ F: doc/guides/prog_guide/bpf_lib.rst IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: examples/l3fwd-ifpx/ F: app/test/test_if_proxy.c F: doc/guides/prog_guide/if_proxy_lib.rst diff --git a/examples/Makefile b/examples/Makefile index feff79784..a8cb02a6c 100644 --- a/examples/Makefile +++ b/examples/Makefile @@ -81,6 +81,7 @@ else $(info vm_power_manager requires libvirt >= 0.9.3) endif endif +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += l3fwd-ifpx DIRS-y += eventdev_pipeline diff --git a/examples/l3fwd-ifpx/Makefile b/examples/l3fwd-ifpx/Makefile new file mode 100644 index 000000000..68eefeb75 --- /dev/null +++ b/examples/l3fwd-ifpx/Makefile @@ -0,0 +1,60 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2020 Marvell International Ltd. + +# binary name +APP = l3fwd + +# all source are stored in SRCS-y +SRCS-y := main.c l3fwd.c + +# Build using pkg-config variables if possible +ifeq ($(shell pkg-config --exists libdpdk && echo 0),0) + +all: shared +.PHONY: shared static +shared: build/$(APP)-shared + ln -sf $(APP)-shared build/$(APP) +static: build/$(APP)-static + ln -sf $(APP)-static build/$(APP) + +PKGCONF ?= pkg-config + +PC_FILE := $(shell $(PKGCONF) --path libdpdk 2>/dev/null) +CFLAGS += -DALLOW_EXPERIMENTAL_API -O3 $(shell $(PKGCONF) --cflags libdpdk) +LDFLAGS_SHARED = $(shell $(PKGCONF) --libs libdpdk) +LDFLAGS_STATIC = -Wl,-Bstatic $(shell $(PKGCONF) --static --libs libdpdk) + +build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED) + +build/$(APP)-static: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_STATIC) + +build: + @mkdir -p $@ + +.PHONY: clean +clean: + rm -f build/$(APP) build/$(APP)-static build/$(APP)-shared + test -d build && rmdir -p build || true + +else # Build using legacy build system + +ifeq ($(RTE_SDK),) +$(error "Please define RTE_SDK environment variable") +endif + +# Default target, detect a build directory, by looking for a path with a .config +RTE_TARGET ?= $(notdir $(abspath $(dir $(firstword $(wildcard $(RTE_SDK)/*/.config))))) + +include $(RTE_SDK)/mk/rte.vars.mk + +CFLAGS += -DALLOW_EXPERIMENTAL_API + +CFLAGS += -I$(SRCDIR) +CFLAGS += -O3 $(USER_FLAGS) +CFLAGS += $(WERROR_FLAGS) +LDLIBS += -lrte_if_proxy -lrte_ethdev -lrte_eal + +include $(RTE_SDK)/mk/rte.extapp.mk +endif diff --git a/examples/l3fwd-ifpx/l3fwd.c b/examples/l3fwd-ifpx/l3fwd.c new file mode 100644 index 000000000..8811aec01 --- /dev/null +++ b/examples/l3fwd-ifpx/l3fwd.c @@ -0,0 +1,1128 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <inttypes.h> +#include <sys/types.h> +#include <string.h> +#include <sys/queue.h> +#include <stdarg.h> +#include <errno.h> +#include <getopt.h> +#include <sys/socket.h> +#include <arpa/inet.h> + +#include <rte_debug.h> +#include <rte_ether.h> +#include <rte_cycles.h> +#include <rte_malloc.h> +#include <rte_mbuf.h> +#include <rte_ip.h> + +#ifndef USE_HASH_CRC +#include <rte_jhash.h> +#else +#include <rte_hash_crc.h> +#endif + +#include <rte_tcp.h> +#include <rte_udp.h> +#include <rte_lpm.h> +#include <rte_lpm6.h> +#include <rte_if_proxy.h> + +#include "l3fwd.h" + +#define DO_RFC_1812_CHECKS + +#define IPV4_L3FWD_LPM_MAX_RULES 1024 +#define IPV4_L3FWD_LPM_NUMBER_TBL8S (1 << 8) +#define IPV6_L3FWD_LPM_MAX_RULES 1024 +#define IPV6_L3FWD_LPM_NUMBER_TBL8S (1 << 16) + +static volatile bool ifpx_ready; + +/* ethernet addresses of ports */ +static +union lladdr_t port_mac[RTE_MAX_ETHPORTS]; + +static struct rte_lpm *ipv4_routes; +static struct rte_lpm6 *ipv6_routes; + +static +struct ipv4_gateway { + uint16_t port; + union lladdr_t lladdr; + uint32_t ip; +} ipv4_gateways[128]; + +static +struct ipv6_gateway { + uint16_t port; + union lladdr_t lladdr; + uint8_t ip[16]; +} ipv6_gateways[128]; + +/* The lowest 2 bits of next hop (which is 24/21 bit for IPv4/6) are reserved to + * encode: + * 00 -> host route: higher bits of next hop are port id and dst MAC should be + * based on dst IP + * 01 -> gateway route: higher bits of next hop are index into gateway array and + * use port and MAC cached there (if no MAC cached yet then search for it + * based on gateway IP) + * 10 -> proxy entry: packet directed to us, just take higher bits as port id of + * proxy and send packet there (without any modification) + * The port id (16 bits) will always fit however this will not work if you + * need more than 2^20 gateways. + */ +enum route_type { + HOST_ROUTE = 0x00, + GW_ROUTE = 0x01, + PROXY_ADDR = 0x02, +}; + +RTE_STD_C11 +_Static_assert(RTE_DIM(ipv4_gateways) <= (1 << 22) && + RTE_DIM(ipv6_gateways) <= (1 << 19), + "Gateway array index has to fit within next_hop with 2 bits reserved"); + +static +uint32_t find_add_gateway(uint16_t port, uint32_t ip) +{ + uint32_t i, idx = -1U; + + for (i = 0; i < RTE_DIM(ipv4_gateways); ++i) { + /* Remember first free slot in case GW is not present. */ + if (idx == -1U && ipv4_gateways[i].ip == 0) + idx = i; + else if (ipv4_gateways[i].ip == ip) + /* For now assume that given GW will be always at the + * same port, so no checking for that + */ + return i; + } + if (idx != -1U) { + ipv4_gateways[idx].port = port; + ipv4_gateways[idx].ip = ip; + /* Since ARP tables are kept per lcore MAC will be updated + * during first lookup. + */ + } + return idx; +} + +static +void clear_gateway(uint32_t ip) +{ + uint32_t i; + + for (i = 0; i < RTE_DIM(ipv4_gateways); ++i) { + if (ipv4_gateways[i].ip == ip) { + ipv4_gateways[i].ip = 0; + ipv4_gateways[i].lladdr.val = 0; + ipv4_gateways[i].port = RTE_MAX_ETHPORTS; + break; + } + } +} + +static +uint32_t find_add_gateway6(uint16_t port, const uint8_t *ip) +{ + uint32_t i, idx = -1U; + + for (i = 0; i < RTE_DIM(ipv6_gateways); ++i) { + /* Remember first free slot in case GW is not present. */ + if (idx == -1U && ipv6_gateways[i].ip[0] == 0) + idx = i; + else if (ipv6_gateways[i].ip[0]) + /* For now assume that given GW will be always at the + * same port, so no checking for that + */ + return i; + } + if (idx != -1U) { + ipv6_gateways[idx].port = port; + memcpy(ipv6_gateways[idx].ip, ip, 16); + /* Since ARP tables are kept per lcore MAC will be updated + * during first lookup. + */ + } + return idx; +} + +static +void clear_gateway6(const uint8_t *ip) +{ + uint32_t i; + + for (i = 0; i < RTE_DIM(ipv6_gateways); ++i) { + if (memcmp(ipv6_gateways[i].ip, ip, 16) == 0) { + memset(&ipv6_gateways[i].ip, 0, 16); + ipv6_gateways[i].lladdr.val = 0; + ipv6_gateways[i].port = RTE_MAX_ETHPORTS; + break; + } + } +} + +/* Assumptions: + * - Link related changes (MAC/MTU/...) need to be executed once, and it's OK + * to run them from the callback - if this is not the case (e.g. -EBUSY for + * MTU change, then event notification need to be used and more sophisticated + * coordination with lcore loops and stopping/starting of the ports: for + * example lcores not receiving on this port just mark it as inactive and stop + * transmitting to it and the one with RX stops the port sets the MAC starts + * it and notifies other lcores that it is back). + * - LPM is safe to be modified by one writer, and read by many without any + * locks (it looks to me like this is the case), however upon routing change + * there might be a transient period during which packets are not directed + * according to new rule. + * - Hash is unsafe to be used that way (and I don't want to turn on relevant + * flags just to excersize queued notifications) so every lcore keeps its + * copy of relevant data. + * Therefore there are callbacks defined for the routing info/address changes + * and remaining ones are handled via events on per lcore basis. + */ +static +int mac_change(const struct rte_ifpx_mac_change *ev) +{ + int i; + struct rte_ether_addr mac_addr; + char buf[RTE_ETHER_ADDR_FMT_SIZE]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(buf, sizeof(buf), &ev->mac); + RTE_LOG(DEBUG, L3FWD, "MAC change for port %d: %s\n", + ev->port_id, buf); + } + /* NOTE - use copy because RTE functions don't take const args */ + rte_ether_addr_copy(&ev->mac, &mac_addr); + i = rte_eth_dev_default_mac_addr_set(ev->port_id, &mac_addr); + if (i == -EOPNOTSUPP) + i = rte_eth_dev_mac_addr_add(ev->port_id, &mac_addr, 0); + if (i < 0) + RTE_LOG(WARNING, L3FWD, "Failed to set MAC address\n"); + else { + port_mac[ev->port_id].mac.addr = ev->mac; + port_mac[ev->port_id].mac.valid = 1; + } + return 1; +} + +static +int link_change(const struct rte_ifpx_link_change *ev) +{ + uint16_t proxy_id = rte_ifpx_proxy_get(ev->port_id); + uint32_t mask; + + /* Mark the proxy too since we get only port notifications. */ + mask = 1U << ev->port_id | 1U << proxy_id; + + RTE_LOG(DEBUG, L3FWD, "Link change for port %d: %d\n", + ev->port_id, ev->is_up); + if (ev->is_up) { + rte_eth_dev_set_link_up(ev->port_id); + active_port_mask |= mask; + } else { + rte_eth_dev_set_link_down(ev->port_id); + active_port_mask &= ~mask; + } + active_port_mask &= enabled_port_mask; + return 1; +} + +static +int addr_add(const struct rte_ifpx_addr_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 address for port %d: %s\n", + ev->port_id, buf); + } + rte_lpm_add(ipv4_routes, ev->ip, 32, + ev->port_id << 2 | PROXY_ADDR); + return 1; +} + +static +int route_add(const struct rte_ifpx_route_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t nh, ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 route for port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + + /* On Linux upon changing of the IP we get notification for both addr + * and route, so just check if we already have addr entry and if so + * then ignore this notification. + */ + if (ev->depth == 32 && + rte_lpm_lookup(ipv4_routes, ev->ip, &nh) == 0 && nh & PROXY_ADDR) + return 1; + + if (ev->gateway) { + nh = find_add_gateway(ev->port_id, ev->gateway); + if (nh != -1U) + rte_lpm_add(ipv4_routes, ev->ip, ev->depth, + nh << 2 | GW_ROUTE); + else + RTE_LOG(WARNING, L3FWD, "No free slot in GW array\n"); + } else + rte_lpm_add(ipv4_routes, ev->ip, ev->depth, + ev->port_id << 2 | HOST_ROUTE); + return 1; +} + +static +int addr_del(const struct rte_ifpx_addr_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 address removed from port %d: %s\n", + ev->port_id, buf); + } + rte_lpm_delete(ipv4_routes, ev->ip, 32); + return 1; +} + +static +int route_del(const struct rte_ifpx_route_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 route removed from port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + if (ev->gateway) + clear_gateway(ev->gateway); + rte_lpm_delete(ipv4_routes, ev->ip, ev->depth); + return 1; +} + +static +int addr6_add(const struct rte_ifpx_addr6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 address for port %d: %s\n", + ev->port_id, buf); + } + rte_lpm6_add(ipv6_routes, ev->ip, 128, + ev->port_id << 2 | PROXY_ADDR); + return 1; +} + +static +int route6_add(const struct rte_ifpx_route6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + /* See comment in route_add(). */ + uint32_t nh; + if (ev->depth == 128 && + rte_lpm6_lookup(ipv6_routes, ev->ip, &nh) == 0 && nh & PROXY_ADDR) + return 1; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 route for port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + /* no valid IPv6 address starts with 0x00 */ + if (ev->gateway[0]) { + nh = find_add_gateway6(ev->port_id, ev->ip); + if (nh != -1U) + rte_lpm6_add(ipv6_routes, ev->ip, ev->depth, + nh << 2 | GW_ROUTE); + else + RTE_LOG(WARNING, L3FWD, "No free slot in GW6 array\n"); + } else + rte_lpm6_add(ipv6_routes, ev->ip, ev->depth, + ev->port_id << 2 | HOST_ROUTE); + return 1; +} + +static +int addr6_del(const struct rte_ifpx_addr6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 address removed from port %d: %s\n", + ev->port_id, buf); + } + rte_lpm6_delete(ipv6_routes, ev->ip, 128); + return 1; +} + +static +int route6_del(const struct rte_ifpx_route6_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 route removed from port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + if (ev->gateway[0]) + clear_gateway6(ev->gateway); + rte_lpm6_delete(ipv6_routes, ev->ip, ev->depth); + return 1; +} + +static +int cfg_done(void) +{ + uint16_t port_id, px; + const struct rte_ifpx_info *pinfo; + + RTE_LOG(DEBUG, L3FWD, "Proxy config finished\n"); + + /* Copy MAC addresses of the proxies - to be used as src MAC during + * forwarding. + */ + RTE_ETH_FOREACH_DEV(port_id) { + px = rte_ifpx_proxy_get(port_id); + if (px != RTE_MAX_ETHPORTS && px != port_id) { + pinfo = rte_ifpx_info_get(px); + rte_ether_addr_copy(&pinfo->mac, + &port_mac[port_id].mac.addr); + port_mac[port_id].mac.valid = 1; + } + } + + ifpx_ready = 1; + return 1; +} + +static +struct rte_ifpx_callback ifpx_callbacks[] = { + { RTE_IFPX_MAC_CHANGE, {.mac_change = mac_change} }, + { RTE_IFPX_LINK_CHANGE, {.link_change = link_change} }, + { RTE_IFPX_ADDR_ADD, {.addr_add = addr_add} }, + { RTE_IFPX_ADDR_DEL, {.addr_del = addr_del} }, + { RTE_IFPX_ADDR6_ADD, {.addr6_add = addr6_add} }, + { RTE_IFPX_ADDR6_DEL, {.addr6_del = addr6_del} }, + { RTE_IFPX_ROUTE_ADD, {.route_add = route_add} }, + { RTE_IFPX_ROUTE_DEL, {.route_del = route_del} }, + { RTE_IFPX_ROUTE6_ADD, {.route6_add = route6_add} }, + { RTE_IFPX_ROUTE6_DEL, {.route6_del = route6_del} }, + { RTE_IFPX_CFG_DONE, {.cfg_done = cfg_done} }, +}; + +int init_if_proxy(void) +{ + char buf[16]; + unsigned int i; + + rte_ifpx_callbacks_register(RTE_DIM(ifpx_callbacks), ifpx_callbacks); + + RTE_LCORE_FOREACH(i) { + if (lcore_conf[i].n_rx_queue == 0) + continue; + snprintf(buf, sizeof(buf), "IFPX-events_%d", i); + lcore_conf[i].ev_queue = rte_ring_create(buf, 16, SOCKET_ID_ANY, + RING_F_SP_ENQ | RING_F_SC_DEQ); + if (!lcore_conf[i].ev_queue) { + RTE_LOG(ERR, L3FWD, + "Failed to create event queue for lcore %d\n", + i); + return -1; + } + rte_ifpx_queue_add(lcore_conf[i].ev_queue); + } + + return rte_ifpx_listen(); +} + +void close_if_proxy(void) +{ + unsigned int i; + + RTE_LCORE_FOREACH(i) { + if (lcore_conf[i].n_rx_queue == 0) + continue; + rte_ring_free(lcore_conf[i].ev_queue); + } + rte_ifpx_close(); +} + +void wait_for_config_done(void) +{ + while (!ifpx_ready) + rte_delay_ms(100); +} + +#ifdef DO_RFC_1812_CHECKS +static inline +int is_valid_ipv4_pkt(struct rte_ipv4_hdr *pkt, uint32_t link_len) +{ + /* From http://www.rfc-editor.org/rfc/rfc1812.txt section 5.2.2 */ + /* + * 1. The packet length reported by the Link Layer must be large + * enough to hold the minimum length legal IP datagram (20 bytes). + */ + if (link_len < sizeof(struct rte_ipv4_hdr)) + return -1; + + /* 2. The IP checksum must be correct. */ + /* this is checked in H/W */ + + /* + * 3. The IP version number must be 4. If the version number is not 4 + * then the packet may be another version of IP, such as IPng or + * ST-II. + */ + if (((pkt->version_ihl) >> 4) != 4) + return -3; + /* + * 4. The IP header length field must be large enough to hold the + * minimum length legal IP datagram (20 bytes = 5 words). + */ + if ((pkt->version_ihl & 0xf) < 5) + return -4; + + /* + * 5. The IP total length field must be large enough to hold the IP + * datagram header, whose length is specified in the IP header length + * field. + */ + if (rte_cpu_to_be_16(pkt->total_length) < sizeof(struct rte_ipv4_hdr)) + return -5; + + return 0; +} +#endif + +/* Send burst of packets on an output interface */ +static inline +int send_burst(struct lcore_conf *lconf, uint16_t n, uint16_t port) +{ + struct rte_mbuf **m_table; + int ret; + uint16_t queueid; + + queueid = lconf->tx_queue_id[port]; + m_table = (struct rte_mbuf **)lconf->tx_mbufs[port].m_table; + + ret = rte_eth_tx_burst(port, queueid, m_table, n); + if (unlikely(ret < n)) { + do { + rte_pktmbuf_free(m_table[ret]); + } while (++ret < n); + } + + return 0; +} + +/* Enqueue a single packet, and send burst if queue is filled */ +static inline +int send_single_packet(struct lcore_conf *lconf, + struct rte_mbuf *m, uint16_t port) +{ + uint16_t len; + + len = lconf->tx_mbufs[port].len; + lconf->tx_mbufs[port].m_table[len] = m; + len++; + + /* enough pkts to be sent */ + if (unlikely(len == MAX_PKT_BURST)) { + send_burst(lconf, MAX_PKT_BURST, port); + len = 0; + } + + lconf->tx_mbufs[port].len = len; + return 0; +} + +static inline +int ipv4_get_destination(const struct rte_ipv4_hdr *ipv4_hdr, + struct rte_lpm *lpm, uint32_t *next_hop) +{ + return rte_lpm_lookup(lpm, + rte_be_to_cpu_32(ipv4_hdr->dst_addr), + next_hop); +} + +static inline +int ipv6_get_destination(const struct rte_ipv6_hdr *ipv6_hdr, + struct rte_lpm6 *lpm, uint32_t *next_hop) +{ + return rte_lpm6_lookup(lpm, ipv6_hdr->dst_addr, next_hop); +} + +static +uint16_t ipv4_process_pkt(struct lcore_conf *lconf, + struct rte_ether_hdr *eth_hdr, + struct rte_ipv4_hdr *ipv4_hdr, uint16_t portid) +{ + union lladdr_t lladdr = { 0 }; + int i; + uint32_t ip, nh; + + /* Here we know that packet is not from proxy - this case is handled + * in the main loop - so if we fail to find destination we will direct + * it to the proxy. + */ + if (ipv4_get_destination(ipv4_hdr, ipv4_routes, &nh) < 0) + return rte_ifpx_proxy_get(portid); + + if (nh & PROXY_ADDR) + return nh >> 2; + + /* Packet not to us so update src/dst MAC. */ + if (nh & GW_ROUTE) { + i = nh >> 2; + if (ipv4_gateways[i].lladdr.mac.valid) + lladdr = ipv4_gateways[i].lladdr; + else { + i = rte_hash_lookup(lconf->neigh_hash, + &ipv4_gateways[i].ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh_map[i]; + ipv4_gateways[i].lladdr = lladdr; + } + nh = ipv4_gateways[i].port; + } else { + nh >>= 2; + ip = rte_be_to_cpu_32(ipv4_hdr->dst_addr); + i = rte_hash_lookup(lconf->neigh_hash, &ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh_map[i]; + } + + RTE_ASSERT(lladdr.mac.valid); + RTE_ASSERT(port_mac[nh].mac.valid); + /* dst addr */ + *(uint64_t *)ð_hdr->d_addr = lladdr.val; + /* src addr */ + rte_ether_addr_copy(&port_mac[nh].mac.addr, ð_hdr->s_addr); + + return nh; +} + +static +uint16_t ipv6_process_pkt(struct lcore_conf *lconf, + struct rte_ether_hdr *eth_hdr, + struct rte_ipv6_hdr *ipv6_hdr, uint16_t portid) +{ + union lladdr_t lladdr = { 0 }; + int i; + uint32_t nh; + + /* Here we know that packet is not from proxy - this case is handled + * in the main loop - so if we fail to find destination we will direct + * it to the proxy. + */ + if (ipv6_get_destination(ipv6_hdr, ipv6_routes, &nh) < 0) + return rte_ifpx_proxy_get(portid); + + if (nh & PROXY_ADDR) + return nh >> 2; + + /* Packet not to us so update src/dst MAC. */ + if (nh & GW_ROUTE) { + i = nh >> 2; + if (ipv6_gateways[i].lladdr.mac.valid) + lladdr = ipv6_gateways[i].lladdr; + else { + i = rte_hash_lookup(lconf->neigh6_hash, + ipv6_gateways[i].ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh6_map[i]; + ipv6_gateways[i].lladdr = lladdr; + } + nh = ipv6_gateways[i].port; + } else { + nh >>= 2; + i = rte_hash_lookup(lconf->neigh6_hash, ipv6_hdr->dst_addr); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh6_map[i]; + } + + RTE_ASSERT(lladdr.mac.valid); + /* dst addr */ + *(uint64_t *)ð_hdr->d_addr = lladdr.val; + /* src addr */ + rte_ether_addr_copy(&port_mac[nh].mac.addr, ð_hdr->s_addr); + + return nh; +} + +static __rte_always_inline +void l3fwd_lpm_simple_forward(struct rte_mbuf *m, uint16_t portid, + struct lcore_conf *lconf) +{ + struct rte_ether_hdr *eth_hdr; + uint32_t nh; + + eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *); + + if (RTE_ETH_IS_IPV4_HDR(m->packet_type)) { + /* Handle IPv4 headers.*/ + struct rte_ipv4_hdr *ipv4_hdr; + + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *, + sizeof(*eth_hdr)); + +#ifdef DO_RFC_1812_CHECKS + /* Check to make sure the packet is valid (RFC1812) */ + if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) { + rte_pktmbuf_free(m); + return; + } +#endif + nh = ipv4_process_pkt(lconf, eth_hdr, ipv4_hdr, portid); + +#ifdef DO_RFC_1812_CHECKS + /* Update time to live and header checksum */ + --(ipv4_hdr->time_to_live); + ++(ipv4_hdr->hdr_checksum); +#endif + } else if (RTE_ETH_IS_IPV6_HDR(m->packet_type)) { + /* Handle IPv6 headers.*/ + struct rte_ipv6_hdr *ipv6_hdr; + + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *, + sizeof(*eth_hdr)); + + nh = ipv6_process_pkt(lconf, eth_hdr, ipv6_hdr, portid); + } else + /* Unhandled protocol */ + nh = rte_ifpx_proxy_get(portid); + + if (nh >= RTE_MAX_ETHPORTS || (active_port_mask & 1 << nh) == 0) + rte_pktmbuf_free(m); + else + send_single_packet(lconf, m, nh); +} + +static inline +void l3fwd_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, + uint16_t portid, struct lcore_conf *lconf) +{ + int32_t j; + + /* Prefetch first packets */ + for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *)); + + /* Prefetch and forward already prefetched packets. */ + for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ + j + PREFETCH_OFFSET], void *)); + l3fwd_lpm_simple_forward(pkts_burst[j], portid, lconf); + } + + /* Forward remaining prefetched packets */ + for (; j < nb_rx; j++) + l3fwd_lpm_simple_forward(pkts_burst[j], portid, lconf); +} + +static +void handle_neigh_add(struct lcore_conf *lconf, + const struct rte_ifpx_neigh_change *ev) +{ + char mac[RTE_ETHER_ADDR_FMT_SIZE]; + char ip[INET_ADDRSTRLEN]; + int32_t i, a; + + i = rte_hash_add_key(lconf->neigh_hash, &ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to add IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(mac, sizeof(mac), &ev->mac); + a = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &a, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour update for port %d: %s -> %s@%d\n", + ev->port_id, ip, mac, i); + } + lconf->neigh_map[i].mac.addr = ev->mac; + lconf->neigh_map[i].mac.valid = 1; +} + +static +void handle_neigh_del(struct lcore_conf *lconf, + const struct rte_ifpx_neigh_change *ev) +{ + char ip[INET_ADDRSTRLEN]; + int32_t i, a; + + i = rte_hash_del_key(lconf->neigh_hash, &ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, + "Failed to remove IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + a = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &a, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour removal for port %d: %s\n", + ev->port_id, ip); + } + lconf->neigh_map[i].val = 0; +} + +static +void handle_neigh6_add(struct lcore_conf *lconf, + const struct rte_ifpx_neigh6_change *ev) +{ + char mac[RTE_ETHER_ADDR_FMT_SIZE]; + char ip[INET6_ADDRSTRLEN]; + int32_t i; + + i = rte_hash_add_key(lconf->neigh6_hash, ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to add IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(mac, sizeof(mac), &ev->mac); + inet_ntop(AF_INET6, ev->ip, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour update for port %d: %s -> %s@%d\n", + ev->port_id, ip, mac, i); + } + lconf->neigh6_map[i].mac.addr = ev->mac; + lconf->neigh6_map[i].mac.valid = 1; +} + +static +void handle_neigh6_del(struct lcore_conf *lconf, + const struct rte_ifpx_neigh6_change *ev) +{ + char ip[INET6_ADDRSTRLEN]; + int32_t i; + + i = rte_hash_del_key(lconf->neigh6_hash, ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to remove IPv6 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour removal for port %d: %s\n", + ev->port_id, ip); + } + lconf->neigh6_map[i].val = 0; +} + +static +void handle_events(struct lcore_conf *lconf) +{ + struct rte_ifpx_event *ev; + + while (rte_ring_dequeue(lconf->ev_queue, (void **)&ev) == 0) { + switch (ev->type) { + case RTE_IFPX_NEIGH_ADD: + handle_neigh_add(lconf, &ev->neigh_change); + break; + case RTE_IFPX_NEIGH_DEL: + handle_neigh_del(lconf, &ev->neigh_change); + break; + case RTE_IFPX_NEIGH6_ADD: + handle_neigh6_add(lconf, &ev->neigh6_change); + break; + case RTE_IFPX_NEIGH6_DEL: + handle_neigh6_del(lconf, &ev->neigh6_change); + break; + default: + RTE_LOG(WARNING, L3FWD, + "Unexpected event: %d\n", ev->type); + } + free(ev); + } +} + +void setup_lpm(void) +{ + struct rte_lpm6_config cfg6; + struct rte_lpm_config cfg4; + + /* create the LPM table */ + cfg4.max_rules = IPV4_L3FWD_LPM_MAX_RULES; + cfg4.number_tbl8s = IPV4_L3FWD_LPM_NUMBER_TBL8S; + cfg4.flags = 0; + ipv4_routes = rte_lpm_create("IPV4_L3FWD_LPM", SOCKET_ID_ANY, &cfg4); + if (ipv4_routes == NULL) + rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table\n"); + + /* create the LPM6 table */ + cfg6.max_rules = IPV6_L3FWD_LPM_MAX_RULES; + cfg6.number_tbl8s = IPV6_L3FWD_LPM_NUMBER_TBL8S; + cfg6.flags = 0; + ipv6_routes = rte_lpm6_create("IPV6_L3FWD_LPM", SOCKET_ID_ANY, &cfg6); + if (ipv6_routes == NULL) + rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table\n"); +} + +static +uint32_t hash_ipv4(const void *key, uint32_t key_len __rte_unused, + uint32_t init_val) +{ +#ifndef USE_HASH_CRC + return rte_jhash_1word(*(const uint32_t *)key, init_val); +#else + return rte_hash_crc_4byte(*(const uint32_t *)key, init_val); +#endif +} + +static +uint32_t hash_ipv6(const void *key, uint32_t key_len __rte_unused, + uint32_t init_val) +{ +#ifndef USE_HASH_CRC + return rte_jhash_32b(key, 4, init_val); +#else + const uint64_t *pk = key; + init_val = rte_hash_crc_8byte(*pk, init_val); + return rte_hash_crc_8byte(*(pk+1), init_val); +#endif +} + +static +int setup_neigh(struct lcore_conf *lconf) +{ + char buf[16]; + struct rte_hash_parameters ipv4_hparams = { + .name = buf, + .entries = L3FWD_NEIGH_ENTRIES, + .key_len = 4, + .hash_func = hash_ipv4, + .hash_func_init_val = 0, + }; + struct rte_hash_parameters ipv6_hparams = { + .name = buf, + .entries = L3FWD_NEIGH_ENTRIES, + .key_len = 16, + .hash_func = hash_ipv6, + .hash_func_init_val = 0, + }; + + snprintf(buf, sizeof(buf), "neigh_hash-%d", rte_lcore_id()); + lconf->neigh_hash = rte_hash_create(&ipv4_hparams); + snprintf(buf, sizeof(buf), "neigh_map-%d", rte_lcore_id()); + lconf->neigh_map = rte_zmalloc(buf, + L3FWD_NEIGH_ENTRIES*sizeof(*lconf->neigh_map), + 8); + if (lconf->neigh_hash == NULL || lconf->neigh_map == NULL) { + RTE_LOG(ERR, L3FWD, + "Unable to create the l3fwd ARP/IPv4 table (lcore %d)\n", + rte_lcore_id()); + return -1; + } + + snprintf(buf, sizeof(buf), "neigh6_hash-%d", rte_lcore_id()); + lconf->neigh6_hash = rte_hash_create(&ipv6_hparams); + snprintf(buf, sizeof(buf), "neigh6_map-%d", rte_lcore_id()); + lconf->neigh6_map = rte_zmalloc(buf, + L3FWD_NEIGH_ENTRIES*sizeof(*lconf->neigh6_map), + 8); + if (lconf->neigh6_hash == NULL || lconf->neigh6_map == NULL) { + RTE_LOG(ERR, L3FWD, + "Unable to create the l3fwd ARP/IPv6 table (lcore %d)\n", + rte_lcore_id()); + return -1; + } + return 0; +} + +int lpm_check_ptype(int portid) +{ + int i, ret; + int ptype_l3_ipv4 = 0, ptype_l3_ipv6 = 0; + uint32_t ptype_mask = RTE_PTYPE_L3_MASK; + + ret = rte_eth_dev_get_supported_ptypes(portid, ptype_mask, NULL, 0); + if (ret <= 0) + return 0; + + uint32_t ptypes[ret]; + + ret = rte_eth_dev_get_supported_ptypes(portid, ptype_mask, ptypes, ret); + for (i = 0; i < ret; ++i) { + if (ptypes[i] & RTE_PTYPE_L3_IPV4) + ptype_l3_ipv4 = 1; + if (ptypes[i] & RTE_PTYPE_L3_IPV6) + ptype_l3_ipv6 = 1; + } + + if (ptype_l3_ipv4 == 0) + RTE_LOG(WARNING, L3FWD, + "port %d cannot parse RTE_PTYPE_L3_IPV4\n", portid); + + if (ptype_l3_ipv6 == 0) + RTE_LOG(WARNING, L3FWD, + "port %d cannot parse RTE_PTYPE_L3_IPV6\n", portid); + + if (ptype_l3_ipv4 && ptype_l3_ipv6) + return 1; + + return 0; + +} + +static inline +void lpm_parse_ptype(struct rte_mbuf *m) +{ + struct rte_ether_hdr *eth_hdr; + uint32_t packet_type = RTE_PTYPE_UNKNOWN; + uint16_t ether_type; + + eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *); + ether_type = eth_hdr->ether_type; + if (ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) + packet_type |= RTE_PTYPE_L3_IPV4_EXT_UNKNOWN; + else if (ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV6)) + packet_type |= RTE_PTYPE_L3_IPV6_EXT_UNKNOWN; + + m->packet_type = packet_type; +} + +uint16_t lpm_cb_parse_ptype(uint16_t port __rte_unused, + uint16_t queue __rte_unused, + struct rte_mbuf *pkts[], uint16_t nb_pkts, + uint16_t max_pkts __rte_unused, + void *user_param __rte_unused) +{ + unsigned int i; + + if (unlikely(nb_pkts == 0)) + return nb_pkts; + rte_prefetch0(rte_pktmbuf_mtod(pkts[0], struct ether_hdr *)); + for (i = 0; i < (unsigned int) (nb_pkts - 1); ++i) { + rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], + struct ether_hdr *)); + lpm_parse_ptype(pkts[i]); + } + lpm_parse_ptype(pkts[i]); + + return nb_pkts; +} + +/* main processing loop */ +int lpm_main_loop(void *dummy __rte_unused) +{ + struct rte_mbuf *pkts_burst[MAX_PKT_BURST]; + unsigned int lcore_id; + uint64_t prev_tsc, diff_tsc, cur_tsc; + int i, j, nb_rx; + uint16_t portid; + uint8_t queueid; + struct lcore_conf *lconf; + struct lcore_rx_queue *rxq; + const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) / + US_PER_S * BURST_TX_DRAIN_US; + + prev_tsc = 0; + + lcore_id = rte_lcore_id(); + lconf = &lcore_conf[lcore_id]; + + if (setup_neigh(lconf) < 0) { + RTE_LOG(ERR, L3FWD, "lcore %u failed to setup its ARP tables\n", + lcore_id); + return 0; + } + + if (lconf->n_rx_queue == 0) { + RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id); + return 0; + } + + RTE_LOG(INFO, L3FWD, "entering main loop on lcore %u\n", lcore_id); + + for (i = 0; i < lconf->n_rx_queue; i++) { + + portid = lconf->rx_queue_list[i].port_id; + queueid = lconf->rx_queue_list[i].queue_id; + RTE_LOG(INFO, L3FWD, + " -- lcoreid=%u portid=%u rxqueueid=%hhu\n", + lcore_id, portid, queueid); + } + + while (!force_quit) { + + cur_tsc = rte_rdtsc(); + /* + * TX burst and event queue drain + */ + diff_tsc = cur_tsc - prev_tsc; + if (unlikely(diff_tsc % drain_tsc == 0)) { + + for (i = 0; i < lconf->n_tx_port; ++i) { + portid = lconf->tx_port_id[i]; + if (lconf->tx_mbufs[portid].len == 0) + continue; + send_burst(lconf, + lconf->tx_mbufs[portid].len, + portid); + lconf->tx_mbufs[portid].len = 0; + } + + if (diff_tsc > EV_QUEUE_DRAIN * drain_tsc) { + if (lconf->ev_queue && + !rte_ring_empty(lconf->ev_queue)) + handle_events(lconf); + prev_tsc = cur_tsc; + } + } + + /* + * Read packet from RX queues + */ + for (i = 0; i < lconf->n_rx_queue; ++i) { + rxq = &lconf->rx_queue_list[i]; + portid = rxq->port_id; + queueid = rxq->queue_id; + nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, + MAX_PKT_BURST); + if (nb_rx == 0) + continue; + /* If current queue is from proxy interface then there + * is no need to figure out destination port - just + * forward it to the bound port. + */ + if (unlikely(rxq->dst_port != RTE_MAX_ETHPORTS)) { + for (j = 0; j < nb_rx; ++j) + send_single_packet(lconf, pkts_burst[j], + rxq->dst_port); + } else + l3fwd_send_packets(nb_rx, pkts_burst, portid, + lconf); + } + } + + return 0; +} diff --git a/examples/l3fwd-ifpx/l3fwd.h b/examples/l3fwd-ifpx/l3fwd.h new file mode 100644 index 000000000..fc60078c5 --- /dev/null +++ b/examples/l3fwd-ifpx/l3fwd.h @@ -0,0 +1,98 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#ifndef __L3_FWD_H__ +#define __L3_FWD_H__ + +#include <stdbool.h> + +#include <rte_ethdev.h> +#include <rte_log.h> +#include <rte_hash.h> + +#define RTE_LOGTYPE_L3FWD RTE_LOGTYPE_USER1 + +#define MAX_PKT_BURST 32 +#define BURST_TX_DRAIN_US 100 /* TX drain every ~100us */ +#define EV_QUEUE_DRAIN 5 /* Check event queue every 5 TX drains */ + +#define MAX_RX_QUEUE_PER_LCORE 16 + +/* + * Try to avoid TX buffering if we have at least MAX_TX_BURST packets to send. + */ +#define MAX_TX_BURST (MAX_PKT_BURST / 2) + +/* Configure how many packets ahead to prefetch, when reading packets */ +#define PREFETCH_OFFSET 3 + +/* Hash parameters. */ +#ifdef RTE_ARCH_64 +/* default to 4 million hash entries (approx) */ +#define L3FWD_HASH_ENTRIES (1024*1024*4) +#else +/* 32-bit has less address-space for hugepage memory, limit to 1M entries */ +#define L3FWD_HASH_ENTRIES (1024*1024*1) +#endif +#define HASH_ENTRY_NUMBER_DEFAULT 4 +/* Default ARP table size */ +#define L3FWD_NEIGH_ENTRIES 1024 + +union lladdr_t { + uint64_t val; + struct { + struct rte_ether_addr addr; + uint16_t valid; + } mac; +}; + +struct mbuf_table { + uint16_t len; + struct rte_mbuf *m_table[MAX_PKT_BURST]; +}; + +struct lcore_rx_queue { + uint16_t port_id; + uint16_t dst_port; + uint8_t queue_id; +} __rte_cache_aligned; + +struct lcore_conf { + uint16_t n_rx_queue; + struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE]; + uint16_t n_tx_port; + uint16_t tx_port_id[RTE_MAX_ETHPORTS]; + uint16_t tx_queue_id[RTE_MAX_ETHPORTS]; + struct mbuf_table tx_mbufs[RTE_MAX_ETHPORTS]; + struct rte_ring *ev_queue; + union lladdr_t *neigh_map; + struct rte_hash *neigh_hash; + union lladdr_t *neigh6_map; + struct rte_hash *neigh6_hash; +} __rte_cache_aligned; + +extern volatile bool force_quit; + +/* mask of enabled/active ports */ +extern uint32_t enabled_port_mask; +extern uint32_t active_port_mask; + +extern struct lcore_conf lcore_conf[RTE_MAX_LCORE]; + +int init_if_proxy(void); +void close_if_proxy(void); + +void wait_for_config_done(void); + +void setup_lpm(void); + +int lpm_check_ptype(int portid); + +uint16_t +lpm_cb_parse_ptype(uint16_t port, uint16_t queue, struct rte_mbuf *pkts[], + uint16_t nb_pkts, uint16_t max_pkts, void *user_param); + +int lpm_main_loop(__attribute__((unused)) void *dummy); + +#endif /* __L3_FWD_H__ */ diff --git a/examples/l3fwd-ifpx/main.c b/examples/l3fwd-ifpx/main.c new file mode 100644 index 000000000..7f1da5ec2 --- /dev/null +++ b/examples/l3fwd-ifpx/main.c @@ -0,0 +1,740 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#include <stdlib.h> +#include <stdint.h> +#include <inttypes.h> +#include <sys/types.h> +#include <string.h> +#include <sys/queue.h> +#include <stdarg.h> +#include <errno.h> +#include <getopt.h> +#include <signal.h> +#include <stdbool.h> + +#include <rte_byteorder.h> +#include <rte_memory.h> +#include <rte_memcpy.h> +#include <rte_eal.h> +#include <rte_launch.h> +#include <rte_atomic.h> +#include <rte_cycles.h> +#include <rte_prefetch.h> +#include <rte_lcore.h> +#include <rte_per_lcore.h> +#include <rte_branch_prediction.h> +#include <rte_interrupts.h> +#include <rte_random.h> +#include <rte_debug.h> +#include <rte_ether.h> +#include <rte_ethdev.h> +#include <rte_mempool.h> +#include <rte_mbuf.h> +#include <rte_ip.h> +#include <rte_tcp.h> +#include <rte_udp.h> +#include <rte_string_fns.h> +#include <rte_cpuflags.h> +#include <rte_if_proxy.h> + +#include <cmdline_parse.h> +#include <cmdline_parse_etheraddr.h> + +#include "l3fwd.h" + +/* + * Configurable number of RX/TX ring descriptors + */ +#define RTE_TEST_RX_DESC_DEFAULT 1024 +#define RTE_TEST_TX_DESC_DEFAULT 1024 + +#define MAX_TX_QUEUE_PER_PORT RTE_MAX_ETHPORTS +#define MAX_RX_QUEUE_PER_PORT 128 + +#define MAX_LCORE_PARAMS 1024 + +/* Static global variables used within this file. */ +static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT; +static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT; + +/**< Ports set in promiscuous mode off by default. */ +static int promiscuous_on; + +/* Global variables. */ + +static int parse_ptype; /**< Parse packet type using rx callback, and */ + /**< disabled by default */ + +volatile bool force_quit; + +/* mask of enabled/active ports */ +uint32_t enabled_port_mask; +uint32_t active_port_mask; + +struct lcore_conf lcore_conf[RTE_MAX_LCORE]; + +struct lcore_params { + uint16_t port_id; + uint8_t queue_id; + uint8_t lcore_id; +} __rte_cache_aligned; + +static struct lcore_params lcore_params[MAX_LCORE_PARAMS]; +static struct lcore_params lcore_params_default[] = { + {0, 0, 2}, + {0, 1, 2}, + {0, 2, 2}, + {1, 0, 2}, + {1, 1, 2}, + {1, 2, 2}, + {2, 0, 2}, + {3, 0, 3}, + {3, 1, 3}, +}; + +static uint16_t nb_lcore_params; + +static struct rte_eth_conf port_conf = { + .rxmode = { + .mq_mode = ETH_MQ_RX_RSS, + .max_rx_pkt_len = RTE_ETHER_MAX_LEN, + .split_hdr_size = 0, + .offloads = DEV_RX_OFFLOAD_CHECKSUM, + }, + .rx_adv_conf = { + .rss_conf = { + .rss_key = NULL, + .rss_hf = ETH_RSS_IP, + }, + }, + .txmode = { + .mq_mode = ETH_MQ_TX_NONE, + }, +}; + +static struct rte_mempool *pktmbuf_pool; + +static int +check_lcore_params(void) +{ + uint8_t queue, lcore; + uint16_t i, port_id; + int socketid; + + for (i = 0; i < nb_lcore_params; ++i) { + queue = lcore_params[i].queue_id; + if (queue >= MAX_RX_QUEUE_PER_PORT) { + RTE_LOG(ERR, L3FWD, "Invalid queue number: %hhu\n", + queue); + return -1; + } + lcore = lcore_params[i].lcore_id; + if (!rte_lcore_is_enabled(lcore)) { + RTE_LOG(ERR, L3FWD, "lcore %hhu is not enabled " + "in lcore mask\n", lcore); + return -1; + } + port_id = lcore_params[i].port_id; + if ((enabled_port_mask & (1 << port_id)) == 0) { + RTE_LOG(ERR, L3FWD, "port %u is not enabled " + "in port mask\n", port_id); + return -1; + } + if (!rte_eth_dev_is_valid_port(port_id)) { + RTE_LOG(ERR, L3FWD, "port %u is not present " + "on the board\n", port_id); + return -1; + } + socketid = rte_lcore_to_socket_id(lcore); + if (socketid != 0) { + RTE_LOG(WARNING, L3FWD, + "lcore %hhu is on socket %d with numa off\n", + lcore, socketid); + } + } + return 0; +} + +static int +add_proxies(void) +{ + uint16_t i, p, port_id, proxy_id; + + for (i = 0, p = nb_lcore_params; i < nb_lcore_params; ++i) { + if (p >= RTE_DIM(lcore_params)) { + RTE_LOG(ERR, L3FWD, "Not enough room in lcore_params " + "to add proxy\n"); + return -1; + } + port_id = lcore_params[i].port_id; + if (rte_ifpx_proxy_get(port_id) != RTE_MAX_ETHPORTS) + continue; + + proxy_id = rte_ifpx_proxy_create(RTE_IFPX_DEFAULT); + if (proxy_id == RTE_MAX_ETHPORTS) { + RTE_LOG(ERR, L3FWD, "Failed to crate proxy\n"); + return -1; + } + rte_ifpx_port_bind(port_id, proxy_id); + /* mark proxy as enabled - the corresponding port is, since we + * are after checking of lcore_params + */ + enabled_port_mask |= 1 << proxy_id; + lcore_params[p].port_id = proxy_id; + lcore_params[p].lcore_id = lcore_params[i].lcore_id; + lcore_params[p].queue_id = lcore_params[i].queue_id; + ++p; + } + + nb_lcore_params = p; + return 0; +} + +static uint8_t +get_port_n_rx_queues(const uint16_t port) +{ + int queue = -1; + uint16_t i; + + for (i = 0; i < nb_lcore_params; ++i) { + if (lcore_params[i].port_id == port) { + if (lcore_params[i].queue_id == queue+1) + queue = lcore_params[i].queue_id; + else + rte_exit(EXIT_FAILURE, "queue ids of the port %d must be" + " in sequence and must start with 0\n", + lcore_params[i].port_id); + } + } + return (uint8_t)(++queue); +} + +static int +init_lcore_rx_queues(void) +{ + uint16_t i, p, nb_rx_queue; + uint8_t lcore; + struct lcore_rx_queue *rq; + + for (i = 0; i < nb_lcore_params; ++i) { + lcore = lcore_params[i].lcore_id; + nb_rx_queue = lcore_conf[lcore].n_rx_queue; + if (nb_rx_queue >= MAX_RX_QUEUE_PER_LCORE) { + RTE_LOG(ERR, L3FWD, + "too many queues (%u) for lcore: %u\n", + (unsigned int)nb_rx_queue + 1, + (unsigned int)lcore); + return -1; + } + rq = &lcore_conf[lcore].rx_queue_list[nb_rx_queue]; + rq->port_id = lcore_params[i].port_id; + rq->queue_id = lcore_params[i].queue_id; + if (rte_ifpx_is_proxy(rq->port_id)) { + if (rte_ifpx_port_get(rq->port_id, &p, 1) > 0) + rq->dst_port = p; + else + RTE_LOG(WARNING, L3FWD, + "Found proxy that has no port bound\n"); + } else + rq->dst_port = RTE_MAX_ETHPORTS; + lcore_conf[lcore].n_rx_queue++; + } + return 0; +} + +/* display usage */ +static void +print_usage(const char *prgname) +{ + fprintf(stderr, "%s [EAL options] --" + " -p PORTMASK" + " [-P]" + " --config (port,queue,lcore)[,(port,queue,lcore)]" + " [--ipv6]" + " [--parse-ptype]" + + " -p PORTMASK: Hexadecimal bitmask of ports to configure\n" + " -P : Enable promiscuous mode\n" + " --config (port,queue,lcore): Rx queue configuration\n" + " --ipv6: Set if running ipv6 packets\n" + " --parse-ptype: Set to use software to analyze packet type\n", + prgname); +} + +static int +parse_portmask(const char *portmask) +{ + char *end = NULL; + unsigned long pm; + + /* parse hexadecimal string */ + pm = strtoul(portmask, &end, 16); + if ((portmask[0] == '\0') || (end == NULL) || (*end != '\0')) + return -1; + + if (pm == 0) + return -1; + + return pm; +} + +static int +parse_config(const char *q_arg) +{ + char s[256]; + const char *p, *p0 = q_arg; + char *end; + enum fieldnames { + FLD_PORT = 0, + FLD_QUEUE, + FLD_LCORE, + _NUM_FLD + }; + unsigned long int_fld[_NUM_FLD]; + char *str_fld[_NUM_FLD]; + int i; + unsigned int size; + + nb_lcore_params = 0; + + while ((p = strchr(p0, '(')) != NULL) { + ++p; + p0 = strchr(p, ')'); + if (p0 == NULL) + return -1; + + size = p0 - p; + if (size >= sizeof(s)) + return -1; + + snprintf(s, sizeof(s), "%.*s", size, p); + if (rte_strsplit(s, sizeof(s), str_fld, _NUM_FLD, ',') != + _NUM_FLD) + return -1; + for (i = 0; i < _NUM_FLD; i++) { + errno = 0; + int_fld[i] = strtoul(str_fld[i], &end, 0); + if (errno != 0 || end == str_fld[i] || int_fld[i] > 255) + return -1; + } + if (nb_lcore_params >= MAX_LCORE_PARAMS) { + RTE_LOG(ERR, L3FWD, "exceeded max number of lcore " + "params: %hu\n", nb_lcore_params); + return -1; + } + lcore_params[nb_lcore_params].port_id = + (uint8_t)int_fld[FLD_PORT]; + lcore_params[nb_lcore_params].queue_id = + (uint8_t)int_fld[FLD_QUEUE]; + lcore_params[nb_lcore_params].lcore_id = + (uint8_t)int_fld[FLD_LCORE]; + ++nb_lcore_params; + } + return 0; +} + +#define MAX_JUMBO_PKT_LEN 9600 +#define MEMPOOL_CACHE_SIZE 256 + +static const char short_options[] = + "p:" /* portmask */ + "P" /* promiscuous */ + "L" /* enable long prefix match */ + "E" /* enable exact match */ + ; + +#define CMD_LINE_OPT_CONFIG "config" +#define CMD_LINE_OPT_IPV6 "ipv6" +#define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype" +enum { + /* long options mapped to a short option */ + + /* first long only option value must be >= 256, so that we won't + * conflict with short options + */ + CMD_LINE_OPT_MIN_NUM = 256, + CMD_LINE_OPT_CONFIG_NUM, + CMD_LINE_OPT_PARSE_PTYPE_NUM, +}; + +static const struct option lgopts[] = { + {CMD_LINE_OPT_CONFIG, 1, 0, CMD_LINE_OPT_CONFIG_NUM}, + {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, CMD_LINE_OPT_PARSE_PTYPE_NUM}, + {NULL, 0, 0, 0} +}; + +/* + * This expression is used to calculate the number of mbufs needed + * depending on user input, taking into account memory for rx and + * tx hardware rings, cache per lcore and mtable per port per lcore. + * RTE_MAX is used to ensure that NB_MBUF never goes below a minimum + * value of 8192 + */ +#define NB_MBUF(nports) RTE_MAX( \ + (nports*nb_rx_queue*nb_rxd + \ + nports*nb_lcores*MAX_PKT_BURST + \ + nports*n_tx_queue*nb_txd + \ + nb_lcores*MEMPOOL_CACHE_SIZE), \ + 8192U) + +/* Parse the argument given in the command line of the application */ +static int +parse_args(int argc, char **argv) +{ + int opt, ret; + char **argvopt; + int option_index; + char *prgname = argv[0]; + + argvopt = argv; + + /* Error or normal output strings. */ + while ((opt = getopt_long(argc, argvopt, short_options, + lgopts, &option_index)) != EOF) { + + switch (opt) { + /* portmask */ + case 'p': + enabled_port_mask = parse_portmask(optarg); + if (enabled_port_mask == 0) { + RTE_LOG(ERR, L3FWD, "Invalid portmask\n"); + print_usage(prgname); + return -1; + } + break; + + case 'P': + promiscuous_on = 1; + break; + + /* long options */ + case CMD_LINE_OPT_CONFIG_NUM: + ret = parse_config(optarg); + if (ret) { + RTE_LOG(ERR, L3FWD, "Invalid config\n"); + print_usage(prgname); + return -1; + } + break; + + case CMD_LINE_OPT_PARSE_PTYPE_NUM: + RTE_LOG(INFO, L3FWD, "soft parse-ptype is enabled\n"); + parse_ptype = 1; + break; + + default: + print_usage(prgname); + return -1; + } + } + + if (nb_lcore_params == 0) { + memcpy(lcore_params, lcore_params_default, + sizeof(lcore_params_default)); + nb_lcore_params = RTE_DIM(lcore_params_default); + } + + if (optind >= 0) + argv[optind-1] = prgname; + + ret = optind-1; + optind = 1; /* reset getopt lib */ + return ret; +} + +static void +signal_handler(int signum) +{ + if (signum == SIGINT || signum == SIGTERM) { + RTE_LOG(NOTICE, L3FWD, + "\n\nSignal %d received, preparing to exit...\n", + signum); + force_quit = true; + } +} + +static int +prepare_ptype_parser(uint16_t portid, uint16_t queueid) +{ + if (parse_ptype) { + RTE_LOG(INFO, L3FWD, "Port %d: softly parse packet type info\n", + portid); + if (rte_eth_add_rx_callback(portid, queueid, + lpm_cb_parse_ptype, + NULL)) + return 1; + + RTE_LOG(ERR, L3FWD, "Failed to add rx callback: port=%d\n", + portid); + return 0; + } + + if (lpm_check_ptype(portid)) + return 1; + + RTE_LOG(ERR, L3FWD, + "port %d cannot parse packet type, please add --%s\n", + portid, CMD_LINE_OPT_PARSE_PTYPE); + return 0; +} + +int +main(int argc, char **argv) +{ + struct lcore_conf *lconf; + struct rte_eth_dev_info dev_info; + struct rte_eth_txconf *txconf; + int ret; + unsigned int nb_ports; + uint32_t nb_mbufs; + uint16_t queueid, portid; + unsigned int lcore_id; + uint32_t nb_tx_queue, nb_lcores; + uint8_t nb_rx_queue, queue; + + /* init EAL */ + ret = rte_eal_init(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Invalid EAL parameters\n"); + argc -= ret; + argv += ret; + + force_quit = false; + signal(SIGINT, signal_handler); + signal(SIGTERM, signal_handler); + + /* parse application arguments (after the EAL ones) */ + ret = parse_args(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Invalid L3FWD parameters\n"); + + if (check_lcore_params() < 0) + rte_exit(EXIT_FAILURE, "check_lcore_params failed\n"); + + if (add_proxies() < 0) + rte_exit(EXIT_FAILURE, "add_proxies failed\n"); + + ret = init_lcore_rx_queues(); + if (ret < 0) + rte_exit(EXIT_FAILURE, "init_lcore_rx_queues failed\n"); + + nb_ports = rte_eth_dev_count_avail(); + + nb_lcores = rte_lcore_count(); + + /* Initial number of mbufs in pool - the amount required for hardware + * rx/tx rings will be added during configuration of ports. + */ + nb_mbufs = nb_ports * nb_lcores * MAX_PKT_BURST + /* mbuf tables */ + nb_lcores * MEMPOOL_CACHE_SIZE; /* per lcore cache */ + + /* Init the lookup structures. */ + setup_lpm(); + + /* initialize all ports (including proxies) */ + RTE_ETH_FOREACH_DEV(portid) { + struct rte_eth_conf local_port_conf = port_conf; + + /* skip ports that are not enabled */ + if ((enabled_port_mask & (1 << portid)) == 0) { + RTE_LOG(INFO, L3FWD, "Skipping disabled port %d\n", + portid); + continue; + } + + /* init port */ + RTE_LOG(INFO, L3FWD, "Initializing port %d ...\n", portid); + + nb_rx_queue = get_port_n_rx_queues(portid); + nb_tx_queue = nb_lcores; + + ret = rte_eth_dev_info_get(portid, &dev_info); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "Error during getting device (port %u) info: %s\n", + portid, strerror(-ret)); + if (nb_rx_queue > dev_info.max_rx_queues || + nb_tx_queue > dev_info.max_tx_queues) + rte_exit(EXIT_FAILURE, + "Port %d cannot configure enough queues\n", + portid); + + RTE_LOG(INFO, L3FWD, "Creating queues: nb_rxq=%d nb_txq=%u...\n", + nb_rx_queue, nb_tx_queue); + + if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE) + local_port_conf.txmode.offloads |= + DEV_TX_OFFLOAD_MBUF_FAST_FREE; + + local_port_conf.rx_adv_conf.rss_conf.rss_hf &= + dev_info.flow_type_rss_offloads; + if (local_port_conf.rx_adv_conf.rss_conf.rss_hf != + port_conf.rx_adv_conf.rss_conf.rss_hf) { + RTE_LOG(INFO, L3FWD, + "Port %u modified RSS hash function based on hardware support," + "requested:%#"PRIx64" configured:%#"PRIx64"\n", + portid, port_conf.rx_adv_conf.rss_conf.rss_hf, + local_port_conf.rx_adv_conf.rss_conf.rss_hf); + } + + ret = rte_eth_dev_configure(portid, nb_rx_queue, + (uint16_t)nb_tx_queue, + &local_port_conf); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "Cannot configure device: err=%d, port=%d\n", + ret, portid); + + ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd, + &nb_txd); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "Cannot adjust number of descriptors: err=%d, " + "port=%d\n", ret, portid); + + nb_mbufs += nb_rx_queue * nb_rxd + nb_tx_queue * nb_txd; + /* init one TX queue per couple (lcore,port) */ + queueid = 0; + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + + RTE_LOG(INFO, L3FWD, "\ttxq=%u,%d\n", lcore_id, + queueid); + + txconf = &dev_info.default_txconf; + txconf->offloads = local_port_conf.txmode.offloads; + ret = rte_eth_tx_queue_setup(portid, queueid, nb_txd, + SOCKET_ID_ANY, txconf); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_tx_queue_setup: err=%d, " + "port=%d\n", ret, portid); + + lconf = &lcore_conf[lcore_id]; + lconf->tx_queue_id[portid] = queueid; + queueid++; + + lconf->tx_port_id[lconf->n_tx_port] = portid; + lconf->n_tx_port++; + } + RTE_LOG(INFO, L3FWD, "\n"); + } + + /* Init pkt pool. */ + pktmbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", + rte_align32prevpow2(nb_mbufs), MEMPOOL_CACHE_SIZE, + 0, RTE_MBUF_DEFAULT_BUF_SIZE, SOCKET_ID_ANY); + if (pktmbuf_pool == NULL) + rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n"); + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + lconf = &lcore_conf[lcore_id]; + RTE_LOG(INFO, L3FWD, "Initializing rx queues on lcore %u ...\n", + lcore_id); + /* init RX queues */ + for (queue = 0; queue < lconf->n_rx_queue; ++queue) { + struct rte_eth_rxconf rxq_conf; + + portid = lconf->rx_queue_list[queue].port_id; + queueid = lconf->rx_queue_list[queue].queue_id; + + RTE_LOG(INFO, L3FWD, "\trxq=%d,%d\n", portid, queueid); + + ret = rte_eth_dev_info_get(portid, &dev_info); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "Error during getting device (port %u) info: %s\n", + portid, strerror(-ret)); + + rxq_conf = dev_info.default_rxconf; + rxq_conf.offloads = port_conf.rxmode.offloads; + ret = rte_eth_rx_queue_setup(portid, queueid, + nb_rxd, SOCKET_ID_ANY, + &rxq_conf, + pktmbuf_pool); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_rx_queue_setup: err=%d, port=%d\n", + ret, portid); + } + } + + RTE_LOG(INFO, L3FWD, "\n"); + + /* start ports */ + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) + continue; + + /* Start device */ + ret = rte_eth_dev_start(portid); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_dev_start: err=%d, port=%d\n", + ret, portid); + + /* + * If enabled, put device in promiscuous mode. + * This allows IO forwarding mode to forward packets + * to itself through 2 cross-connected ports of the + * target machine. + */ + if (promiscuous_on) { + ret = rte_eth_promiscuous_enable(portid); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "rte_eth_promiscuous_enable: err=%s, port=%u\n", + rte_strerror(-ret), portid); + } + } + /* we've managed to start all enabled ports so active == enabled */ + active_port_mask = enabled_port_mask; + + RTE_LOG(INFO, L3FWD, "\n"); + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + lconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < lconf->n_rx_queue; ++queue) { + portid = lconf->rx_queue_list[queue].port_id; + queueid = lconf->rx_queue_list[queue].queue_id; + if (prepare_ptype_parser(portid, queueid) == 0) + rte_exit(EXIT_FAILURE, "ptype check fails\n"); + } + } + + if (init_if_proxy() < 0) + rte_exit(EXIT_FAILURE, "Failed to configure proxy lib\n"); + wait_for_config_done(); + + ret = 0; + /* launch per-lcore init on every lcore */ + rte_eal_mp_remote_launch(lpm_main_loop, NULL, CALL_MASTER); + RTE_LCORE_FOREACH_SLAVE(lcore_id) { + if (rte_eal_wait_lcore(lcore_id) < 0) { + ret = -1; + break; + } + } + + /* stop ports */ + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) + continue; + RTE_LOG(INFO, L3FWD, "Closing port %d...", portid); + rte_eth_dev_stop(portid); + rte_eth_dev_close(portid); + rte_log(RTE_LOG_INFO, RTE_LOGTYPE_L3FWD, " Done\n"); + } + + close_if_proxy(); + RTE_LOG(INFO, L3FWD, "Bye...\n"); + + return ret; +} diff --git a/examples/l3fwd-ifpx/meson.build b/examples/l3fwd-ifpx/meson.build new file mode 100644 index 000000000..f0c0920b8 --- /dev/null +++ b/examples/l3fwd-ifpx/meson.build @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2020 Marvell International Ltd. + +# meson file, for building this example as part of a main DPDK build. +# +# To build this example as a standalone application with an already-installed +# DPDK instance, use 'make' + +allow_experimental_apis = true +deps += ['hash', 'lpm', 'if_proxy'] +sources = files('l3fwd.c', 'main.c') diff --git a/examples/meson.build b/examples/meson.build index 1f2b6f516..319d765eb 100644 --- a/examples/meson.build +++ b/examples/meson.build @@ -23,7 +23,7 @@ all_examples = [ 'l2fwd', 'l2fwd-cat', 'l2fwd-event', 'l2fwd-crypto', 'l2fwd-jobstats', 'l2fwd-keepalive', 'l3fwd', - 'l3fwd-acl', 'l3fwd-power', + 'l3fwd-acl', 'l3fwd-ifpx', 'l3fwd-power', 'link_status_interrupt', 'multi_process/client_server_mp/mp_client', 'multi_process/client_server_mp/mp_server', -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v4 0/4] Introduce IF proxy library 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka ` (7 preceding siblings ...) 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 " Andrzej Ostruszka @ 2020-06-22 9:21 ` Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 1/4] lib: introduce IF Proxy library Andrzej Ostruszka ` (3 more replies) 8 siblings, 4 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-06-22 9:21 UTC (permalink / raw) To: dev All Please find in this patch set updated version of IF Proxy library. This is just a rebase on top of master of version 3 and, as announced in Marvell roadmap, is meant to be merged into 20.08. As previously note that this version does not change the notification scheme yet, since discussion about general DPDK messaging/notification scheme has not started. Once this framework crystallize we are willing to rebase on top of it and depending on the outcome of this rebase we might incorporate some performance improvements to the example application or merge the example with the regular l3fwd application. Changes in V4 ============= - Rebased on master Changes in V3 ============= - Changed callback registration scheme to make the ABI more robust - Added new platform callback to provide mask with events available - All library data access is guarded with a lock - When port is unbound and proxy has no more ports then it is automatically released Changes in V2 ============= - Cleaned up checkpatch warnings - Removed dead/unused code and added gateway clearing in l3fwd-ifpx What is this useful for ======================= Usually, when an ethernet port is assigned to DPDK it vanishes from the system and user looses ability to control it via normal configuration utilities (e.g. those from iproute2 package). Moreover by default DPDK application is not aware of the network configuration of the system. To address both of these issues application needs to: - add some command line interface (or other mechanism) allowing for control of the port and its configuration - query the status of network configuration and monitor its changes The purpose of this library is to help with both of these tasks (as long as they remain in domain of configuration available to the system). In other words, if DPDK application has some special needs, that cannot be addressed by the normal system configuration utilities, then they need to be solved by the application itself. The connection between DPDK and system is based on the existence of ports that are visible to both DPDK and system (like Tap, KNI and possibly some other drivers). These ports serve as an interface proxies. Let's visualize the action of the library by the following example: Linux | DPDK ============================================================== | | +-------+ +-------+ | | Port1 | | Port2 | "ip link set dev tap1 mtu 1600" | +-------+ +-------+ | | ^ ^ ^ | +------+ | mtu_change | | `->| Tap1 |---' callback | | +------+ | | "ip addr add 198.51.100.14 \ | | | dev tap2" | | | | +------+ | | +->| Tap2 |------------------' | | +------+ addr_add callback | "ip route add 198.0.2.0/24 \ | | | dev tap2" | | route_add callback | | `---------------------' So we have two ports Port1 and Port2 that are not visible to the system. We create two proxy interfaces (here based on Tap driver) and bind the ports to their proxies. When user issues a command changing MTU for Tap1 interface the library notes this and calls "mtu_change" callback for the Port1. Similarly when user adds an IPv4 address to the Tap2 interface "addr_add" callback is called for the Port2 and the same happens for configuration of routing rule pointing to Tap2. Apart from callbacks this library can notify about changes via adding events to notification queues. See below for more inforamtion about that and a complete list of available callbacks. Please note that nothing has been mentioned about forwarding of the packets between system and DPDK. Since the proxies are normal DPDK ports you can receive/send to them via usual RX/TX burst API. However since the library is not aware of the structure of packet processing used by the application it cannot automatically forward the packets - it is responsibility of the application to include proxy ports into its packet processing engine. As mentioned above the intention of the library is to: - provide information about network configuration that would allow application to decide what to do with the packets received on DPDK ports, - allow for control of the ports via standard configuration utilities Although the library only helps you to identify proxy for given port (and vice versa) and calls appropriate callbacks it does open some interesting possibilities. For example you can use the proxy ports to forward packets for protocols that you do not wish to handle in DPDK application to the system protocol stack and just listen to the configuration changes - so that way you can "offload" handling of those protocols to the system. How to use it ============= Usage of this library is rather simple. You have to: 1. Create proxy (if you don't have port suitable for being proxy or you have one but do not wish to use it as a proxy). 2. Bind port to proxy. 3. Register callbacks and/or event queues. 4. Start listening to the network configuration. The only mandatory requirement for DPDK port to be able to act as a proxy is that it is visible in the system - this is checked during port to proxy binding by calling rte_eth_dev_info_get() on proxy port and inspecting 'if_index' field (it has to be non-zero). One can create such port in the application by calling: proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); Upon success this returns id of DPDK proxy port created (RTE_MAX_ETHPORTS on failure). The argument selects type of proxy port to create (currently Tap/KNI only). This function actually is just a wrapper around: uint16_t rte_ifpx_create_by_devarg(const char *devarg); creating valid 'devarg' string for the chosen type of proxy. If you have other driver capable of acting as a proxy you can call rte_ifpx_create_by_devarg() directly passing appropriate argument. Once you have id of both port and proxy you can bind the two via: rte_ifpx_port_bind(port_id, proxy_id); This creates logical binding - as mentioned above there is no automatic packet forwarding. With this binding whenever user changes the state of proxy interface in the system (link up/down, change mac/mtu, add/remove IPv4/IPv6) you get appropriate notification for the bound port. So far we've mentioned several times that the library calls callbacks. They are grouped in 'struct rte_ifpx_callbacks' and user provides them to the library via: rte_ifpx_callbacks_register(len, cbs); It is worth mentioning that the context (lcore/thread) in which these callbacks are called is implementation defined. It might differ between different platforms, so the application needs to assume that some kind of inter lcore/thread synchronization/communication is required. Apart from notification via callbacks this library also supports notifying about the changes via adding events to the configured notification queues. The queues are registered via: int rte_ifpx_queue_add(struct rte_ring *r); and the actual logic used is: if there is callback registered then it is called, if it returns non-zero then event is considered completed, otherwise event is added to each configured notification queue. That way application can update data structures that are safe to be modified by single writer from within callback or do the common preprocessing steps (if any needed) in callback and data that is replicated can be updated during handling of queued events. Once we have bindings in place and notification configured, the only essential part that remains is to get the current network configuration and start listening to its changes. This is accomplished via a call to: rte_ifpx_listen(); And basically this is all one needs to understand how to use this library. Other less essential parts include: - ability to query what events are available for given platform - getting mapping between proxy and port - unbinding the ports from proxy - destroying proxy port - closing the listening service - getting basic information about proxy Currently available features and implementation =============================================== The library's API is system independent but it obviously needs some system dependent parts. We provide exemplary Linux implementation (based on netlink sockets). Very similar implementation is possible for FreeBSD (with the usage of PF_ROUTE sockets). Windows implementation would need to differ much (probably IP Helper library would be of some help). Here is the list of currently implemented callbacks: int (*mac_change)(const struct rte_ifpx_mac_change *event); int (*mtu_change)(const struct rte_ifpx_mtu_change *event); int (*link_change)(const struct rte_ifpx_link_change *event); int (*addr_add)(const struct rte_ifpx_addr_change *event); int (*addr_del)(const struct rte_ifpx_addr_change *event); int (*addr6_add)(const struct rte_ifpx_addr6_change *event); int (*addr6_del)(const struct rte_ifpx_addr6_change *event); int (*route_add)(const struct rte_ifpx_route_change *event); int (*route_del)(const struct rte_ifpx_route_change *event); int (*route6_add)(const struct rte_ifpx_route6_change *event); int (*route6_del)(const struct rte_ifpx_route6_change *event); int (*neigh_add)(const struct rte_ifpx_neigh_change *event); int (*neigh_del)(const struct rte_ifpx_neigh_change *event); int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); int (*cfg_done)(void); They are all rather self-descriptive with the exception of the last one. When the user calls rte_ifpx_listen() the library first queries the system for its current configuration. That might require several request/reply exchanges between DPDK and system and once it is finished this callback is called to let application know that all info has been gathered. It is worth to mention also that while typical case would be a 1-to-1 mapping between port and proxy, the 1-to-many mapping is also supported. In that case related callbacks will be called for each port bound to given proxy interface - it is application responsibility to define semantic of such mapping (e.g. all changes apply to all ports, or link changes apply to all but other are accepted in "round robin" fashion, or some other logic). As mentioned above Linux implementation is based on netlink socket. This socket is registered as file descriptor in EAL interrupts (similarly to how EAL alarms are implemented). With regards Andrzej Ostruszka Andrzej Ostruszka (4): lib: introduce IF Proxy library if_proxy: add library documentation if_proxy: add simple functionality test if_proxy: add example application MAINTAINERS | 6 + app/test/Makefile | 5 + app/test/meson.build | 4 + app/test/test_if_proxy.c | 707 +++++++++++ config/common_base | 5 + config/common_linux | 1 + doc/guides/prog_guide/if_proxy_lib.rst | 142 +++ doc/guides/prog_guide/index.rst | 1 + examples/Makefile | 1 + examples/l3fwd-ifpx/Makefile | 60 + examples/l3fwd-ifpx/l3fwd.c | 1131 ++++++++++++++++++ examples/l3fwd-ifpx/l3fwd.h | 98 ++ examples/l3fwd-ifpx/main.c | 740 ++++++++++++ examples/l3fwd-ifpx/meson.build | 11 + examples/meson.build | 4 +- lib/Makefile | 3 + lib/librte_eal/include/rte_eal_interrupts.h | 2 + lib/librte_eal/linux/eal_interrupts.c | 14 +- lib/librte_if_proxy/Makefile | 29 + lib/librte_if_proxy/if_proxy_common.c | 494 ++++++++ lib/librte_if_proxy/if_proxy_priv.h | 97 ++ lib/librte_if_proxy/linux/Makefile | 4 + lib/librte_if_proxy/linux/if_proxy.c | 550 +++++++++ lib/librte_if_proxy/meson.build | 19 + lib/librte_if_proxy/rte_if_proxy.h | 561 +++++++++ lib/librte_if_proxy/rte_if_proxy_version.map | 19 + lib/meson.build | 2 +- 27 files changed, 4703 insertions(+), 7 deletions(-) create mode 100644 app/test/test_if_proxy.c create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst create mode 100644 examples/l3fwd-ifpx/Makefile create mode 100644 examples/l3fwd-ifpx/l3fwd.c create mode 100644 examples/l3fwd-ifpx/l3fwd.h create mode 100644 examples/l3fwd-ifpx/main.c create mode 100644 examples/l3fwd-ifpx/meson.build create mode 100644 lib/librte_if_proxy/Makefile create mode 100644 lib/librte_if_proxy/if_proxy_common.c create mode 100644 lib/librte_if_proxy/if_proxy_priv.h create mode 100644 lib/librte_if_proxy/linux/Makefile create mode 100644 lib/librte_if_proxy/linux/if_proxy.c create mode 100644 lib/librte_if_proxy/meson.build create mode 100644 lib/librte_if_proxy/rte_if_proxy.h create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v4 1/4] lib: introduce IF Proxy library 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 0/4] Introduce IF proxy library Andrzej Ostruszka @ 2020-06-22 9:21 ` Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 2/4] if_proxy: add library documentation Andrzej Ostruszka ` (2 subsequent siblings) 3 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-06-22 9:21 UTC (permalink / raw) To: dev, Thomas Monjalon, Ray Kinsella, Neil Horman This library allows to designate ports visible to the system (such as Tun/Tap or KNI) as port representors serving as proxies for other DPDK ports. When such a proxy is configured this library initially queries network configuration from the system and later monitors its changes. The information gathered is passed to the application either via a set of user registered callbacks or as an event added to the configured notification queue (or a combination of these two mechanisms). This way user can use normal network utilities (like those from the iproute2 suite) to configure DPDK ports. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 3 + config/common_base | 5 + config/common_linux | 1 + lib/Makefile | 3 + lib/librte_eal/include/rte_eal_interrupts.h | 2 + lib/librte_eal/linux/eal_interrupts.c | 14 +- lib/librte_if_proxy/Makefile | 29 + lib/librte_if_proxy/if_proxy_common.c | 494 ++++++++++++++++ lib/librte_if_proxy/if_proxy_priv.h | 97 ++++ lib/librte_if_proxy/linux/Makefile | 4 + lib/librte_if_proxy/linux/if_proxy.c | 550 ++++++++++++++++++ lib/librte_if_proxy/meson.build | 19 + lib/librte_if_proxy/rte_if_proxy.h | 561 +++++++++++++++++++ lib/librte_if_proxy/rte_if_proxy_version.map | 19 + lib/meson.build | 2 +- 15 files changed, 1798 insertions(+), 5 deletions(-) create mode 100644 lib/librte_if_proxy/Makefile create mode 100644 lib/librte_if_proxy/if_proxy_common.c create mode 100644 lib/librte_if_proxy/if_proxy_priv.h create mode 100644 lib/librte_if_proxy/linux/Makefile create mode 100644 lib/librte_if_proxy/linux/if_proxy.c create mode 100644 lib/librte_if_proxy/meson.build create mode 100644 lib/librte_if_proxy/rte_if_proxy.h create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map diff --git a/MAINTAINERS b/MAINTAINERS index 816696caf2..65c5a18723 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1499,6 +1499,9 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com> M: Pavan Nikhilesh <pbhagavatula@marvell.com> F: lib/librte_node/ +IF Proxy - EXPERIMENTAL +M: Andrzej Ostruszka <aostruszka@marvell.com> +F: lib/librte_if_proxy/ Test Applications ----------------- diff --git a/config/common_base b/config/common_base index c7d5c73215..ddb36f3293 100644 --- a/config/common_base +++ b/config/common_base @@ -1099,6 +1099,11 @@ CONFIG_RTE_LIBRTE_GRAPH_STATS=y # CONFIG_RTE_LIBRTE_NODE=y +# +# Compile librte_if_proxy +# +CONFIG_RTE_LIBRTE_IF_PROXY=n + # # Compile the test application # diff --git a/config/common_linux b/config/common_linux index 816810671a..1244eb0ae9 100644 --- a/config/common_linux +++ b/config/common_linux @@ -16,6 +16,7 @@ CONFIG_RTE_LIBRTE_VHOST_NUMA=y CONFIG_RTE_LIBRTE_VHOST_POSTCOPY=n CONFIG_RTE_LIBRTE_PMD_VHOST=y CONFIG_RTE_LIBRTE_IFC_PMD=y +CONFIG_RTE_LIBRTE_IF_PROXY=y CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y CONFIG_RTE_LIBRTE_PMD_MEMIF=y CONFIG_RTE_LIBRTE_PMD_SOFTNIC=y diff --git a/lib/Makefile b/lib/Makefile index e0e5eb4d8d..4018cb5b2b 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -127,6 +127,9 @@ DEPDIRS-librte_graph := librte_eal DIRS-$(CONFIG_RTE_LIBRTE_NODE) += librte_node DEPDIRS-librte_node := librte_graph librte_lpm librte_ethdev librte_mbuf +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += librte_if_proxy +DEPDIRS-librte_if_proxy := librte_eal librte_ethdev + ifeq ($(CONFIG_RTE_EXEC_ENV_LINUX),y) DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni endif diff --git a/lib/librte_eal/include/rte_eal_interrupts.h b/lib/librte_eal/include/rte_eal_interrupts.h index 773a34a42b..296a3853d2 100644 --- a/lib/librte_eal/include/rte_eal_interrupts.h +++ b/lib/librte_eal/include/rte_eal_interrupts.h @@ -36,6 +36,8 @@ enum rte_intr_handle_type { RTE_INTR_HANDLE_VDEV, /**< virtual device */ RTE_INTR_HANDLE_DEV_EVENT, /**< device event handle */ RTE_INTR_HANDLE_VFIO_REQ, /**< VFIO request handle */ + RTE_INTR_HANDLE_NETLINK, /**< netlink notification handle */ + RTE_INTR_HANDLE_MAX /**< count of elements */ }; diff --git a/lib/librte_eal/linux/eal_interrupts.c b/lib/librte_eal/linux/eal_interrupts.c index 16e7a7d512..91ddafc59f 100644 --- a/lib/librte_eal/linux/eal_interrupts.c +++ b/lib/librte_eal/linux/eal_interrupts.c @@ -691,6 +691,9 @@ rte_intr_enable(const struct rte_intr_handle *intr_handle) break; /* not used at this moment */ case RTE_INTR_HANDLE_ALARM: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif rc = -1; break; #ifdef VFIO_PRESENT @@ -818,6 +821,9 @@ rte_intr_disable(const struct rte_intr_handle *intr_handle) break; /* not used at this moment */ case RTE_INTR_HANDLE_ALARM: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif rc = -1; break; #ifdef VFIO_PRESENT @@ -915,12 +921,12 @@ eal_intr_process_interrupts(struct epoll_event *events, int nfds) break; #endif #endif - case RTE_INTR_HANDLE_VDEV: case RTE_INTR_HANDLE_EXT: - bytes_read = 0; - call = true; - break; + case RTE_INTR_HANDLE_VDEV: case RTE_INTR_HANDLE_DEV_EVENT: +#if RTE_LIBRTE_IF_PROXY + case RTE_INTR_HANDLE_NETLINK: +#endif bytes_read = 0; call = true; break; diff --git a/lib/librte_if_proxy/Makefile b/lib/librte_if_proxy/Makefile new file mode 100644 index 0000000000..43cb702a2e --- /dev/null +++ b/lib/librte_if_proxy/Makefile @@ -0,0 +1,29 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +include $(RTE_SDK)/mk/rte.vars.mk + +# library name +LIB = librte_if_proxy.a + +CFLAGS += -DALLOW_EXPERIMENTAL_API +CFLAGS += -O3 +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) +LDLIBS += -lrte_eal -lrte_ethdev + +EXPORT_MAP := rte_if_proxy_version.map + +LIBABIVER := 1 + +# all source are stored in SRCS-y +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) := if_proxy_common.c + +SYSDIR := $(patsubst "%app",%,$(CONFIG_RTE_EXEC_ENV)) +include $(SRCDIR)/$(SYSDIR)/Makefile + +SRCS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += $(addprefix $(SYSDIR)/,$(SRCS)) + +# install this header file +SYMLINK-$(CONFIG_RTE_LIBRTE_IF_PROXY)-include := rte_if_proxy.h + +include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_if_proxy/if_proxy_common.c b/lib/librte_if_proxy/if_proxy_common.c new file mode 100644 index 0000000000..546dc78105 --- /dev/null +++ b/lib/librte_if_proxy/if_proxy_common.c @@ -0,0 +1,494 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#include <if_proxy_priv.h> +#include <rte_string_fns.h> + + +/* Definitions of data mentioned in if_proxy_priv.h and local ones. */ +int ifpx_log_type; + +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; + +rte_spinlock_t ifpx_lock = RTE_SPINLOCK_INITIALIZER; + +struct ifpx_proxies_head ifpx_proxies = TAILQ_HEAD_INITIALIZER(ifpx_proxies); + +struct ifpx_queue_node { + TAILQ_ENTRY(ifpx_queue_node) elem; + uint16_t state; + struct rte_ring *r; +}; +static +TAILQ_HEAD(ifpx_queues_head, ifpx_queue_node) ifpx_queues = + TAILQ_HEAD_INITIALIZER(ifpx_queues); + +/* All function pointers have the same size - so use this one to typecast + * different callbacks in rte_ifpx_callbacks and test their presence in a + * generic way. + */ +union cb_ptr_t { + int (*f_ptr)(void *ev); /* type for normal event notification */ + int (*cfg_done)(void); /* lib notification for finished config */ +}; +union { + struct rte_ifpx_callbacks cbs; + union cb_ptr_t funcs[RTE_IFPX_NUM_EVENTS]; +} ifpx_callbacks; + +uint64_t rte_ifpx_events_available(void) +{ + /* All events are supported on Linux. */ + return (1ULL << RTE_IFPX_NUM_EVENTS) - 1; +} + +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type) +{ + char devargs[16] = { '\0' }; + int dev_cnt = 0, nlen; + uint16_t port_id; + + switch (type) { + case RTE_IFPX_DEFAULT: + case RTE_IFPX_TAP: + nlen = strlcpy(devargs, "net_tap", sizeof(devargs)); + break; + case RTE_IFPX_KNI: + nlen = strlcpy(devargs, "net_kni", sizeof(devargs)); + break; + default: + IFPX_LOG(ERR, "Unknown proxy type: %d", type); + return RTE_MAX_ETHPORTS; + } + + RTE_ETH_FOREACH_DEV(port_id) { + if (strcmp(rte_eth_devices[port_id].device->driver->name, + devargs) == 0) + ++dev_cnt; + } + snprintf(devargs+nlen, sizeof(devargs)-nlen, "%d", dev_cnt); + + return rte_ifpx_proxy_create_by_devarg(devargs); +} + +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg) +{ + uint16_t port_id = RTE_MAX_ETHPORTS; + struct rte_dev_iterator iter; + + if (rte_dev_probe(devarg) < 0) { + IFPX_LOG(ERR, "Failed to create proxy port %s\n", devarg); + return RTE_MAX_ETHPORTS; + } + + if (rte_eth_iterator_init(&iter, devarg) == 0) { + port_id = rte_eth_iterator_next(&iter); + if (port_id != RTE_MAX_ETHPORTS) + rte_eth_iterator_cleanup(&iter); + } + + return port_id; +} + +int ifpx_proxy_destroy(struct ifpx_proxy_node *px) +{ + unsigned int i; + uint16_t proxy_id = px->proxy_id; + + TAILQ_REMOVE(&ifpx_proxies, px, elem); + free(px); + + /* Clear any bindings for this proxy. */ + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) { + if (ifpx_ports[i] == proxy_id) { + if (i == proxy_id) /* this entry is for proxy itself */ + ifpx_ports[i] = RTE_MAX_ETHPORTS; + else + rte_ifpx_port_unbind(i); + } + } + + return rte_dev_remove(rte_eth_devices[proxy_id].device); +} + +int rte_ifpx_proxy_destroy(uint16_t proxy_id) +{ + struct ifpx_proxy_node *px; + int ec = 0; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id != proxy_id) + continue; + } + if (!px) { + ec = -EINVAL; + goto exit; + } + if (px->state & IN_USE) + px->state |= DEL_PENDING; + else + ec = ifpx_proxy_destroy(px); +exit: + rte_spinlock_unlock(&ifpx_lock); + return ec; +} + +int rte_ifpx_queue_add(struct rte_ring *r) +{ + struct ifpx_queue_node *node; + int ec = 0; + + if (!r) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(node, &ifpx_queues, elem) { + if (node->r == r) { + ec = -EEXIST; + goto exit; + } + } + + node = malloc(sizeof(*node)); + if (!node) { + ec = -ENOMEM; + goto exit; + } + + node->r = r; + TAILQ_INSERT_TAIL(&ifpx_queues, node, elem); +exit: + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +int rte_ifpx_queue_remove(struct rte_ring *r) +{ + struct ifpx_queue_node *node, *next; + int ec = -EINVAL; + + if (!r) + return ec; + + rte_spinlock_lock(&ifpx_lock); + for (node = TAILQ_FIRST(&ifpx_queues); node; node = next) { + next = TAILQ_NEXT(node, elem); + if (node->r != r) + continue; + TAILQ_REMOVE(&ifpx_queues, node, elem); + free(node); + ec = 0; + break; + } + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id) +{ + struct rte_eth_dev_info proxy_eth_info; + struct ifpx_proxy_node *px; + int ec; + + if (port_id >= RTE_MAX_ETHPORTS || proxy_id >= RTE_MAX_ETHPORTS || + /* port is a proxy */ + ifpx_ports[port_id] == port_id) { + IFPX_LOG(ERR, "Invalid port_id: %d", port_id); + return -EINVAL; + } + + /* Do automatic rebinding but issue a warning since this is not + * considered to be a valid behaviour. + */ + if (ifpx_ports[port_id] != RTE_MAX_ETHPORTS) { + IFPX_LOG(WARNING, "Port already bound: %d -> %d", port_id, + ifpx_ports[port_id]); + } + + /* Search for existing proxy - if not found add one to the list. */ + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == proxy_id) + break; + } + if (!px) { + ec = rte_eth_dev_info_get(proxy_id, &proxy_eth_info); + if (ec < 0 || proxy_eth_info.if_index == 0) { + IFPX_LOG(ERR, "Invalid proxy: %d", proxy_id); + rte_spinlock_unlock(&ifpx_lock); + return ec < 0 ? ec : -EINVAL; + } + px = malloc(sizeof(*px)); + if (!px) { + rte_spinlock_unlock(&ifpx_lock); + return -ENOMEM; + } + px->proxy_id = proxy_id; + px->info.if_index = proxy_eth_info.if_index; + rte_eth_dev_get_mtu(proxy_id, &px->info.mtu); + rte_eth_macaddr_get(proxy_id, &px->info.mac); + memset(px->info.if_name, 0, sizeof(px->info.if_name)); + TAILQ_INSERT_TAIL(&ifpx_proxies, px, elem); + ifpx_ports[proxy_id] = proxy_id; + } + rte_spinlock_unlock(&ifpx_lock); + ifpx_ports[port_id] = proxy_id; + + /* Add proxy MAC to the port - since port will often just forward + * packets from the proxy/system they will be sent with proxy MAC as + * src. In order to pass communication in other direction we should be + * accepting packets with proxy MAC as dst. + */ + rte_eth_dev_mac_addr_add(port_id, &px->info.mac, 0); + + if (ifpx_platform.get_info) + ifpx_platform.get_info(px->info.if_index); + + return 0; +} + +int rte_ifpx_port_unbind(uint16_t port_id) +{ + if (port_id >= RTE_MAX_ETHPORTS || + ifpx_ports[port_id] == RTE_MAX_ETHPORTS || + /* port is a proxy */ + ifpx_ports[port_id] == port_id) + return -EINVAL; + + ifpx_ports[port_id] = RTE_MAX_ETHPORTS; + /* Proxy without any port bound is OK - that is the state of the proxy + * that has just been created, and it can still report routing + * information. So we do not even check if this is the case. + */ + + return 0; +} + +int rte_ifpx_callbacks_register(const struct rte_ifpx_callbacks *cbs) +{ + if (!cbs) + return -EINVAL; + + rte_spinlock_lock(&ifpx_lock); + ifpx_callbacks.cbs = *cbs; + rte_spinlock_unlock(&ifpx_lock); + + return 0; +} + +void rte_ifpx_callbacks_unregister(void) +{ + rte_spinlock_lock(&ifpx_lock); + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); + rte_spinlock_unlock(&ifpx_lock); +} + +uint16_t rte_ifpx_proxy_get(uint16_t port_id) +{ + if (port_id >= RTE_MAX_ETHPORTS) + return RTE_MAX_ETHPORTS; + + return ifpx_ports[port_id]; +} + +unsigned int rte_ifpx_port_get(uint16_t proxy_id, + uint16_t *ports, unsigned int num) +{ + unsigned int p, cnt = 0; + + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] == proxy_id && ifpx_ports[p] != p) { + ++cnt; + if (ports && num > 0) { + *ports++ = p; + --num; + } + } + } + return cnt; +} + +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id) +{ + struct ifpx_proxy_node *px; + + if (port_id >= RTE_MAX_ETHPORTS || + ifpx_ports[port_id] == RTE_MAX_ETHPORTS) + return NULL; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->proxy_id == ifpx_ports[port_id]) + break; + } + rte_spinlock_unlock(&ifpx_lock); + RTE_ASSERT(px && "Internal IF Proxy library error"); + + return &px->info; +} + +static +void queue_event(const struct rte_ifpx_event *ev, struct rte_ring *r) +{ + struct rte_ifpx_event *e = malloc(sizeof(*ev)); + + if (!e) { + IFPX_LOG(ERR, "Failed to allocate event!"); + return; + } + RTE_ASSERT(r); + + *e = *ev; + rte_ring_sp_enqueue(r, e); +} + +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px) +{ + struct ifpx_queue_node *q; + int done = 0; + uint16_t p, proxy_id; + + if (px) { + if (px->state & DEL_PENDING) + return; + proxy_id = px->proxy_id; + RTE_ASSERT(proxy_id != RTE_MAX_ETHPORTS); + px->state |= IN_USE; + } else + proxy_id = RTE_MAX_ETHPORTS; + + RTE_ASSERT(ev); + /* This function is expected to be called with a lock held. */ + RTE_ASSERT(rte_spinlock_trylock(&ifpx_lock) == 0); + + if (ifpx_callbacks.funcs[ev->type].f_ptr) { + union cb_ptr_t cb = ifpx_callbacks.funcs[ev->type]; + + /* Drop the lock for the time of callback call. */ + rte_spinlock_unlock(&ifpx_lock); + if (px) { + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] != proxy_id || + ifpx_ports[p] == p) + continue; + ev->data.port_id = p; + done = cb.f_ptr(&ev->data) || done; + } + } else { + RTE_ASSERT(ev->type == RTE_IFPX_CFG_DONE); + done = cb.cfg_done(); + } + rte_spinlock_lock(&ifpx_lock); + } + if (done) + goto exit; + + /* Event not "consumed" yet so try to notify via queues. */ + TAILQ_FOREACH(q, &ifpx_queues, elem) { + if (px) { + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] != proxy_id || + ifpx_ports[p] == p) + continue; + /* Set the port_id - the remaining params should + * be filled before calling this function. + */ + ev->data.port_id = p; + queue_event(ev, q->r); + } + } else + queue_event(ev, q->r); + } +exit: + if (px) + px->state &= ~IN_USE; +} + +void ifpx_cleanup_proxies(void) +{ + struct ifpx_proxy_node *px, *next; + for (px = TAILQ_FIRST(&ifpx_proxies); px; px = next) { + next = TAILQ_NEXT(px, elem); + if (px->state & DEL_PENDING) + ifpx_proxy_destroy(px); + } +} + +int rte_ifpx_listen(void) +{ + int ec; + + if (!ifpx_platform.listen) + return -ENOTSUP; + + ec = ifpx_platform.listen(); + if (ec == 0 && ifpx_platform.get_info) + ifpx_platform.get_info(0); + + return ec; +} + +int rte_ifpx_close(void) +{ + struct ifpx_proxy_node *px; + struct ifpx_queue_node *q; + unsigned int p; + int ec = 0; + + if (ifpx_platform.close) { + ec = ifpx_platform.close(); + if (ec != 0) + IFPX_LOG(ERR, "Platform 'close' calback failed."); + } + + rte_spinlock_lock(&ifpx_lock); + /* Remove queues. */ + while (!TAILQ_EMPTY(&ifpx_queues)) { + q = TAILQ_FIRST(&ifpx_queues); + TAILQ_REMOVE(&ifpx_queues, q, elem); + free(q); + } + + /* Clear callbacks. */ + memset(&ifpx_callbacks.cbs, 0, sizeof(ifpx_callbacks.cbs)); + + /* Unbind ports. */ + for (p = 0; p < RTE_DIM(ifpx_ports); ++p) { + if (ifpx_ports[p] == RTE_MAX_ETHPORTS) + continue; + if (ifpx_ports[p] == p) + /* port is a proxy - just clear entry */ + ifpx_ports[p] = RTE_MAX_ETHPORTS; + else + rte_ifpx_port_unbind(p); + } + + /* Clear proxies. */ + while (!TAILQ_EMPTY(&ifpx_proxies)) { + px = TAILQ_FIRST(&ifpx_proxies); + TAILQ_REMOVE(&ifpx_proxies, px, elem); + free(px); + } + + rte_spinlock_unlock(&ifpx_lock); + + return ec; +} + +RTE_INIT(if_proxy_init) +{ + unsigned int i; + for (i = 0; i < RTE_DIM(ifpx_ports); ++i) + ifpx_ports[i] = RTE_MAX_ETHPORTS; + + ifpx_log_type = rte_log_register("lib.if_proxy"); + if (ifpx_log_type >= 0) + rte_log_set_level(ifpx_log_type, RTE_LOG_WARNING); + + if (ifpx_platform.init) + ifpx_platform.init(); +} diff --git a/lib/librte_if_proxy/if_proxy_priv.h b/lib/librte_if_proxy/if_proxy_priv.h new file mode 100644 index 0000000000..dd7468891c --- /dev/null +++ b/lib/librte_if_proxy/if_proxy_priv.h @@ -0,0 +1,97 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ +#ifndef _IF_PROXY_PRIV_H_ +#define _IF_PROXY_PRIV_H_ + +#include <rte_if_proxy.h> +#include <rte_spinlock.h> + +extern int ifpx_log_type; +#define IFPX_LOG(level, fmt, args...) \ + rte_log(RTE_LOG_ ## level, ifpx_log_type, "%s(): " fmt "\n", \ + __func__, ##args) + +/* Table keeping mapping between port and their proxies. */ +extern +uint16_t ifpx_ports[RTE_MAX_ETHPORTS]; + +/* Callbacks and proxies are kept in linked lists. Since this library is really + * a slow/config path we guard them with a lock - and only one for all of them + * should be enough. We don't expect a need to protect other data structures - + * e.g. data for given port is expected be accessed/modified from single thread. + */ +extern rte_spinlock_t ifpx_lock; + +enum ifpx_node_status { + IN_USE = 1U << 0, + DEL_PENDING = 1U << 1, +}; + +/* List of configured proxies */ +struct ifpx_proxy_node { + TAILQ_ENTRY(ifpx_proxy_node) elem; + uint16_t proxy_id; + uint16_t state; + struct rte_ifpx_info info; +}; +extern +TAILQ_HEAD(ifpx_proxies_head, ifpx_proxy_node) ifpx_proxies; + +/* This function should be called by the implementation whenever it notices + * change in the network configuration. The arguments are: + * - ev : pointer to filled event data structure (all fields are expected to be + * filled, with the exception of 'port_id' for all proxy/port related + * events: this function clones the event notification for each bound port + * and fills 'port_id' appropriately). + * - px : proxy node when given event is proxy/port related, otherwise pass NULL + */ +void ifpx_notify_event(struct rte_ifpx_event *ev, struct ifpx_proxy_node *px); + +/* This function should be called by the implementation whenever it is done with + * notification about network configuration change. It is only really needed + * for the case of callback based API - from the callback user might to attempt + * to remove callbacks/proxies. Removing of callbacks is handled by the + * ifpx_notify_event() function above, however only implementation really knows + * when notification for given proxy is finished so it is a duty of it to call + * this function to cleanup all proxies that has been marked for deletion. + */ +void ifpx_cleanup_proxies(void); + +/* This is the internal function removing the proxy from the list. It is + * related to the notification function above and intended to be used by the + * platform implementation for the case of callback based API. + * During notification via callback the internal lock is released so that + * operation would not deadlock on an attempt to take a lock. However + * modification (destruction) is not really performed - instead the + * callbacks/proxies are marked as "to be deleted". + * Handling of callbacks that are "to be deleted" is done by the + * ifpx_notify_event() function itself however it cannot delete the proxies (in + * particular the proxy passed as an argument) since they might still be + * referred by the calling function. So it is a responsibility of the platform + * implementation to check after calling notification function if there are any + * proxies to be removed and use ifpx_proxy_destroy() to actually release them. + */ +int ifpx_proxy_destroy(struct ifpx_proxy_node *px); + +/* Every implementation should provide definition of this structure: + * - init : called during library initialization (NULL when not needed) + * - listen : this function should start service listening to the network + * configuration events/changes, + * - close : this function should close the service started by listen() + * - get_info : this function should query system for current configuration of + * interface with index 'if_index'. After successful initialization of + * listening service this function is called with 0 as an argument. In that + * case configuration of all ports should be obtained - and when this + * procedure completes a RTE_IFPX_CFG_DONE event should be signaled via + * ifpx_notify_event(). + */ +extern +struct ifpx_platform_callbacks { + void (*init)(void); + int (*listen)(void); + int (*close)(void); + void (*get_info)(int if_index); +} ifpx_platform; + +#endif /* _IF_PROXY_PRIV_H_ */ diff --git a/lib/librte_if_proxy/linux/Makefile b/lib/librte_if_proxy/linux/Makefile new file mode 100644 index 0000000000..275b7e1e33 --- /dev/null +++ b/lib/librte_if_proxy/linux/Makefile @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +SRCS += if_proxy.c diff --git a/lib/librte_if_proxy/linux/if_proxy.c b/lib/librte_if_proxy/linux/if_proxy.c new file mode 100644 index 0000000000..0204505e31 --- /dev/null +++ b/lib/librte_if_proxy/linux/if_proxy.c @@ -0,0 +1,550 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ +#include <if_proxy_priv.h> +#include <rte_interrupts.h> +#include <rte_string_fns.h> + +#include <stdbool.h> +#include <unistd.h> +#include <errno.h> +#include <sys/socket.h> +#include <linux/rtnetlink.h> +#include <linux/if.h> + +static +struct rte_intr_handle ifpx_irq = { + .type = RTE_INTR_HANDLE_NETLINK, + .fd = -1, +}; + +static +unsigned int ifpx_pid; + +static +int request_info(int type, int index) +{ + static rte_spinlock_t send_lock = RTE_SPINLOCK_INITIALIZER; + struct info_get { + struct nlmsghdr h; + union { + struct ifinfomsg ifm; + struct ifaddrmsg ifa; + struct rtmsg rtm; + struct ndmsg ndm; + } __rte_aligned(NLMSG_ALIGNTO); + } info_req; + int ret; + + memset(&info_req, 0, sizeof(info_req)); + /* First byte of these messages is family, so just make sure that this + * memset is enough to get all families. + */ + RTE_ASSERT(AF_UNSPEC == 0); + + info_req.h.nlmsg_pid = ifpx_pid; + info_req.h.nlmsg_type = type; + info_req.h.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP; + info_req.h.nlmsg_len = offsetof(struct info_get, ifm); + + switch (type) { + case RTM_GETLINK: + info_req.h.nlmsg_len += sizeof(info_req.ifm); + info_req.ifm.ifi_index = index; + break; + case RTM_GETADDR: + info_req.h.nlmsg_len += sizeof(info_req.ifa); + info_req.ifa.ifa_index = index; + break; + case RTM_GETROUTE: + info_req.h.nlmsg_len += sizeof(info_req.rtm); + break; + case RTM_GETNEIGH: + info_req.h.nlmsg_len += sizeof(info_req.ndm); + break; + default: + IFPX_LOG(WARNING, "Unhandled message type: %d", type); + return -EINVAL; + } + /* Store request type (and if it is global or link specific) in 'seq'. + * Later it is used during handling of reply to continue requesting of + * information dump from system - if needed. + */ + info_req.h.nlmsg_seq = index << 8 | type; + + IFPX_LOG(DEBUG, "\tRequesting msg %d for: %u", type, index); + + rte_spinlock_lock(&send_lock); + ret = send(ifpx_irq.fd, &info_req, info_req.h.nlmsg_len, 0); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to send netlink msg: %d", errno); + rte_errno = errno; + } + rte_spinlock_unlock(&send_lock); + + return ret; +} + +static +void handle_link(const struct nlmsghdr *h) +{ + const struct ifinfomsg *ifi = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + const struct rtattr *attrs[IFLA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + + IFPX_LOG(DEBUG, "\tLink action (%u): %u, 0x%x/0x%x (flags/changed)", + ifi->ifi_index, h->nlmsg_type, ifi->ifi_flags, + ifi->ifi_change); + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (unsigned int)ifi->ifi_index) + break; + } + + /* Drop messages that are not associated with any proxy */ + if (!px) + goto exit; + /* When message is a reply to request for specific interface then keep + * it only when it contains info for this interface. + */ + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && + (h->nlmsg_seq >> 8) != (unsigned int)ifi->ifi_index) + goto exit; + + for (attr = IFLA_RTA(ifi); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > IFLA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + if (ifi->ifi_change & IFF_UP) { + ev.type = RTE_IFPX_LINK_CHANGE; + ev.link_change.is_up = ifi->ifi_flags & IFF_UP; + ifpx_notify_event(&ev, px); + } + if (attrs[IFLA_MTU]) { + uint16_t mtu = *(const int *)RTA_DATA(attrs[IFLA_MTU]); + if (mtu != px->info.mtu) { + px->info.mtu = mtu; + ev.type = RTE_IFPX_MTU_CHANGE; + ev.mtu_change.mtu = mtu; + ifpx_notify_event(&ev, px); + } + } + if (attrs[IFLA_ADDRESS]) { + const struct rte_ether_addr *mac = + RTA_DATA(attrs[IFLA_ADDRESS]); + + RTE_ASSERT(RTA_PAYLOAD(attrs[IFLA_ADDRESS]) == + RTE_ETHER_ADDR_LEN); + if (memcmp(mac, &px->info.mac, RTE_ETHER_ADDR_LEN) != 0) { + rte_ether_addr_copy(mac, &px->info.mac); + ev.type = RTE_IFPX_MAC_CHANGE; + rte_ether_addr_copy(mac, &ev.mac_change.mac); + ifpx_notify_event(&ev, px); + } + } + if (h->nlmsg_pid == ifpx_pid) { + RTE_ASSERT((h->nlmsg_seq & 0xFF) == RTM_GETLINK); + /* If this is reply for specific link request (not initial + * global dump) then follow up with address request, otherwise + * just store the interface name. + */ + if (h->nlmsg_seq >> 8) + request_info(RTM_GETADDR, ifi->ifi_index); + else if (!px->info.if_name[0] && attrs[IFLA_IFNAME]) + strlcpy(px->info.if_name, RTA_DATA(attrs[IFLA_IFNAME]), + sizeof(px->info.if_name)); + } + + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void handle_addr(const struct nlmsghdr *h, bool needs_del) +{ + const struct ifaddrmsg *ifa = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*ifa)); + const struct rtattr *attrs[IFA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tAddr action (%u): %u, family: %u", + ifa->ifa_index, h->nlmsg_type, ifa->ifa_family); + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == ifa->ifa_index) + break; + } + + /* Drop messages that are not associated with any proxy */ + if (!px) + goto exit; + /* When message is a reply to request for specific interface then keep + * it only when it contains info for this interface. + */ + if (h->nlmsg_pid == ifpx_pid && h->nlmsg_seq >> 8 && + (h->nlmsg_seq >> 8) != ifa->ifa_index) + goto exit; + + for (attr = IFA_RTA(ifa); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > IFA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + if (attrs[IFA_ADDRESS]) { + ip = RTA_DATA(attrs[IFA_ADDRESS]); + if (ifa->ifa_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_ADDR_DEL + : RTE_IFPX_ADDR_ADD; + ev.addr_change.ip = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_ADDR6_DEL + : RTE_IFPX_ADDR6_ADD; + memcpy(ev.addr6_change.ip, ip, 16); + } + ifpx_notify_event(&ev, px); + ifpx_cleanup_proxies(); + } +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void handle_route(const struct nlmsghdr *h, bool needs_del) +{ + const struct rtmsg *r = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*r)); + const struct rtattr *attrs[RTA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct rte_ifpx_event ev; + struct ifpx_proxy_node *px = NULL; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tRoute action: %u, family: %u", + h->nlmsg_type, r->rtm_family); + + for (attr = RTM_RTA(r); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > RTA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + memset(&ev, 0, sizeof(ev)); + ev.type = RTE_IFPX_NUM_EVENTS; + + rte_spinlock_lock(&ifpx_lock); + if (attrs[RTA_OIF]) { + int if_index = *((int32_t *)RTA_DATA(attrs[RTA_OIF])); + + if (if_index > 0) { + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (uint32_t)if_index) + break; + } + } + } + /* We are only interested in routes related to the proxy interfaces and + * we need to have dst - otherwise skip the message. + */ + if (!px || !attrs[RTA_DST]) + goto exit; + + ip = RTA_DATA(attrs[RTA_DST]); + /* This is common to both IPv4/6. */ + ev.route_change.depth = r->rtm_dst_len; + if (r->rtm_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_ROUTE_DEL + : RTE_IFPX_ROUTE_ADD; + ev.route_change.ip = RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_ROUTE6_DEL + : RTE_IFPX_ROUTE6_ADD; + memcpy(ev.route6_change.ip, ip, 16); + } + if (attrs[RTA_GATEWAY]) { + ip = RTA_DATA(attrs[RTA_GATEWAY]); + if (r->rtm_family == AF_INET) + ev.route_change.gateway = + RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + else + memcpy(ev.route6_change.gateway, ip, 16); + } + + ifpx_notify_event(&ev, px); + /* Let's check for proxies to remove here too - just in case somebody + * removed the non-proxy related callback. + */ + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +/* Link, addr and route related messages seem to have this macro defined but not + * neighbour one. Define one if it is missing - const qualifiers added just to + * silence compiler - for some reason it is not needed in equivalent macros for + * other messages and here compiler is complaining about (char*) cast on pointer + * to const. + */ +#ifndef NDA_RTA +#define NDA_RTA(r) ((const struct rtattr *)(((const char *)(r)) + \ + NLMSG_ALIGN(sizeof(struct ndmsg)))) +#endif + +static +void handle_neigh(const struct nlmsghdr *h, bool needs_del) +{ + const struct ndmsg *n = NLMSG_DATA(h); + int alen = h->nlmsg_len - NLMSG_LENGTH(sizeof(*n)); + const struct rtattr *attrs[NDA_MAX+1] = { NULL }; + const struct rtattr *attr; + struct ifpx_proxy_node *px; + struct rte_ifpx_event ev; + const uint8_t *ip; + + IFPX_LOG(DEBUG, "\tNeighbour action: %u, family: %u, state: %u, if: %d", + h->nlmsg_type, n->ndm_family, n->ndm_state, n->ndm_ifindex); + + for (attr = NDA_RTA(n); RTA_OK(attr, alen); + attr = RTA_NEXT(attr, alen)) { + if (attr->rta_type > NDA_MAX) + continue; + attrs[attr->rta_type] = attr; + } + + memset(&ev, 0, sizeof(ev)); + ev.type = RTE_IFPX_NUM_EVENTS; + + rte_spinlock_lock(&ifpx_lock); + TAILQ_FOREACH(px, &ifpx_proxies, elem) { + if (px->info.if_index == (unsigned int)n->ndm_ifindex) + break; + } + /* We need only subset of neighbourhood related to proxy interfaces. + * lladdr seems to be needed only for adding new entry - modifications + * (also reported via RTM_NEWLINK) and deletion include only dst. + */ + if (!px || !attrs[NDA_DST] || (!needs_del && !attrs[NDA_LLADDR])) + goto exit; + + ip = RTA_DATA(attrs[NDA_DST]); + if (n->ndm_family == AF_INET) { + ev.type = needs_del ? RTE_IFPX_NEIGH_DEL + : RTE_IFPX_NEIGH_ADD; + ev.neigh_change.ip = RTE_IPV4(ip[0], ip[1], ip[2], ip[3]); + } else { + ev.type = needs_del ? RTE_IFPX_NEIGH6_DEL + : RTE_IFPX_NEIGH6_ADD; + memcpy(ev.neigh6_change.ip, ip, 16); + } + if (attrs[NDA_LLADDR]) + rte_ether_addr_copy(RTA_DATA(attrs[NDA_LLADDR]), + &ev.neigh_change.mac); + + ifpx_notify_event(&ev, px); + /* Let's check for proxies to remove here too - just in case somebody + * removed the non-proxy related callback. + */ + ifpx_cleanup_proxies(); +exit: + rte_spinlock_unlock(&ifpx_lock); +} + +static +void if_proxy_intr_callback(void *arg __rte_unused) +{ + struct nlmsghdr *h; + struct sockaddr_nl addr; + socklen_t addr_len; + char buf[8192]; + ssize_t len; + +restart: + len = recvfrom(ifpx_irq.fd, buf, sizeof(buf), 0, + (struct sockaddr *)&addr, &addr_len); + if (len < 0) { + if (errno == EINTR) { + IFPX_LOG(DEBUG, "recvmsg() interrupted"); + goto restart; + } + IFPX_LOG(ERR, "Failed to read netlink msg: %ld (errno %d)", + len, errno); + return; + } + if (addr_len != sizeof(addr)) { + IFPX_LOG(ERR, "Invalid netlink addr size: %d", addr_len); + return; + } + IFPX_LOG(DEBUG, "Read %lu bytes (buf %lu) from %u/%u", len, + sizeof(buf), addr.nl_pid, addr.nl_groups); + + for (h = (struct nlmsghdr *)buf; NLMSG_OK(h, len); + h = NLMSG_NEXT(h, len)) { + IFPX_LOG(DEBUG, "Recv msg: %u (%u/%u/%u seq/flags/pid)", + h->nlmsg_type, h->nlmsg_seq, h->nlmsg_flags, + h->nlmsg_pid); + + switch (h->nlmsg_type) { + case RTM_NEWLINK: + case RTM_DELLINK: + handle_link(h); + break; + case RTM_NEWADDR: + case RTM_DELADDR: + handle_addr(h, h->nlmsg_type == RTM_DELADDR); + break; + case RTM_NEWROUTE: + case RTM_DELROUTE: + handle_route(h, h->nlmsg_type == RTM_DELROUTE); + break; + case RTM_NEWNEIGH: + case RTM_DELNEIGH: + handle_neigh(h, h->nlmsg_type == RTM_DELNEIGH); + break; + } + + /* If this is a reply for global request then follow up with + * additional requests and notify about finish. + */ + if (h->nlmsg_pid == ifpx_pid && (h->nlmsg_seq >> 8) == 0 && + h->nlmsg_type == NLMSG_DONE) { + if ((h->nlmsg_seq & 0xFF) == RTM_GETLINK) + request_info(RTM_GETADDR, 0); + else if ((h->nlmsg_seq & 0xFF) == RTM_GETADDR) + request_info(RTM_GETROUTE, 0); + else if ((h->nlmsg_seq & 0xFF) == RTM_GETROUTE) + request_info(RTM_GETNEIGH, 0); + else { + struct rte_ifpx_event ev = { + .type = RTE_IFPX_CFG_DONE + }; + + RTE_ASSERT((h->nlmsg_seq & 0xFF) == + RTM_GETNEIGH); + rte_spinlock_lock(&ifpx_lock); + ifpx_notify_event(&ev, NULL); + rte_spinlock_unlock(&ifpx_lock); + } + } + } + IFPX_LOG(DEBUG, "Finished msg loop: %ld bytes left", len); +} + +static +int nlink_listen(void) +{ + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + .nl_pid = 0, + }; + socklen_t addr_len = sizeof(addr); + int ret; + + if (ifpx_irq.fd != -1) { + rte_errno = EBUSY; + return -1; + } + + addr.nl_groups = 1 << (RTNLGRP_LINK-1) + | 1 << (RTNLGRP_NEIGH-1) + | 1 << (RTNLGRP_IPV4_IFADDR-1) + | 1 << (RTNLGRP_IPV6_IFADDR-1) + | 1 << (RTNLGRP_IPV4_ROUTE-1) + | 1 << (RTNLGRP_IPV6_ROUTE-1); + + ifpx_irq.fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, + NETLINK_ROUTE); + if (ifpx_irq.fd == -1) { + IFPX_LOG(ERR, "Failed to create netlink socket: %d", errno); + goto error; + } + /* Starting with kernel 4.19 you can request dump for a specific + * interface and kernel will filter out and send only relevant info. + * Otherwise NLM_F_DUMP will generate info for all interfaces and you + * need to filter them yourself. + */ +#ifdef NETLINK_DUMP_STRICT_CHK + ret = 1; /* use this var also as an input param */ + ret = setsockopt(ifpx_irq.fd, SOL_SOCKET, NETLINK_DUMP_STRICT_CHK, + &ret, sizeof(ret)); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to set socket option: %d", errno); + goto error; + } +#endif + + ret = bind(ifpx_irq.fd, (struct sockaddr *)&addr, addr_len); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to bind socket: %d", errno); + goto error; + } + ret = getsockname(ifpx_irq.fd, (struct sockaddr *)&addr, &addr_len); + if (ret < 0) { + IFPX_LOG(ERR, "Failed to get socket addr: %d", errno); + goto error; + } else { + ifpx_pid = addr.nl_pid; + IFPX_LOG(DEBUG, "Assigned port ID: %u", addr.nl_pid); + } + + ret = rte_intr_callback_register(&ifpx_irq, if_proxy_intr_callback, + NULL); + if (ret == 0) + return 0; + +error: + rte_errno = errno; + if (ifpx_irq.fd != -1) { + close(ifpx_irq.fd); + ifpx_irq.fd = -1; + } + return -1; +} + +static +int nlink_close(void) +{ + int ec; + + if (ifpx_irq.fd < 0) + return -EBADFD; + + do + ec = rte_intr_callback_unregister(&ifpx_irq, + if_proxy_intr_callback, NULL); + while (ec == -EAGAIN); /* unlikely but possible - at least I think so */ + + close(ifpx_irq.fd); + ifpx_irq.fd = -1; + ifpx_pid = 0; + + return 0; +} + +static +void nlink_get_info(int if_index) +{ + if (ifpx_irq.fd != -1) + request_info(RTM_GETLINK, if_index); +} + +struct ifpx_platform_callbacks ifpx_platform = { + .init = NULL, + .listen = nlink_listen, + .close = nlink_close, + .get_info = nlink_get_info, +}; diff --git a/lib/librte_if_proxy/meson.build b/lib/librte_if_proxy/meson.build new file mode 100644 index 0000000000..f0c1a6e15e --- /dev/null +++ b/lib/librte_if_proxy/meson.build @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(C) 2020 Marvell International Ltd. + +# Currently only implemented on Linux +if not is_linux + build = false + reason = 'only supported on linux' +endif + +version = 1 +allow_experimental_apis = true + +deps += ['ethdev'] +sources = files('if_proxy_common.c') +headers = files('rte_if_proxy.h') + +if is_linux + sources += files('linux/if_proxy.c') +endif diff --git a/lib/librte_if_proxy/rte_if_proxy.h b/lib/librte_if_proxy/rte_if_proxy.h new file mode 100644 index 0000000000..70f7017198 --- /dev/null +++ b/lib/librte_if_proxy/rte_if_proxy.h @@ -0,0 +1,561 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#ifndef _RTE_IF_PROXY_H_ +#define _RTE_IF_PROXY_H_ + +/** + * @file + * RTE IF Proxy library + * + * The IF Proxy library allows for monitoring of system network configuration + * and configuration of DPDK ports by using usual system utilities (like the + * ones from iproute2 package). + * + * It is based on the notion of "proxy interface" which actually can be any DPDK + * port which is also visible to the system - that is it has non-zero 'if_index' + * field in 'rte_eth_dev_info' structure. + * + * If application doesn't have any such port (or doesn't want to use it for + * proxy) it can create one by calling: + * + * proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT); + * + * This function is just a wrapper that constructs valid 'devargs' string based + * on the proxy type chosen (currently Tap or KNI) and creates the interface by + * calling rte_ifpx_dev_create(). + * + * Once one has DPDK port capable of being proxy one can bind target DPDK port + * to it by calling. + * + * rte_ifpx_port_bind(port_id, proxy_id); + * + * This binding is a logical one - there is no automatic packet forwarding + * between port and it's proxy since the library doesn't know the structure of + * application's packet processing. It remains application responsibility to + * forward the packets from/to proxy port (by calling the usual DPDK RX/TX burst + * API). However when the library notes some change to the proxy interface it + * will simply call appropriate callback with 'port_id' of the DPDK port that is + * bound to this proxy interface. The binding can be 1 to many - that is many + * ports can point to one proxy - in that case registered callbacks will be + * called for every bound port. + * + * The callbacks that are used for notifications are described by the + * 'rte_ifpx_callbacks' structure and they are registered by calling: + * + * rte_ifpx_callbacks_register(&cbs); + * + * Finally the application should call: + * + * rte_ifpx_listen(); + * + * which will query system for present network configuration and start listening + * to its changes. + */ + +#include <rte_eal.h> +#include <rte_ethdev.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Enum naming the type of proxy to create. + * + * @see rte_ifpx_create() + */ +enum rte_ifpx_proxy_type { + RTE_IFPX_DEFAULT, /**< Use default proxy type for given arch. */ + RTE_IFPX_TAP, /**< Use Tap based port for proxy. */ + RTE_IFPX_KNI /**< Use KNI based port for proxy. */ +}; + +/** + * Create DPDK port that can serve as an interface proxy. + * + * This function is just a wrapper around rte_ifpx_create_by_devarg() that + * constructs its 'devarg' argument based on type of proxy requested. + * + * @param type + * A type of proxy to create. + * + * @return + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. + * + * @see enum rte_ifpx_type + * @see rte_ifpx_create_by_devarg() + */ +__rte_experimental +uint16_t rte_ifpx_proxy_create(enum rte_ifpx_proxy_type type); + +/** + * Create DPDK port that can serve as an interface proxy. + * + * @param devarg + * A string passed to rte_dev_probe() to create proxy port. + * + * @return + * DPDK port id on success, RTE_MAX_ETHPORTS otherwise. + */ +__rte_experimental +uint16_t rte_ifpx_proxy_create_by_devarg(const char *devarg); + +/** + * Remove DPDK proxy port. + * + * In addition to removing the proxy port the bindings (if any) are cleared. + * + * @param proxy_id + * Port id of the proxy that should be removed. + * + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_proxy_destroy(uint16_t proxy_id); + +/** + * The rte_ifpx_event_type enum lists all possible event types that can be + * signaled by this library. To learn what events are supported on your + * platform call rte_ifpx_events_available(). + * + * NOTE - do not reorder these enums freely, their values need to correspond to + * the order of the callbacks in struct rte_ifpx_callbacks. + */ +enum rte_ifpx_event_type { + RTE_IFPX_MAC_CHANGE, /**< @see struct rte_ifpx_mac_change */ + RTE_IFPX_MTU_CHANGE, /**< @see struct rte_ifpx_mtu_change */ + RTE_IFPX_LINK_CHANGE, /**< @see struct rte_ifpx_link_change */ + RTE_IFPX_ADDR_ADD, /**< @see struct rte_ifpx_addr_change */ + RTE_IFPX_ADDR_DEL, /**< @see struct rte_ifpx_addr_change */ + RTE_IFPX_ADDR6_ADD, /**< @see struct rte_ifpx_addr6_change */ + RTE_IFPX_ADDR6_DEL, /**< @see struct rte_ifpx_addr6_change */ + RTE_IFPX_ROUTE_ADD, /**< @see struct rte_ifpx_route_change */ + RTE_IFPX_ROUTE_DEL, /**< @see struct rte_ifpx_route_change */ + RTE_IFPX_ROUTE6_ADD, /**< @see struct rte_ifpx_route6_change */ + RTE_IFPX_ROUTE6_DEL, /**< @see struct rte_ifpx_route6_change */ + RTE_IFPX_NEIGH_ADD, /**< @see struct rte_ifpx_neigh_change */ + RTE_IFPX_NEIGH_DEL, /**< @see struct rte_ifpx_neigh_change */ + RTE_IFPX_NEIGH6_ADD, /**< @see struct rte_ifpx_neigh6_change */ + RTE_IFPX_NEIGH6_DEL, /**< @see struct rte_ifpx_neigh6_change */ + RTE_IFPX_CFG_DONE, /**< This event is a lib specific event - it is + * signaled when initial network configuration + * query is finished and has no event data. + */ + RTE_IFPX_NUM_EVENTS, +}; + +/** + * Get the bit mask of implemented events/callbacks for this platform. + * + * @return + * Bit mask of events/callbacks implemented: each event type can be tested by + * checking bit (1 << ev) where 'ev' is one of the rte_ifpx_event_type enum + * values. + * @see enum rte_ifpx_event_type + */ +__rte_experimental +uint64_t rte_ifpx_events_available(void); + +/** + * The rte_ifpx_event defines structure used to pass notification event to + * application. Each event type has its own dedicated inner structure - these + * structures are also used when using callbacks notifications. + */ +struct rte_ifpx_event { + enum rte_ifpx_event_type type; + union { + /** Structure used to pass notification about MAC change of the + * proxy interface. + * @see RTE_IFPX_MAC_CHANGE + */ + struct rte_ifpx_mac_change { + uint16_t port_id; + struct rte_ether_addr mac; + } mac_change; + /** Structure used to pass notification about MTU change. + * @see RTE_IFPX_MTU_CHANGE + */ + struct rte_ifpx_mtu_change { + uint16_t port_id; + uint16_t mtu; + } mtu_change; + /** Structure used to pass notification about link going + * up/down. + * @see RTE_IFPX_LINK_CHANGE + */ + struct rte_ifpx_link_change { + uint16_t port_id; + int is_up; + } link_change; + /** Structure used to pass notification about IPv4 address being + * added/removed. All IPv4 addresses reported by this library + * are in host order. + * @see RTE_IFPX_ADDR_ADD + * @see RTE_IFPX_ADDR_DEL + */ + struct rte_ifpx_addr_change { + uint16_t port_id; + uint32_t ip; + } addr_change; + /** Structure used to pass notification about IPv6 address being + * added/removed. + * @see RTE_IFPX_ADDR6_ADD + * @see RTE_IFPX_ADDR6_DEL + */ + struct rte_ifpx_addr6_change { + uint16_t port_id; + uint8_t ip[16]; + } addr6_change; + /** Structure used to pass notification about IPv4 route being + * added/removed. + * @see RTE_IFPX_ROUTE_ADD + * @see RTE_IFPX_ROUTE_DEL + */ + struct rte_ifpx_route_change { + uint16_t port_id; + uint8_t depth; + uint32_t ip; + uint32_t gateway; + } route_change; + /** Structure used to pass notification about IPv6 route being + * added/removed. + * @see RTE_IFPX_ROUTE6_ADD + * @see RTE_IFPX_ROUTE6_DEL + */ + struct rte_ifpx_route6_change { + uint16_t port_id; + uint8_t depth; + uint8_t ip[16]; + uint8_t gateway[16]; + } route6_change; + /** Structure used to pass notification about IPv4 neighbour + * info changes. + * @see RTE_IFPX_NEIGH_ADD + * @see RTE_IFPX_NEIGH_DEL + */ + struct rte_ifpx_neigh_change { + uint16_t port_id; + struct rte_ether_addr mac; + uint32_t ip; + } neigh_change; + /** Structure used to pass notification about IPv6 neighbour + * info changes. + * @see RTE_IFPX_NEIGH6_ADD + * @see RTE_IFPX_NEIGH6_DEL + */ + struct rte_ifpx_neigh6_change { + uint16_t port_id; + struct rte_ether_addr mac; + uint8_t ip[16]; + } neigh6_change; + /* This structure is used internally - to abstract common parts + * of proxy/port related events and to be able to refer to this + * union without giving it a name. + */ + struct { + uint16_t port_id; + } data; + }; +}; + +/** + * This library can deliver notification about network configuration changes + * either by the use of registered callbacks and/or by queueing change events to + * configured notification queues. The logic used is: + * 1. If there is callback registered for given event type it is called. In + * case of many ports to one proxy binding, this callback is called for every + * port bound. + * 2. If this callback returns non-zero value (for any of ports in case of + * many-1 bindings) the handling of an event is considered as complete. + * 3. Otherwise the event is added to each configured event queue. The event is + * allocated with malloc() so after dequeueing and handling the application + * should deallocate it with free(). + * + * This dual notification mechanism is meant to provide some flexibility to + * application writer. For example, if you store your data in a single writer/ + * many readers coherent data structure you could just update this structure + * from the callback. If you keep separate copy per lcore/port you could make + * some common preparations (if applicable) in the callback, return 0 and use + * notification queues to pick up the change and update data structures. Or you + * could skip the callbacks altogether and just use notification queues - and + * configure them at the level appropriate for your application design (one + * global / one per lcore / one per port ...). + */ + +/** + * Add notification queue to the list of queues. + * + * @param r + * Ring used for queueing of notification events - application can assume that + * there is only one producer. + * @return + * 0 on success, negative otherwise. + */ +int rte_ifpx_queue_add(struct rte_ring *r); + +/** + * Remove notification queue from the list of queues. + * + * @param r + * Notification ring used for queueing of notification events (previously + * added via rte_ifpx_queue_add()). + * @return + * 0 on success, negative otherwise. + */ +int rte_ifpx_queue_remove(struct rte_ring *r); + +/** + * This structure groups the callbacks that might be called as a notification + * events for changing network configuration. Not every platform might + * implement all of them and you can query the availability with + * rte_ifpx_callbacks_available() function. + * @see rte_ifpx_events_available() + * @see rte_ifpx_callbacks_register() + */ +struct rte_ifpx_callbacks { + int (*mac_change)(const struct rte_ifpx_mac_change *event); + /**< Callback for notification about MAC change of the proxy interface. + * This callback (as all other port related callbacks) is called for + * each port (with its port_id as a first argument) bound to the proxy + * interface for which change has been observed. + * @see struct rte_ifpx_mac_change + * @return non-zero if event handling is finished + */ + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); + /**< Callback for notification about MTU change. + * @see struct rte_ifpx_mtu_change + * @return non-zero if event handling is finished + */ + int (*link_change)(const struct rte_ifpx_link_change *event); + /**< Callback for notification about link going up/down. + * @see struct rte_ifpx_link_change + * @return non-zero if event handling is finished + */ + int (*addr_add)(const struct rte_ifpx_addr_change *event); + /**< Callback for notification about IPv4 address being added. + * @see struct rte_ifpx_addr_change + * @return non-zero if event handling is finished + */ + int (*addr_del)(const struct rte_ifpx_addr_change *event); + /**< Callback for notification about IPv4 address removal. + * @see struct rte_ifpx_addr_change + * @return non-zero if event handling is finished + */ + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); + /**< Callback for notification about IPv6 address being added. + * @see struct rte_ifpx_addr6_change + */ + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); + /**< Callback for notification about IPv4 address removal. + * @see struct rte_ifpx_addr6_change + * @return non-zero if event handling is finished + */ + /* Please note that "route" callbacks might be also called when user + * adds address to the interface (that is in addition to address related + * callbacks). + */ + int (*route_add)(const struct rte_ifpx_route_change *event); + /**< Callback for notification about IPv4 route being added. + * @see struct rte_ifpx_route_change + * @return non-zero if event handling is finished + */ + int (*route_del)(const struct rte_ifpx_route_change *event); + /**< Callback for notification about IPv4 route removal. + * @see struct rte_ifpx_route_change + * @return non-zero if event handling is finished + */ + int (*route6_add)(const struct rte_ifpx_route6_change *event); + /**< Callback for notification about IPv6 route being added. + * @see struct rte_ifpx_route6_change + * @return non-zero if event handling is finished + */ + int (*route6_del)(const struct rte_ifpx_route6_change *event); + /**< Callback for notification about IPv6 route removal. + * @see struct rte_ifpx_route6_change + * @return non-zero if event handling is finished + */ + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); + /**< Callback for notification about IPv4 neighbour being added. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); + /**< Callback for notification about IPv4 neighbour removal. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); + /**< Callback for notification about IPv6 neighbour being added. + * @see struct rte_ifpx_neigh_change + */ + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); + /**< Callback for notification about IPv6 neighbour removal. + * @see struct rte_ifpx_neigh_change + * @return non-zero if event handling is finished + */ + int (*cfg_done)(void); + /**< Lib specific callback - called when initial network configuration + * query is finished. + * @return non-zero if event handling is finished + */ +}; + +/** + * Register proxy callbacks. + * + * This function registers callbacks to be called upon appropriate network + * event notification. + * + * @param cbs + * Set of callbacks that will be called. The library does not take any + * ownership of the pointer passed - the callbacks are stored internally. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_callbacks_register(const struct rte_ifpx_callbacks *cbs); + +/** + * Unregister proxy callbacks. + * + * This function unregisters callbacks previously registered with + * rte_ifpx_callbacks_register(). + * + * @param cbs + * Handle/pointer returned on previous callback registration. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +void rte_ifpx_callbacks_unregister(void); + +/** + * Bind the port to its proxy. + * + * After calling this function all network configuration of the proxy (and it's + * changes) will be passed to given port by calling registered callbacks with + * 'port_id' as an argument. + * + * Note: since both arguments are of the same type in order to not mix them and + * ease remembering the order the first one is kept the same for bind/unbind. + * + * @param port_id + * Id of the port to be bound. + * @param proxy_id + * Id of the proxy the port needs to be bound to. + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_port_bind(uint16_t port_id, uint16_t proxy_id); + +/** + * Unbind the port from its proxy. + * + * After calling this function registered callbacks will no longer be called for + * this port (but they might be called for other ports in one to many binding + * scenario). + * + * @param port_id + * Id of the port to unbind. + * @return + * 0 on success, negative on error. + */ +__rte_experimental +int rte_ifpx_port_unbind(uint16_t port_id); + +/** + * Get the system network configuration and start listening to its changes. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_listen(void); + +/** + * Remove all bindings/callbacks and stop listening to network configuration. + * + * @return + * 0 on success, negative otherwise. + */ +__rte_experimental +int rte_ifpx_close(void); + +/** + * Get the id of the proxy the port is bound to. + * + * @param port_id + * Id of the port for which to get proxy. + * @return + * Port id of the proxy on success, RTE_MAX_ETHPORTS on error. + */ +__rte_experimental +uint16_t rte_ifpx_proxy_get(uint16_t port_id); + +/** + * Test for port acting as a proxy. + * + * @param port_id + * Id of the port. + * @return + * 1 if port acts as a proxy, 0 otherwise. + */ +static inline +int rte_ifpx_is_proxy(uint16_t port_id) +{ + return rte_ifpx_proxy_get(port_id) == port_id; +} + +/** + * Get the ids of the ports bound to the proxy. + * + * @param proxy_id + * Id of the proxy for which to get ports. + * @param ports + * Array where to store the port ids. + * @param num + * Size of the 'ports' array. + * @return + * The number of ports bound to given proxy. Note that bound ports are filled + * in 'ports' array up to its size but the return value is always the total + * number of ports bound - so you can make call first with NULL/0 to query for + * the size of the buffer to create or call it with the buffer you have and + * later check if it was large enough. + */ +__rte_experimental +unsigned int rte_ifpx_port_get(uint16_t proxy_id, + uint16_t *ports, unsigned int num); + +/** + * The structure containing some properties of the proxy interface. + */ +struct rte_ifpx_info { + unsigned int if_index; /* entry valid iff if_index != 0 */ + uint16_t mtu; + struct rte_ether_addr mac; + char if_name[RTE_ETH_NAME_MAX_LEN]; +}; + +/** + * Get the properties of the proxy interface. Argument can be either id of the + * proxy or an id of a port that is bound to it. + * + * @param port_id + * Id of the port (or proxy) for which to get proxy properties. + * @return + * Pointer to the proxy information structure. + */ +__rte_experimental +const struct rte_ifpx_info *rte_ifpx_info_get(uint16_t port_id); + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_IF_PROXY_H_ */ diff --git a/lib/librte_if_proxy/rte_if_proxy_version.map b/lib/librte_if_proxy/rte_if_proxy_version.map new file mode 100644 index 0000000000..e2093137da --- /dev/null +++ b/lib/librte_if_proxy/rte_if_proxy_version.map @@ -0,0 +1,19 @@ +EXPERIMENTAL { + global: + + rte_ifpx_proxy_create; + rte_ifpx_proxy_create_by_devarg; + rte_ifpx_proxy_destroy; + rte_ifpx_events_available; + rte_ifpx_callbacks_register; + rte_ifpx_callbacks_unregister; + rte_ifpx_port_bind; + rte_ifpx_port_unbind; + rte_ifpx_listen; + rte_ifpx_close; + rte_ifpx_proxy_get; + rte_ifpx_port_get; + rte_ifpx_info_get; + + local: *; +}; diff --git a/lib/meson.build b/lib/meson.build index d190d84eff..cf7ab71900 100644 --- a/lib/meson.build +++ b/lib/meson.build @@ -22,7 +22,7 @@ libraries = [ 'acl', 'bbdev', 'bitratestats', 'cfgfile', 'compressdev', 'cryptodev', 'distributor', 'efd', 'eventdev', - 'gro', 'gso', 'ip_frag', 'jobstats', + 'gro', 'gso', 'if_proxy', 'ip_frag', 'jobstats', 'kni', 'latencystats', 'lpm', 'member', 'power', 'pdump', 'rawdev', 'rib', 'reorder', 'sched', 'security', 'stack', 'vhost', -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v4 2/4] if_proxy: add library documentation 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 0/4] Introduce IF proxy library Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 1/4] lib: introduce IF Proxy library Andrzej Ostruszka @ 2020-06-22 9:21 ` Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 3/4] if_proxy: add simple functionality test Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 4/4] if_proxy: add example application Andrzej Ostruszka 3 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-06-22 9:21 UTC (permalink / raw) To: dev, Thomas Monjalon, John McNamara, Marko Kovacevic This commit adds documentation of IF Proxy library. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + doc/guides/prog_guide/if_proxy_lib.rst | 142 +++++++++++++++++++++++++ doc/guides/prog_guide/index.rst | 1 + 3 files changed, 144 insertions(+) create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst diff --git a/MAINTAINERS b/MAINTAINERS index 65c5a18723..c817d06c6b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1502,6 +1502,7 @@ F: lib/librte_node/ IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: doc/guides/prog_guide/if_proxy_lib.rst Test Applications ----------------- diff --git a/doc/guides/prog_guide/if_proxy_lib.rst b/doc/guides/prog_guide/if_proxy_lib.rst new file mode 100644 index 0000000000..f0b9ed70d8 --- /dev/null +++ b/doc/guides/prog_guide/if_proxy_lib.rst @@ -0,0 +1,142 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(C) 2020 Marvell International Ltd. + +.. _IF_Proxy_Library: + +IF Proxy Library +================ + +When a network interface is assigned to DPDK it usually disappears from +the system and user looses ability to configure it via typical +configuration tools. +There are basically two options to deal with this situation: + +- configure it via command line arguments and/or load configuration + from some file, +- add support for live configuration via some IPC mechanism. + +The first option is static and the second one requires some work to add +communication loop (e.g. separate thread listening/communicating on +a socket). + +This library adds a possibility to configure DPDK ports by using normal +configuration utilities (e.g. from iproute2 suite). +It requires user to configure additional DPDK ports that are visible to +the system (such as Tap or KNI - actually any port that has valid +`if_index` in ``struct rte_eth_dev_info`` will do) and designate them as +a port representor (a proxy) in the system. + +Let's see typical intended usage by an example. +Suppose that you have application that handles traffic on two ports (in +the white list below):: + + ./app -w 00:14.0 -w 00:16.0 --vdev=net_tap0 --vdev=net_tap1 + +So in addition to the "regular" ports you need to configure proxy ports. +These proxy ports can be created via a command line (like above) or from +within the application (e.g. by using `rte_ifpx_proxy_create()` +function). + +When you have proxy ports you need to bind them to the "regular" ports:: + + rte_ifpx_port_bind(port0, proxy0); + rte_ifpx_port_bind(port1, proxy1); + +This binding is a logical one - there is no automatic packet forwarding +configured. +This is because library cannot tell upfront what portion of the traffic +received on ports 0/1 should be redirected to the system via proxies and +also it does not know how the application is structured (what packet +processing engines it uses). +Therefore it is application writer responsibility to include proxy ports +into its packet processing and forward appropriate packets between +proxies and ports. +What the library actually does is that it gets network configuration +from the system and listens to its changes. +This information is then matched against `if_index` of the configured +proxies and passed to the application. + +There are two mechanisms via which library passes notifications to the +application. +First is the set of global callbacks that user has +to register via:: + + rte_ifpx_callbacks_register(&cbs); + +Here `cbs` is a ``struct rte_ifpx_callbacks`` which has following +members:: + + int (*mac_change)(const struct rte_ifpx_mac_change *event); + int (*mtu_change)(const struct rte_ifpx_mtu_change *event); + int (*link_change)(const struct rte_ifpx_link_change *event); + int (*addr_add)(const struct rte_ifpx_addr_change *event); + int (*addr_del)(const struct rte_ifpx_addr_change *event); + int (*addr6_add)(const struct rte_ifpx_addr6_change *event); + int (*addr6_del)(const struct rte_ifpx_addr6_change *event); + int (*route_add)(const struct rte_ifpx_route_change *event); + int (*route_del)(const struct rte_ifpx_route_change *event); + int (*route6_add)(const struct rte_ifpx_route6_change *event); + int (*route6_del)(const struct rte_ifpx_route6_change *event); + int (*neigh_add)(const struct rte_ifpx_neigh_change *event); + int (*neigh_del)(const struct rte_ifpx_neigh_change *event); + int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event); + int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event); + int (*cfg_done)(void); + +All of them should be self explanatory apart from the last one which is +library specific callback - called when initial network configuration +query is finished. + +So for example when the user issues command:: + + ip link set dev dtap0 mtu 1600 + +then library will call `mtu_change()` callback with MTU change event +having port_id equal to `port0` (id of the port bound to this proxy) and +`mtu` equal to 1600 (``dtap0`` is the default interface name for +``net_tap0``). +Application can simply use `rte_eth_dev_set_mtu()` in this callback. +The same way `rte_eth_dev_default_mac_addr_set()` can be used in +`mac_change()` and `rte_eth_dev_set_link_up/down()` inside the +`link_change()` callback that does dispatch based on `is_up` member of +its `event` argument. + +Please note however that the context in which these callbacks are called +is most probably different from the one in which packets are handled and +it is application writer responsibility to use proper synchronization +mechanisms - if they are needed. + +Second notification mechanism relies on queueing of event notifications +to the configured notification rings. +Application can add queue via:: + + int rte_ifpx_queue_add(struct rte_ring *r); + +This type of notification is used when there is no callback registered +for given type of event or when it is registered but it returns 0. +This way application has following choices: + +- if the data structure that needs to be updated due to notification + is safe to be modified by a single writer (while being used by other + readers) then it can simply do that inside the callback and return + non-zero value to signal end of the event handling + +- otherwise, when there are some common preparation steps that needs + to be done only once, application can register callback that will + perform these steps and return 0 - library will then add an event to + each registered notification queue + +- if the data structures are replicated and there are no common steps + then application can simply skip registering of the callbacks and + configure notification queues (e.g. 1 per each lcore) + +Once we have bindings in place and notification configured, the only +essential part that remains is to get the current network configuration +and start listening to its changes. +This is accomplished via a call to:: + + int rte_ifpx_listen(void); + +From that moment you should see notifications coming to your +application: first ones resulting from querying of current system +configurations and subsequent on the configuration changes. diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst index 6f63f300af..9dfc473dc3 100644 --- a/doc/guides/prog_guide/index.rst +++ b/doc/guides/prog_guide/index.rst @@ -60,6 +60,7 @@ Programmer's Guide bpf_lib ipsec_lib graph_lib + if_proxy_lib source_org dev_kit_build_system dev_kit_root_make_help -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v4 3/4] if_proxy: add simple functionality test 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 0/4] Introduce IF proxy library Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 2/4] if_proxy: add library documentation Andrzej Ostruszka @ 2020-06-22 9:21 ` Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 4/4] if_proxy: add example application Andrzej Ostruszka 3 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-06-22 9:21 UTC (permalink / raw) To: dev, Thomas Monjalon This commit adds simple test of the library notifications. Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + app/test/Makefile | 5 + app/test/meson.build | 4 + app/test/test_if_proxy.c | 707 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 717 insertions(+) create mode 100644 app/test/test_if_proxy.c diff --git a/MAINTAINERS b/MAINTAINERS index c817d06c6b..857754203c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1502,6 +1502,7 @@ F: lib/librte_node/ IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: app/test/test_if_proxy.c F: doc/guides/prog_guide/if_proxy_lib.rst Test Applications diff --git a/app/test/Makefile b/app/test/Makefile index 7b96a03a64..48b9a22856 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -236,6 +236,11 @@ SRCS-y += test_graph.c SRCS-y += test_graph_perf.c endif +ifeq ($(CONFIG_RTE_LIBRTE_IF_PROXY),y) +SRCS-y += test_if_proxy.c +LDLIBS += -lrte_if_proxy +endif + ifeq ($(CONFIG_RTE_LIBRTE_RAWDEV),y) SRCS-y += test_rawdev.c endif diff --git a/app/test/meson.build b/app/test/meson.build index 5233ead46e..5a8198fc12 100644 --- a/app/test/meson.build +++ b/app/test/meson.build @@ -381,6 +381,10 @@ endif if dpdk_conf.has('RTE_LIBRTE_PDUMP') test_deps += 'pdump' endif +if dpdk_conf.has('RTE_LIBRTE_IF_PROXY') + test_deps += 'if_proxy' + test_sources += 'test_if_proxy.c' +endif if cc.has_argument('-Wno-format-truncation') cflags += '-Wno-format-truncation' diff --git a/app/test/test_if_proxy.c b/app/test/test_if_proxy.c new file mode 100644 index 0000000000..72ff782b68 --- /dev/null +++ b/app/test/test_if_proxy.c @@ -0,0 +1,707 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2020 Marvell International Ltd. + */ + +#include "test.h" + +#include <rte_ethdev.h> +#include <rte_if_proxy.h> +#include <rte_cycles.h> + +#include <string.h> +#include <unistd.h> +#include <signal.h> +#include <net/if.h> +#include <arpa/inet.h> +#include <pthread.h> +#include <time.h> + +/* There are two types of event notifications - one using callbacks and one + * using event queues (rings). We'll test them both and this "bool" will govern + * the type of API to use. + */ +static int use_callbacks = 1; +static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; +static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; + +static struct rte_ring *ev_queue; + +enum net_event_mask { + INITIALIZED = 1U << RTE_IFPX_CFG_DONE, + LINK_CHANGED = 1U << RTE_IFPX_LINK_CHANGE, + MAC_CHANGED = 1U << RTE_IFPX_MAC_CHANGE, + MTU_CHANGED = 1U << RTE_IFPX_MTU_CHANGE, + ADDR_ADD = 1U << RTE_IFPX_ADDR_ADD, + ADDR_DEL = 1U << RTE_IFPX_ADDR_DEL, + ROUTE_ADD = 1U << RTE_IFPX_ROUTE_ADD, + ROUTE_DEL = 1U << RTE_IFPX_ROUTE_DEL, + ADDR6_ADD = 1U << RTE_IFPX_ADDR6_ADD, + ADDR6_DEL = 1U << RTE_IFPX_ADDR6_DEL, + ROUTE6_ADD = 1U << RTE_IFPX_ROUTE6_ADD, + ROUTE6_DEL = 1U << RTE_IFPX_ROUTE6_DEL, + NEIGH_ADD = 1U << RTE_IFPX_NEIGH_ADD, + NEIGH_DEL = 1U << RTE_IFPX_NEIGH_DEL, + NEIGH6_ADD = 1U << RTE_IFPX_NEIGH6_ADD, + NEIGH6_DEL = 1U << RTE_IFPX_NEIGH6_DEL, +}; + +static unsigned int state; + +static struct { + struct rte_ether_addr mac_addr; + uint16_t port_id, mtu; + struct in_addr ipv4, route4; + struct in6_addr ipv6, route6; + uint16_t depth4, depth6; + int is_up; +} net_cfg; + +static +int unlock_notify(unsigned int op) +{ + /* the mutex is expected to be locked on entry */ + RTE_VERIFY(pthread_mutex_trylock(&mutex) == EBUSY); + state |= op; + + pthread_mutex_unlock(&mutex); + return pthread_cond_signal(&cond); +} + +static +void handle_event(struct rte_ifpx_event *ev); + +static +int wait_for(unsigned int op_mask, unsigned int sec) +{ + int ec; + + if (use_callbacks) { + struct timespec time; + + ec = pthread_mutex_trylock(&mutex); + /* the mutex is expected to be locked on entry */ + RTE_VERIFY(ec == EBUSY); + + ec = 0; + clock_gettime(CLOCK_REALTIME, &time); + time.tv_sec += sec; + + while ((state & op_mask) != op_mask && ec == 0) + ec = pthread_cond_timedwait(&cond, &mutex, &time); + } else { + uint64_t deadline; + struct rte_ifpx_event *ev; + + ec = 0; + deadline = rte_get_timer_cycles() + sec * rte_get_timer_hz(); + + while ((state & op_mask) != op_mask) { + if (rte_get_timer_cycles() >= deadline) { + ec = ETIMEDOUT; + break; + } + if (rte_ring_dequeue(ev_queue, (void **)&ev) == 0) + handle_event(ev); + } + } + + return ec; +} + +static +int expect(unsigned int op_mask, const char *fmt, ...) +#if __GNUC__ + __attribute__((format(printf, 2, 3))); +#endif + +static +int expect(unsigned int op_mask, const char *fmt, ...) +{ + char cmd[128]; + va_list args; + int ret; + + state &= ~op_mask; + va_start(args, fmt); + vsnprintf(cmd, sizeof(cmd), fmt, args); + va_end(args); + ret = system(cmd); + if (ret == 0) + /* IPv6 address notifications seem to need that long delay. */ + return wait_for(op_mask, 2); + return ret; +} + +static +int mac_change(const struct rte_ifpx_mac_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(MAC_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int mtu_change(const struct rte_ifpx_mtu_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->mtu == net_cfg.mtu) { + unlock_notify(MTU_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int link_change(const struct rte_ifpx_link_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->is_up == net_cfg.is_up) { + /* Special case for testing of callbacks modification from + * inside of callback: we catch putting link down (the last + * operation in test) and remove callbacks registered. + */ + if (!ev->is_up) + rte_ifpx_callbacks_unregister(); + unlock_notify(LINK_CHANGED); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr_add(const struct rte_ifpx_addr_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->ip == net_cfg.ipv4.s_addr) { + unlock_notify(ADDR_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr_del(const struct rte_ifpx_addr_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (ev->ip == net_cfg.ipv4.s_addr) { + unlock_notify(ADDR_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr6_add(const struct rte_ifpx_addr6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(ADDR6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int addr6_del(const struct rte_ifpx_addr6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(ADDR6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route_add(const struct rte_ifpx_route_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth4 == ev->depth && net_cfg.route4.s_addr == ev->ip) { + unlock_notify(ROUTE_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route_del(const struct rte_ifpx_route_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth4 == ev->depth && net_cfg.route4.s_addr == ev->ip) { + unlock_notify(ROUTE_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route6_add(const struct rte_ifpx_route6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth6 == ev->depth && + /* don't check for trailing zeros */ + memcmp(ev->ip, net_cfg.route6.s6_addr, ev->depth/8) == 0) { + unlock_notify(ROUTE6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int route6_del(const struct rte_ifpx_route6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.depth6 == ev->depth && + /* don't check for trailing zeros */ + memcmp(ev->ip, net_cfg.route6.s6_addr, ev->depth/8) == 0) { + unlock_notify(ROUTE6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh_add(const struct rte_ifpx_neigh_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.ipv4.s_addr == ev->ip && + memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(NEIGH_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh_del(const struct rte_ifpx_neigh_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (net_cfg.ipv4.s_addr == ev->ip) { + unlock_notify(NEIGH_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh6_add(const struct rte_ifpx_neigh6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0 && + memcmp(ev->mac.addr_bytes, net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) == 0) { + unlock_notify(NEIGH6_ADD); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int neigh6_del(const struct rte_ifpx_neigh6_change *ev) +{ + pthread_mutex_lock(&mutex); + RTE_VERIFY(ev->port_id == net_cfg.port_id); + if (memcmp(ev->ip, net_cfg.ipv6.s6_addr, 16) == 0) { + unlock_notify(NEIGH6_DEL); + return 1; + } + pthread_mutex_unlock(&mutex); + return 0; +} + +static +int cfg_done(void) +{ + pthread_mutex_lock(&mutex); + unlock_notify(INITIALIZED); + return 1; +} + +static +void handle_event(struct rte_ifpx_event *ev) +{ + if (ev->type != RTE_IFPX_CFG_DONE) + RTE_VERIFY(ev->data.port_id == net_cfg.port_id); + + /* If params do not match what we expect just free the event. */ + switch (ev->type) { + case RTE_IFPX_MAC_CHANGE: + if (memcmp(ev->mac_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_MTU_CHANGE: + if (ev->mtu_change.mtu != net_cfg.mtu) + goto exit; + break; + case RTE_IFPX_LINK_CHANGE: + if (ev->link_change.is_up != net_cfg.is_up) + goto exit; + break; + case RTE_IFPX_ADDR_ADD: + if (ev->addr_change.ip != net_cfg.ipv4.s_addr) + goto exit; + break; + case RTE_IFPX_ADDR_DEL: + if (ev->addr_change.ip != net_cfg.ipv4.s_addr) + goto exit; + break; + case RTE_IFPX_ADDR6_ADD: + if (memcmp(ev->addr6_change.ip, net_cfg.ipv6.s6_addr, + 16) != 0) + goto exit; + break; + case RTE_IFPX_ADDR6_DEL: + if (memcmp(ev->addr6_change.ip, net_cfg.ipv6.s6_addr, + 16) != 0) + goto exit; + break; + case RTE_IFPX_ROUTE_ADD: + if (net_cfg.depth4 != ev->route_change.depth || + net_cfg.route4.s_addr != ev->route_change.ip) + goto exit; + break; + case RTE_IFPX_ROUTE_DEL: + if (net_cfg.depth4 != ev->route_change.depth || + net_cfg.route4.s_addr != ev->route_change.ip) + goto exit; + break; + case RTE_IFPX_ROUTE6_ADD: + if (net_cfg.depth6 != ev->route6_change.depth || + /* don't check for trailing zeros */ + memcmp(ev->route6_change.ip, net_cfg.route6.s6_addr, + ev->route6_change.depth/8) != 0) + goto exit; + break; + case RTE_IFPX_ROUTE6_DEL: + if (net_cfg.depth6 != ev->route6_change.depth || + /* don't check for trailing zeros */ + memcmp(ev->route6_change.ip, net_cfg.route6.s6_addr, + ev->route6_change.depth/8) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH_ADD: + if (net_cfg.ipv4.s_addr != ev->neigh_change.ip || + memcmp(ev->neigh_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH_DEL: + if (net_cfg.ipv4.s_addr != ev->neigh_change.ip) + goto exit; + break; + case RTE_IFPX_NEIGH6_ADD: + if (memcmp(ev->neigh6_change.ip, + net_cfg.ipv6.s6_addr, 16) != 0 || + memcmp(ev->neigh6_change.mac.addr_bytes, + net_cfg.mac_addr.addr_bytes, + RTE_ETHER_ADDR_LEN) != 0) + goto exit; + break; + case RTE_IFPX_NEIGH6_DEL: + if (memcmp(ev->neigh6_change.ip, net_cfg.ipv6.s6_addr, 16) != 0) + goto exit; + break; + case RTE_IFPX_CFG_DONE: + break; + default: + RTE_VERIFY(0 && "Unhandled event type"); + } + + state |= 1U << ev->type; +exit: + free(ev); +} + +static +struct rte_ifpx_callbacks cbs = { + .mac_change = mac_change, + .mtu_change = mtu_change, + .link_change = link_change, + .addr_add = addr_add, + .addr_del = addr_del, + .addr6_add = addr6_add, + .addr6_del = addr6_del, + .route_add = route_add, + .route_del = route_del, + .route6_add = route6_add, + .route6_del = route6_del, + .neigh_add = neigh_add, + .neigh_del = neigh_del, + .neigh6_add = neigh6_add, + .neigh6_del = neigh6_del, + /* lib specific callback */ + .cfg_done = cfg_done, +}; + +static +int test_notifications(const struct rte_ifpx_info *pinfo) +{ + char mac_buf[RTE_ETHER_ADDR_FMT_SIZE]; + int ec; + + /* Test link up notification. */ + net_cfg.is_up = 1; + ec = expect(LINK_CHANGED, "ip link set dev %s up", pinfo->if_name); + if (ec != 0) { + printf("Failed to notify about link going up\n"); + return ec; + } + + /* Test for MAC changes notification. */ + rte_eth_random_addr(net_cfg.mac_addr.addr_bytes); + rte_ether_format_addr(mac_buf, sizeof(mac_buf), &net_cfg.mac_addr); + ec = expect(MAC_CHANGED, "ip link set dev %s address %s", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notification about mac change\n"); + return ec; + } + + /* Test for MTU changes notification. */ + net_cfg.mtu = pinfo->mtu + 100; + ec = expect(MTU_CHANGED, "ip link set dev %s mtu %d", + pinfo->if_name, net_cfg.mtu); + if (ec != 0) { + printf("Missing/wrong notification about mtu change\n"); + return ec; + } + + /* Test for adding of IPv4 address - using address from TEST-2 pool. + * This test is specific to linux netlink behaviour - after adding + * address we get both notification about address being added and new + * route. So I check both. + */ + net_cfg.ipv4.s_addr = RTE_IPV4(198, 51, 100, 14); + net_cfg.route4.s_addr = net_cfg.ipv4.s_addr; + net_cfg.depth4 = 32; + ec = expect(ADDR_ADD | ROUTE_ADD, "ip addr add 198.51.100.14 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 address add\n"); + return ec; + } + + /* Test for IPv4 address removal. See comment above for 'addr add'. */ + ec = expect(ADDR_DEL | ROUTE_DEL, "ip addr del 198.51.100.14/32 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 address del\n"); + return ec; + } + + /* Test for adding IPv4 route. */ + net_cfg.route4.s_addr = RTE_IPV4(198, 51, 100, 0); + net_cfg.depth4 = 24; + ec = expect(ROUTE_ADD, "ip route add 198.51.100.0/24 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 route add\n"); + return ec; + } + + /* Test for IPv4 route removal. */ + ec = expect(ROUTE_DEL, "ip route del 198.51.100.0/24 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 route del\n"); + return ec; + } + + /* Test for neighbour addresses notifications. */ + rte_eth_random_addr(net_cfg.mac_addr.addr_bytes); + rte_ether_format_addr(mac_buf, sizeof(mac_buf), &net_cfg.mac_addr); + + ec = expect(NEIGH_ADD, + "ip neigh add 198.51.100.14 dev %s lladdr %s nud noarp", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 neighbour add\n"); + return ec; + } + + ec = expect(NEIGH_DEL, "ip neigh del 198.51.100.14 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv4 neighbour del\n"); + return ec; + } + + /* Now the same for IPv6 - with address from "documentation pool". */ + inet_pton(AF_INET6, "2001:db8::dead:beef", net_cfg.ipv6.s6_addr); + /* This is specific to linux netlink behaviour - after adding address + * we get both notification about address being added and new route. + * So I wait for both. + */ + memcpy(net_cfg.route6.s6_addr, net_cfg.ipv6.s6_addr, 16); + net_cfg.depth6 = 128; + ec = expect(ADDR6_ADD | ROUTE6_ADD, + "ip addr add 2001:db8::dead:beef dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 address add\n"); + return ec; + } + + /* See comment above for 'addr6 add'. */ + ec = expect(ADDR6_DEL | ROUTE6_DEL, + "ip addr del 2001:db8::dead:beef/128 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 address del\n"); + return ec; + } + + net_cfg.depth6 = 96; + ec = expect(ROUTE6_ADD, "ip route add 2001:db8::dead:0/96 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 route add\n"); + return ec; + } + + ec = expect(ROUTE6_DEL, "ip route del 2001:db8::dead:0/96 dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 route del\n"); + return ec; + } + + ec = expect(NEIGH6_ADD, + "ip neigh add 2001:db8::dead:beef dev %s lladdr %s nud noarp", + pinfo->if_name, mac_buf); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 neighbour add\n"); + return ec; + } + + ec = expect(NEIGH6_DEL, "ip neigh del 2001:db8::dead:beef dev %s", + pinfo->if_name); + if (ec != 0) { + printf("Missing/wrong notifications about IPv6 neighbour del\n"); + return ec; + } + + /* Finally put link down and test for notification. */ + net_cfg.is_up = 0; + ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name); + if (ec != 0) { + printf("Failed to notify about link going down\n"); + return ec; + } + + return 0; +} + +static +int test_if_proxy(void) +{ + int ec; + const struct rte_ifpx_info *pinfo; + uint16_t proxy_id; + + state = 0; + memset(&net_cfg, 0, sizeof(net_cfg)); + + if (rte_eth_dev_count_avail() == 0) { + printf("Run this test with at least one port configured\n"); + return 1; + } + /* User the first port available. */ + RTE_ETH_FOREACH_DEV(net_cfg.port_id) + break; + proxy_id = rte_ifpx_proxy_create(RTE_IFPX_DEFAULT); + RTE_VERIFY(proxy_id != RTE_MAX_ETHPORTS); + rte_ifpx_port_bind(net_cfg.port_id, proxy_id); + rte_ifpx_callbacks_register(&cbs); + rte_ifpx_listen(); + + /* Let's start with callback based API. */ + use_callbacks = 1; + pthread_mutex_lock(&mutex); + ec = wait_for(INITIALIZED, 2); + if (ec != 0) { + printf("Failed to obtain network configuration\n"); + goto exit; + } + pinfo = rte_ifpx_info_get(net_cfg.port_id); + RTE_VERIFY(pinfo); + + /* Make sure the link is down. */ + net_cfg.is_up = 0; + ec = expect(LINK_CHANGED, "ip link set dev %s down", pinfo->if_name); + RTE_VERIFY(ec == ETIMEDOUT || ec == 0); + + ec = test_notifications(pinfo); + if (ec != 0) { + printf("Failed test with callback based API\n"); + goto exit; + } + /* Switch to event queue based API and repeat tests. */ + use_callbacks = 0; + ev_queue = rte_ring_create("IFPX-events", 16, SOCKET_ID_ANY, + RING_F_SP_ENQ | RING_F_SC_DEQ); + ec = rte_ifpx_queue_add(ev_queue); + if (ec != 0) { + printf("Failed to add a notification queue\n"); + goto exit; + } + ec = test_notifications(pinfo); + if (ec != 0) { + printf("Failed test with event queue based API\n"); + goto exit; + } + +exit: + pthread_mutex_unlock(&mutex); + /* Proxy ports are not owned by the lib. Internal references to them + * are cleared on close, but the ports are not destroyed so we need to + * do that explicitly. + */ + rte_ifpx_proxy_destroy(proxy_id); + rte_ifpx_close(); + /* Queue is removed from the lib by rte_ifpx_close() - here we just + * free it. + */ + rte_ring_free(ev_queue); + ev_queue = NULL; + + return ec; +} + +REGISTER_TEST_COMMAND(if_proxy_autotest, test_if_proxy) -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [dpdk-dev] [PATCH v4 4/4] if_proxy: add example application 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 0/4] Introduce IF proxy library Andrzej Ostruszka ` (2 preceding siblings ...) 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 3/4] if_proxy: add simple functionality test Andrzej Ostruszka @ 2020-06-22 9:21 ` Andrzej Ostruszka 3 siblings, 0 replies; 64+ messages in thread From: Andrzej Ostruszka @ 2020-06-22 9:21 UTC (permalink / raw) To: dev, Thomas Monjalon Add an example application showing possible library usage. This is a simplified version of l3fwd where: - many performance improvements has been removed in order to simplify logic and put focus on the proxy library usage, - the configuration of forwarding has to be done by the user (using typical system tools on proxy ports) - these changes are passed to the application via library notifications. It is meant to show how you can update some data from callbacks (routing - see note below) and how those that are replicated (e.g. kept per lcore) can be updated via event queueing (here neighbouring info). Note: This example assumes that LPM tables can be updated by a single writer while being used by others. To the best of author's knowledge this is the case (by preliminary code inspection) but DPDK does not make such a promise. Obviously, upon the change, there will be a transient period (when some IPs will be directed still to the old destination) but that is expected. Note also that in some cases you might need to tweak your system configuration to see effects. For example you send Gratuitous ARP to DPDK port and expect neighbour tables to be updated in application which does not happen. The packet will be sent to the kernel but it might drop it, please check /proc/sys/net/ipv4/conf/dtap0/arp_accept and related configuration options ('dtap0' here is just a name of your proxy port). Signed-off-by: Andrzej Ostruszka <aostruszka@marvell.com> --- MAINTAINERS | 1 + examples/Makefile | 1 + examples/l3fwd-ifpx/Makefile | 60 ++ examples/l3fwd-ifpx/l3fwd.c | 1131 +++++++++++++++++++++++++++++++ examples/l3fwd-ifpx/l3fwd.h | 98 +++ examples/l3fwd-ifpx/main.c | 740 ++++++++++++++++++++ examples/l3fwd-ifpx/meson.build | 11 + examples/meson.build | 4 +- 8 files changed, 2044 insertions(+), 2 deletions(-) create mode 100644 examples/l3fwd-ifpx/Makefile create mode 100644 examples/l3fwd-ifpx/l3fwd.c create mode 100644 examples/l3fwd-ifpx/l3fwd.h create mode 100644 examples/l3fwd-ifpx/main.c create mode 100644 examples/l3fwd-ifpx/meson.build diff --git a/MAINTAINERS b/MAINTAINERS index 857754203c..ad5c084611 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1502,6 +1502,7 @@ F: lib/librte_node/ IF Proxy - EXPERIMENTAL M: Andrzej Ostruszka <aostruszka@marvell.com> F: lib/librte_if_proxy/ +F: examples/l3fwd-ifpx/ F: app/test/test_if_proxy.c F: doc/guides/prog_guide/if_proxy_lib.rst diff --git a/examples/Makefile b/examples/Makefile index b7e99a2f78..6212d16f07 100644 --- a/examples/Makefile +++ b/examples/Makefile @@ -84,6 +84,7 @@ else $(info vm_power_manager requires libvirt >= 0.9.3) endif endif +DIRS-$(CONFIG_RTE_LIBRTE_IF_PROXY) += l3fwd-ifpx DIRS-y += eventdev_pipeline diff --git a/examples/l3fwd-ifpx/Makefile b/examples/l3fwd-ifpx/Makefile new file mode 100644 index 0000000000..68eefeb75d --- /dev/null +++ b/examples/l3fwd-ifpx/Makefile @@ -0,0 +1,60 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2020 Marvell International Ltd. + +# binary name +APP = l3fwd + +# all source are stored in SRCS-y +SRCS-y := main.c l3fwd.c + +# Build using pkg-config variables if possible +ifeq ($(shell pkg-config --exists libdpdk && echo 0),0) + +all: shared +.PHONY: shared static +shared: build/$(APP)-shared + ln -sf $(APP)-shared build/$(APP) +static: build/$(APP)-static + ln -sf $(APP)-static build/$(APP) + +PKGCONF ?= pkg-config + +PC_FILE := $(shell $(PKGCONF) --path libdpdk 2>/dev/null) +CFLAGS += -DALLOW_EXPERIMENTAL_API -O3 $(shell $(PKGCONF) --cflags libdpdk) +LDFLAGS_SHARED = $(shell $(PKGCONF) --libs libdpdk) +LDFLAGS_STATIC = -Wl,-Bstatic $(shell $(PKGCONF) --static --libs libdpdk) + +build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED) + +build/$(APP)-static: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_STATIC) + +build: + @mkdir -p $@ + +.PHONY: clean +clean: + rm -f build/$(APP) build/$(APP)-static build/$(APP)-shared + test -d build && rmdir -p build || true + +else # Build using legacy build system + +ifeq ($(RTE_SDK),) +$(error "Please define RTE_SDK environment variable") +endif + +# Default target, detect a build directory, by looking for a path with a .config +RTE_TARGET ?= $(notdir $(abspath $(dir $(firstword $(wildcard $(RTE_SDK)/*/.config))))) + +include $(RTE_SDK)/mk/rte.vars.mk + +CFLAGS += -DALLOW_EXPERIMENTAL_API + +CFLAGS += -I$(SRCDIR) +CFLAGS += -O3 $(USER_FLAGS) +CFLAGS += $(WERROR_FLAGS) +LDLIBS += -lrte_if_proxy -lrte_ethdev -lrte_eal + +include $(RTE_SDK)/mk/rte.extapp.mk +endif diff --git a/examples/l3fwd-ifpx/l3fwd.c b/examples/l3fwd-ifpx/l3fwd.c new file mode 100644 index 0000000000..4b457dfad1 --- /dev/null +++ b/examples/l3fwd-ifpx/l3fwd.c @@ -0,0 +1,1131 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <inttypes.h> +#include <sys/types.h> +#include <string.h> +#include <sys/queue.h> +#include <stdarg.h> +#include <errno.h> +#include <getopt.h> +#include <sys/socket.h> +#include <arpa/inet.h> + +#include <rte_debug.h> +#include <rte_ether.h> +#include <rte_cycles.h> +#include <rte_malloc.h> +#include <rte_mbuf.h> +#include <rte_ip.h> + +#ifndef USE_HASH_CRC +#include <rte_jhash.h> +#else +#include <rte_hash_crc.h> +#endif + +#include <rte_tcp.h> +#include <rte_udp.h> +#include <rte_lpm.h> +#include <rte_lpm6.h> +#include <rte_if_proxy.h> + +#include "l3fwd.h" + +#define DO_RFC_1812_CHECKS + +#define IPV4_L3FWD_LPM_MAX_RULES 1024 +#define IPV4_L3FWD_LPM_NUMBER_TBL8S (1 << 8) +#define IPV6_L3FWD_LPM_MAX_RULES 1024 +#define IPV6_L3FWD_LPM_NUMBER_TBL8S (1 << 16) + +static volatile bool ifpx_ready; + +/* ethernet addresses of ports */ +static +union lladdr_t port_mac[RTE_MAX_ETHPORTS]; + +static struct rte_lpm *ipv4_routes; +static struct rte_lpm6 *ipv6_routes; + +static +struct ipv4_gateway { + uint16_t port; + union lladdr_t lladdr; + uint32_t ip; +} ipv4_gateways[128]; + +static +struct ipv6_gateway { + uint16_t port; + union lladdr_t lladdr; + uint8_t ip[16]; +} ipv6_gateways[128]; + +/* The lowest 2 bits of next hop (which is 24/21 bit for IPv4/6) are reserved to + * encode: + * 00 -> host route: higher bits of next hop are port id and dst MAC should be + * based on dst IP + * 01 -> gateway route: higher bits of next hop are index into gateway array and + * use port and MAC cached there (if no MAC cached yet then search for it + * based on gateway IP) + * 10 -> proxy entry: packet directed to us, just take higher bits as port id of + * proxy and send packet there (without any modification) + * The port id (16 bits) will always fit however this will not work if you + * need more than 2^20 gateways. + */ +enum route_type { + HOST_ROUTE = 0x00, + GW_ROUTE = 0x01, + PROXY_ADDR = 0x02, +}; + +RTE_STD_C11 +_Static_assert(RTE_DIM(ipv4_gateways) <= (1 << 22) && + RTE_DIM(ipv6_gateways) <= (1 << 19), + "Gateway array index has to fit within next_hop with 2 bits reserved"); + +static +uint32_t find_add_gateway(uint16_t port, uint32_t ip) +{ + uint32_t i, idx = -1U; + + for (i = 0; i < RTE_DIM(ipv4_gateways); ++i) { + /* Remember first free slot in case GW is not present. */ + if (idx == -1U && ipv4_gateways[i].ip == 0) + idx = i; + else if (ipv4_gateways[i].ip == ip) + /* For now assume that given GW will be always at the + * same port, so no checking for that + */ + return i; + } + if (idx != -1U) { + ipv4_gateways[idx].port = port; + ipv4_gateways[idx].ip = ip; + /* Since ARP tables are kept per lcore MAC will be updated + * during first lookup. + */ + } + return idx; +} + +static +void clear_gateway(uint32_t ip) +{ + uint32_t i; + + for (i = 0; i < RTE_DIM(ipv4_gateways); ++i) { + if (ipv4_gateways[i].ip == ip) { + ipv4_gateways[i].ip = 0; + ipv4_gateways[i].lladdr.val = 0; + ipv4_gateways[i].port = RTE_MAX_ETHPORTS; + break; + } + } +} + +static +uint32_t find_add_gateway6(uint16_t port, const uint8_t *ip) +{ + uint32_t i, idx = -1U; + + for (i = 0; i < RTE_DIM(ipv6_gateways); ++i) { + /* Remember first free slot in case GW is not present. */ + if (idx == -1U && ipv6_gateways[i].ip[0] == 0) + idx = i; + else if (ipv6_gateways[i].ip[0]) + /* For now assume that given GW will be always at the + * same port, so no checking for that + */ + return i; + } + if (idx != -1U) { + ipv6_gateways[idx].port = port; + memcpy(ipv6_gateways[idx].ip, ip, 16); + /* Since ARP tables are kept per lcore MAC will be updated + * during first lookup. + */ + } + return idx; +} + +static +void clear_gateway6(const uint8_t *ip) +{ + uint32_t i; + + for (i = 0; i < RTE_DIM(ipv6_gateways); ++i) { + if (memcmp(ipv6_gateways[i].ip, ip, 16) == 0) { + memset(&ipv6_gateways[i].ip, 0, 16); + ipv6_gateways[i].lladdr.val = 0; + ipv6_gateways[i].port = RTE_MAX_ETHPORTS; + break; + } + } +} + +/* Assumptions: + * - Link related changes (MAC/MTU/...) need to be executed once, and it's OK + * to run them from the callback - if this is not the case (e.g. -EBUSY for + * MTU change, then event notification need to be used and more sophisticated + * coordination with lcore loops and stopping/starting of the ports: for + * example lcores not receiving on this port just mark it as inactive and stop + * transmitting to it and the one with RX stops the port sets the MAC starts + * it and notifies other lcores that it is back). + * - LPM is safe to be modified by one writer, and read by many without any + * locks (it looks to me like this is the case), however upon routing change + * there might be a transient period during which packets are not directed + * according to new rule. + * - Hash is unsafe to be used that way (and I don't want to turn on relevant + * flags just to excersize queued notifications) so every lcore keeps its + * copy of relevant data. + * Therefore there are callbacks defined for the routing info/address changes + * and remaining ones are handled via events on per lcore basis. + */ +static +int mac_change(const struct rte_ifpx_mac_change *ev) +{ + int i; + struct rte_ether_addr mac_addr; + char buf[RTE_ETHER_ADDR_FMT_SIZE]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(buf, sizeof(buf), &ev->mac); + RTE_LOG(DEBUG, L3FWD, "MAC change for port %d: %s\n", + ev->port_id, buf); + } + /* NOTE - use copy because RTE functions don't take const args */ + rte_ether_addr_copy(&ev->mac, &mac_addr); + i = rte_eth_dev_default_mac_addr_set(ev->port_id, &mac_addr); + if (i == -EOPNOTSUPP) + i = rte_eth_dev_mac_addr_add(ev->port_id, &mac_addr, 0); + if (i < 0) + RTE_LOG(WARNING, L3FWD, "Failed to set MAC address\n"); + else { + port_mac[ev->port_id].mac.addr = ev->mac; + port_mac[ev->port_id].mac.valid = 1; + } + return 1; +} + +static +int link_change(const struct rte_ifpx_link_change *ev) +{ + uint16_t proxy_id = rte_ifpx_proxy_get(ev->port_id); + uint32_t mask; + + /* Mark the proxy too since we get only port notifications. */ + mask = 1U << ev->port_id | 1U << proxy_id; + + RTE_LOG(DEBUG, L3FWD, "Link change for port %d: %d\n", + ev->port_id, ev->is_up); + if (ev->is_up) { + rte_eth_dev_set_link_up(ev->port_id); + active_port_mask |= mask; + } else { + rte_eth_dev_set_link_down(ev->port_id); + active_port_mask &= ~mask; + } + active_port_mask &= enabled_port_mask; + return 1; +} + +static +int addr_add(const struct rte_ifpx_addr_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 address for port %d: %s\n", + ev->port_id, buf); + } + rte_lpm_add(ipv4_routes, ev->ip, 32, + ev->port_id << 2 | PROXY_ADDR); + return 1; +} + +static +int route_add(const struct rte_ifpx_route_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t nh, ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 route for port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + + /* On Linux upon changing of the IP we get notification for both addr + * and route, so just check if we already have addr entry and if so + * then ignore this notification. + */ + if (ev->depth == 32 && + rte_lpm_lookup(ipv4_routes, ev->ip, &nh) == 0 && nh & PROXY_ADDR) + return 1; + + if (ev->gateway) { + nh = find_add_gateway(ev->port_id, ev->gateway); + if (nh != -1U) + rte_lpm_add(ipv4_routes, ev->ip, ev->depth, + nh << 2 | GW_ROUTE); + else + RTE_LOG(WARNING, L3FWD, "No free slot in GW array\n"); + } else + rte_lpm_add(ipv4_routes, ev->ip, ev->depth, + ev->port_id << 2 | HOST_ROUTE); + return 1; +} + +static +int addr_del(const struct rte_ifpx_addr_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 address removed from port %d: %s\n", + ev->port_id, buf); + } + rte_lpm_delete(ipv4_routes, ev->ip, 32); + return 1; +} + +static +int route_del(const struct rte_ifpx_route_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + uint32_t ip; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + ip = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv4 route removed from port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + if (ev->gateway) + clear_gateway(ev->gateway); + rte_lpm_delete(ipv4_routes, ev->ip, ev->depth); + return 1; +} + +static +int addr6_add(const struct rte_ifpx_addr6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 address for port %d: %s\n", + ev->port_id, buf); + } + rte_lpm6_add(ipv6_routes, ev->ip, 128, + ev->port_id << 2 | PROXY_ADDR); + return 1; +} + +static +int route6_add(const struct rte_ifpx_route6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + /* See comment in route_add(). */ + uint32_t nh; + if (ev->depth == 128 && + rte_lpm6_lookup(ipv6_routes, ev->ip, &nh) == 0 && nh & PROXY_ADDR) + return 1; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 route for port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + /* no valid IPv6 address starts with 0x00 */ + if (ev->gateway[0]) { + nh = find_add_gateway6(ev->port_id, ev->ip); + if (nh != -1U) + rte_lpm6_add(ipv6_routes, ev->ip, ev->depth, + nh << 2 | GW_ROUTE); + else + RTE_LOG(WARNING, L3FWD, "No free slot in GW6 array\n"); + } else + rte_lpm6_add(ipv6_routes, ev->ip, ev->depth, + ev->port_id << 2 | HOST_ROUTE); + return 1; +} + +static +int addr6_del(const struct rte_ifpx_addr6_change *ev) +{ + char buf[INET6_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 address removed from port %d: %s\n", + ev->port_id, buf); + } + rte_lpm6_delete(ipv6_routes, ev->ip, 128); + return 1; +} + +static +int route6_del(const struct rte_ifpx_route6_change *ev) +{ + char buf[INET_ADDRSTRLEN]; + + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, buf, sizeof(buf)); + RTE_LOG(DEBUG, L3FWD, "IPv6 route removed from port %d: %s/%d\n", + ev->port_id, buf, ev->depth); + } + if (ev->gateway[0]) + clear_gateway6(ev->gateway); + rte_lpm6_delete(ipv6_routes, ev->ip, ev->depth); + return 1; +} + +static +int cfg_done(void) +{ + uint16_t port_id, px; + const struct rte_ifpx_info *pinfo; + + RTE_LOG(DEBUG, L3FWD, "Proxy config finished\n"); + + /* Copy MAC addresses of the proxies - to be used as src MAC during + * forwarding. + */ + RTE_ETH_FOREACH_DEV(port_id) { + px = rte_ifpx_proxy_get(port_id); + if (px != RTE_MAX_ETHPORTS && px != port_id) { + pinfo = rte_ifpx_info_get(px); + rte_ether_addr_copy(&pinfo->mac, + &port_mac[port_id].mac.addr); + port_mac[port_id].mac.valid = 1; + } + } + + ifpx_ready = 1; + return 1; +} + +static +struct rte_ifpx_callbacks ifpx_callbacks = { + .mac_change = mac_change, +#if 0 + .mtu_change = mtu_change, +#endif + .link_change = link_change, + .addr_add = addr_add, + .addr_del = addr_del, + .addr6_add = addr6_add, + .addr6_del = addr6_del, + .route_add = route_add, + .route_del = route_del, + .route6_add = route6_add, + .route6_del = route6_del, + .cfg_done = cfg_done, +}; + +int init_if_proxy(void) +{ + char buf[16]; + unsigned int i; + + rte_ifpx_callbacks_register(&ifpx_callbacks); + + RTE_LCORE_FOREACH(i) { + if (lcore_conf[i].n_rx_queue == 0) + continue; + snprintf(buf, sizeof(buf), "IFPX-events_%d", i); + lcore_conf[i].ev_queue = rte_ring_create(buf, 16, SOCKET_ID_ANY, + RING_F_SP_ENQ | RING_F_SC_DEQ); + if (!lcore_conf[i].ev_queue) { + RTE_LOG(ERR, L3FWD, + "Failed to create event queue for lcore %d\n", + i); + return -1; + } + rte_ifpx_queue_add(lcore_conf[i].ev_queue); + } + + return rte_ifpx_listen(); +} + +void close_if_proxy(void) +{ + unsigned int i; + + RTE_LCORE_FOREACH(i) { + if (lcore_conf[i].n_rx_queue == 0) + continue; + rte_ring_free(lcore_conf[i].ev_queue); + } + rte_ifpx_close(); +} + +void wait_for_config_done(void) +{ + while (!ifpx_ready) + rte_delay_ms(100); +} + +#ifdef DO_RFC_1812_CHECKS +static inline +int is_valid_ipv4_pkt(struct rte_ipv4_hdr *pkt, uint32_t link_len) +{ + /* From http://www.rfc-editor.org/rfc/rfc1812.txt section 5.2.2 */ + /* + * 1. The packet length reported by the Link Layer must be large + * enough to hold the minimum length legal IP datagram (20 bytes). + */ + if (link_len < sizeof(struct rte_ipv4_hdr)) + return -1; + + /* 2. The IP checksum must be correct. */ + /* this is checked in H/W */ + + /* + * 3. The IP version number must be 4. If the version number is not 4 + * then the packet may be another version of IP, such as IPng or + * ST-II. + */ + if (((pkt->version_ihl) >> 4) != 4) + return -3; + /* + * 4. The IP header length field must be large enough to hold the + * minimum length legal IP datagram (20 bytes = 5 words). + */ + if ((pkt->version_ihl & 0xf) < 5) + return -4; + + /* + * 5. The IP total length field must be large enough to hold the IP + * datagram header, whose length is specified in the IP header length + * field. + */ + if (rte_cpu_to_be_16(pkt->total_length) < sizeof(struct rte_ipv4_hdr)) + return -5; + + return 0; +} +#endif + +/* Send burst of packets on an output interface */ +static inline +int send_burst(struct lcore_conf *lconf, uint16_t n, uint16_t port) +{ + struct rte_mbuf **m_table; + int ret; + uint16_t queueid; + + queueid = lconf->tx_queue_id[port]; + m_table = (struct rte_mbuf **)lconf->tx_mbufs[port].m_table; + + ret = rte_eth_tx_burst(port, queueid, m_table, n); + if (unlikely(ret < n)) { + do { + rte_pktmbuf_free(m_table[ret]); + } while (++ret < n); + } + + return 0; +} + +/* Enqueue a single packet, and send burst if queue is filled */ +static inline +int send_single_packet(struct lcore_conf *lconf, + struct rte_mbuf *m, uint16_t port) +{ + uint16_t len; + + len = lconf->tx_mbufs[port].len; + lconf->tx_mbufs[port].m_table[len] = m; + len++; + + /* enough pkts to be sent */ + if (unlikely(len == MAX_PKT_BURST)) { + send_burst(lconf, MAX_PKT_BURST, port); + len = 0; + } + + lconf->tx_mbufs[port].len = len; + return 0; +} + +static inline +int ipv4_get_destination(const struct rte_ipv4_hdr *ipv4_hdr, + struct rte_lpm *lpm, uint32_t *next_hop) +{ + return rte_lpm_lookup(lpm, + rte_be_to_cpu_32(ipv4_hdr->dst_addr), + next_hop); +} + +static inline +int ipv6_get_destination(const struct rte_ipv6_hdr *ipv6_hdr, + struct rte_lpm6 *lpm, uint32_t *next_hop) +{ + return rte_lpm6_lookup(lpm, ipv6_hdr->dst_addr, next_hop); +} + +static +uint16_t ipv4_process_pkt(struct lcore_conf *lconf, + struct rte_ether_hdr *eth_hdr, + struct rte_ipv4_hdr *ipv4_hdr, uint16_t portid) +{ + union lladdr_t lladdr = { 0 }; + int i; + uint32_t ip, nh; + + /* Here we know that packet is not from proxy - this case is handled + * in the main loop - so if we fail to find destination we will direct + * it to the proxy. + */ + if (ipv4_get_destination(ipv4_hdr, ipv4_routes, &nh) < 0) + return rte_ifpx_proxy_get(portid); + + if (nh & PROXY_ADDR) + return nh >> 2; + + /* Packet not to us so update src/dst MAC. */ + if (nh & GW_ROUTE) { + i = nh >> 2; + if (ipv4_gateways[i].lladdr.mac.valid) + lladdr = ipv4_gateways[i].lladdr; + else { + i = rte_hash_lookup(lconf->neigh_hash, + &ipv4_gateways[i].ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh_map[i]; + ipv4_gateways[i].lladdr = lladdr; + } + nh = ipv4_gateways[i].port; + } else { + nh >>= 2; + ip = rte_be_to_cpu_32(ipv4_hdr->dst_addr); + i = rte_hash_lookup(lconf->neigh_hash, &ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh_map[i]; + } + + RTE_ASSERT(lladdr.mac.valid); + RTE_ASSERT(port_mac[nh].mac.valid); + /* dst addr */ + *(uint64_t *)ð_hdr->d_addr = lladdr.val; + /* src addr */ + rte_ether_addr_copy(&port_mac[nh].mac.addr, ð_hdr->s_addr); + + return nh; +} + +static +uint16_t ipv6_process_pkt(struct lcore_conf *lconf, + struct rte_ether_hdr *eth_hdr, + struct rte_ipv6_hdr *ipv6_hdr, uint16_t portid) +{ + union lladdr_t lladdr = { 0 }; + int i; + uint32_t nh; + + /* Here we know that packet is not from proxy - this case is handled + * in the main loop - so if we fail to find destination we will direct + * it to the proxy. + */ + if (ipv6_get_destination(ipv6_hdr, ipv6_routes, &nh) < 0) + return rte_ifpx_proxy_get(portid); + + if (nh & PROXY_ADDR) + return nh >> 2; + + /* Packet not to us so update src/dst MAC. */ + if (nh & GW_ROUTE) { + i = nh >> 2; + if (ipv6_gateways[i].lladdr.mac.valid) + lladdr = ipv6_gateways[i].lladdr; + else { + i = rte_hash_lookup(lconf->neigh6_hash, + ipv6_gateways[i].ip); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh6_map[i]; + ipv6_gateways[i].lladdr = lladdr; + } + nh = ipv6_gateways[i].port; + } else { + nh >>= 2; + i = rte_hash_lookup(lconf->neigh6_hash, ipv6_hdr->dst_addr); + if (i < 0) + return rte_ifpx_proxy_get(portid); + lladdr = lconf->neigh6_map[i]; + } + + RTE_ASSERT(lladdr.mac.valid); + /* dst addr */ + *(uint64_t *)ð_hdr->d_addr = lladdr.val; + /* src addr */ + rte_ether_addr_copy(&port_mac[nh].mac.addr, ð_hdr->s_addr); + + return nh; +} + +static __rte_always_inline +void l3fwd_lpm_simple_forward(struct rte_mbuf *m, uint16_t portid, + struct lcore_conf *lconf) +{ + struct rte_ether_hdr *eth_hdr; + uint32_t nh; + + eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *); + + if (RTE_ETH_IS_IPV4_HDR(m->packet_type)) { + /* Handle IPv4 headers.*/ + struct rte_ipv4_hdr *ipv4_hdr; + + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *, + sizeof(*eth_hdr)); + +#ifdef DO_RFC_1812_CHECKS + /* Check to make sure the packet is valid (RFC1812) */ + if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) { + rte_pktmbuf_free(m); + return; + } +#endif + nh = ipv4_process_pkt(lconf, eth_hdr, ipv4_hdr, portid); + +#ifdef DO_RFC_1812_CHECKS + /* Update time to live and header checksum */ + --(ipv4_hdr->time_to_live); + ++(ipv4_hdr->hdr_checksum); +#endif + } else if (RTE_ETH_IS_IPV6_HDR(m->packet_type)) { + /* Handle IPv6 headers.*/ + struct rte_ipv6_hdr *ipv6_hdr; + + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *, + sizeof(*eth_hdr)); + + nh = ipv6_process_pkt(lconf, eth_hdr, ipv6_hdr, portid); + } else + /* Unhandled protocol */ + nh = rte_ifpx_proxy_get(portid); + + if (nh >= RTE_MAX_ETHPORTS || (active_port_mask & 1 << nh) == 0) + rte_pktmbuf_free(m); + else + send_single_packet(lconf, m, nh); +} + +static inline +void l3fwd_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, + uint16_t portid, struct lcore_conf *lconf) +{ + int32_t j; + + /* Prefetch first packets */ + for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *)); + + /* Prefetch and forward already prefetched packets. */ + for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ + j + PREFETCH_OFFSET], void *)); + l3fwd_lpm_simple_forward(pkts_burst[j], portid, lconf); + } + + /* Forward remaining prefetched packets */ + for (; j < nb_rx; j++) + l3fwd_lpm_simple_forward(pkts_burst[j], portid, lconf); +} + +static +void handle_neigh_add(struct lcore_conf *lconf, + const struct rte_ifpx_neigh_change *ev) +{ + char mac[RTE_ETHER_ADDR_FMT_SIZE]; + char ip[INET_ADDRSTRLEN]; + int32_t i, a; + + i = rte_hash_add_key(lconf->neigh_hash, &ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to add IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(mac, sizeof(mac), &ev->mac); + a = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &a, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour update for port %d: %s -> %s@%d\n", + ev->port_id, ip, mac, i); + } + lconf->neigh_map[i].mac.addr = ev->mac; + lconf->neigh_map[i].mac.valid = 1; +} + +static +void handle_neigh_del(struct lcore_conf *lconf, + const struct rte_ifpx_neigh_change *ev) +{ + char ip[INET_ADDRSTRLEN]; + int32_t i, a; + + i = rte_hash_del_key(lconf->neigh_hash, &ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, + "Failed to remove IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + a = rte_cpu_to_be_32(ev->ip); + inet_ntop(AF_INET, &a, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour removal for port %d: %s\n", + ev->port_id, ip); + } + lconf->neigh_map[i].val = 0; +} + +static +void handle_neigh6_add(struct lcore_conf *lconf, + const struct rte_ifpx_neigh6_change *ev) +{ + char mac[RTE_ETHER_ADDR_FMT_SIZE]; + char ip[INET6_ADDRSTRLEN]; + int32_t i; + + i = rte_hash_add_key(lconf->neigh6_hash, ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to add IPv4 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + rte_ether_format_addr(mac, sizeof(mac), &ev->mac); + inet_ntop(AF_INET6, ev->ip, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour update for port %d: %s -> %s@%d\n", + ev->port_id, ip, mac, i); + } + lconf->neigh6_map[i].mac.addr = ev->mac; + lconf->neigh6_map[i].mac.valid = 1; +} + +static +void handle_neigh6_del(struct lcore_conf *lconf, + const struct rte_ifpx_neigh6_change *ev) +{ + char ip[INET6_ADDRSTRLEN]; + int32_t i; + + i = rte_hash_del_key(lconf->neigh6_hash, ev->ip); + if (i < 0) { + RTE_LOG(WARNING, L3FWD, "Failed to remove IPv6 neighbour entry\n"); + return; + } + if (rte_log_get_level(RTE_LOGTYPE_L3FWD) >= (int)RTE_LOG_DEBUG) { + inet_ntop(AF_INET6, ev->ip, ip, sizeof(ip)); + RTE_LOG(DEBUG, L3FWD, "Neighbour removal for port %d: %s\n", + ev->port_id, ip); + } + lconf->neigh6_map[i].val = 0; +} + +static +void handle_events(struct lcore_conf *lconf) +{ + struct rte_ifpx_event *ev; + + while (rte_ring_dequeue(lconf->ev_queue, (void **)&ev) == 0) { + switch (ev->type) { + case RTE_IFPX_NEIGH_ADD: + handle_neigh_add(lconf, &ev->neigh_change); + break; + case RTE_IFPX_NEIGH_DEL: + handle_neigh_del(lconf, &ev->neigh_change); + break; + case RTE_IFPX_NEIGH6_ADD: + handle_neigh6_add(lconf, &ev->neigh6_change); + break; + case RTE_IFPX_NEIGH6_DEL: + handle_neigh6_del(lconf, &ev->neigh6_change); + break; + default: + RTE_LOG(WARNING, L3FWD, + "Unexpected event: %d\n", ev->type); + } + free(ev); + } +} + +void setup_lpm(void) +{ + struct rte_lpm6_config cfg6; + struct rte_lpm_config cfg4; + + /* create the LPM table */ + cfg4.max_rules = IPV4_L3FWD_LPM_MAX_RULES; + cfg4.number_tbl8s = IPV4_L3FWD_LPM_NUMBER_TBL8S; + cfg4.flags = 0; + ipv4_routes = rte_lpm_create("IPV4_L3FWD_LPM", SOCKET_ID_ANY, &cfg4); + if (ipv4_routes == NULL) + rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table\n"); + + /* create the LPM6 table */ + cfg6.max_rules = IPV6_L3FWD_LPM_MAX_RULES; + cfg6.number_tbl8s = IPV6_L3FWD_LPM_NUMBER_TBL8S; + cfg6.flags = 0; + ipv6_routes = rte_lpm6_create("IPV6_L3FWD_LPM", SOCKET_ID_ANY, &cfg6); + if (ipv6_routes == NULL) + rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table\n"); +} + +static +uint32_t hash_ipv4(const void *key, uint32_t key_len __rte_unused, + uint32_t init_val) +{ +#ifndef USE_HASH_CRC + return rte_jhash_1word(*(const uint32_t *)key, init_val); +#else + return rte_hash_crc_4byte(*(const uint32_t *)key, init_val); +#endif +} + +static +uint32_t hash_ipv6(const void *key, uint32_t key_len __rte_unused, + uint32_t init_val) +{ +#ifndef USE_HASH_CRC + return rte_jhash_32b(key, 4, init_val); +#else + const uint64_t *pk = key; + init_val = rte_hash_crc_8byte(*pk, init_val); + return rte_hash_crc_8byte(*(pk+1), init_val); +#endif +} + +static +int setup_neigh(struct lcore_conf *lconf) +{ + char buf[16]; + struct rte_hash_parameters ipv4_hparams = { + .name = buf, + .entries = L3FWD_NEIGH_ENTRIES, + .key_len = 4, + .hash_func = hash_ipv4, + .hash_func_init_val = 0, + }; + struct rte_hash_parameters ipv6_hparams = { + .name = buf, + .entries = L3FWD_NEIGH_ENTRIES, + .key_len = 16, + .hash_func = hash_ipv6, + .hash_func_init_val = 0, + }; + + snprintf(buf, sizeof(buf), "neigh_hash-%d", rte_lcore_id()); + lconf->neigh_hash = rte_hash_create(&ipv4_hparams); + snprintf(buf, sizeof(buf), "neigh_map-%d", rte_lcore_id()); + lconf->neigh_map = rte_zmalloc(buf, + L3FWD_NEIGH_ENTRIES*sizeof(*lconf->neigh_map), + 8); + if (lconf->neigh_hash == NULL || lconf->neigh_map == NULL) { + RTE_LOG(ERR, L3FWD, + "Unable to create the l3fwd ARP/IPv4 table (lcore %d)\n", + rte_lcore_id()); + return -1; + } + + snprintf(buf, sizeof(buf), "neigh6_hash-%d", rte_lcore_id()); + lconf->neigh6_hash = rte_hash_create(&ipv6_hparams); + snprintf(buf, sizeof(buf), "neigh6_map-%d", rte_lcore_id()); + lconf->neigh6_map = rte_zmalloc(buf, + L3FWD_NEIGH_ENTRIES*sizeof(*lconf->neigh6_map), + 8); + if (lconf->neigh6_hash == NULL || lconf->neigh6_map == NULL) { + RTE_LOG(ERR, L3FWD, + "Unable to create the l3fwd ARP/IPv6 table (lcore %d)\n", + rte_lcore_id()); + return -1; + } + return 0; +} + +int lpm_check_ptype(int portid) +{ + int i, ret; + int ptype_l3_ipv4 = 0, ptype_l3_ipv6 = 0; + uint32_t ptype_mask = RTE_PTYPE_L3_MASK; + + ret = rte_eth_dev_get_supported_ptypes(portid, ptype_mask, NULL, 0); + if (ret <= 0) + return 0; + + uint32_t ptypes[ret]; + + ret = rte_eth_dev_get_supported_ptypes(portid, ptype_mask, ptypes, ret); + for (i = 0; i < ret; ++i) { + if (ptypes[i] & RTE_PTYPE_L3_IPV4) + ptype_l3_ipv4 = 1; + if (ptypes[i] & RTE_PTYPE_L3_IPV6) + ptype_l3_ipv6 = 1; + } + + if (ptype_l3_ipv4 == 0) + RTE_LOG(WARNING, L3FWD, + "port %d cannot parse RTE_PTYPE_L3_IPV4\n", portid); + + if (ptype_l3_ipv6 == 0) + RTE_LOG(WARNING, L3FWD, + "port %d cannot parse RTE_PTYPE_L3_IPV6\n", portid); + + if (ptype_l3_ipv4 && ptype_l3_ipv6) + return 1; + + return 0; + +} + +static inline +void lpm_parse_ptype(struct rte_mbuf *m) +{ + struct rte_ether_hdr *eth_hdr; + uint32_t packet_type = RTE_PTYPE_UNKNOWN; + uint16_t ether_type; + + eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *); + ether_type = eth_hdr->ether_type; + if (ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) + packet_type |= RTE_PTYPE_L3_IPV4_EXT_UNKNOWN; + else if (ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV6)) + packet_type |= RTE_PTYPE_L3_IPV6_EXT_UNKNOWN; + + m->packet_type = packet_type; +} + +uint16_t lpm_cb_parse_ptype(uint16_t port __rte_unused, + uint16_t queue __rte_unused, + struct rte_mbuf *pkts[], uint16_t nb_pkts, + uint16_t max_pkts __rte_unused, + void *user_param __rte_unused) +{ + unsigned int i; + + if (unlikely(nb_pkts == 0)) + return nb_pkts; + rte_prefetch0(rte_pktmbuf_mtod(pkts[0], struct ether_hdr *)); + for (i = 0; i < (unsigned int) (nb_pkts - 1); ++i) { + rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], + struct ether_hdr *)); + lpm_parse_ptype(pkts[i]); + } + lpm_parse_ptype(pkts[i]); + + return nb_pkts; +} + +/* main processing loop */ +int lpm_main_loop(void *dummy __rte_unused) +{ + struct rte_mbuf *pkts_burst[MAX_PKT_BURST]; + unsigned int lcore_id; + uint64_t prev_tsc, diff_tsc, cur_tsc; + int i, j, nb_rx; + uint16_t portid; + uint8_t queueid; + struct lcore_conf *lconf; + struct lcore_rx_queue *rxq; + const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) / + US_PER_S * BURST_TX_DRAIN_US; + + prev_tsc = 0; + + lcore_id = rte_lcore_id(); + lconf = &lcore_conf[lcore_id]; + + if (setup_neigh(lconf) < 0) { + RTE_LOG(ERR, L3FWD, "lcore %u failed to setup its ARP tables\n", + lcore_id); + return 0; + } + + if (lconf->n_rx_queue == 0) { + RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id); + return 0; + } + + RTE_LOG(INFO, L3FWD, "entering main loop on lcore %u\n", lcore_id); + + for (i = 0; i < lconf->n_rx_queue; i++) { + + portid = lconf->rx_queue_list[i].port_id; + queueid = lconf->rx_queue_list[i].queue_id; + RTE_LOG(INFO, L3FWD, + " -- lcoreid=%u portid=%u rxqueueid=%hhu\n", + lcore_id, portid, queueid); + } + + while (!force_quit) { + + cur_tsc = rte_rdtsc(); + /* + * TX burst and event queue drain + */ + diff_tsc = cur_tsc - prev_tsc; + if (unlikely(diff_tsc % drain_tsc == 0)) { + + for (i = 0; i < lconf->n_tx_port; ++i) { + portid = lconf->tx_port_id[i]; + if (lconf->tx_mbufs[portid].len == 0) + continue; + send_burst(lconf, + lconf->tx_mbufs[portid].len, + portid); + lconf->tx_mbufs[portid].len = 0; + } + + if (diff_tsc > EV_QUEUE_DRAIN * drain_tsc) { + if (lconf->ev_queue && + !rte_ring_empty(lconf->ev_queue)) + handle_events(lconf); + prev_tsc = cur_tsc; + } + } + + /* + * Read packet from RX queues + */ + for (i = 0; i < lconf->n_rx_queue; ++i) { + rxq = &lconf->rx_queue_list[i]; + portid = rxq->port_id; + queueid = rxq->queue_id; + nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, + MAX_PKT_BURST); + if (nb_rx == 0) + continue; + /* If current queue is from proxy interface then there + * is no need to figure out destination port - just + * forward it to the bound port. + */ + if (unlikely(rxq->dst_port != RTE_MAX_ETHPORTS)) { + for (j = 0; j < nb_rx; ++j) + send_single_packet(lconf, pkts_burst[j], + rxq->dst_port); + } else + l3fwd_send_packets(nb_rx, pkts_burst, portid, + lconf); + } + } + + return 0; +} diff --git a/examples/l3fwd-ifpx/l3fwd.h b/examples/l3fwd-ifpx/l3fwd.h new file mode 100644 index 0000000000..fc60078c50 --- /dev/null +++ b/examples/l3fwd-ifpx/l3fwd.h @@ -0,0 +1,98 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#ifndef __L3_FWD_H__ +#define __L3_FWD_H__ + +#include <stdbool.h> + +#include <rte_ethdev.h> +#include <rte_log.h> +#include <rte_hash.h> + +#define RTE_LOGTYPE_L3FWD RTE_LOGTYPE_USER1 + +#define MAX_PKT_BURST 32 +#define BURST_TX_DRAIN_US 100 /* TX drain every ~100us */ +#define EV_QUEUE_DRAIN 5 /* Check event queue every 5 TX drains */ + +#define MAX_RX_QUEUE_PER_LCORE 16 + +/* + * Try to avoid TX buffering if we have at least MAX_TX_BURST packets to send. + */ +#define MAX_TX_BURST (MAX_PKT_BURST / 2) + +/* Configure how many packets ahead to prefetch, when reading packets */ +#define PREFETCH_OFFSET 3 + +/* Hash parameters. */ +#ifdef RTE_ARCH_64 +/* default to 4 million hash entries (approx) */ +#define L3FWD_HASH_ENTRIES (1024*1024*4) +#else +/* 32-bit has less address-space for hugepage memory, limit to 1M entries */ +#define L3FWD_HASH_ENTRIES (1024*1024*1) +#endif +#define HASH_ENTRY_NUMBER_DEFAULT 4 +/* Default ARP table size */ +#define L3FWD_NEIGH_ENTRIES 1024 + +union lladdr_t { + uint64_t val; + struct { + struct rte_ether_addr addr; + uint16_t valid; + } mac; +}; + +struct mbuf_table { + uint16_t len; + struct rte_mbuf *m_table[MAX_PKT_BURST]; +}; + +struct lcore_rx_queue { + uint16_t port_id; + uint16_t dst_port; + uint8_t queue_id; +} __rte_cache_aligned; + +struct lcore_conf { + uint16_t n_rx_queue; + struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE]; + uint16_t n_tx_port; + uint16_t tx_port_id[RTE_MAX_ETHPORTS]; + uint16_t tx_queue_id[RTE_MAX_ETHPORTS]; + struct mbuf_table tx_mbufs[RTE_MAX_ETHPORTS]; + struct rte_ring *ev_queue; + union lladdr_t *neigh_map; + struct rte_hash *neigh_hash; + union lladdr_t *neigh6_map; + struct rte_hash *neigh6_hash; +} __rte_cache_aligned; + +extern volatile bool force_quit; + +/* mask of enabled/active ports */ +extern uint32_t enabled_port_mask; +extern uint32_t active_port_mask; + +extern struct lcore_conf lcore_conf[RTE_MAX_LCORE]; + +int init_if_proxy(void); +void close_if_proxy(void); + +void wait_for_config_done(void); + +void setup_lpm(void); + +int lpm_check_ptype(int portid); + +uint16_t +lpm_cb_parse_ptype(uint16_t port, uint16_t queue, struct rte_mbuf *pkts[], + uint16_t nb_pkts, uint16_t max_pkts, void *user_param); + +int lpm_main_loop(__attribute__((unused)) void *dummy); + +#endif /* __L3_FWD_H__ */ diff --git a/examples/l3fwd-ifpx/main.c b/examples/l3fwd-ifpx/main.c new file mode 100644 index 0000000000..7f1da5ec2e --- /dev/null +++ b/examples/l3fwd-ifpx/main.c @@ -0,0 +1,740 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Marvell International Ltd. + */ + +#include <stdlib.h> +#include <stdint.h> +#include <inttypes.h> +#include <sys/types.h> +#include <string.h> +#include <sys/queue.h> +#include <stdarg.h> +#include <errno.h> +#include <getopt.h> +#include <signal.h> +#include <stdbool.h> + +#include <rte_byteorder.h> +#include <rte_memory.h> +#include <rte_memcpy.h> +#include <rte_eal.h> +#include <rte_launch.h> +#include <rte_atomic.h> +#include <rte_cycles.h> +#include <rte_prefetch.h> +#include <rte_lcore.h> +#include <rte_per_lcore.h> +#include <rte_branch_prediction.h> +#include <rte_interrupts.h> +#include <rte_random.h> +#include <rte_debug.h> +#include <rte_ether.h> +#include <rte_ethdev.h> +#include <rte_mempool.h> +#include <rte_mbuf.h> +#include <rte_ip.h> +#include <rte_tcp.h> +#include <rte_udp.h> +#include <rte_string_fns.h> +#include <rte_cpuflags.h> +#include <rte_if_proxy.h> + +#include <cmdline_parse.h> +#include <cmdline_parse_etheraddr.h> + +#include "l3fwd.h" + +/* + * Configurable number of RX/TX ring descriptors + */ +#define RTE_TEST_RX_DESC_DEFAULT 1024 +#define RTE_TEST_TX_DESC_DEFAULT 1024 + +#define MAX_TX_QUEUE_PER_PORT RTE_MAX_ETHPORTS +#define MAX_RX_QUEUE_PER_PORT 128 + +#define MAX_LCORE_PARAMS 1024 + +/* Static global variables used within this file. */ +static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT; +static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT; + +/**< Ports set in promiscuous mode off by default. */ +static int promiscuous_on; + +/* Global variables. */ + +static int parse_ptype; /**< Parse packet type using rx callback, and */ + /**< disabled by default */ + +volatile bool force_quit; + +/* mask of enabled/active ports */ +uint32_t enabled_port_mask; +uint32_t active_port_mask; + +struct lcore_conf lcore_conf[RTE_MAX_LCORE]; + +struct lcore_params { + uint16_t port_id; + uint8_t queue_id; + uint8_t lcore_id; +} __rte_cache_aligned; + +static struct lcore_params lcore_params[MAX_LCORE_PARAMS]; +static struct lcore_params lcore_params_default[] = { + {0, 0, 2}, + {0, 1, 2}, + {0, 2, 2}, + {1, 0, 2}, + {1, 1, 2}, + {1, 2, 2}, + {2, 0, 2}, + {3, 0, 3}, + {3, 1, 3}, +}; + +static uint16_t nb_lcore_params; + +static struct rte_eth_conf port_conf = { + .rxmode = { + .mq_mode = ETH_MQ_RX_RSS, + .max_rx_pkt_len = RTE_ETHER_MAX_LEN, + .split_hdr_size = 0, + .offloads = DEV_RX_OFFLOAD_CHECKSUM, + }, + .rx_adv_conf = { + .rss_conf = { + .rss_key = NULL, + .rss_hf = ETH_RSS_IP, + }, + }, + .txmode = { + .mq_mode = ETH_MQ_TX_NONE, + }, +}; + +static struct rte_mempool *pktmbuf_pool; + +static int +check_lcore_params(void) +{ + uint8_t queue, lcore; + uint16_t i, port_id; + int socketid; + + for (i = 0; i < nb_lcore_params; ++i) { + queue = lcore_params[i].queue_id; + if (queue >= MAX_RX_QUEUE_PER_PORT) { + RTE_LOG(ERR, L3FWD, "Invalid queue number: %hhu\n", + queue); + return -1; + } + lcore = lcore_params[i].lcore_id; + if (!rte_lcore_is_enabled(lcore)) { + RTE_LOG(ERR, L3FWD, "lcore %hhu is not enabled " + "in lcore mask\n", lcore); + return -1; + } + port_id = lcore_params[i].port_id; + if ((enabled_port_mask & (1 << port_id)) == 0) { + RTE_LOG(ERR, L3FWD, "port %u is not enabled " + "in port mask\n", port_id); + return -1; + } + if (!rte_eth_dev_is_valid_port(port_id)) { + RTE_LOG(ERR, L3FWD, "port %u is not present " + "on the board\n", port_id); + return -1; + } + socketid = rte_lcore_to_socket_id(lcore); + if (socketid != 0) { + RTE_LOG(WARNING, L3FWD, + "lcore %hhu is on socket %d with numa off\n", + lcore, socketid); + } + } + return 0; +} + +static int +add_proxies(void) +{ + uint16_t i, p, port_id, proxy_id; + + for (i = 0, p = nb_lcore_params; i < nb_lcore_params; ++i) { + if (p >= RTE_DIM(lcore_params)) { + RTE_LOG(ERR, L3FWD, "Not enough room in lcore_params " + "to add proxy\n"); + return -1; + } + port_id = lcore_params[i].port_id; + if (rte_ifpx_proxy_get(port_id) != RTE_MAX_ETHPORTS) + continue; + + proxy_id = rte_ifpx_proxy_create(RTE_IFPX_DEFAULT); + if (proxy_id == RTE_MAX_ETHPORTS) { + RTE_LOG(ERR, L3FWD, "Failed to crate proxy\n"); + return -1; + } + rte_ifpx_port_bind(port_id, proxy_id); + /* mark proxy as enabled - the corresponding port is, since we + * are after checking of lcore_params + */ + enabled_port_mask |= 1 << proxy_id; + lcore_params[p].port_id = proxy_id; + lcore_params[p].lcore_id = lcore_params[i].lcore_id; + lcore_params[p].queue_id = lcore_params[i].queue_id; + ++p; + } + + nb_lcore_params = p; + return 0; +} + +static uint8_t +get_port_n_rx_queues(const uint16_t port) +{ + int queue = -1; + uint16_t i; + + for (i = 0; i < nb_lcore_params; ++i) { + if (lcore_params[i].port_id == port) { + if (lcore_params[i].queue_id == queue+1) + queue = lcore_params[i].queue_id; + else + rte_exit(EXIT_FAILURE, "queue ids of the port %d must be" + " in sequence and must start with 0\n", + lcore_params[i].port_id); + } + } + return (uint8_t)(++queue); +} + +static int +init_lcore_rx_queues(void) +{ + uint16_t i, p, nb_rx_queue; + uint8_t lcore; + struct lcore_rx_queue *rq; + + for (i = 0; i < nb_lcore_params; ++i) { + lcore = lcore_params[i].lcore_id; + nb_rx_queue = lcore_conf[lcore].n_rx_queue; + if (nb_rx_queue >= MAX_RX_QUEUE_PER_LCORE) { + RTE_LOG(ERR, L3FWD, + "too many queues (%u) for lcore: %u\n", + (unsigned int)nb_rx_queue + 1, + (unsigned int)lcore); + return -1; + } + rq = &lcore_conf[lcore].rx_queue_list[nb_rx_queue]; + rq->port_id = lcore_params[i].port_id; + rq->queue_id = lcore_params[i].queue_id; + if (rte_ifpx_is_proxy(rq->port_id)) { + if (rte_ifpx_port_get(rq->port_id, &p, 1) > 0) + rq->dst_port = p; + else + RTE_LOG(WARNING, L3FWD, + "Found proxy that has no port bound\n"); + } else + rq->dst_port = RTE_MAX_ETHPORTS; + lcore_conf[lcore].n_rx_queue++; + } + return 0; +} + +/* display usage */ +static void +print_usage(const char *prgname) +{ + fprintf(stderr, "%s [EAL options] --" + " -p PORTMASK" + " [-P]" + " --config (port,queue,lcore)[,(port,queue,lcore)]" + " [--ipv6]" + " [--parse-ptype]" + + " -p PORTMASK: Hexadecimal bitmask of ports to configure\n" + " -P : Enable promiscuous mode\n" + " --config (port,queue,lcore): Rx queue configuration\n" + " --ipv6: Set if running ipv6 packets\n" + " --parse-ptype: Set to use software to analyze packet type\n", + prgname); +} + +static int +parse_portmask(const char *portmask) +{ + char *end = NULL; + unsigned long pm; + + /* parse hexadecimal string */ + pm = strtoul(portmask, &end, 16); + if ((portmask[0] == '\0') || (end == NULL) || (*end != '\0')) + return -1; + + if (pm == 0) + return -1; + + return pm; +} + +static int +parse_config(const char *q_arg) +{ + char s[256]; + const char *p, *p0 = q_arg; + char *end; + enum fieldnames { + FLD_PORT = 0, + FLD_QUEUE, + FLD_LCORE, + _NUM_FLD + }; + unsigned long int_fld[_NUM_FLD]; + char *str_fld[_NUM_FLD]; + int i; + unsigned int size; + + nb_lcore_params = 0; + + while ((p = strchr(p0, '(')) != NULL) { + ++p; + p0 = strchr(p, ')'); + if (p0 == NULL) + return -1; + + size = p0 - p; + if (size >= sizeof(s)) + return -1; + + snprintf(s, sizeof(s), "%.*s", size, p); + if (rte_strsplit(s, sizeof(s), str_fld, _NUM_FLD, ',') != + _NUM_FLD) + return -1; + for (i = 0; i < _NUM_FLD; i++) { + errno = 0; + int_fld[i] = strtoul(str_fld[i], &end, 0); + if (errno != 0 || end == str_fld[i] || int_fld[i] > 255) + return -1; + } + if (nb_lcore_params >= MAX_LCORE_PARAMS) { + RTE_LOG(ERR, L3FWD, "exceeded max number of lcore " + "params: %hu\n", nb_lcore_params); + return -1; + } + lcore_params[nb_lcore_params].port_id = + (uint8_t)int_fld[FLD_PORT]; + lcore_params[nb_lcore_params].queue_id = + (uint8_t)int_fld[FLD_QUEUE]; + lcore_params[nb_lcore_params].lcore_id = + (uint8_t)int_fld[FLD_LCORE]; + ++nb_lcore_params; + } + return 0; +} + +#define MAX_JUMBO_PKT_LEN 9600 +#define MEMPOOL_CACHE_SIZE 256 + +static const char short_options[] = + "p:" /* portmask */ + "P" /* promiscuous */ + "L" /* enable long prefix match */ + "E" /* enable exact match */ + ; + +#define CMD_LINE_OPT_CONFIG "config" +#define CMD_LINE_OPT_IPV6 "ipv6" +#define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype" +enum { + /* long options mapped to a short option */ + + /* first long only option value must be >= 256, so that we won't + * conflict with short options + */ + CMD_LINE_OPT_MIN_NUM = 256, + CMD_LINE_OPT_CONFIG_NUM, + CMD_LINE_OPT_PARSE_PTYPE_NUM, +}; + +static const struct option lgopts[] = { + {CMD_LINE_OPT_CONFIG, 1, 0, CMD_LINE_OPT_CONFIG_NUM}, + {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, CMD_LINE_OPT_PARSE_PTYPE_NUM}, + {NULL, 0, 0, 0} +}; + +/* + * This expression is used to calculate the number of mbufs needed + * depending on user input, taking into account memory for rx and + * tx hardware rings, cache per lcore and mtable per port per lcore. + * RTE_MAX is used to ensure that NB_MBUF never goes below a minimum + * value of 8192 + */ +#define NB_MBUF(nports) RTE_MAX( \ + (nports*nb_rx_queue*nb_rxd + \ + nports*nb_lcores*MAX_PKT_BURST + \ + nports*n_tx_queue*nb_txd + \ + nb_lcores*MEMPOOL_CACHE_SIZE), \ + 8192U) + +/* Parse the argument given in the command line of the application */ +static int +parse_args(int argc, char **argv) +{ + int opt, ret; + char **argvopt; + int option_index; + char *prgname = argv[0]; + + argvopt = argv; + + /* Error or normal output strings. */ + while ((opt = getopt_long(argc, argvopt, short_options, + lgopts, &option_index)) != EOF) { + + switch (opt) { + /* portmask */ + case 'p': + enabled_port_mask = parse_portmask(optarg); + if (enabled_port_mask == 0) { + RTE_LOG(ERR, L3FWD, "Invalid portmask\n"); + print_usage(prgname); + return -1; + } + break; + + case 'P': + promiscuous_on = 1; + break; + + /* long options */ + case CMD_LINE_OPT_CONFIG_NUM: + ret = parse_config(optarg); + if (ret) { + RTE_LOG(ERR, L3FWD, "Invalid config\n"); + print_usage(prgname); + return -1; + } + break; + + case CMD_LINE_OPT_PARSE_PTYPE_NUM: + RTE_LOG(INFO, L3FWD, "soft parse-ptype is enabled\n"); + parse_ptype = 1; + break; + + default: + print_usage(prgname); + return -1; + } + } + + if (nb_lcore_params == 0) { + memcpy(lcore_params, lcore_params_default, + sizeof(lcore_params_default)); + nb_lcore_params = RTE_DIM(lcore_params_default); + } + + if (optind >= 0) + argv[optind-1] = prgname; + + ret = optind-1; + optind = 1; /* reset getopt lib */ + return ret; +} + +static void +signal_handler(int signum) +{ + if (signum == SIGINT || signum == SIGTERM) { + RTE_LOG(NOTICE, L3FWD, + "\n\nSignal %d received, preparing to exit...\n", + signum); + force_quit = true; + } +} + +static int +prepare_ptype_parser(uint16_t portid, uint16_t queueid) +{ + if (parse_ptype) { + RTE_LOG(INFO, L3FWD, "Port %d: softly parse packet type info\n", + portid); + if (rte_eth_add_rx_callback(portid, queueid, + lpm_cb_parse_ptype, + NULL)) + return 1; + + RTE_LOG(ERR, L3FWD, "Failed to add rx callback: port=%d\n", + portid); + return 0; + } + + if (lpm_check_ptype(portid)) + return 1; + + RTE_LOG(ERR, L3FWD, + "port %d cannot parse packet type, please add --%s\n", + portid, CMD_LINE_OPT_PARSE_PTYPE); + return 0; +} + +int +main(int argc, char **argv) +{ + struct lcore_conf *lconf; + struct rte_eth_dev_info dev_info; + struct rte_eth_txconf *txconf; + int ret; + unsigned int nb_ports; + uint32_t nb_mbufs; + uint16_t queueid, portid; + unsigned int lcore_id; + uint32_t nb_tx_queue, nb_lcores; + uint8_t nb_rx_queue, queue; + + /* init EAL */ + ret = rte_eal_init(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Invalid EAL parameters\n"); + argc -= ret; + argv += ret; + + force_quit = false; + signal(SIGINT, signal_handler); + signal(SIGTERM, signal_handler); + + /* parse application arguments (after the EAL ones) */ + ret = parse_args(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Invalid L3FWD parameters\n"); + + if (check_lcore_params() < 0) + rte_exit(EXIT_FAILURE, "check_lcore_params failed\n"); + + if (add_proxies() < 0) + rte_exit(EXIT_FAILURE, "add_proxies failed\n"); + + ret = init_lcore_rx_queues(); + if (ret < 0) + rte_exit(EXIT_FAILURE, "init_lcore_rx_queues failed\n"); + + nb_ports = rte_eth_dev_count_avail(); + + nb_lcores = rte_lcore_count(); + + /* Initial number of mbufs in pool - the amount required for hardware + * rx/tx rings will be added during configuration of ports. + */ + nb_mbufs = nb_ports * nb_lcores * MAX_PKT_BURST + /* mbuf tables */ + nb_lcores * MEMPOOL_CACHE_SIZE; /* per lcore cache */ + + /* Init the lookup structures. */ + setup_lpm(); + + /* initialize all ports (including proxies) */ + RTE_ETH_FOREACH_DEV(portid) { + struct rte_eth_conf local_port_conf = port_conf; + + /* skip ports that are not enabled */ + if ((enabled_port_mask & (1 << portid)) == 0) { + RTE_LOG(INFO, L3FWD, "Skipping disabled port %d\n", + portid); + continue; + } + + /* init port */ + RTE_LOG(INFO, L3FWD, "Initializing port %d ...\n", portid); + + nb_rx_queue = get_port_n_rx_queues(portid); + nb_tx_queue = nb_lcores; + + ret = rte_eth_dev_info_get(portid, &dev_info); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "Error during getting device (port %u) info: %s\n", + portid, strerror(-ret)); + if (nb_rx_queue > dev_info.max_rx_queues || + nb_tx_queue > dev_info.max_tx_queues) + rte_exit(EXIT_FAILURE, + "Port %d cannot configure enough queues\n", + portid); + + RTE_LOG(INFO, L3FWD, "Creating queues: nb_rxq=%d nb_txq=%u...\n", + nb_rx_queue, nb_tx_queue); + + if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE) + local_port_conf.txmode.offloads |= + DEV_TX_OFFLOAD_MBUF_FAST_FREE; + + local_port_conf.rx_adv_conf.rss_conf.rss_hf &= + dev_info.flow_type_rss_offloads; + if (local_port_conf.rx_adv_conf.rss_conf.rss_hf != + port_conf.rx_adv_conf.rss_conf.rss_hf) { + RTE_LOG(INFO, L3FWD, + "Port %u modified RSS hash function based on hardware support," + "requested:%#"PRIx64" configured:%#"PRIx64"\n", + portid, port_conf.rx_adv_conf.rss_conf.rss_hf, + local_port_conf.rx_adv_conf.rss_conf.rss_hf); + } + + ret = rte_eth_dev_configure(portid, nb_rx_queue, + (uint16_t)nb_tx_queue, + &local_port_conf); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "Cannot configure device: err=%d, port=%d\n", + ret, portid); + + ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd, + &nb_txd); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "Cannot adjust number of descriptors: err=%d, " + "port=%d\n", ret, portid); + + nb_mbufs += nb_rx_queue * nb_rxd + nb_tx_queue * nb_txd; + /* init one TX queue per couple (lcore,port) */ + queueid = 0; + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + + RTE_LOG(INFO, L3FWD, "\ttxq=%u,%d\n", lcore_id, + queueid); + + txconf = &dev_info.default_txconf; + txconf->offloads = local_port_conf.txmode.offloads; + ret = rte_eth_tx_queue_setup(portid, queueid, nb_txd, + SOCKET_ID_ANY, txconf); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_tx_queue_setup: err=%d, " + "port=%d\n", ret, portid); + + lconf = &lcore_conf[lcore_id]; + lconf->tx_queue_id[portid] = queueid; + queueid++; + + lconf->tx_port_id[lconf->n_tx_port] = portid; + lconf->n_tx_port++; + } + RTE_LOG(INFO, L3FWD, "\n"); + } + + /* Init pkt pool. */ + pktmbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", + rte_align32prevpow2(nb_mbufs), MEMPOOL_CACHE_SIZE, + 0, RTE_MBUF_DEFAULT_BUF_SIZE, SOCKET_ID_ANY); + if (pktmbuf_pool == NULL) + rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n"); + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + lconf = &lcore_conf[lcore_id]; + RTE_LOG(INFO, L3FWD, "Initializing rx queues on lcore %u ...\n", + lcore_id); + /* init RX queues */ + for (queue = 0; queue < lconf->n_rx_queue; ++queue) { + struct rte_eth_rxconf rxq_conf; + + portid = lconf->rx_queue_list[queue].port_id; + queueid = lconf->rx_queue_list[queue].queue_id; + + RTE_LOG(INFO, L3FWD, "\trxq=%d,%d\n", portid, queueid); + + ret = rte_eth_dev_info_get(portid, &dev_info); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "Error during getting device (port %u) info: %s\n", + portid, strerror(-ret)); + + rxq_conf = dev_info.default_rxconf; + rxq_conf.offloads = port_conf.rxmode.offloads; + ret = rte_eth_rx_queue_setup(portid, queueid, + nb_rxd, SOCKET_ID_ANY, + &rxq_conf, + pktmbuf_pool); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_rx_queue_setup: err=%d, port=%d\n", + ret, portid); + } + } + + RTE_LOG(INFO, L3FWD, "\n"); + + /* start ports */ + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) + continue; + + /* Start device */ + ret = rte_eth_dev_start(portid); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_eth_dev_start: err=%d, port=%d\n", + ret, portid); + + /* + * If enabled, put device in promiscuous mode. + * This allows IO forwarding mode to forward packets + * to itself through 2 cross-connected ports of the + * target machine. + */ + if (promiscuous_on) { + ret = rte_eth_promiscuous_enable(portid); + if (ret != 0) + rte_exit(EXIT_FAILURE, + "rte_eth_promiscuous_enable: err=%s, port=%u\n", + rte_strerror(-ret), portid); + } + } + /* we've managed to start all enabled ports so active == enabled */ + active_port_mask = enabled_port_mask; + + RTE_LOG(INFO, L3FWD, "\n"); + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + lconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < lconf->n_rx_queue; ++queue) { + portid = lconf->rx_queue_list[queue].port_id; + queueid = lconf->rx_queue_list[queue].queue_id; + if (prepare_ptype_parser(portid, queueid) == 0) + rte_exit(EXIT_FAILURE, "ptype check fails\n"); + } + } + + if (init_if_proxy() < 0) + rte_exit(EXIT_FAILURE, "Failed to configure proxy lib\n"); + wait_for_config_done(); + + ret = 0; + /* launch per-lcore init on every lcore */ + rte_eal_mp_remote_launch(lpm_main_loop, NULL, CALL_MASTER); + RTE_LCORE_FOREACH_SLAVE(lcore_id) { + if (rte_eal_wait_lcore(lcore_id) < 0) { + ret = -1; + break; + } + } + + /* stop ports */ + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) + continue; + RTE_LOG(INFO, L3FWD, "Closing port %d...", portid); + rte_eth_dev_stop(portid); + rte_eth_dev_close(portid); + rte_log(RTE_LOG_INFO, RTE_LOGTYPE_L3FWD, " Done\n"); + } + + close_if_proxy(); + RTE_LOG(INFO, L3FWD, "Bye...\n"); + + return ret; +} diff --git a/examples/l3fwd-ifpx/meson.build b/examples/l3fwd-ifpx/meson.build new file mode 100644 index 0000000000..f0c0920b81 --- /dev/null +++ b/examples/l3fwd-ifpx/meson.build @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2020 Marvell International Ltd. + +# meson file, for building this example as part of a main DPDK build. +# +# To build this example as a standalone application with an already-installed +# DPDK instance, use 'make' + +allow_experimental_apis = true +deps += ['hash', 'lpm', 'if_proxy'] +sources = files('l3fwd.c', 'main.c') diff --git a/examples/meson.build b/examples/meson.build index 3b540012f9..76b110bb61 100644 --- a/examples/meson.build +++ b/examples/meson.build @@ -25,8 +25,8 @@ all_examples = [ 'l2fwd', 'l2fwd-cat', 'l2fwd-event', 'l2fwd-crypto', 'l2fwd-jobstats', 'l2fwd-keepalive', 'l3fwd', - 'l3fwd-acl', 'l3fwd-power', 'l3fwd-graph', - 'link_status_interrupt', + 'l3fwd-acl', 'l3fwd-graph', 'l3fwd-ifpx', + 'l3fwd-power', 'link_status_interrupt', 'multi_process/client_server_mp/mp_client', 'multi_process/client_server_mp/mp_server', 'multi_process/hotplug_mp', -- 2.17.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2020-07-23 14:09 UTC | newest] Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-06 16:41 [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-03-31 12:36 ` Harman Kalra 2020-03-31 15:37 ` Andrzej Ostruszka [C] 2020-04-01 5:29 ` Varghese, Vipin 2020-04-01 20:08 ` Andrzej Ostruszka [C] 2020-04-08 3:04 ` Varghese, Vipin 2020-04-08 18:13 ` Andrzej Ostruszka [C] 2020-03-06 16:41 ` [dpdk-dev] [PATCH 2/4] if_proxy: add library documentation Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 3/4] if_proxy: add simple functionality test Andrzej Ostruszka 2020-03-06 16:41 ` [dpdk-dev] [PATCH 4/4] if_proxy: add example application Andrzej Ostruszka 2020-03-06 17:17 ` [dpdk-dev] [PATCH 0/4] Introduce IF proxy library Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 " Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-07-02 0:34 ` Stephen Hemminger 2020-07-07 20:13 ` Andrzej Ostruszka [C] 2020-07-08 16:07 ` Morten Brørup 2020-07-09 8:43 ` Andrzej Ostruszka [C] 2020-07-22 0:40 ` Thomas Monjalon 2020-07-22 8:45 ` Jerin Jacob 2020-07-22 8:56 ` Thomas Monjalon 2020-07-22 9:09 ` Jerin Jacob 2020-07-22 9:27 ` Thomas Monjalon 2020-07-22 9:54 ` Jerin Jacob 2020-07-23 14:09 ` [dpdk-dev] [EXT] " Andrzej Ostruszka [C] 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 2/4] if_proxy: add library documentation Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 3/4] if_proxy: add simple functionality test Andrzej Ostruszka 2020-03-10 11:10 ` [dpdk-dev] [PATCH v2 4/4] if_proxy: add example application Andrzej Ostruszka 2020-03-25 8:08 ` [dpdk-dev] [PATCH v2 0/4] Introduce IF proxy library David Marchand 2020-03-25 11:11 ` Morten Brørup 2020-03-26 17:42 ` Andrzej Ostruszka 2020-04-02 13:48 ` Andrzej Ostruszka [C] 2020-04-03 17:19 ` Thomas Monjalon 2020-04-03 19:09 ` Jerin Jacob 2020-04-03 21:18 ` Morten Brørup 2020-04-03 21:57 ` Thomas Monjalon 2020-04-04 10:18 ` Jerin Jacob 2020-04-10 10:41 ` Morten Brørup 2020-04-04 18:30 ` Andrzej Ostruszka [C] 2020-04-04 19:58 ` Thomas Monjalon 2020-04-10 10:03 ` Morten Brørup 2020-04-10 12:28 ` Jerin Jacob 2020-03-26 12:41 ` Andrzej Ostruszka 2020-03-30 19:23 ` Andrzej Ostruszka 2020-04-03 21:42 ` Thomas Monjalon 2020-04-04 18:07 ` Andrzej Ostruszka [C] 2020-04-04 19:51 ` Thomas Monjalon 2020-04-16 16:11 ` [dpdk-dev] [PATCH " Stephen Hemminger 2020-04-16 16:49 ` Jerin Jacob 2020-04-16 17:04 ` Stephen Hemminger 2020-04-16 17:26 ` Andrzej Ostruszka [C] 2020-04-16 17:27 ` Jerin Jacob 2020-04-16 17:12 ` Andrzej Ostruszka [C] 2020-04-16 17:19 ` Stephen Hemminger 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 " Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 2/4] if_proxy: add library documentation Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 3/4] if_proxy: add simple functionality test Andrzej Ostruszka 2020-05-04 8:53 ` [dpdk-dev] [PATCH v3 4/4] if_proxy: add example application Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 0/4] Introduce IF proxy library Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 1/4] lib: introduce IF Proxy library Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 2/4] if_proxy: add library documentation Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 3/4] if_proxy: add simple functionality test Andrzej Ostruszka 2020-06-22 9:21 ` [dpdk-dev] [PATCH v4 4/4] if_proxy: add example application Andrzej Ostruszka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).