From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 241F1A0A0A; Tue, 23 Mar 2021 21:55:18 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9E6424014D; Tue, 23 Mar 2021 21:55:17 +0100 (CET) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by mails.dpdk.org (Postfix) with ESMTP id 87A5340143 for ; Tue, 23 Mar 2021 21:55:15 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1616532914; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=C2VPzWsQm9xqHwtQ+sifpFPZn0/rXyihUOQ8TsoCFyg=; b=eHntX6bVx9iKfFbbvcvEIyWZcqkgV/5+m0ZJH3s4999/DnCr0MnPa7XrdJWVjhQgU2UkcY IpglRmLqtSwaLkDjEEQ6v0pT8xQs0F8Axj8JxwNHNZeuMkyIDZZ94tdc+KHQMsw6TgExnl iEjDu9XFFVFyEYCFpCrcs9QBimhcZ2s= Received: from mail-lj1-f200.google.com (mail-lj1-f200.google.com [209.85.208.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-335-v-Q8Aq5ENGCJY1e-x3uj1w-1; Tue, 23 Mar 2021 16:55:10 -0400 X-MC-Unique: v-Q8Aq5ENGCJY1e-x3uj1w-1 Received: by mail-lj1-f200.google.com with SMTP id 74so6292ljj.3 for ; Tue, 23 Mar 2021 13:55:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=C2VPzWsQm9xqHwtQ+sifpFPZn0/rXyihUOQ8TsoCFyg=; b=UPDqBYR/JJ+BaAwVCKYYzcnjS8eIJK5s2EqWgV2YDdJacbHtMAQWDdNYWPEpnMhhHn CeMWAh7lgP4AC/HH9g6yHmhqOUzblItsbajEFEFON5j+/FEFLzrG2GjsFSKgZHkpsxD/ ei1GUsWva6jxZ8NwR1iLbGQ1PWEY4IgH6khHNP9of0LZ6GXNyBRZTdQ5QA/S8sphtMcY YOmqdJYCTIxyshIu5HLNXcAPDeMRUqP82lcnCbxYJiec5u0KSKZSlsH28lKwu3RCYT6L VDiQ5WMw0KoZrJJIWa7+PguYG95uTpEAXHkNCyxcywOyqiOgqkdO2WMwbEowsLcjBFpc UWMA== X-Gm-Message-State: AOAM532EBhOEyLcxwSF4areKfUElVVQRj/Rx0ImvstoYVsmm7L5MizDL AHIjmXkYNYAAU+NxKnSe3adzzKE06DsL6FKzc4ZEysQzrlQZ/nYbuC4zCTcAsCqwHiQifcBlDD6 ZWBfCQPkXP2Kb4BvQXkE= X-Received: by 2002:a2e:9244:: with SMTP id v4mr4433075ljg.196.1616532909070; Tue, 23 Mar 2021 13:55:09 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzetIJUPrYZD/S6P9FMBkzt8ruOzCzyX5/G0TGPfcj+XEtH5p/22kEKTwmIchN7KOtmqcbib6Lfja3L8XkR1t4= X-Received: by 2002:a2e:9244:: with SMTP id v4mr4433065ljg.196.1616532908818; Tue, 23 Mar 2021 13:55:08 -0700 (PDT) MIME-Version: 1.0 References: <20210317202530.4145673-1-i.maximets@ovn.org> <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org> In-Reply-To: <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org> From: Billy McFall Date: Tue, 23 Mar 2021 16:54:57 -0400 Message-ID: To: Ilya Maximets Cc: Adrian Moreno , Stefan Hajnoczi , Maxime Coquelin , Chenbo Xia , dev@dpdk.org, Julia Suvorova , =?UTF-8?B?TWFyYy1BbmRyw6kgTHVyZWF1?= , Daniel Berrange Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=bmcfall@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets wrote: > On 3/23/21 6:57 PM, Adrian Moreno wrote: > > > > > > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote: > >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote: > >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote: > >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote: > >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote: > >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote: > >>>>>>> And some housekeeping usually required for applications in case t= he > >>>>>>> socket server terminated abnormally and socket files left on a fi= le > >>>>>>> system: > >>>>>>> "failed to bind to vhu: Address already in use; remove it and tr= y > again" > >>>>>> > >>>>>> QEMU avoids this by unlinking before binding. The drawback is that > users > >>>>>> might accidentally hijack an existing listen socket, but that can = be > >>>>>> solved with a pidfile. > >>>>> > >>>>> How exactly this could be solved with a pidfile? > >>>> > >>>> A pidfile prevents two instances of the same service from running at > the > >>>> same time. > >>>> > >>>> The same effect can be achieved by the container orchestrator, > systemd, > >>>> etc too because it refuses to run the same service twice. > >>> > >>> Sure. I understand that. My point was that these could be 2 differen= t > >>> applications and they might not know which process to look for. > >>> > >>>> > >>>>> And what if this is > >>>>> a different application that tries to create a socket on a same pat= h? > >>>>> e.g. QEMU creates a socket (started in a server mode) and user > >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of > >>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind > >>>>> to an existing socket file and will fail. Subsequently port creati= on > >>>>> in OVS will fail. We can't allow OVS to unlink files because this > >>>>> way OVS users will have ability to unlink random sockets that OVS h= as > >>>>> access to and we also has no idea if it's a QEMU that created a fil= e > >>>>> or it was a virtio-user application or someone else. > >>>> > >>>> If rte_vhost unlinks the socket then the user will find that > networking > >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net > device > >>>> or restart QEMU, depending on whether they need to keep the guest > >>>> running or not. This is a misconfiguration that is recoverable. > >>> > >>> True, it's recoverable, but with a high cost. Restart of a VM is > rarely > >>> desirable. And the application inside the guest might not feel itsel= f > >>> well after hot re-plug of a device that it actively used. I'd expect > >>> a DPDK application that runs inside a guest on some virtio-net device > >>> to crash after this kind of manipulations. Especially, if it uses so= me > >>> older versions of DPDK. > >> > >> This unlink issue is probably something we think differently about. > >> There are many ways for users to misconfigure things when working with > >> system tools. If it's possible to catch misconfigurations that is > >> preferrable. In this case it's just the way pathname AF_UNIX domain > >> sockets work and IMO it's better not to have problems starting the > >> service due to stale files than to insist on preventing > >> misconfigurations. QEMU and DPDK do this differently and both seem to = be > >> successful, so =C2=AF\_(=E3=83=84)_/=C2=AF. > >> > >>>> > >>>> Regarding letting OVS unlink files, I agree that it shouldn't if thi= s > >>>> create a security issue. I don't know the security model of OVS. > >>> > >>> In general privileges of a ovs-vswitchd daemon might be completely > >>> different from privileges required to invoke control utilities or > >>> to access the configuration database. SO, yes, we should not allow > >>> that. > >> > >> That can be locked down by restricting the socket path to a file benea= th > >> /var/run/ovs/vhost-user/. > >> > >>>> > >>>>> There are, probably, ways to detect if there is any alive process > that > >>>>> has this socket open, but that sounds like too much for this purpos= e, > >>>>> also I'm not sure if it's possible if actual user is in a different > >>>>> container. > >>>>> So I don't see a good reliable way to detect these conditions. Thi= s > >>>>> falls on shoulders of a higher level management software or a user = to > >>>>> clean these socket files up before adding ports. > >>>> > >>>> Does OVS always run in the same net namespace (pod) as the DPDK > >>>> application? If yes, then abstract AF_UNIX sockets can be used. > Abstract > >>>> AF_UNIX sockets don't have a filesystem path and the socket address > >>>> disappears when there is no process listening anymore. > >>> > >>> OVS is usually started right on the host in a main network namespace. > >>> In case it's started in a pod, it will run in a separate container bu= t > >>> configured with a host network. Applications almost exclusively runs > >>> in separate pods. > >> > >> Okay. > >> > >>>>>>> This patch-set aims to eliminate most of the inconveniences by > >>>>>>> leveraging an infrastructure service provided by a SocketPair > Broker. > >>>>>> > >>>>>> I don't understand yet why this is useful for vhost-user, where th= e > >>>>>> creation of the vhost-user device backend and its use by a VMM are > >>>>>> closely managed by one piece of software: > >>>>>> > >>>>>> 1. Unlink the socket path. > >>>>>> 2. Create, bind, and listen on the socket path. > >>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SP= DK > >>>>>> RPC, spawn a process, etc) and pass in the listen fd. > >>>>>> 4. In the meantime the VMM can open the socket path and call > connect(2). > >>>>>> As soon as the vhost-user device backend calls accept(2) the > >>>>>> connection will proceed (there is no need for sleeping). > >>>>>> > >>>>>> This approach works across containers without a broker. > >>>>> > >>>>> Not sure if I fully understood a question here, but anyway. > >>>>> > >>>>> This approach works fine if you know what application to run. > >>>>> In case of a k8s cluster, it might be a random DPDK application > >>>>> with virtio-user ports running inside a container and want to > >>>>> have a network connection. Also, this application needs to run > >>>>> virtio-user in server mode, otherwise restart of the OVS will > >>>>> require restart of the application. So, you basically need to > >>>>> rely on a third-party application to create a socket with a right > >>>>> name and in a correct location that is shared with a host, so > >>>>> OVS can find it and connect. > >>>>> > >>>>> In a VM world everything is much more simple, since you have > >>>>> a libvirt and QEMU that will take care of all of these stuff > >>>>> and which are also under full control of management software > >>>>> and a system administrator. > >>>>> In case of a container with a "random" DPDK application inside > >>>>> there is no such entity that can help. Of course, some solution > >>>>> might be implemented in docker/podman daemon to create and manage > >>>>> outside-looking sockets for an application inside the container, > >>>>> but that is not available today AFAIK and I'm not sure if it > >>>>> ever will. > >>>> > >>>> Wait, when you say there is no entity like management software or a > >>>> system administrator, then how does OVS know to instantiate the new > >>>> port? I guess something still needs to invoke ovs-ctl add-port? > >>> > >>> I didn't mean that there is no any application that configures > >>> everything. Of course, there is. I mean that there is no such > >>> entity that abstracts all that socket machinery from the user's > >>> application that runs inside the container. QEMU hides all the > >>> details of the connection to vhost backend and presents the device > >>> as a PCI device with a network interface wrapping from the guest > >>> kernel. So, the application inside VM shouldn't care what actually > >>> there is a socket connected to OVS that implements backend and > >>> forward traffic somewhere. For the application it's just a usual > >>> network interface. > >>> But in case of a container world, application should handle all > >>> that by creating a virtio-user device that will connect to some > >>> socket, that has an OVS on the other side. > >>> > >>>> > >>>> Can you describe the steps used today (without the broker) for > >>>> instantiating a new DPDK app container and connecting it to OVS? > >>>> Although my interest is in the vhost-user protocol I think it's > >>>> necessary to understand the OVS requirements here and I know little > >>>> about them. > >>>>> I might describe some things wrong since I worked with k8s and CNI > >>> plugins last time ~1.5 years ago, but the basic schema will look > >>> something like this: > >>> > >>> 1. user decides to start a new pod and requests k8s to do that > >>> via cmdline tools or some API calls. > >>> > >>> 2. k8s scheduler looks for available resources asking resource > >>> manager plugins, finds an appropriate physical host and asks > >>> local to that node kubelet daemon to launch a new pod there. > >>> > > > > When the CNI is called, the pod has already been created, i.e: a PodID > exists > > and so does an associated network namespace. Therefore, everything that > has to > > do with the runtime spec such as mountpoints or devices cannot be > modified by > > this time. > > > > That's why the Device Plugin API is used to modify the Pod's spec befor= e > the CNI > > chain is called. > > > >>> 3. kubelet asks local CNI plugin to allocate network resources > >>> and annotate the pod with required mount points, devices that > >>> needs to be passed in and environment variables. > >>> (this is, IIRC, a gRPC connection. It might be a multus-cni > >>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is > >>> usually deployed as a system DaemonSet, so it runs in a > >>> separate pod. > >>> > >>> 4. Assuming that vhost-user connection requested in server mode. > >>> CNI plugin will: > >>> 4.1 create a directory for a vhost-user socket. > >>> 4.2 add this directory to pod annotations as a mount point. > > > > I believe this is not possible, it would have to inspect the pod's spec > or > > otherwise determine an existing mount point where the socket should be > created. > > Uff. Yes, you're right. Thanks for your clarification. > I mixed up CNI and Device Plugin here. > > CNI itself is not able to annotate new resources to the pod, i.e. > create new mounts or something like this. And I don't recall any > vhost-user device plugins. Is there any? There is an SR-IOV device > plugin, but its purpose is to allocate and pass PCI devices, not create > mounts for vhost-user. > > So, IIUC, right now user must create the directory and specify > a mount point in a pod spec file or pass the whole /var/run/openvswitch > or something like this, right? > > Looking at userspace-cni-network-plugin, it actually just parses > annotations to find the shared directory and fails if there is > no any: > > https://github.com/intel/userspace-cni-network-plugin/blob/master/userspa= ce/userspace.go#L122 > > And examples suggests to specify a directory to mount: > > https://github.com/intel/userspace-cni-network-plugin/blob/master/example= s/ovs-vhost/userspace-ovs-pod-1.yaml#L41 > > Looks like this is done by user's hands. > > Yes, I am one of the primary authors of Userspace CNI. Currently, the directory is by hand. Long term thought was to have a mutating webhook/admission controller inject a directory into the podspec. Not sure if it has changed, but I think when I was originally doing this work, OvS only lets you choose the directory at install time, so it has to be something like /var/run/openvswitch/. You can choose the socketfile name and maybe a subdirectory off the main directory, but not the full path. One of the issues I was trying to solve was making sure ContainerA couldn't see ContainerB's socketfiles. That's where the admission controller could create a unique subdirectory for each container under /var/run/openvswitch/. But this was more of a PoC CNI and other work items always took precedence so that work never completed. Billy > > > +Billy might give more insights on this > > > >>> 4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or > >>> by connecting to ovsdb-server by JSONRPC directly. > >>> It will set port type as dpdkvhostuserclient and specify > >>> socket-path as a path inside the directory it created. > >>> (OVS will create a port and rte_vhost will enter the > >>> re-connection loop since socket does not exist yet.) > >>> 4.4 Set up socket file location as environment variable in > >>> pod annotations. > >>> 4.5 report success to kubelet. > >>> > > > > Since the CNI cannot modify the pod's mounts it has to rely on a Device > Plugin > > or other external entity that can inject the mount point before the pod > is created. > > > > However, there is another usecase that might be relevant: dynamic > attachment of > > network interfaces. In this case the CNI cannot work in collaboration > with a > > Device Plugin or "mount-point injector" and an existing mount point has > to be used. > > Also, some form of notification mechanism has to exist to tell the > workload a > > new socket is ready. > > > >>> 5. kubelet will finish all other preparations and resource > >>> allocations and will ask docker/podman to start a container > >>> with all mount points, devices and environment variables from > >>> the pod annotation. > >>> > >>> 6. docker/podman starts a container. > >>> Need to mention here that in many cases initial process of > >>> a container is not the actual application that will use a > >>> vhost-user connection, but likely a shell that will invoke > >>> the actual application. > >>> > >>> 7. Application starts inside the container, checks the environment > >>> variables (actually, checking of environment variables usually > >>> happens in a shell script that invokes the application with > >>> correct arguments) and creates a net_virtio_user port in server > >>> mode. At this point socket file will be created. > >>> (since we're running third-party application inside the container > >>> we can only assume that it will do what is written here, it's > >>> a responsibility of an application developer to do the right > >>> thing.) > >>> > >>> 8. OVS successfully re-connects to the newly created socket in a > >>> shared directory and vhost-user protocol establishes the network > >>> connectivity. > >>> > >>> As you can wee, there are way too many entities and communication > >>> methods involved. So, passing a pre-opened file descriptor from > >>> CNI all the way down to application is not that easy as it is in > >>> case of QEMU+LibVirt. > >> > >> File descriptor passing isn't necessary if OVS owns the listen socket > >> and the application container is the one who connects. That's why I > >> asked why dpdkvhostuser was deprecated in another email. The benefit o= f > >> doing this would be that the application container can instantly conne= ct > >> to OVS without a sleep loop. > >> > >> I still don't get the attraction of the broker idea. The pros: > >> + Overcomes the issue with stale UNIX domain socket files > >> + Eliminates the re-connect sleep loop > >> > >> Neutral: > >> * vhost-user UNIX domain socket directory container volume is replaced > >> by broker UNIX domain socket bind mount > >> * UNIX domain socket naming conflicts become broker key naming conflic= ts > >> > >> The cons: > >> - Requires running a new service on the host with potential security > >> issues > >> - Requires support in third-party applications, QEMU, and DPDK/OVS > >> - The old code must be kept for compatibility with non-broker > >> configurations, especially since third-party applications may not > >> support the broker. Developers and users will have to learn about bo= th > >> options and decide which one to use. > >> > >> This seems like a modest improvement for the complexity and effort > >> involved. The same pros can be achieved by: > >> * Adding unlink(2) to rte_vhost (or applications can add rm -f > >> $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is > >> it doesn't catch a misconfiguration where the user launches two > >> processes with the same socket path. > >> * Reversing the direction of the client/server relationship to > >> eliminate the re-connect sleep loop at startup. I'm unsure whether > >> this is possible. > >> > >> That said, the broker idea doesn't affect the vhost-user protocol itse= lf > >> and is more of an OVS/DPDK topic. I may just not be familiar enough wi= th > >> OVS/DPDK to understand the benefits of the approach. > >> > >> Stefan > >> > > > > --=20 *Billy McFall* Networking Group CTO Office *Red Hat*