From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id EC320A0A02;
	Wed, 24 Mar 2021 14:11:35 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id A4A4E4067B;
	Wed, 24 Mar 2021 14:11:35 +0100 (CET)
Received: from relay3-d.mail.gandi.net (relay3-d.mail.gandi.net
 [217.70.183.195])
 by mails.dpdk.org (Postfix) with ESMTP id A853F4014F
 for <dev@dpdk.org>; Wed, 24 Mar 2021 14:11:34 +0100 (CET)
X-Originating-IP: 78.45.89.65
Received: from [192.168.1.23] (ip-78-45-89-65.net.upcbroadband.cz
 [78.45.89.65]) (Authenticated sender: i.maximets@ovn.org)
 by relay3-d.mail.gandi.net (Postfix) with ESMTPSA id 37BE76000E;
 Wed, 24 Mar 2021 13:11:31 +0000 (UTC)
To: Stefan Hajnoczi <stefanha@redhat.com>, Billy McFall <bmcfall@redhat.com>
Cc: Ilya Maximets <i.maximets@ovn.org>, Adrian Moreno <amorenoz@redhat.com>,
 Maxime Coquelin <maxime.coquelin@redhat.com>,
 Chenbo Xia <chenbo.xia@intel.com>, dev@dpdk.org,
 Julia Suvorova <jusual@redhat.com>,
 =?UTF-8?Q?Marc-Andr=c3=a9_Lureau?= <marcandre.lureau@redhat.com>,
 Daniel Berrange <berrange@redhat.com>
References: <20210317202530.4145673-1-i.maximets@ovn.org>
 <YFOTU0M50y5GlF25@stefanha-x1.localdomain>
 <eeea4d9f-e600-9b4d-58f3-f8ced9485854@ovn.org>
 <YFSvwk3Z4A/L8fRV@stefanha-x1.localdomain>
 <edd7eb24-a9d6-b50b-04c3-a94d392a011f@ovn.org>
 <YFTdn1wsaHXkr+Bm@stefanha-x1.localdomain>
 <def3f022-7f07-fe5c-461a-37f8443a3a97@redhat.com>
 <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org>
 <CAKLkqD4qBZ=keMb0z4kyagbhrhLC=F2qexiSov3CVN+PsHfXdg@mail.gmail.com>
 <YFsrBXHRhZ+yudHj@stefanha-x1.localdomain>
From: Ilya Maximets <i.maximets@ovn.org>
Message-ID: <597d1ec7-d271-dc0d-522d-b900c9cb00ea@ovn.org>
Date: Wed, 24 Mar 2021 14:11:31 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.0
MIME-Version: 1.0
In-Reply-To: <YFsrBXHRhZ+yudHj@stefanha-x1.localdomain>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and
 virtio-user.
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
>>>>
>>>>
>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>>>>>> And some housekeeping usually required for applications in case the
>>>>>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>>>>>> system:
>>>>>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
>>> again"
>>>>>>>>>
>>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that
>>> users
>>>>>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>>>>>> solved with a pidfile.
>>>>>>>>
>>>>>>>> How exactly this could be solved with a pidfile?
>>>>>>>
>>>>>>> A pidfile prevents two instances of the same service from running at
>>> the
>>>>>>> same time.
>>>>>>>
>>>>>>> The same effect can be achieved by the container orchestrator,
>>> systemd,
>>>>>>> etc too because it refuses to run the same service twice.
>>>>>>
>>>>>> Sure. I understand that.  My point was that these could be 2 different
>>>>>> applications and they might not know which process to look for.
>>>>>>
>>>>>>>
>>>>>>>> And what if this is
>>>>>>>> a different application that tries to create a socket on a same path?
>>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>>>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
>>>>>>>> to an existing socket file and will fail.  Subsequently port creation
>>>>>>>> in OVS will fail.   We can't allow OVS to unlink files because this
>>>>>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>>>>>> access to and we also has no idea if it's a QEMU that created a file
>>>>>>>> or it was a virtio-user application or someone else.
>>>>>>>
>>>>>>> If rte_vhost unlinks the socket then the user will find that
>>> networking
>>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
>>> device
>>>>>>> or restart QEMU, depending on whether they need to keep the guest
>>>>>>> running or not. This is a misconfiguration that is recoverable.
>>>>>>
>>>>>> True, it's recoverable, but with a high cost.  Restart of a VM is
>>> rarely
>>>>>> desirable.  And the application inside the guest might not feel itself
>>>>>> well after hot re-plug of a device that it actively used.  I'd expect
>>>>>> a DPDK application that runs inside a guest on some virtio-net device
>>>>>> to crash after this kind of manipulations.  Especially, if it uses some
>>>>>> older versions of DPDK.
>>>>>
>>>>> This unlink issue is probably something we think differently about.
>>>>> There are many ways for users to misconfigure things when working with
>>>>> system tools. If it's possible to catch misconfigurations that is
>>>>> preferrable. In this case it's just the way pathname AF_UNIX domain
>>>>> sockets work and IMO it's better not to have problems starting the
>>>>> service due to stale files than to insist on preventing
>>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be
>>>>> successful, so ¯\_(ツ)_/¯.
>>>>>
>>>>>>>
>>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>>>>>> create a security issue. I don't know the security model of OVS.
>>>>>>
>>>>>> In general privileges of a ovs-vswitchd daemon might be completely
>>>>>> different from privileges required to invoke control utilities or
>>>>>> to access the configuration database.  SO, yes, we should not allow
>>>>>> that.
>>>>>
>>>>> That can be locked down by restricting the socket path to a file beneath
>>>>> /var/run/ovs/vhost-user/.
>>>>>
>>>>>>>
>>>>>>>> There are, probably, ways to detect if there is any alive process
>>> that
>>>>>>>> has this socket open, but that sounds like too much for this purpose,
>>>>>>>> also I'm not sure if it's possible if actual user is in a different
>>>>>>>> container.
>>>>>>>> So I don't see a good reliable way to detect these conditions.  This
>>>>>>>> falls on shoulders of a higher level management software or a user to
>>>>>>>> clean these socket files up before adding ports.
>>>>>>>
>>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>>>>>> application? If yes, then abstract AF_UNIX sockets can be used.
>>> Abstract
>>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>>>>>> disappears when there is no process listening anymore.
>>>>>>
>>>>>> OVS is usually started right on the host in a main network namespace.
>>>>>> In case it's started in a pod, it will run in a separate container but
>>>>>> configured with a host network.  Applications almost exclusively runs
>>>>>> in separate pods.
>>>>>
>>>>> Okay.
>>>>>
>>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>>>>>> leveraging an infrastructure service provided by a SocketPair
>>> Broker.
>>>>>>>>>
>>>>>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>>>>>> closely managed by one piece of software:
>>>>>>>>>
>>>>>>>>> 1. Unlink the socket path.
>>>>>>>>> 2. Create, bind, and listen on the socket path.
>>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
>>>>>>>>> 4. In the meantime the VMM can open the socket path and call
>>> connect(2).
>>>>>>>>>    As soon as the vhost-user device backend calls accept(2) the
>>>>>>>>>    connection will proceed (there is no need for sleeping).
>>>>>>>>>
>>>>>>>>> This approach works across containers without a broker.
>>>>>>>>
>>>>>>>> Not sure if I fully understood a question here, but anyway.
>>>>>>>>
>>>>>>>> This approach works fine if you know what application to run.
>>>>>>>> In case of a k8s cluster, it might be a random DPDK application
>>>>>>>> with virtio-user ports running inside a container and want to
>>>>>>>> have a network connection.  Also, this application needs to run
>>>>>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>>>>>> require restart of the application.  So, you basically need to
>>>>>>>> rely on a third-party application to create a socket with a right
>>>>>>>> name and in a correct location that is shared with a host, so
>>>>>>>> OVS can find it and connect.
>>>>>>>>
>>>>>>>> In a VM world everything is much more simple, since you have
>>>>>>>> a libvirt and QEMU that will take care of all of these stuff
>>>>>>>> and which are also under full control of management software
>>>>>>>> and a system administrator.
>>>>>>>> In case of a container with a "random" DPDK application inside
>>>>>>>> there is no such entity that can help.  Of course, some solution
>>>>>>>> might be implemented in docker/podman daemon to create and manage
>>>>>>>> outside-looking sockets for an application inside the container,
>>>>>>>> but that is not available today AFAIK and I'm not sure if it
>>>>>>>> ever will.
>>>>>>>
>>>>>>> Wait, when you say there is no entity like management software or a
>>>>>>> system administrator, then how does OVS know to instantiate the new
>>>>>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>>>>>
>>>>>> I didn't mean that there is no any application that configures
>>>>>> everything.  Of course, there is.  I mean that there is no such
>>>>>> entity that abstracts all that socket machinery from the user's
>>>>>> application that runs inside the container.  QEMU hides all the
>>>>>> details of the connection to vhost backend and presents the device
>>>>>> as a PCI device with a network interface wrapping from the guest
>>>>>> kernel.  So, the application inside VM shouldn't care what actually
>>>>>> there is a socket connected to OVS that implements backend and
>>>>>> forward traffic somewhere.  For the application it's just a usual
>>>>>> network interface.
>>>>>> But in case of a container world, application should handle all
>>>>>> that by creating a virtio-user device that will connect to some
>>>>>> socket, that has an OVS on the other side.
>>>>>>
>>>>>>>
>>>>>>> Can you describe the steps used today (without the broker) for
>>>>>>> instantiating a new DPDK app container and connecting it to OVS?
>>>>>>> Although my interest is in the vhost-user protocol I think it's
>>>>>>> necessary to understand the OVS requirements here and I know little
>>>>>>> about them.
>>>>>>>> I might describe some things wrong since I worked with k8s and CNI
>>>>>> plugins last time ~1.5 years ago, but the basic schema will look
>>>>>> something like this:
>>>>>>
>>>>>> 1. user decides to start a new pod and requests k8s to do that
>>>>>>    via cmdline tools or some API calls.
>>>>>>
>>>>>> 2. k8s scheduler looks for available resources asking resource
>>>>>>    manager plugins, finds an appropriate physical host and asks
>>>>>>    local to that node kubelet daemon to launch a new pod there.
>>>>>>
>>>>
>>>> When the CNI is called, the pod has already been created, i.e: a PodID
>>> exists
>>>> and so does an associated network namespace. Therefore, everything that
>>> has to
>>>> do with the runtime spec such as mountpoints or devices cannot be
>>> modified by
>>>> this time.
>>>>
>>>> That's why the Device Plugin API is used to modify the Pod's spec before
>>> the CNI
>>>> chain is called.
>>>>
>>>>>> 3. kubelet asks local CNI plugin to allocate network resources
>>>>>>    and annotate the pod with required mount points, devices that
>>>>>>    needs to be passed in and environment variables.
>>>>>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
>>>>>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
>>>>>>    usually deployed as a system DaemonSet, so it runs in a
>>>>>>    separate pod.
>>>>>>
>>>>>> 4. Assuming that vhost-user connection requested in server mode.
>>>>>>    CNI plugin will:
>>>>>>    4.1 create a directory for a vhost-user socket.
>>>>>>    4.2 add this directory to pod annotations as a mount point.
>>>>
>>>> I believe this is not possible, it would have to inspect the pod's spec
>>> or
>>>> otherwise determine an existing mount point where the socket should be
>>> created.
>>>
>>> Uff.  Yes, you're right.  Thanks for your clarification.
>>> I mixed up CNI and Device Plugin here.
>>>
>>> CNI itself is not able to annotate new resources to the pod, i.e.
>>> create new mounts or something like this.   And I don't recall any
>>> vhost-user device plugins.  Is there any?  There is an SR-IOV device
>>> plugin, but its purpose is to allocate and pass PCI devices, not create
>>> mounts for vhost-user.
>>>
>>> So, IIUC, right now user must create the directory and specify
>>> a mount point in a pod spec file or pass the whole /var/run/openvswitch
>>> or something like this, right?
>>>
>>> Looking at userspace-cni-network-plugin, it actually just parses
>>> annotations to find the shared directory and fails if there is
>>> no any:
>>>
>>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
>>>
>>> And examples suggests to specify a directory to mount:
>>>
>>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
>>>
>>> Looks like this is done by user's hands.
>>>
>>> Yes, I am one of the primary authors of Userspace CNI. Currently, the
>> directory is by hand. Long term thought was to have a mutating
>> webhook/admission controller inject a directory into the podspec.  Not sure
>> if it has changed, but I think when I was originally doing this work, OvS
>> only lets you choose the directory at install time, so it has to be
>> something like /var/run/openvswitch/. You can choose the socketfile name
>> and maybe a subdirectory off the main directory, but not the full path.
>>
>> One of the issues I was trying to solve was making sure ContainerA couldn't
>> see ContainerB's socketfiles. That's where the admission controller could
>> create a unique subdirectory for each container under
>> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
>> always took precedence so that work never completed.
> 
> If the CNI plugin has access to the container's network namespace, could
> it create an abstract AF_UNIX listen socket?
> 
> That way the application inside the container could connect to an
> AF_UNIX socket and there is no need to manage container volumes.
> 
> I'm not familiar with the Open VSwitch, so I'm not sure if there is a
> sane way of passing the listen socket fd into ovswitchd from the CNI
> plugin?
> 
> The steps:
> 1. CNI plugin enters container's network namespace and opens an abstract
>    AF_UNIX listen socket.
> 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
>    add-port step. Instead of using type=dpdkvhostuserclient
>    options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
>    dpdkvhostuser server with the listen fd.

For this step you will need a side channel, i.e. a separate unix socket
created by ovs-vswitchd (most likely, created by rte_vhost on
rte_vhost_driver_register() call).

The problem is that ovs-vsctl talks with ovsdb-server and adds the new
port -- just a new row in the 'interface' table of the database.
ovs-vswitchd receives update from the database and creates the actual
port.  All the communications done through JSONRPC, so passing fds is
not an option.

> 3. When the container starts, it connects to the abstract AF_UNIX
>    socket. The abstract socket name is provided to the container at
>    startup time in an environment variable. The name is unique, at least
>    to the pod, so that multiple containers in the pod can run vhost-user
>    applications.

Few more problems with this solution:

- We still want to run application inside the container in a server mode,
  because virtio-user PMD in client mode doesn't support re-connection.

- How to get this fd again after the OVS restart?  CNI will not be invoked
  at this point to pass a new fd.

- If application will close the connection for any reason (restart, some
  reconfiguration internal to the application) and OVS will be re-started
  at the same time, abstract socket will be gone.  Need a persistent daemon
  to hold it.

Best regards, Ilya Maximets.