DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ilya Maximets <i.maximets@ovn.org>
To: Adrian Moreno <amorenoz@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Ilya Maximets <i.maximets@ovn.org>
Cc: "Maxime Coquelin" <maxime.coquelin@redhat.com>,
	"Chenbo Xia" <chenbo.xia@intel.com>,
	dev@dpdk.org, "Julia Suvorova" <jusual@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Daniel Berrange" <berrange@redhat.com>,
	"Billy McFall" <bmcfall@redhat.com>
Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
Date: Tue, 23 Mar 2021 19:27:42 +0100	[thread overview]
Message-ID: <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org> (raw)
In-Reply-To: <def3f022-7f07-fe5c-461a-37f8443a3a97@redhat.com>

On 3/23/21 6:57 PM, Adrian Moreno wrote:
> 
> 
> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>>> And some housekeeping usually required for applications in case the
>>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>>> system:
>>>>>>>  "failed to bind to vhu: Address already in use; remove it and try again"
>>>>>>
>>>>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>>> solved with a pidfile.
>>>>>
>>>>> How exactly this could be solved with a pidfile?
>>>>
>>>> A pidfile prevents two instances of the same service from running at the
>>>> same time.
>>>>
>>>> The same effect can be achieved by the container orchestrator, systemd,
>>>> etc too because it refuses to run the same service twice.
>>>
>>> Sure. I understand that.  My point was that these could be 2 different
>>> applications and they might not know which process to look for.
>>>
>>>>
>>>>> And what if this is
>>>>> a different application that tries to create a socket on a same path?
>>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
>>>>> to an existing socket file and will fail.  Subsequently port creation
>>>>> in OVS will fail.   We can't allow OVS to unlink files because this
>>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>>> access to and we also has no idea if it's a QEMU that created a file
>>>>> or it was a virtio-user application or someone else.
>>>>
>>>> If rte_vhost unlinks the socket then the user will find that networking
>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net device
>>>> or restart QEMU, depending on whether they need to keep the guest
>>>> running or not. This is a misconfiguration that is recoverable.
>>>
>>> True, it's recoverable, but with a high cost.  Restart of a VM is rarely
>>> desirable.  And the application inside the guest might not feel itself
>>> well after hot re-plug of a device that it actively used.  I'd expect
>>> a DPDK application that runs inside a guest on some virtio-net device
>>> to crash after this kind of manipulations.  Especially, if it uses some
>>> older versions of DPDK.
>>
>> This unlink issue is probably something we think differently about.
>> There are many ways for users to misconfigure things when working with
>> system tools. If it's possible to catch misconfigurations that is
>> preferrable. In this case it's just the way pathname AF_UNIX domain
>> sockets work and IMO it's better not to have problems starting the
>> service due to stale files than to insist on preventing
>> misconfigurations. QEMU and DPDK do this differently and both seem to be
>> successful, so ¯\_(ツ)_/¯.
>>
>>>>
>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>>> create a security issue. I don't know the security model of OVS.
>>>
>>> In general privileges of a ovs-vswitchd daemon might be completely
>>> different from privileges required to invoke control utilities or
>>> to access the configuration database.  SO, yes, we should not allow
>>> that.
>>
>> That can be locked down by restricting the socket path to a file beneath
>> /var/run/ovs/vhost-user/.
>>
>>>>
>>>>> There are, probably, ways to detect if there is any alive process that
>>>>> has this socket open, but that sounds like too much for this purpose,
>>>>> also I'm not sure if it's possible if actual user is in a different
>>>>> container.
>>>>> So I don't see a good reliable way to detect these conditions.  This
>>>>> falls on shoulders of a higher level management software or a user to
>>>>> clean these socket files up before adding ports.
>>>>
>>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>>> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
>>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>>> disappears when there is no process listening anymore.
>>>
>>> OVS is usually started right on the host in a main network namespace.
>>> In case it's started in a pod, it will run in a separate container but
>>> configured with a host network.  Applications almost exclusively runs
>>> in separate pods.
>>
>> Okay.
>>
>>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>>>>
>>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>>> closely managed by one piece of software:
>>>>>>
>>>>>> 1. Unlink the socket path.
>>>>>> 2. Create, bind, and listen on the socket path.
>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
>>>>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>>>>>    As soon as the vhost-user device backend calls accept(2) the
>>>>>>    connection will proceed (there is no need for sleeping).
>>>>>>
>>>>>> This approach works across containers without a broker.
>>>>>
>>>>> Not sure if I fully understood a question here, but anyway.
>>>>>
>>>>> This approach works fine if you know what application to run.
>>>>> In case of a k8s cluster, it might be a random DPDK application
>>>>> with virtio-user ports running inside a container and want to
>>>>> have a network connection.  Also, this application needs to run
>>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>>> require restart of the application.  So, you basically need to
>>>>> rely on a third-party application to create a socket with a right
>>>>> name and in a correct location that is shared with a host, so
>>>>> OVS can find it and connect.
>>>>>
>>>>> In a VM world everything is much more simple, since you have
>>>>> a libvirt and QEMU that will take care of all of these stuff
>>>>> and which are also under full control of management software
>>>>> and a system administrator.
>>>>> In case of a container with a "random" DPDK application inside
>>>>> there is no such entity that can help.  Of course, some solution
>>>>> might be implemented in docker/podman daemon to create and manage
>>>>> outside-looking sockets for an application inside the container,
>>>>> but that is not available today AFAIK and I'm not sure if it
>>>>> ever will.
>>>>
>>>> Wait, when you say there is no entity like management software or a
>>>> system administrator, then how does OVS know to instantiate the new
>>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>>
>>> I didn't mean that there is no any application that configures
>>> everything.  Of course, there is.  I mean that there is no such
>>> entity that abstracts all that socket machinery from the user's
>>> application that runs inside the container.  QEMU hides all the
>>> details of the connection to vhost backend and presents the device
>>> as a PCI device with a network interface wrapping from the guest
>>> kernel.  So, the application inside VM shouldn't care what actually
>>> there is a socket connected to OVS that implements backend and
>>> forward traffic somewhere.  For the application it's just a usual
>>> network interface.
>>> But in case of a container world, application should handle all
>>> that by creating a virtio-user device that will connect to some
>>> socket, that has an OVS on the other side.
>>>
>>>>
>>>> Can you describe the steps used today (without the broker) for
>>>> instantiating a new DPDK app container and connecting it to OVS?
>>>> Although my interest is in the vhost-user protocol I think it's
>>>> necessary to understand the OVS requirements here and I know little
>>>> about them.
>>>>> I might describe some things wrong since I worked with k8s and CNI
>>> plugins last time ~1.5 years ago, but the basic schema will look
>>> something like this:
>>>
>>> 1. user decides to start a new pod and requests k8s to do that
>>>    via cmdline tools or some API calls.
>>>
>>> 2. k8s scheduler looks for available resources asking resource
>>>    manager plugins, finds an appropriate physical host and asks
>>>    local to that node kubelet daemon to launch a new pod there.
>>>
> 
> When the CNI is called, the pod has already been created, i.e: a PodID exists
> and so does an associated network namespace. Therefore, everything that has to
> do with the runtime spec such as mountpoints or devices cannot be modified by
> this time.
> 
> That's why the Device Plugin API is used to modify the Pod's spec before the CNI
> chain is called.
> 
>>> 3. kubelet asks local CNI plugin to allocate network resources
>>>    and annotate the pod with required mount points, devices that
>>>    needs to be passed in and environment variables.
>>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
>>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
>>>    usually deployed as a system DaemonSet, so it runs in a
>>>    separate pod.
>>>
>>> 4. Assuming that vhost-user connection requested in server mode.
>>>    CNI plugin will:
>>>    4.1 create a directory for a vhost-user socket.
>>>    4.2 add this directory to pod annotations as a mount point.
> 
> I believe this is not possible, it would have to inspect the pod's spec or
> otherwise determine an existing mount point where the socket should be created.

Uff.  Yes, you're right.  Thanks for your clarification.
I mixed up CNI and Device Plugin here.

CNI itself is not able to annotate new resources to the pod, i.e.
create new mounts or something like this.   And I don't recall any
vhost-user device plugins.  Is there any?  There is an SR-IOV device
plugin, but its purpose is to allocate and pass PCI devices, not create
mounts for vhost-user.

So, IIUC, right now user must create the directory and specify
a mount point in a pod spec file or pass the whole /var/run/openvswitch
or something like this, right?

Looking at userspace-cni-network-plugin, it actually just parses
annotations to find the shared directory and fails if there is
no any:
 https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122

And examples suggests to specify a directory to mount:
 https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41

Looks like this is done by user's hands.

> 
> +Billy might give more insights on this
> 
>>>    4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
>>>        by connecting to ovsdb-server by JSONRPC directly.
>>>        It will set port type as dpdkvhostuserclient and specify
>>>        socket-path as a path inside the directory it created.
>>>        (OVS will create a port and rte_vhost will enter the
>>>         re-connection loop since socket does not exist yet.)
>>>    4.4 Set up socket file location as environment variable in
>>>        pod annotations.
>>>    4.5 report success to kubelet.
>>>
> 
> Since the CNI cannot modify the pod's mounts it has to rely on a Device Plugin
> or other external entity that can inject the mount point before the pod is created.
> 
> However, there is another usecase that might be relevant: dynamic attachment of
> network interfaces. In this case the CNI cannot work in collaboration with a
> Device Plugin or "mount-point injector" and an existing mount point has to be used.
> Also, some form of notification mechanism has to exist to tell the workload a
> new socket is ready.
> 
>>> 5. kubelet will finish all other preparations and resource
>>>    allocations and will ask docker/podman to start a container
>>>    with all mount points, devices and environment variables from
>>>    the pod annotation.
>>>
>>> 6. docker/podman starts a container.
>>>    Need to mention here that in many cases initial process of
>>>    a container is not the actual application that will use a
>>>    vhost-user connection, but likely a shell that will invoke
>>>    the actual application.
>>>
>>> 7. Application starts inside the container, checks the environment
>>>    variables (actually, checking of environment variables usually
>>>    happens in a shell script that invokes the application with
>>>    correct arguments) and creates a net_virtio_user port in server
>>>    mode.  At this point socket file will be created.
>>>    (since we're running third-party application inside the container
>>>     we can only assume that it will do what is written here, it's
>>>     a responsibility of an application developer to do the right
>>>     thing.)
>>>
>>> 8. OVS successfully re-connects to the newly created socket in a
>>>    shared directory and vhost-user protocol establishes the network
>>>    connectivity.
>>>
>>> As you can wee, there are way too many entities and communication
>>> methods involved.  So, passing a pre-opened file descriptor from
>>> CNI all the way down to application is not that easy as it is in
>>> case of QEMU+LibVirt.
>>
>> File descriptor passing isn't necessary if OVS owns the listen socket
>> and the application container is the one who connects. That's why I
>> asked why dpdkvhostuser was deprecated in another email. The benefit of
>> doing this would be that the application container can instantly connect
>> to OVS without a sleep loop.
>>
>> I still don't get the attraction of the broker idea. The pros:
>> + Overcomes the issue with stale UNIX domain socket files
>> + Eliminates the re-connect sleep loop
>>
>> Neutral:
>> * vhost-user UNIX domain socket directory container volume is replaced
>>   by broker UNIX domain socket bind mount
>> * UNIX domain socket naming conflicts become broker key naming conflicts
>>
>> The cons:
>> - Requires running a new service on the host with potential security
>>   issues
>> - Requires support in third-party applications, QEMU, and DPDK/OVS
>> - The old code must be kept for compatibility with non-broker
>>   configurations, especially since third-party applications may not
>>   support the broker. Developers and users will have to learn about both
>>   options and decide which one to use.
>>
>> This seems like a modest improvement for the complexity and effort
>> involved. The same pros can be achieved by:
>> * Adding unlink(2) to rte_vhost (or applications can add rm -f
>>   $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
>>   it doesn't catch a misconfiguration where the user launches two
>>   processes with the same socket path.
>> * Reversing the direction of the client/server relationship to
>>   eliminate the re-connect sleep loop at startup. I'm unsure whether
>>   this is possible.
>>
>> That said, the broker idea doesn't affect the vhost-user protocol itself
>> and is more of an OVS/DPDK topic. I may just not be familiar enough with
>> OVS/DPDK to understand the benefits of the approach.
>>
>> Stefan
>>
> 


  reply	other threads:[~2021-03-23 18:27 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17 20:25 Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
2021-03-25  8:32   ` Maxime Coquelin
2021-04-07  7:21     ` Xia, Chenbo
2021-03-17 20:25 ` [dpdk-dev] [RFC 2/4] vhost: add support for SocketPair Broker Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 3/4] net/vhost: " Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 4/4] net/virtio: " Ilya Maximets
2021-03-18 17:52 ` [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Stefan Hajnoczi
2021-03-18 19:47   ` Ilya Maximets
2021-03-18 20:14     ` Ilya Maximets
2021-03-19 14:16       ` Stefan Hajnoczi
2021-03-19 15:37         ` Ilya Maximets
2021-03-19 16:01           ` Stefan Hajnoczi
2021-03-19 16:02           ` Marc-André Lureau
2021-03-19  8:51     ` Marc-André Lureau
2021-03-19 11:25       ` Ilya Maximets
2021-03-19 14:05     ` Stefan Hajnoczi
2021-03-19 15:29       ` Ilya Maximets
2021-03-19 17:21         ` Stefan Hajnoczi
2021-03-23 17:57           ` Adrian Moreno
2021-03-23 18:27             ` Ilya Maximets [this message]
2021-03-23 20:54               ` Billy McFall
2021-03-24 12:05                 ` Stefan Hajnoczi
2021-03-24 13:11                   ` Ilya Maximets
2021-03-24 15:07                     ` Stefan Hajnoczi
2021-03-25  9:35                     ` Stefan Hajnoczi
2021-03-25 11:00                       ` Ilya Maximets
2021-03-25 16:43                         ` Stefan Hajnoczi
2021-03-25 17:58                           ` Ilya Maximets
2021-03-30 15:01                             ` Stefan Hajnoczi
2021-03-19 14:39 ` Stefan Hajnoczi
2021-03-19 16:11   ` Ilya Maximets
2021-03-19 16:45     ` Ilya Maximets
2021-03-24 20:56       ` Maxime Coquelin
2021-03-24 21:39         ` Ilya Maximets
2021-03-24 21:51           ` Maxime Coquelin
2021-03-24 22:17             ` Ilya Maximets
2023-06-30  3:45 ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org \
    --to=i.maximets@ovn.org \
    --cc=amorenoz@redhat.com \
    --cc=berrange@redhat.com \
    --cc=bmcfall@redhat.com \
    --cc=chenbo.xia@intel.com \
    --cc=dev@dpdk.org \
    --cc=jusual@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).