DPDK patches and discussions
 help / color / mirror / Atom feed
From: Billy McFall <bmcfall@redhat.com>
To: Ilya Maximets <i.maximets@ovn.org>
Cc: "Adrian Moreno" <amorenoz@redhat.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Maxime Coquelin" <maxime.coquelin@redhat.com>,
	"Chenbo Xia" <chenbo.xia@intel.com>,
	dev@dpdk.org, "Julia Suvorova" <jusual@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Daniel Berrange" <berrange@redhat.com>
Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
Date: Tue, 23 Mar 2021 16:54:57 -0400	[thread overview]
Message-ID: <CAKLkqD4qBZ=keMb0z4kyagbhrhLC=F2qexiSov3CVN+PsHfXdg@mail.gmail.com> (raw)
In-Reply-To: <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org>

On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:

> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >
> >
> > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>>>>> And some housekeeping usually required for applications in case the
> >>>>>>> socket server terminated abnormally and socket files left on a file
> >>>>>>> system:
> >>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
> again"
> >>>>>>
> >>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> users
> >>>>>> might accidentally hijack an existing listen socket, but that can be
> >>>>>> solved with a pidfile.
> >>>>>
> >>>>> How exactly this could be solved with a pidfile?
> >>>>
> >>>> A pidfile prevents two instances of the same service from running at
> the
> >>>> same time.
> >>>>
> >>>> The same effect can be achieved by the container orchestrator,
> systemd,
> >>>> etc too because it refuses to run the same service twice.
> >>>
> >>> Sure. I understand that.  My point was that these could be 2 different
> >>> applications and they might not know which process to look for.
> >>>
> >>>>
> >>>>> And what if this is
> >>>>> a different application that tries to create a socket on a same path?
> >>>>> e.g. QEMU creates a socket (started in a server mode) and user
> >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> >>>>> to an existing socket file and will fail.  Subsequently port creation
> >>>>> in OVS will fail.   We can't allow OVS to unlink files because this
> >>>>> way OVS users will have ability to unlink random sockets that OVS has
> >>>>> access to and we also has no idea if it's a QEMU that created a file
> >>>>> or it was a virtio-user application or someone else.
> >>>>
> >>>> If rte_vhost unlinks the socket then the user will find that
> networking
> >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> device
> >>>> or restart QEMU, depending on whether they need to keep the guest
> >>>> running or not. This is a misconfiguration that is recoverable.
> >>>
> >>> True, it's recoverable, but with a high cost.  Restart of a VM is
> rarely
> >>> desirable.  And the application inside the guest might not feel itself
> >>> well after hot re-plug of a device that it actively used.  I'd expect
> >>> a DPDK application that runs inside a guest on some virtio-net device
> >>> to crash after this kind of manipulations.  Especially, if it uses some
> >>> older versions of DPDK.
> >>
> >> This unlink issue is probably something we think differently about.
> >> There are many ways for users to misconfigure things when working with
> >> system tools. If it's possible to catch misconfigurations that is
> >> preferrable. In this case it's just the way pathname AF_UNIX domain
> >> sockets work and IMO it's better not to have problems starting the
> >> service due to stale files than to insist on preventing
> >> misconfigurations. QEMU and DPDK do this differently and both seem to be
> >> successful, so ¯\_(ツ)_/¯.
> >>
> >>>>
> >>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> >>>> create a security issue. I don't know the security model of OVS.
> >>>
> >>> In general privileges of a ovs-vswitchd daemon might be completely
> >>> different from privileges required to invoke control utilities or
> >>> to access the configuration database.  SO, yes, we should not allow
> >>> that.
> >>
> >> That can be locked down by restricting the socket path to a file beneath
> >> /var/run/ovs/vhost-user/.
> >>
> >>>>
> >>>>> There are, probably, ways to detect if there is any alive process
> that
> >>>>> has this socket open, but that sounds like too much for this purpose,
> >>>>> also I'm not sure if it's possible if actual user is in a different
> >>>>> container.
> >>>>> So I don't see a good reliable way to detect these conditions.  This
> >>>>> falls on shoulders of a higher level management software or a user to
> >>>>> clean these socket files up before adding ports.
> >>>>
> >>>> Does OVS always run in the same net namespace (pod) as the DPDK
> >>>> application? If yes, then abstract AF_UNIX sockets can be used.
> Abstract
> >>>> AF_UNIX sockets don't have a filesystem path and the socket address
> >>>> disappears when there is no process listening anymore.
> >>>
> >>> OVS is usually started right on the host in a main network namespace.
> >>> In case it's started in a pod, it will run in a separate container but
> >>> configured with a host network.  Applications almost exclusively runs
> >>> in separate pods.
> >>
> >> Okay.
> >>
> >>>>>>> This patch-set aims to eliminate most of the inconveniences by
> >>>>>>> leveraging an infrastructure service provided by a SocketPair
> Broker.
> >>>>>>
> >>>>>> I don't understand yet why this is useful for vhost-user, where the
> >>>>>> creation of the vhost-user device backend and its use by a VMM are
> >>>>>> closely managed by one piece of software:
> >>>>>>
> >>>>>> 1. Unlink the socket path.
> >>>>>> 2. Create, bind, and listen on the socket path.
> >>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
> >>>>>> 4. In the meantime the VMM can open the socket path and call
> connect(2).
> >>>>>>    As soon as the vhost-user device backend calls accept(2) the
> >>>>>>    connection will proceed (there is no need for sleeping).
> >>>>>>
> >>>>>> This approach works across containers without a broker.
> >>>>>
> >>>>> Not sure if I fully understood a question here, but anyway.
> >>>>>
> >>>>> This approach works fine if you know what application to run.
> >>>>> In case of a k8s cluster, it might be a random DPDK application
> >>>>> with virtio-user ports running inside a container and want to
> >>>>> have a network connection.  Also, this application needs to run
> >>>>> virtio-user in server mode, otherwise restart of the OVS will
> >>>>> require restart of the application.  So, you basically need to
> >>>>> rely on a third-party application to create a socket with a right
> >>>>> name and in a correct location that is shared with a host, so
> >>>>> OVS can find it and connect.
> >>>>>
> >>>>> In a VM world everything is much more simple, since you have
> >>>>> a libvirt and QEMU that will take care of all of these stuff
> >>>>> and which are also under full control of management software
> >>>>> and a system administrator.
> >>>>> In case of a container with a "random" DPDK application inside
> >>>>> there is no such entity that can help.  Of course, some solution
> >>>>> might be implemented in docker/podman daemon to create and manage
> >>>>> outside-looking sockets for an application inside the container,
> >>>>> but that is not available today AFAIK and I'm not sure if it
> >>>>> ever will.
> >>>>
> >>>> Wait, when you say there is no entity like management software or a
> >>>> system administrator, then how does OVS know to instantiate the new
> >>>> port? I guess something still needs to invoke ovs-ctl add-port?
> >>>
> >>> I didn't mean that there is no any application that configures
> >>> everything.  Of course, there is.  I mean that there is no such
> >>> entity that abstracts all that socket machinery from the user's
> >>> application that runs inside the container.  QEMU hides all the
> >>> details of the connection to vhost backend and presents the device
> >>> as a PCI device with a network interface wrapping from the guest
> >>> kernel.  So, the application inside VM shouldn't care what actually
> >>> there is a socket connected to OVS that implements backend and
> >>> forward traffic somewhere.  For the application it's just a usual
> >>> network interface.
> >>> But in case of a container world, application should handle all
> >>> that by creating a virtio-user device that will connect to some
> >>> socket, that has an OVS on the other side.
> >>>
> >>>>
> >>>> Can you describe the steps used today (without the broker) for
> >>>> instantiating a new DPDK app container and connecting it to OVS?
> >>>> Although my interest is in the vhost-user protocol I think it's
> >>>> necessary to understand the OVS requirements here and I know little
> >>>> about them.
> >>>>> I might describe some things wrong since I worked with k8s and CNI
> >>> plugins last time ~1.5 years ago, but the basic schema will look
> >>> something like this:
> >>>
> >>> 1. user decides to start a new pod and requests k8s to do that
> >>>    via cmdline tools or some API calls.
> >>>
> >>> 2. k8s scheduler looks for available resources asking resource
> >>>    manager plugins, finds an appropriate physical host and asks
> >>>    local to that node kubelet daemon to launch a new pod there.
> >>>
> >
> > When the CNI is called, the pod has already been created, i.e: a PodID
> exists
> > and so does an associated network namespace. Therefore, everything that
> has to
> > do with the runtime spec such as mountpoints or devices cannot be
> modified by
> > this time.
> >
> > That's why the Device Plugin API is used to modify the Pod's spec before
> the CNI
> > chain is called.
> >
> >>> 3. kubelet asks local CNI plugin to allocate network resources
> >>>    and annotate the pod with required mount points, devices that
> >>>    needs to be passed in and environment variables.
> >>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
> >>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
> >>>    usually deployed as a system DaemonSet, so it runs in a
> >>>    separate pod.
> >>>
> >>> 4. Assuming that vhost-user connection requested in server mode.
> >>>    CNI plugin will:
> >>>    4.1 create a directory for a vhost-user socket.
> >>>    4.2 add this directory to pod annotations as a mount point.
> >
> > I believe this is not possible, it would have to inspect the pod's spec
> or
> > otherwise determine an existing mount point where the socket should be
> created.
>
> Uff.  Yes, you're right.  Thanks for your clarification.
> I mixed up CNI and Device Plugin here.
>
> CNI itself is not able to annotate new resources to the pod, i.e.
> create new mounts or something like this.   And I don't recall any
> vhost-user device plugins.  Is there any?  There is an SR-IOV device
> plugin, but its purpose is to allocate and pass PCI devices, not create
> mounts for vhost-user.
>
> So, IIUC, right now user must create the directory and specify
> a mount point in a pod spec file or pass the whole /var/run/openvswitch
> or something like this, right?
>
> Looking at userspace-cni-network-plugin, it actually just parses
> annotations to find the shared directory and fails if there is
> no any:
>
> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
>
> And examples suggests to specify a directory to mount:
>
> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
>
> Looks like this is done by user's hands.
>
> Yes, I am one of the primary authors of Userspace CNI. Currently, the
directory is by hand. Long term thought was to have a mutating
webhook/admission controller inject a directory into the podspec.  Not sure
if it has changed, but I think when I was originally doing this work, OvS
only lets you choose the directory at install time, so it has to be
something like /var/run/openvswitch/. You can choose the socketfile name
and maybe a subdirectory off the main directory, but not the full path.

One of the issues I was trying to solve was making sure ContainerA couldn't
see ContainerB's socketfiles. That's where the admission controller could
create a unique subdirectory for each container under
/var/run/openvswitch/. But this was more of a PoC CNI and other work items
always took precedence so that work never completed.

Billy

>
> > +Billy might give more insights on this
> >
> >>>    4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
> >>>        by connecting to ovsdb-server by JSONRPC directly.
> >>>        It will set port type as dpdkvhostuserclient and specify
> >>>        socket-path as a path inside the directory it created.
> >>>        (OVS will create a port and rte_vhost will enter the
> >>>         re-connection loop since socket does not exist yet.)
> >>>    4.4 Set up socket file location as environment variable in
> >>>        pod annotations.
> >>>    4.5 report success to kubelet.
> >>>
> >
> > Since the CNI cannot modify the pod's mounts it has to rely on a Device
> Plugin
> > or other external entity that can inject the mount point before the pod
> is created.
> >
> > However, there is another usecase that might be relevant: dynamic
> attachment of
> > network interfaces. In this case the CNI cannot work in collaboration
> with a
> > Device Plugin or "mount-point injector" and an existing mount point has
> to be used.
> > Also, some form of notification mechanism has to exist to tell the
> workload a
> > new socket is ready.
> >
> >>> 5. kubelet will finish all other preparations and resource
> >>>    allocations and will ask docker/podman to start a container
> >>>    with all mount points, devices and environment variables from
> >>>    the pod annotation.
> >>>
> >>> 6. docker/podman starts a container.
> >>>    Need to mention here that in many cases initial process of
> >>>    a container is not the actual application that will use a
> >>>    vhost-user connection, but likely a shell that will invoke
> >>>    the actual application.
> >>>
> >>> 7. Application starts inside the container, checks the environment
> >>>    variables (actually, checking of environment variables usually
> >>>    happens in a shell script that invokes the application with
> >>>    correct arguments) and creates a net_virtio_user port in server
> >>>    mode.  At this point socket file will be created.
> >>>    (since we're running third-party application inside the container
> >>>     we can only assume that it will do what is written here, it's
> >>>     a responsibility of an application developer to do the right
> >>>     thing.)
> >>>
> >>> 8. OVS successfully re-connects to the newly created socket in a
> >>>    shared directory and vhost-user protocol establishes the network
> >>>    connectivity.
> >>>
> >>> As you can wee, there are way too many entities and communication
> >>> methods involved.  So, passing a pre-opened file descriptor from
> >>> CNI all the way down to application is not that easy as it is in
> >>> case of QEMU+LibVirt.
> >>
> >> File descriptor passing isn't necessary if OVS owns the listen socket
> >> and the application container is the one who connects. That's why I
> >> asked why dpdkvhostuser was deprecated in another email. The benefit of
> >> doing this would be that the application container can instantly connect
> >> to OVS without a sleep loop.
> >>
> >> I still don't get the attraction of the broker idea. The pros:
> >> + Overcomes the issue with stale UNIX domain socket files
> >> + Eliminates the re-connect sleep loop
> >>
> >> Neutral:
> >> * vhost-user UNIX domain socket directory container volume is replaced
> >>   by broker UNIX domain socket bind mount
> >> * UNIX domain socket naming conflicts become broker key naming conflicts
> >>
> >> The cons:
> >> - Requires running a new service on the host with potential security
> >>   issues
> >> - Requires support in third-party applications, QEMU, and DPDK/OVS
> >> - The old code must be kept for compatibility with non-broker
> >>   configurations, especially since third-party applications may not
> >>   support the broker. Developers and users will have to learn about both
> >>   options and decide which one to use.
> >>
> >> This seems like a modest improvement for the complexity and effort
> >> involved. The same pros can be achieved by:
> >> * Adding unlink(2) to rte_vhost (or applications can add rm -f
> >>   $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
> >>   it doesn't catch a misconfiguration where the user launches two
> >>   processes with the same socket path.
> >> * Reversing the direction of the client/server relationship to
> >>   eliminate the re-connect sleep loop at startup. I'm unsure whether
> >>   this is possible.
> >>
> >> That said, the broker idea doesn't affect the vhost-user protocol itself
> >> and is more of an OVS/DPDK topic. I may just not be familiar enough with
> >> OVS/DPDK to understand the benefits of the approach.
> >>
> >> Stefan
> >>
> >
>
>

-- 
*Billy McFall*
Networking Group
CTO Office
*Red Hat*

  reply	other threads:[~2021-03-23 20:55 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17 20:25 Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
2021-03-25  8:32   ` Maxime Coquelin
2021-04-07  7:21     ` Xia, Chenbo
2021-03-17 20:25 ` [dpdk-dev] [RFC 2/4] vhost: add support for SocketPair Broker Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 3/4] net/vhost: " Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 4/4] net/virtio: " Ilya Maximets
2021-03-18 17:52 ` [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Stefan Hajnoczi
2021-03-18 19:47   ` Ilya Maximets
2021-03-18 20:14     ` Ilya Maximets
2021-03-19 14:16       ` Stefan Hajnoczi
2021-03-19 15:37         ` Ilya Maximets
2021-03-19 16:01           ` Stefan Hajnoczi
2021-03-19 16:02           ` Marc-André Lureau
2021-03-19  8:51     ` Marc-André Lureau
2021-03-19 11:25       ` Ilya Maximets
2021-03-19 14:05     ` Stefan Hajnoczi
2021-03-19 15:29       ` Ilya Maximets
2021-03-19 17:21         ` Stefan Hajnoczi
2021-03-23 17:57           ` Adrian Moreno
2021-03-23 18:27             ` Ilya Maximets
2021-03-23 20:54               ` Billy McFall [this message]
2021-03-24 12:05                 ` Stefan Hajnoczi
2021-03-24 13:11                   ` Ilya Maximets
2021-03-24 15:07                     ` Stefan Hajnoczi
2021-03-25  9:35                     ` Stefan Hajnoczi
2021-03-25 11:00                       ` Ilya Maximets
2021-03-25 16:43                         ` Stefan Hajnoczi
2021-03-25 17:58                           ` Ilya Maximets
2021-03-30 15:01                             ` Stefan Hajnoczi
2021-03-19 14:39 ` Stefan Hajnoczi
2021-03-19 16:11   ` Ilya Maximets
2021-03-19 16:45     ` Ilya Maximets
2021-03-24 20:56       ` Maxime Coquelin
2021-03-24 21:39         ` Ilya Maximets
2021-03-24 21:51           ` Maxime Coquelin
2021-03-24 22:17             ` Ilya Maximets
2023-06-30  3:45 ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKLkqD4qBZ=keMb0z4kyagbhrhLC=F2qexiSov3CVN+PsHfXdg@mail.gmail.com' \
    --to=bmcfall@redhat.com \
    --cc=amorenoz@redhat.com \
    --cc=berrange@redhat.com \
    --cc=chenbo.xia@intel.com \
    --cc=dev@dpdk.org \
    --cc=i.maximets@ovn.org \
    --cc=jusual@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).