From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id BF8E7A0A02; Thu, 25 Mar 2021 18:59:01 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 73F22140F34; Thu, 25 Mar 2021 18:59:01 +0100 (CET) Received: from relay2-d.mail.gandi.net (relay2-d.mail.gandi.net [217.70.183.194]) by mails.dpdk.org (Postfix) with ESMTP id BDCE3140F32 for ; Thu, 25 Mar 2021 18:58:59 +0100 (CET) X-Originating-IP: 78.45.89.65 Received: from [192.168.1.23] (ip-78-45-89-65.net.upcbroadband.cz [78.45.89.65]) (Authenticated sender: i.maximets@ovn.org) by relay2-d.mail.gandi.net (Postfix) with ESMTPSA id 7F2D740007; Thu, 25 Mar 2021 17:58:57 +0000 (UTC) To: Stefan Hajnoczi , Ilya Maximets Cc: Billy McFall , Adrian Moreno , Maxime Coquelin , Chenbo Xia , dev@dpdk.org, Julia Suvorova , =?UTF-8?Q?Marc-Andr=c3=a9_Lureau?= , Daniel Berrange References: <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org> <597d1ec7-d271-dc0d-522d-b900c9cb00ea@ovn.org> <2ba6ff01-fe2d-253f-cb36-303b63ba2133@ovn.org> From: Ilya Maximets Message-ID: <8a9c1923-7711-9962-fa37-a4e84e399d4f@ovn.org> Date: Thu, 25 Mar 2021 18:58:56 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 3/25/21 5:43 PM, Stefan Hajnoczi wrote: > On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote: >> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote: >>> On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote: >>>> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote: >>>>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote: >>>>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets wrote: >>>>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote: >>>>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote: >>>>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote: >>>>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote: >>>>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote: >>>>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote: >>>>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote: >>>> - How to get this fd again after the OVS restart? CNI will not be invoked >>>> at this point to pass a new fd. >>>> >>>> - If application will close the connection for any reason (restart, some >>>> reconfiguration internal to the application) and OVS will be re-started >>>> at the same time, abstract socket will be gone. Need a persistent daemon >>>> to hold it. >>> >>> I remembered that these two points can be solved by sd_notify(3) >>> FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if >>> this is the case (at least in the CNI use case)? >>> >>> https://www.freedesktop.org/software/systemd/man/sd_notify.html >> >> IIUC, these file descriptors only passed on the restart of the service, >> so port-del + port-add scenario is not covered (and this is a very >> common usecase, users are implementing some configuration changes this >> way and also this is internally possible scenario, e.g. this sequence >> will be triggered internally to change the OpenFlow port number). >> port-del will release all the resources including the listening socket. >> Keeping the fd for later use is not an option, because OVS will not know >> if this port will be added back or not and fds is a limited resource. > > If users of the CNI plugin are reasonably expected to do this then it > sounds like a blocker for the sd_notify(3) approach. Maybe it could be > fixed by introducing an atomic port-rename (?) operation, but this is > starting to sound too invasive. It's hard to implement, actually. Things like 'port-rename' will be internally implemented as del+add in most cases. Otherwise, it will require a significant rework of OVS internals. There are things that could be adjusted on the fly, but some fundamental parts like OF port number that every other part depends on are not easy to change. > >> It's also unclear how to map these file descriptors to particular ports >> they belong to after restart. > > The first fd would be a memfd containing a description of the remaining > fds plus any other crash recovery state that OVS wants. Yeah, I saw that it's possible to assign names to fds, so from this perspective it's not a big problem. > >> OVS could run as a system pod or as a systemd service. It differs from >> one setup to another. So it might not be controlled by systemd. > > Does the CNI plugin allow both configurations? CNI runs as a DaemonSet (pod on each node) by itself, and it doesn't matter if OVS is running on the host or in a different pod. They have a part of a filesystem to share (/var/run/openvswitch/ and some other). For example, OVN-K8s CNI provides an OVS DaemonSet: https://github.com/ovn-org/ovn-kubernetes/blob/master/dist/templates/ovs-node.yaml.j2 Users can use it, but it's not required and indifferent from the CNI point of view. Everything is a pod in k8s, but you can run some parts on the host if you wish. In general, CNI plugin only needs a network connection to the ovsdb-server process. In reality, most of CNI plugins are connecting via control socket in /var/run/openvswitch. > > It's impossible to come up with one approach that works for everyone in > the general case (beyond the CNI plugin, beyond Kubernetes). If we're looking for a solution to store abstract sockets somehow for OVS then it's hard to came up with something generic. It will have dependency on specific init system anyway. OTOH, Broker solution will work for all cases. :) One may think of a broker as a service that supplies abstract sockets for processes from different namespaces. These sockets are already connected, for convenience. > I think we > need to enumerate use cases and decide which ones are currently not > addressed satisfactorily. > >> Also, it behaves as an old-style daemon, so it closes all the file >> descriptors, forkes and so on. This might be adjusted, though, with >> some rework of the deamonization procedure. > > Doesn't sound like fun but may be doable. It really doesn't sound like fun, so I'd like to not do that unless we have a solid usecase. > >> On the side note, it maybe interesting to allow user application to >> create a socket and pass a pollable file descriptor directly to >> rte_vhost_driver_register() instead of a socket path. This way >> the user application may choose to use an abstract socket or a file >> socket or any other future type of socket connections. This will >> also allow user application to store these sockets somewhere, or >> receive them from systemd/init/other management software. > > Yes, sounds useful. > > Stefan >