From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 1664442B21; Tue, 16 May 2023 10:53:11 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9A01840C35; Tue, 16 May 2023 10:53:10 +0200 (CEST) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by mails.dpdk.org (Postfix) with ESMTP id EA4864068E for ; Tue, 16 May 2023 10:53:08 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684227188; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jQb+ZkTKaskuFrS1V32zawCHassTfFgztBC7QesLlD4=; b=IqagL+XKtA/XYQmQ+8fi8bfVeAm/l2fxmkz5a3zL+Z5/JYXcUo50FF4pNIdhI9uCwoZksy ES6JK7fupjSv+pFo9K1dwKaTwqoqDwrVB+pThGtk+SXXSRq8mvi9pjCzkodIaj3MT3Vjh1 uGxRL3bOtA2orUe6rdpBPyOZHyMmV5A= Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-265-xneY-Wh0O5-APvorc2H6GQ-1; Tue, 16 May 2023 04:53:07 -0400 X-MC-Unique: xneY-Wh0O5-APvorc2H6GQ-1 Received: by mail-ed1-f69.google.com with SMTP id 4fb4d7f45d1cf-50b4bfb51b0so12264370a12.0 for ; Tue, 16 May 2023 01:53:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684227185; x=1686819185; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JhEqvbpuCjmjwITGH9dAYoz+2/o0nO6JX1wqkOU636w=; b=jjMOaYcbqeRII74C8lM6GKwCU4zopbOHvF3fn8wcPQGBwMT02iM5gcIIC56zQQ9pNt Geod+GgANljvtNRe6u4uLiLueETSiEYpZKKg4lkp6cBbFY8sEFqMp+3+RByTNJLuxZYa IRIPP9pboucuxNjthf+l7j3RntVKjHnmHtdbOd1DPE38N9RX/AJGICesxRIYL1omRU75 d2157Xp8Ij6HocCVupGc4MKC+u/e//qbxhc2ahlb+6ZtvZsl3aeHd3vn8xAfWnSF+A3m +SX2VUX0CACbM/roPLEeYBf+I9W+rPlQmd36/LpuZSdfpjhWbzn34YCAdeH7qsH6CF0E FiDg== X-Gm-Message-State: AC+VfDxBtuwUz7PC6hyP1VQjybFnXiUHOUgdhc0ywoLJEeHNRyD2vHag WoQX0B6D4QhgaivxJNoWKTbSvc7pBtLuZO1HG+YYW9IPUXA51XBWSq2qyXdcce4koS7TNhvFXqG owSM= X-Received: by 2002:a05:6402:541:b0:50b:d34c:4710 with SMTP id i1-20020a056402054100b0050bd34c4710mr27718045edx.5.1684227185194; Tue, 16 May 2023 01:53:05 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7ajFeWzI7Nje8gsKAHa9xk9ywmsSazJJ/ZBYzucWq305Qn0BVp3AzwpnaY6mnSodE+bRR8xQ== X-Received: by 2002:a05:6402:541:b0:50b:d34c:4710 with SMTP id i1-20020a056402054100b0050bd34c4710mr27718019edx.5.1684227184613; Tue, 16 May 2023 01:53:04 -0700 (PDT) Received: from [10.39.192.202] (5920ab7b.static.cust.trined.nl. [89.32.171.123]) by smtp.gmail.com with ESMTPSA id s24-20020aa7d798000000b0050bc37ff74asm8133326edq.44.2023.05.16.01.53.03 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 16 May 2023 01:53:03 -0700 (PDT) From: Eelco Chaudron To: David Marchand Cc: maxime.coquelin@redhat.com, chenbo.xia@intel.com, dev@dpdk.org Subject: Re: [PATCH v2 3/3] vhost: add device op to offload the interrupt kick Date: Tue, 16 May 2023 10:53:00 +0200 X-Mailer: MailMate (1.14r5964) Message-ID: <683D0767-DCE8-45F4-B904-CD2DD574DE53@redhat.com> In-Reply-To: References: <168069838578.833254.4856666346839028593.stgit@ebuild.local> <168069850278.833254.17836553051894685658.stgit@ebuild.local> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 10 May 2023, at 13:44, David Marchand wrote: > On Wed, Apr 5, 2023 at 2:42=E2=80=AFPM Eelco Chaudron wrote: >> >> This patch adds an operation callback which gets called every time the >> library wants to call eventfd_write(). This eventfd_write() call could >> result in a system call, which could potentially block the PMD thread. >> >> The callback function can decide whether it's ok to handle the >> eventfd_write() now or have the newly introduced function, >> rte_vhost_notify_guest(), called at a later time. >> >> This can be used by 3rd party applications, like OVS, to avoid system >> calls being called as part of the PMD threads. >> >> Signed-off-by: Eelco Chaudron >> --- >> lib/vhost/meson.build | 2 + >> lib/vhost/rte_vhost.h | 23 +++++++++++++++- >> lib/vhost/socket.c | 72 ++++++++++++++++++++++++++++++++++++++++++= ++++--- >> lib/vhost/version.map | 9 ++++++ >> lib/vhost/vhost.c | 38 ++++++++++++++++++++++++++ >> lib/vhost/vhost.h | 65 +++++++++++++++++++++++++++++++-----------= -- >> 6 files changed, 184 insertions(+), 25 deletions(-) >> >> diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build >> index 197a51d936..e93ba6b078 100644 >> --- a/lib/vhost/meson.build >> +++ b/lib/vhost/meson.build >> @@ -39,3 +39,5 @@ driver_sdk_headers =3D files( >> 'vdpa_driver.h', >> ) >> deps +=3D ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev'] >> + >> +use_function_versioning =3D true >> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h >> index 58a5d4be92..7a10bc36cf 100644 >> --- a/lib/vhost/rte_vhost.h >> +++ b/lib/vhost/rte_vhost.h >> @@ -298,7 +298,13 @@ struct rte_vhost_device_ops { >> */ >> void (*guest_notified)(int vid); >> >> - void *reserved[1]; /**< Reserved for future extension */ >> + /** >> + * If this callback is registered, notification to the guest can >> + * be handled by the front-end calling rte_vhost_notify_guest(). >> + * If it's not handled, 'false' should be returned. This can be = used >> + * to remove the "slow" eventfd_write() syscall from the datapat= h. >> + */ >> + bool (*guest_notify)(int vid, uint16_t queue_id); >> }; >> >> /** >> @@ -433,6 +439,21 @@ void rte_vhost_log_used_vring(int vid, uint16_t vri= ng_idx, >> >> int rte_vhost_enable_guest_notification(int vid, uint16_t queue_id, int= enable); >> >> +/** >> + * @warning >> + * @b EXPERIMENTAL: this API may change, or be removed, without prior n= otice. >> + * >> + * Inject the offloaded interrupt into the vhost device's queue. For mo= re >> + * details see the 'guest_notify' vhost device operation. >> + * >> + * @param vid >> + * vhost device ID >> + * @param queue_id >> + * virtio queue index >> + */ >> +__rte_experimental >> +void rte_vhost_notify_guest(int vid, uint16_t queue_id); >> + >> /** >> * Register vhost driver. path could be different for multiple >> * instance support. >> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c >> index 669c322e12..787d6bacf8 100644 >> --- a/lib/vhost/socket.c >> +++ b/lib/vhost/socket.c >> @@ -15,6 +15,7 @@ >> #include >> #include >> >> +#include >> #include >> >> #include "fd_man.h" >> @@ -43,6 +44,7 @@ struct vhost_user_socket { >> bool async_copy; >> bool net_compliant_ol_flags; >> bool stats_enabled; >> + bool alloc_notify_ops; >> >> /* >> * The "supported_features" indicates the feature bits the >> @@ -846,6 +848,14 @@ vhost_user_socket_mem_free(struct vhost_user_socket= *vsocket) >> vsocket->path =3D NULL; >> } >> >> + if (vsocket && vsocket->alloc_notify_ops) { >> +#pragma GCC diagnostic push >> +#pragma GCC diagnostic ignored "-Wcast-qual" >> + free((struct rte_vhost_device_ops *)vsocket->notify_ops)= ; >> +#pragma GCC diagnostic pop >> + vsocket->notify_ops =3D NULL; >> + } > > Rather than select the behavior based on a boolean (and here force the > compiler to close its eyes), I would instead add a non const pointer > to ops (let's say alloc_notify_ops) in vhost_user_socket. > The code can then unconditionnally call free(vsocket->alloc_notify_ops); Good idea, I will make the change in v3. >> + >> if (vsocket) { >> free(vsocket); >> vsocket =3D NULL; >> @@ -1099,21 +1109,75 @@ rte_vhost_driver_unregister(const char *path) >> /* >> * Register ops so that we can add/remove device to data core. >> */ >> -int >> -rte_vhost_driver_callback_register(const char *path, >> - struct rte_vhost_device_ops const * const ops) >> +static int >> +rte_vhost_driver_callback_register__(const char *path, > > No need for a rte_ prefix on static symbol. > And we can simply call this internal helper vhost_driver_callback_registe= r(). ACK. >> + struct rte_vhost_device_ops const * const ops, bool ops_allocate= d) >> { >> struct vhost_user_socket *vsocket; >> >> pthread_mutex_lock(&vhost_user.mutex); >> vsocket =3D find_vhost_user_socket(path); >> - if (vsocket) >> + if (vsocket) { >> + if (vsocket->alloc_notify_ops) { >> + vsocket->alloc_notify_ops =3D false; >> +#pragma GCC diagnostic push >> +#pragma GCC diagnostic ignored "-Wcast-qual" >> + free((struct rte_vhost_device_ops *)vsocket->not= ify_ops); >> +#pragma GCC diagnostic pop >> + } >> vsocket->notify_ops =3D ops; >> + if (ops_allocated) >> + vsocket->alloc_notify_ops =3D true; >> + } >> pthread_mutex_unlock(&vhost_user.mutex); >> >> return vsocket ? 0 : -1; >> } >> >> +int __vsym >> +rte_vhost_driver_callback_register_v24(const char *path, >> + struct rte_vhost_device_ops const * const ops) >> +{ >> + return rte_vhost_driver_callback_register__(path, ops, false); >> +} >> + >> +int __vsym >> +rte_vhost_driver_callback_register_v23(const char *path, >> + struct rte_vhost_device_ops const * const ops) >> +{ >> + int ret; >> + >> + /* >> + * Although the ops structure is a const structure, we do need t= o >> + * override the guest_notify operation. This is because with the >> + * previous APIs it was "reserved" and if any garbage value was = passed, >> + * it could crash the application. >> + */ >> + if (ops && !ops->guest_notify) { > > Hum, as described in the comment above, I don't think we should look > at ops->guest_notify value at all. > Checking ops !=3D NULL should be enough. Not sure I get you here. If the guest_notify passed by the user is NULL, it= means the previously =E2=80=98reserved[1]=E2=80=99 field is NULL, so we do= not need to use a new structure. I guess your comment would be true if we would introduce a new field in the= data structure, not replacing a reserved one. >> + struct rte_vhost_device_ops *new_ops; >> + >> + new_ops =3D malloc(sizeof(*new_ops)); > > Strictly speaking, we lose the numa affinity of "ops" by calling malloc. > I am unclear of the impact though. Don=E2=80=99t think there is a portable API that we can use to determine th= e NUMA for the ops memory and then allocate this on the same numa? Any thoughts or ideas on how to solve this? I hope most people will memset(= ) the ops structure and the reserved[1] part is zero, but it might be a pro= blem in the future if more extensions get added. >> + if (new_ops =3D=3D NULL) >> + return -1; >> + >> + memcpy(new_ops, ops, sizeof(*new_ops)); >> + new_ops->guest_notify =3D NULL; >> + >> + ret =3D rte_vhost_driver_callback_register__(path, ops, = true); >> + } else { >> + ret =3D rte_vhost_driver_callback_register__(path, ops, = false); >> + } >> + >> + return ret; >> +} >> + >> +/* Mark the v23 function as the old version, and v24 as the default ver= sion. */ >> +VERSION_SYMBOL(rte_vhost_driver_callback_register, _v23, 23); >> +BIND_DEFAULT_SYMBOL(rte_vhost_driver_callback_register, _v24, 24); >> +MAP_STATIC_SYMBOL(int rte_vhost_driver_callback_register(const char *pa= th, >> + struct rte_vhost_device_ops const * const ops), >> + rte_vhost_driver_callback_register_v24); >> + >> struct rte_vhost_device_ops const * >> vhost_driver_callback_get(const char *path) >> { >> diff --git a/lib/vhost/version.map b/lib/vhost/version.map >> index d322a4a888..7bcbfd12cf 100644 >> --- a/lib/vhost/version.map >> +++ b/lib/vhost/version.map >> @@ -64,6 +64,12 @@ DPDK_23 { >> local: *; >> }; >> >> +DPDK_24 { >> + global: >> + >> + rte_vhost_driver_callback_register; >> +} DPDK_23; >> + >> EXPERIMENTAL { >> global: >> >> @@ -98,6 +104,9 @@ EXPERIMENTAL { >> # added in 22.11 >> rte_vhost_async_dma_unconfigure; >> rte_vhost_vring_call_nonblock; >> + >> + # added in 23.07 >> + rte_vhost_notify_guest; >> }; >> >> INTERNAL { >> diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c >> index 8ff6434c93..79e88f986e 100644 >> --- a/lib/vhost/vhost.c >> +++ b/lib/vhost/vhost.c >> @@ -44,6 +44,10 @@ static const struct vhost_vq_stats_name_off vhost_vq_= stat_strings[] =3D { >> {"size_1024_1518_packets", offsetof(struct vhost_virtqueue, stat= s.size_bins[6])}, >> {"size_1519_max_packets", offsetof(struct vhost_virtqueue, stat= s.size_bins[7])}, >> {"guest_notifications", offsetof(struct vhost_virtqueue, stat= s.guest_notifications)}, >> + {"guest_notifications_offloaded", offsetof(struct vhost_virtqueu= e, >> + stats.guest_notifications_offloaded)}, >> + {"guest_notifications_error", offsetof(struct vhost_virtqueue, >> + stats.guest_notifications_error)}, >> {"iotlb_hits", offsetof(struct vhost_virtqueue, stat= s.iotlb_hits)}, >> {"iotlb_misses", offsetof(struct vhost_virtqueue, stat= s.iotlb_misses)}, >> {"inflight_submitted", offsetof(struct vhost_virtqueue, stat= s.inflight_submitted)}, >> @@ -1467,6 +1471,40 @@ rte_vhost_enable_guest_notification(int vid, uint= 16_t queue_id, int enable) >> return ret; >> } >> >> +void >> +rte_vhost_notify_guest(int vid, uint16_t queue_id) >> +{ >> + struct virtio_net *dev =3D get_device(vid); >> + struct vhost_virtqueue *vq; >> + >> + if (!dev || queue_id >=3D VHOST_MAX_VRING) >> + return; >> + >> + vq =3D dev->virtqueue[queue_id]; >> + if (!vq) >> + return; >> + >> + rte_rwlock_read_lock(&vq->access_lock); >> + >> + if (vq->callfd >=3D 0) { >> + int ret =3D eventfd_write(vq->callfd, (eventfd_t)1); >> + >> + if (ret) { >> + if (dev->flags & VIRTIO_DEV_STATS_ENABLED) >> + __atomic_fetch_add(&vq->stats.guest_noti= fications_error, >> + 1, __ATOMIC_RELAXED); >> + } else { >> + if (dev->flags & VIRTIO_DEV_STATS_ENABLED) >> + __atomic_fetch_add(&vq->stats.guest_noti= fications, >> + 1, __ATOMIC_RELAXED); >> + if (dev->notify_ops->guest_notified) >> + dev->notify_ops->guest_notified(dev->vid= ); >> + } >> + } >> + >> + rte_rwlock_read_unlock(&vq->access_lock); >> +} >> + >> void >> rte_vhost_log_write(int vid, uint64_t addr, uint64_t len) >> { >> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h >> index 37609c7c8d..0ab75b78b5 100644 >> --- a/lib/vhost/vhost.h >> +++ b/lib/vhost/vhost.h >> @@ -141,6 +141,8 @@ struct virtqueue_stats { >> uint64_t inflight_completed; >> /* Counters below are atomic, and should be incremented as such.= */ >> uint64_t guest_notifications; >> + uint64_t guest_notifications_offloaded; >> + uint64_t guest_notifications_error; >> }; >> >> /** >> @@ -884,6 +886,35 @@ vhost_need_event(uint16_t event_idx, uint16_t new_i= dx, uint16_t old) >> return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx = - old); >> } >> >> +static __rte_always_inline void >> +vhost_vring_inject_irq(struct virtio_net *dev, struct vhost_virtqueue *= vq) >> +{ >> + int ret; >> + >> + if (dev->notify_ops->guest_notify && >> + dev->notify_ops->guest_notify(dev->vid, vq->index)) { >> + if (dev->flags & VIRTIO_DEV_STATS_ENABLED) >> + __atomic_fetch_add(&vq->stats.guest_notification= s_offloaded, >> + 1, __ATOMIC_RELAXED); >> + return; >> + } >> + >> + ret =3D eventfd_write(vq->callfd, (eventfd_t) 1); >> + if (ret) { >> + if (dev->flags & VIRTIO_DEV_STATS_ENABLED) >> + __atomic_fetch_add(&vq->stats.guest_notification= s_error, >> + 1, __ATOMIC_RELAXED); >> + return; >> + } >> + >> + if (dev->flags & VIRTIO_DEV_STATS_ENABLED) >> + __atomic_fetch_add(&vq->stats.guest_notifications, >> + 1, __ATOMIC_RELAXED); >> + if (dev->notify_ops->guest_notified) >> + dev->notify_ops->guest_notified(dev->vid); >> +} >> + >> + > > One empty line is enough. Will remove. >> static __rte_always_inline void >> vhost_vring_call_split(struct virtio_net *dev, struct vhost_virtqueue *= vq) >> { >> @@ -903,26 +934,16 @@ vhost_vring_call_split(struct virtio_net *dev, str= uct vhost_virtqueue *vq) >> "%s: used_event_idx=3D%d, old=3D%d, new=3D%d\n", >> __func__, vhost_used_event(vq), old, new); >> >> - if ((vhost_need_event(vhost_used_event(vq), new, old) && >> - (vq->callfd >=3D 0)) || >> - unlikely(!signalled_used_valid)) { >> - eventfd_write(vq->callfd, (eventfd_t) 1); >> - if (dev->flags & VIRTIO_DEV_STATS_ENABLED) >> - __atomic_fetch_add(&vq->stats.guest_noti= fications, >> - 1, __ATOMIC_RELAXED); >> - if (dev->notify_ops->guest_notified) >> - dev->notify_ops->guest_notified(dev->vid= ); >> + if ((vhost_need_event(vhost_used_event(vq), new, old) || >> + unlikely(!signalled_used_valid)) && >> + =09=09 vq->callfd >=3D 0) { > > A bit hard to read (even before your change). > > After the change, indentation does not comply with dpdk coding style. > http://doc.dpdk.org/guides/contributing/coding_style.html#c-indentation Yes, I always mess up the DPDK indentation style :( > > But putting indentation aside, is this change equivalent? > - if ((vhost_need_event(vhost_used_event(vq), new, old) && > - (vq->callfd >=3D 0)) || > - unlikely(!signalled_used_valid)) { > + if ((vhost_need_event(vhost_used_event(vq), new, old) || > + unlikely(!signalled_used_valid)) && > + vq->callfd >=3D 0) { They are not equal, but in the past eventfd_write() should also not have be= en called with callfd < 0, guess this was an existing bug ;) >> + vhost_vring_inject_irq(dev, vq); >> } >> } else { >> /* Kick the guest if necessary. */ >> if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT) >> && (vq->callfd >=3D 0)) { >> - eventfd_write(vq->callfd, (eventfd_t)1); >> - if (dev->flags & VIRTIO_DEV_STATS_ENABLED) >> - __atomic_fetch_add(&vq->stats.guest_noti= fications, >> - 1, __ATOMIC_RELAXED); >> - if (dev->notify_ops->guest_notified) >> - dev->notify_ops->guest_notified(dev->vid= ); >> + vhost_vring_inject_irq(dev, vq); >> } >> } >> } >> @@ -974,11 +995,8 @@ vhost_vring_call_packed(struct virtio_net *dev, str= uct vhost_virtqueue *vq) >> if (vhost_need_event(off, new, old)) >> kick =3D true; >> kick: >> - if (kick) { >> - eventfd_write(vq->callfd, (eventfd_t)1); >> - if (dev->notify_ops->guest_notified) >> - dev->notify_ops->guest_notified(dev->vid); >> - } >> + if (kick && vq->callfd >=3D 0) >> + vhost_vring_inject_irq(dev, vq); >> } >> >> static __rte_always_inline void >> @@ -1017,4 +1035,11 @@ mbuf_is_consumed(struct rte_mbuf *m) >> >> uint64_t hua_to_alignment(struct rte_vhost_memory *mem, void *ptr); >> void mem_set_dump(void *ptr, size_t size, bool enable, uint64_t alignme= nt); >> + >> +/* Versioned functions */ >> +int rte_vhost_driver_callback_register_v23(const char *path, >> + struct rte_vhost_device_ops const * const ops); >> +int rte_vhost_driver_callback_register_v24(const char *path, >> + struct rte_vhost_device_ops const * const ops); >> + >> #endif /* _VHOST_NET_CDEV_H_ */ >> > > > --=20 > David Marchand