From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 210F141D52;
	Thu, 23 Feb 2023 17:57:46 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id EBC8F40A7A;
	Thu, 23 Feb 2023 17:57:45 +0100 (CET)
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
 by mails.dpdk.org (Postfix) with ESMTP id 8138040697
 for <dev@dpdk.org>; Thu, 23 Feb 2023 17:57:44 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
 s=mimecast20190719; t=1677171464;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=Gq8LV/6irC1lYJNZphClYAhc8ENVUFlHV/86RCqvmZs=;
 b=NdB24kJAgCXT4+Nnu5azuHTTJmOZpIUZ79bf411SufXVzWpkB59utQbbYt70/IVG7hWJaZ
 wJFb1uT09k9FQhis6nDAA+Qmn4rIsRz6bfTPZEzhBLyXk0ZAej4dI+x7ToZ46JQNzA01GW
 MGjUfEnb/55w6XVVmYtt8O8Kz0TLeJQ=
Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com
 [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-479-Jh7WfLRWMgKtk9Gv5g50Zg-1; Thu, 23 Feb 2023 11:57:41 -0500
X-MC-Unique: Jh7WfLRWMgKtk9Gv5g50Zg-1
Received: by mail-ed1-f70.google.com with SMTP id
 v12-20020a056402174c00b004acb5c6e52bso15324911edx.1
 for <dev@dpdk.org>; Thu, 23 Feb 2023 08:57:41 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=Gq8LV/6irC1lYJNZphClYAhc8ENVUFlHV/86RCqvmZs=;
 b=ZGvIRPK7nyAsnwTC5TidKFMCc8Si5JGTV0jYfaWidYKEaI614VZROaF+jHd1gK6scs
 hxw9mSX+6cLxn0MtijXQzKhZHM8LvnmNnwQ6zbBpejGv6IO5nJsAEtMbZQ3+1w91rwWc
 2cSP0nXLhbGQsPFfCvrptjqju3KVY/HPUal3c6mkQVhg3GLy24WpIug0mhAx/a3rVjh7
 PdzZQC2ThEkvp1RYNWueooZwgFXBNlclinwA5ORKNPrSIMQRLVgv6XYJXVMgqLiMu/Fm
 0X23okRJEV9ltJe0vH0CL/oI6ryRRoDwNJxFlup4qJjUahr13du8Fm+To2zDUE/Jr580
 fVeg==
X-Gm-Message-State: AO0yUKUcXStVFbnqdGdqATVxbjPYmEHWHBBSt6V+k1/5Ycggo6cj7Mre
 bzerWXPoCbwWy09z7+qRV/jEFZE8CDxgU/yrjUFIGgy7VnsldZA0qD80cM6SI+BqcwY/sEFz/F5
 YvRL7dLbIxJyuBAVAVRI=
X-Received: by 2002:a17:906:48c9:b0:878:4a24:1a5c with SMTP id
 d9-20020a17090648c900b008784a241a5cmr9592860ejt.6.1677171460055; 
 Thu, 23 Feb 2023 08:57:40 -0800 (PST)
X-Google-Smtp-Source: AK7set8ckFVwAaxH+lT3Ne09Yce0RAKIQ27AHvWTRoQMv7CKjLWKuKfkkrgBU9aoE+Y2uQmKVip7mIES3O3pMkBC6Rk=
X-Received: by 2002:a17:906:48c9:b0:878:4a24:1a5c with SMTP id
 d9-20020a17090648c900b008784a241a5cmr9592852ejt.6.1677171459677; Thu, 23 Feb
 2023 08:57:39 -0800 (PST)
MIME-Version: 1.0
References: <20230222214342.1092333-1-mkp@redhat.com>
 <20230223043551.1108541-1-mkp@redhat.com>
 <e347d80e-9a8f-b2da-9d54-d87bf98d9cc5@redhat.com>
In-Reply-To: <e347d80e-9a8f-b2da-9d54-d87bf98d9cc5@redhat.com>
From: Mike Pattrick <mkp@redhat.com>
Date: Thu, 23 Feb 2023 11:57:28 -0500
Message-ID: <CAHcdBH4oPsK3f4wrrrGpvcY0Wq1btmx8w_X2=Mhpc3QRXiM_Qg@mail.gmail.com>
Subject: Re: [PATCH v2] vhost: fix madvise arguments alignment
To: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: dev@dpdk.org, david.marchand@redhat.com, chenbo.xia@intel.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On Thu, Feb 23, 2023 at 11:12 AM Maxime Coquelin
<maxime.coquelin@redhat.com> wrote:
>
> Hi Mike,
>
> Thanks for  looking into this issue.
>
> On 2/23/23 05:35, Mike Pattrick wrote:
> > The arguments passed to madvise should be aligned to the alignment of
> > the backing memory. Now we keep track of each regions alignment and use
> > then when setting coredump preferences. To facilitate this, a new member
> > was added to rte_vhost_mem_region. A new function was added to easily
> > translate memory address back to region alignment. Unneeded calls to
> > madvise were reduced, as the cache removal case should already be
> > covered by the cache insertion case. The previously inline function
> > mem_set_dump was removed from a header file and made not inline.
> >
> > Fixes: 338ad77c9ed3 ("vhost: exclude VM hugepages from coredumps")
> >
> > Signed-off-by: Mike Pattrick <mkp@redhat.com>
> > ---
> > Since v1:
> >   - Corrected a cast for 32bit compiles
> > ---
> >   lib/vhost/iotlb.c      |  9 +++---
> >   lib/vhost/rte_vhost.h  |  1 +
> >   lib/vhost/vhost.h      | 12 ++------
> >   lib/vhost/vhost_user.c | 63 +++++++++++++++++++++++++++++++++++-------
> >   4 files changed, 60 insertions(+), 25 deletions(-)
> >
> > diff --git a/lib/vhost/iotlb.c b/lib/vhost/iotlb.c
> > index a0b8fd7302..5293507b63 100644
> > --- a/lib/vhost/iotlb.c
> > +++ b/lib/vhost/iotlb.c
> > @@ -149,7 +149,6 @@ vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
> >       rte_rwlock_write_lock(&vq->iotlb_lock);
> >
> >       RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
> > -             mem_set_dump((void *)(uintptr_t)node->uaddr, node->size, true);
>
> Hmm, it should have been called with enable=false here since we are
> removing the entry from the IOTLB cache. It should be kept in order to
> "DONTDUMP" pages evicted from the cache.

Here I was thinking that if we add an entry and then remove a
different entry, they could be in the same page. But on I should have
kept an enable=false in remove_all().

And now that I think about it again, I could just check if there are
any active cache entries in the page on every evict/remove, they're
sorted so that should be an easy check. Unless there are any
objections I'll go forward with that.

>
> >               TAILQ_REMOVE(&vq->iotlb_list, node, next);
> >               vhost_user_iotlb_pool_put(vq, node);
> >       }
> > @@ -171,7 +170,6 @@ vhost_user_iotlb_cache_random_evict(struct vhost_virtqueue *vq)
> >
> >       RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
> >               if (!entry_idx) {
> > -                     mem_set_dump((void *)(uintptr_t)node->uaddr, node->size, true);
>
> Same here.
>
> >                       TAILQ_REMOVE(&vq->iotlb_list, node, next);
> >                       vhost_user_iotlb_pool_put(vq, node);
> >                       vq->iotlb_cache_nr--;
> > @@ -224,14 +222,16 @@ vhost_user_iotlb_cache_insert(struct virtio_net *dev, struct vhost_virtqueue *vq
> >                       vhost_user_iotlb_pool_put(vq, new_node);
> >                       goto unlock;
> >               } else if (node->iova > new_node->iova) {
> > -                     mem_set_dump((void *)(uintptr_t)node->uaddr, node->size, true);
> > +                     mem_set_dump((void *)(uintptr_t)new_node->uaddr, new_node->size, true,
> > +                             hua_to_alignment(dev->mem, (void *)(uintptr_t)node->uaddr));
> >                       TAILQ_INSERT_BEFORE(node, new_node, next);
> >                       vq->iotlb_cache_nr++;
> >                       goto unlock;
> >               }
> >       }
> >
> > -     mem_set_dump((void *)(uintptr_t)node->uaddr, node->size, true);
> > +     mem_set_dump((void *)(uintptr_t)new_node->uaddr, new_node->size, true,
> > +             hua_to_alignment(dev->mem, (void *)(uintptr_t)new_node->uaddr));
> >       TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
> >       vq->iotlb_cache_nr++;
> >
> > @@ -259,7 +259,6 @@ vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
> >                       break;
> >
> >               if (iova < node->iova + node->size) {
> > -                     mem_set_dump((void *)(uintptr_t)node->uaddr, node->size, true);
> >                       TAILQ_REMOVE(&vq->iotlb_list, node, next);
> >                       vhost_user_iotlb_pool_put(vq, node);
> >                       vq->iotlb_cache_nr--;
> > diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
> > index a395843fe9..c5c97ea67e 100644
> > --- a/lib/vhost/rte_vhost.h
> > +++ b/lib/vhost/rte_vhost.h
> > @@ -136,6 +136,7 @@ struct rte_vhost_mem_region {
> >       void     *mmap_addr;
> >       uint64_t mmap_size;
> >       int fd;
> > +     uint64_t alignment;
>
> This is not possible to do this as it breaks the ABI.
> You have to store the information somewhere else, or simply call
> get_blk_size() in hua_to_alignment() since the fd is not closed.
>

Sorry about that! You're right, checking the fd per operation should
be easy enough.

Thanks for the review,

M

> >   };
> >
> >   /**
> > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> > index 5750f0c005..a2467ba509 100644
> > --- a/lib/vhost/vhost.h
> > +++ b/lib/vhost/vhost.h
> > @@ -1009,14 +1009,6 @@ mbuf_is_consumed(struct rte_mbuf *m)
> >       return true;
> >   }
> >
> > -static __rte_always_inline void
> > -mem_set_dump(__rte_unused void *ptr, __rte_unused size_t size, __rte_unused bool enable)
> > -{
> > -#ifdef MADV_DONTDUMP
> > -     if (madvise(ptr, size, enable ? MADV_DODUMP : MADV_DONTDUMP) == -1) {
> > -             rte_log(RTE_LOG_INFO, vhost_config_log_level,
> > -                     "VHOST_CONFIG: could not set coredump preference (%s).\n", strerror(errno));
> > -     }
> > -#endif
> > -}
> > +uint64_t hua_to_alignment(struct rte_vhost_memory *mem, void *ptr);
> > +void mem_set_dump(void *ptr, size_t size, bool enable, uint64_t alignment);
> >   #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> > index d702d082dd..6d09597fbe 100644
> > --- a/lib/vhost/vhost_user.c
> > +++ b/lib/vhost/vhost_user.c
> > @@ -737,6 +737,40 @@ log_addr_to_gpa(struct virtio_net *dev, struct vhost_virtqueue *vq)
> >       return log_gpa;
> >   }
> >
> > +uint64_t
> > +hua_to_alignment(struct rte_vhost_memory *mem, void *ptr)
> > +{
> > +     struct rte_vhost_mem_region *r;
> > +     uint32_t i;
> > +     uintptr_t hua = (uintptr_t)ptr;
> > +
> > +     for (i = 0; i < mem->nregions; i++) {
> > +             r = &mem->regions[i];
> > +             if (hua >= r->host_user_addr &&
> > +                     hua < r->host_user_addr + r->size) {
> > +                     return r->alignment;
> > +             }
> > +     }
> > +
> > +     /* If region isn't found, don't align at all */
> > +     return 1;
> > +}
> > +
> > +void
> > +mem_set_dump(void *ptr, size_t size, bool enable, uint64_t pagesz)
> > +{
> > +#ifdef MADV_DONTDUMP
> > +     void *start = RTE_PTR_ALIGN_FLOOR(ptr, pagesz);
> > +     uintptr_t end = RTE_ALIGN_CEIL((uintptr_t)ptr + size, pagesz);
> > +     size_t len = end - (uintptr_t)start;
> > +
> > +     if (madvise(start, len, enable ? MADV_DODUMP : MADV_DONTDUMP) == -1) {
> > +             rte_log(RTE_LOG_INFO, vhost_config_log_level,
> > +                     "VHOST_CONFIG: could not set coredump preference (%s).\n", strerror(errno));
> > +     }
> > +#endif
> > +}
> > +
> >   static void
> >   translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
> >   {
> > @@ -767,6 +801,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
> >                       return;
> >               }
> >
> > +             mem_set_dump(vq->desc_packed, len, true,
> > +                     hua_to_alignment(dev->mem, vq->desc_packed));
> >               numa_realloc(&dev, &vq);
> >               *pdev = dev;
> >               *pvq = vq;
> > @@ -782,6 +818,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
> >                       return;
> >               }
> >
> > +             mem_set_dump(vq->driver_event, len, true,
> > +                     hua_to_alignment(dev->mem, vq->driver_event));
> >               len = sizeof(struct vring_packed_desc_event);
> >               vq->device_event = (struct vring_packed_desc_event *)
> >                                       (uintptr_t)ring_addr_to_vva(dev,
> > @@ -793,9 +831,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
> >                       return;
> >               }
> >
> > -             mem_set_dump(vq->desc_packed, len, true);
> > -             mem_set_dump(vq->driver_event, len, true);
> > -             mem_set_dump(vq->device_event, len, true);
> > +             mem_set_dump(vq->device_event, len, true,
> > +                     hua_to_alignment(dev->mem, vq->device_event));
> >               vq->access_ok = true;
> >               return;
> >       }
> > @@ -812,6 +849,7 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
> >               return;
> >       }
> >
> > +     mem_set_dump(vq->desc, len, true, hua_to_alignment(dev->mem, vq->desc));
> >       numa_realloc(&dev, &vq);
> >       *pdev = dev;
> >       *pvq = vq;
> > @@ -827,6 +865,7 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
> >               return;
> >       }
> >
> > +     mem_set_dump(vq->avail, len, true, hua_to_alignment(dev->mem, vq->avail));
> >       len = sizeof(struct vring_used) +
> >               sizeof(struct vring_used_elem) * vq->size;
> >       if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX))
> > @@ -839,6 +878,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
> >               return;
> >       }
> >
> > +     mem_set_dump(vq->used, len, true, hua_to_alignment(dev->mem, vq->used));
> > +
> >       if (vq->last_used_idx != vq->used->idx) {
> >               VHOST_LOG_CONFIG(dev->ifname, WARNING,
> >                       "last_used_idx (%u) and vq->used->idx (%u) mismatches;\n",
> > @@ -849,9 +890,6 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
> >                       "some packets maybe resent for Tx and dropped for Rx\n");
> >       }
> >
> > -     mem_set_dump(vq->desc, len, true);
> > -     mem_set_dump(vq->avail, len, true);
> > -     mem_set_dump(vq->used, len, true);
> >       vq->access_ok = true;
> >
> >       VHOST_LOG_CONFIG(dev->ifname, DEBUG, "mapped address desc: %p\n", vq->desc);
> > @@ -1230,7 +1268,8 @@ vhost_user_mmap_region(struct virtio_net *dev,
> >       region->mmap_addr = mmap_addr;
> >       region->mmap_size = mmap_size;
> >       region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
> > -     mem_set_dump(mmap_addr, mmap_size, false);
> > +     region->alignment = alignment;
> > +     mem_set_dump(mmap_addr, mmap_size, false, alignment);
> >
> >       if (dev->async_copy) {
> >               if (add_guest_pages(dev, region, alignment) < 0) {
> > @@ -1535,7 +1574,6 @@ inflight_mem_alloc(struct virtio_net *dev, const char *name, size_t size, int *f
> >               return NULL;
> >       }
> >
> > -     mem_set_dump(ptr, size, false);
> >       *fd = mfd;
> >       return ptr;
> >   }
> > @@ -1566,6 +1604,7 @@ vhost_user_get_inflight_fd(struct virtio_net **pdev,
> >       uint64_t pervq_inflight_size, mmap_size;
> >       uint16_t num_queues, queue_size;
> >       struct virtio_net *dev = *pdev;
> > +     uint64_t alignment;
> >       int fd, i, j;
> >       int numa_node = SOCKET_ID_ANY;
> >       void *addr;
> > @@ -1628,6 +1667,8 @@ vhost_user_get_inflight_fd(struct virtio_net **pdev,
> >               dev->inflight_info->fd = -1;
> >       }
> >
> > +     alignment = get_blk_size(fd);
> > +     mem_set_dump(addr, mmap_size, false, alignment);
> >       dev->inflight_info->addr = addr;
> >       dev->inflight_info->size = ctx->msg.payload.inflight.mmap_size = mmap_size;
> >       dev->inflight_info->fd = ctx->fds[0] = fd;
> > @@ -1744,10 +1785,10 @@ vhost_user_set_inflight_fd(struct virtio_net **pdev,
> >               dev->inflight_info->fd = -1;
> >       }
> >
> > -     mem_set_dump(addr, mmap_size, false);
> >       dev->inflight_info->fd = fd;
> >       dev->inflight_info->addr = addr;
> >       dev->inflight_info->size = mmap_size;
> > +     mem_set_dump(addr, mmap_size, false, get_blk_size(fd));
> >
> >       for (i = 0; i < num_queues; i++) {
> >               vq = dev->virtqueue[i];
> > @@ -2242,6 +2283,7 @@ vhost_user_set_log_base(struct virtio_net **pdev,
> >       struct virtio_net *dev = *pdev;
> >       int fd = ctx->fds[0];
> >       uint64_t size, off;
> > +     uint64_t alignment;
> >       void *addr;
> >       uint32_t i;
> >
> > @@ -2280,6 +2322,7 @@ vhost_user_set_log_base(struct virtio_net **pdev,
> >        * fail when offset is not page size aligned.
> >        */
> >       addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > +     alignment = get_blk_size(fd);
> >       close(fd);
> >       if (addr == MAP_FAILED) {
> >               VHOST_LOG_CONFIG(dev->ifname, ERR, "mmap log base failed!\n");
> > @@ -2296,7 +2339,7 @@ vhost_user_set_log_base(struct virtio_net **pdev,
> >       dev->log_addr = (uint64_t)(uintptr_t)addr;
> >       dev->log_base = dev->log_addr + off;
> >       dev->log_size = size;
> > -     mem_set_dump(addr, size, false);
> > +     mem_set_dump(addr, size + off, false, alignment);
> >
> >       for (i = 0; i < dev->nr_vring; i++) {
> >               struct vhost_virtqueue *vq = dev->virtqueue[i];
>