From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id AC8D643F31;
	Sun, 28 Apr 2024 05:22:17 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 3417840265;
	Sun, 28 Apr 2024 05:22:17 +0200 (CEST)
Received: from NAM11-BN8-obe.outbound.protection.outlook.com
 (mail-bn8nam11on2084.outbound.protection.outlook.com [40.107.236.84])
 by mails.dpdk.org (Postfix) with ESMTP id 60D7940263
 for <dev@dpdk.org>; Sun, 28 Apr 2024 05:22:15 +0200 (CEST)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=hFI+VPhqnN/47bBYT+dUYHF4mQ1SPRKJKVJUDa2DW2emADdQhy2KjzBm5NeQ+UpURdryQE04xGBJpSFzQIxFBwj/7sXJ0mvR8FbQROodIanfMS+8N+JjZDodN2Sye0fvQ8J6sT2wOayppmihc72++GKSkkOHiJkkMibAfioBkyIh3vVOyJ1yW0qvr4COW7kokLjLszgzW2y3OfC/w6PpbtMRvpkDHzY9aDZlqGumlAzO+VJhxKWwJp9UE22tD3tLwPAwevqeTraokackJUW1Jva/qoIRjExnVABQL9Qo8lVJQBdX/hcdyAswFYm3kXWKQaszv1BB9n3RkOks/YrEhg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; 
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=VrhzUXDy7Q6TqaAUXKSEtg0DBrWeMTNDt4bQQUQGKmE=;
 b=hI0JLmUupRFeaWhyqOQqJiAfgTjiWbC9EtybjrulYFBCOa5FXR/bQ7FTN8epg5KGbxSZi23t0WOQBharw0mRePn6LvOuHnWLS6y6eKIa/aboox0EzY17j8ULWqg8cp1yGyVIWAhc0TXXavx2MkRvj1qjJ0+v6iTEtjWvtlS7qE4iXRvlC13cQkbYkET5fXu8rtgAwItM22x2gimvFpzWoaVxGuLHW6aci3qWxS5dBQHbjEuKecwBf2dc9n+euAYZQaW7oBQGBod710E8hBLo0tKGVn9TQ4x1G9QvoDo4B0FK4ANkAiurY+k4oCPHRPUrQPHPO+mJIuL+qCQYNi6d3A==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com;
 dkim=pass header.d=nvidia.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com;
 s=selector2;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=VrhzUXDy7Q6TqaAUXKSEtg0DBrWeMTNDt4bQQUQGKmE=;
 b=mWTKATgF8jwwDMvCVoWvidXP6ukxk00i26txjoFWcK6247Vl+bGsSWUGm4LZsLdV2T8GOwTAP4QlVAg5fqxYMIP+4sZVXUD9dr9HRsNr8Y8y0FwMRvC//n3/0QShOKRw7Y+GNamDtDL0lkKZ8yOFYDhOvh5w/TjyvoUpvU1Kr7Qg1WhSciqMGekdyZfkPLEGOcTqDaj4lDAQqB1MMVuivlOkbQrlduKmH1wDveP53TvPBHE4kaTHaoAdVWmCC6PWJh4grhtHHhX/YeoEYqibdGdZgv4pFDBhzG1yo35swmKNcyTOYxwPXlVKIiYm+ui2vx1XijJAyBzm9AcZaidVGw==
Received: from MN0PR12MB6341.namprd12.prod.outlook.com (2603:10b6:208:3c2::13)
 by PH7PR12MB7355.namprd12.prod.outlook.com (2603:10b6:510:20e::16)
 with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7519.32; Sun, 28 Apr
 2024 03:22:10 +0000
Received: from MN0PR12MB6341.namprd12.prod.outlook.com
 ([fe80::4cbb:90c1:e4f:38d5]) by MN0PR12MB6341.namprd12.prod.outlook.com
 ([fe80::4cbb:90c1:e4f:38d5%4]) with mapi id 15.20.7519.023; Sun, 28 Apr 2024
 03:22:09 +0000
From: Chenbo Xia <chenbox@nvidia.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, David Marchand <david.marchand@redhat.com>
Subject: Re: [PATCH v3 5/5] vhost: manage FD with epoll
Thread-Topic: [PATCH v3 5/5] vhost: manage FD with epoll
Thread-Index: AQHainPuLFmTTr3JTEaONiKPCrQY/bF9IYuA
Date: Sun, 28 Apr 2024 03:22:09 +0000
Message-ID: <B677E726-2C99-4734-9363-2A2D1C09611E@nvidia.com>
References: <20240409114845.1336403-1-maxime.coquelin@redhat.com>
 <20240409114845.1336403-6-maxime.coquelin@redhat.com>
In-Reply-To: <20240409114845.1336403-6-maxime.coquelin@redhat.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=nvidia.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: MN0PR12MB6341:EE_|PH7PR12MB7355:EE_
x-ms-office365-filtering-correlation-id: c2df9f0e-0e3b-48db-9df6-08dc673263b0
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0; ARA:13230031|1800799015|376005|366007|38070700009;
x-microsoft-antispam-message-info: =?us-ascii?Q?eDsKkYIKmDbCB/2j43SAH8SgRGaMjzcEbDc8eV7R2E0LKYYfVcPbimRCIH5I?=
 =?us-ascii?Q?aH+/WNuIxPca7oAR5sfLKhSAeLep+olfSVufcaf2wPiDYRlt1GXtW23SnFlT?=
 =?us-ascii?Q?3ZG/2xu/EEzdo0V5RMW8y8Ao91q01oVfn8DdK9yRKLzKKC0pvx6QBbhQkmtg?=
 =?us-ascii?Q?9v8LBjVqtPh6r989dxM2X9rLGD+reBvgMmE13rKbNYBBBuYuEOVn0rsKw8sl?=
 =?us-ascii?Q?wzCinLdT0QaNFTz+JD6N/uAFwcUXVHHk6Q16b0jbUOYcmzgxqTUMNvOAOeG/?=
 =?us-ascii?Q?bSa197MYWPl/k9IBT8PM098deCn49UV3q+f4i4AkjQnldgE1FeRrmWrfNk8K?=
 =?us-ascii?Q?fq62N/SPVMIqn5kmNWztQHdXod9nag07QC79fy3KnZ3cgPmCBNXBeCp4HCpA?=
 =?us-ascii?Q?Fqg9K77skv1oIJIrSQxVBdGgnxOwx6gwcOwE5XXEnYaLvgcF0WMdx5GkbsFg?=
 =?us-ascii?Q?RwaaU5Q8PH0b1BoHC4x052D0jtvOXfDDMuFelqbAz/c5UATfxzuqZZPyU1Jv?=
 =?us-ascii?Q?bZJEy8d8g2APRN/T6lq7/dFBn2awS4DbK4Vr3DOlfB9d2kkWplJx/THq6WJ+?=
 =?us-ascii?Q?M1yctunJXVUIuinEZIQ7WiVd86uOz1HsVMKWGDGHMhDtvBWPxgAPDA5K0pQs?=
 =?us-ascii?Q?y7TEChb6m/WZkcpKJ4OeoaqhojZahtxr+9/IeieDzMI/3WaGkRnOWORj5nhZ?=
 =?us-ascii?Q?77pwOW5yUvp2vMbhtem5BLQioFYF1cn6gHN20U/MlW6rcUFGy5IbCrkGt+i3?=
 =?us-ascii?Q?0yo08Gk6irNPhud6UUuCqf5tlq+VDdIDSFAOhgDzGhV0gXrVCRoHOhTSnVIW?=
 =?us-ascii?Q?UcQc5MCqD+NTIzRCzwkpHvyQlc4qGZYAa6YG/0OONpFRJhBFL6eP94nNczIP?=
 =?us-ascii?Q?uT6n5OV2ekdtZOY2wfrDc5vCAzmcgUG2EohRC1jIwGq3xOI1rf99LFHJRSN/?=
 =?us-ascii?Q?r4u5ax8fvsVQtIqt52WZXJkZC3MFdf5c9tGX+6xfJqt5Rh5lQxY95P1LXF15?=
 =?us-ascii?Q?LwKF45tluyDaUyZ9pv4FNfFLBf60nQ0Rthqo39fgIOFamrkYqSK5j4ABkvxt?=
 =?us-ascii?Q?4uQZETtodF3Q/tecN42jYP5ARg86VY9btPOgfwJyopWmKvDsWhkQdDMNhj1G?=
 =?us-ascii?Q?qYy53wCMYx35cQZ0SLZ6AkAHxfRsQTCYFXci7GzCaRkf6MTX1N9MVsayP5Ar?=
 =?us-ascii?Q?1IfcejypMDEzIDdU+gjr1uvE6aKzDRuM9I0IDPXRNaOLQDLcJ5KeKOZ33dyy?=
 =?us-ascii?Q?JYYNvvxedk4zJx/Cp/OP5XtZvhnIg3fM1f1qnh5oq9mKklGV7wVhtFCa4igE?=
 =?us-ascii?Q?kQLsbOTV1yPZ8rIvZx3zBhGuewbBpu1ijLWELpxN6EVjdQ=3D=3D?=
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:;
 IPV:NLI; SFV:NSPM; H:MN0PR12MB6341.namprd12.prod.outlook.com; PTR:; CAT:NONE;
 SFS:(13230031)(1800799015)(376005)(366007)(38070700009); DIR:OUT; SFP:1101; 
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?Ach6Q/+NCclDLys4ERnS8r9HRKkSZmxalYagrkntXzZIha3hYXHABnq5Jgyv?=
 =?us-ascii?Q?JfOdeHy8yhdxNZR9BaLW4ltd2WjK9zaMWAgpr/O4uXMHImFUPzkE2tOAeg71?=
 =?us-ascii?Q?VvNSGaDaOlb6kN6zKF+zdKJxCyqLrT80V4lGxqpXBfahHBKYJ1AGTF/7hvEU?=
 =?us-ascii?Q?H99fCMgv1N18UA7gvxQcK7SDjUYKOt4hAHAnTYhIYC/bmMOSL1vkNchGrCzy?=
 =?us-ascii?Q?XdVlhdtfXxzr8xzeORjZqxWLgZXDc0PS+Y82oYOif47PbeOJq48yUWtxE2zp?=
 =?us-ascii?Q?KrYSYjiKk52y4aYT/Ee+kOrvDPWRkvCbSy+SACy4T1u6kfoXwAGK5neO/rSp?=
 =?us-ascii?Q?tknGeKU7WJCjwZfHfkuKigCQizY8xL2sPZFAaqbbOhpXP3HrOo2x/03Hnxkc?=
 =?us-ascii?Q?ppsuB81RQxbz3HqHI3K59jEaWBXGdsuhBkK5Z+3CyPJ9T1UT2HWL/okSWUsP?=
 =?us-ascii?Q?xhJwHZktqk5HjJCBe6Ra4xJ2dXN86Z9v/DHH6VwB8yAYGKATxmLv2D0nnfXv?=
 =?us-ascii?Q?tgSBAiT7MnkZAigorv7/blZcFEW+slnrNEmXwtqzGSDHV3SUZ1rTtZe5TVDo?=
 =?us-ascii?Q?4IVd9qPQzZzs9+vPMwT7C807SCX2sjtP6czu2XA6zF+QLHEtxMmdut+CfXZz?=
 =?us-ascii?Q?9FH/frpDfG8MRW3cUg+H2698xPot4kTyhUfWGsNXqKFZ+A9DC3JFVNn3Omi2?=
 =?us-ascii?Q?fP12ehj1T1KH+zCwPKmUTsJxuo87lhSpmtWCWIldDu5Zt/qNH37e+AL44okg?=
 =?us-ascii?Q?u9wZI4RlY3NmsVPmiDBNG0uC73lJjCqiCAzqunHlANdJYurmhsP0qxESwHa9?=
 =?us-ascii?Q?YL7PDvLuMo4eT2X685AOUDYAMGMFsseTEMWNQWtv+Ym5c5KVm6ZjEOmFtqds?=
 =?us-ascii?Q?uaky8/SUhIU3iHdWkUgw3Ik7msyEtqzB7Il2+o9BQDY1svoQ336r12EIvIC7?=
 =?us-ascii?Q?OLYmb3CRq2yRzaoUrMF3LFlbJsvO3gDM0H2BnLtEb1n6RApJSK3U5Y3r4sPE?=
 =?us-ascii?Q?SOHPn4Tnwbro+GFxfQyV4q/fr+WTPKRaa/pysjkIowCBVwz1O23o8tJSw4vB?=
 =?us-ascii?Q?CCF3tj97hny4XVI+uhjXrU2TfcWST8Vpxv4g3M0q9Hk9D3mvne4K7MfGuSX6?=
 =?us-ascii?Q?D06z8ur+qFw0G44UgenGZK+NY9hZbcn3sKBrk8Fs0ahaW0y4RJC2x109S0jy?=
 =?us-ascii?Q?4SSc90Da0bq32oLW3M2efUaAvYEKrNsz3FHkFkr8x++w6kXRJA92iWLYspJ3?=
 =?us-ascii?Q?pSij+APEoGt4U13QC8YzOTnSVoBgLHIZJQ4RRxJe4NI1A0WLngFP4vowdw6N?=
 =?us-ascii?Q?GRgfhR/PjSMVt5ufqCwjY8muvSxNgqiBLr5ubrWgu4QGwoLBdKbhcMrAbuBQ?=
 =?us-ascii?Q?cwBfpbKdjt5J42dprfJyAgsQC4BWgGl6h8BME2AnP8ILq/3khCrds9Cuuj8d?=
 =?us-ascii?Q?N9uVqsR5BxhgnP/3w2XGbsdibp8ffvTT7p6UKgSJ6+uOhP4S+alFsufyKzZO?=
 =?us-ascii?Q?1txUqM12EX+6vip7xGz6IDOZUNOESYkdMVncUbz6pVAHyJWzpjhCpI+7edYG?=
 =?us-ascii?Q?S6prmpX5g8tp2FHe5ClUDfmiWuBFAhE5viTLrkq6nyTkMtvpis8+Fz76VMOH?=
 =?us-ascii?Q?Hg=3D=3D?=
Content-Type: text/plain; charset="us-ascii"
Content-ID: <537ACE670CC111439BE781F419FEB26A@namprd12.prod.outlook.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: MN0PR12MB6341.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: c2df9f0e-0e3b-48db-9df6-08dc673263b0
X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Apr 2024 03:22:09.8281 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: PDuwEZ+c6yd4bdxnNjO/l/ejBiHBbtmpjl5DhnZjtxYqpJQ7UWf2/XrXt8plaKNrvB3oDGdVpZgC5LKdm0aNBw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR12MB7355
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Hi Maxime,

> On Apr 9, 2024, at 19:48, Maxime Coquelin <maxime.coquelin@redhat.com> wr=
ote:
>=20
> External email: Use caution opening links or attachments
>=20
>=20
> From: David Marchand <david.marchand@redhat.com>
>=20
> Switch to epoll so that the concern over the poll() fd array
> is removed.
> Add a simple list of used entries and track the next free entry.
>=20
> epoll() is thread safe, we no more need a synchronization
> mechanism and so can remove the notification pipe.
>=20
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
> lib/vhost/fd_man.c | 399 ++++++++++++---------------------------------
> lib/vhost/fd_man.h |   5 +-
> 2 files changed, 106 insertions(+), 298 deletions(-)
>=20
> diff --git a/lib/vhost/fd_man.c b/lib/vhost/fd_man.c
> index 8b47c97d45..a4a2965da1 100644
> --- a/lib/vhost/fd_man.c
> +++ b/lib/vhost/fd_man.c
> @@ -3,9 +3,9 @@
>  */
>=20
> #include <errno.h>
> -#include <pthread.h>
> #include <stdio.h>
> #include <string.h>
> +#include <sys/epoll.h>
> #include <unistd.h>
>=20
> #include <rte_common.h>
> @@ -21,49 +21,34 @@ RTE_LOG_REGISTER_SUFFIX(vhost_fdset_logtype, fdset, I=
NFO);
> #define VHOST_FDMAN_LOG(level, ...) \
>        RTE_LOG_LINE(level, VHOST_FDMAN, "" __VA_ARGS__)
>=20
> -#define FDPOLLERR (POLLERR | POLLHUP | POLLNVAL)
> -
> struct fdentry {
>        int fd;         /* -1 indicates this entry is empty */
>        fd_cb rcb;      /* callback when this fd is readable. */
>        fd_cb wcb;      /* callback when this fd is writeable.*/
>        void *dat;      /* fd context */
>        int busy;       /* whether this entry is being used in cb. */
> +       LIST_ENTRY(fdentry) next;
> };
>=20
> struct fdset {
>        char name[RTE_THREAD_NAME_SIZE];
> -       struct pollfd rwfds[MAX_FDS];
> +       int epfd;
>        struct fdentry fd[MAX_FDS];
> +       LIST_HEAD(, fdentry) fdlist;
> +       int next_free_idx;
>        rte_thread_t tid;
>        pthread_mutex_t fd_mutex;
> -       pthread_mutex_t fd_polling_mutex;
> -       int num;        /* current fd number of this fdset */
> -
> -       union pipefds {
> -               struct {
> -                       int pipefd[2];
> -               };
> -               struct {
> -                       int readfd;
> -                       int writefd;
> -               };
> -       } u;
> -
> -       pthread_mutex_t sync_mutex;
> -       pthread_cond_t sync_cond;
> -       bool sync;
> +

Not sure this blank line is intended or not :)

>        bool destroy;
> };
>=20
> -static int fdset_add_no_sync(struct fdset *pfdset, int fd, fd_cb rcb, fd=
_cb wcb, void *dat);
> -static uint32_t fdset_event_dispatch(void *arg);
> -
> #define MAX_FDSETS 8
>=20
> static struct fdset *fdsets[MAX_FDSETS];
> pthread_mutex_t fdsets_mutex =3D PTHREAD_MUTEX_INITIALIZER;
>=20
> +static uint32_t fdset_event_dispatch(void *arg);
> +
> static struct fdset *
> fdset_lookup(const char *name)
> {
> @@ -96,166 +81,6 @@ fdset_insert(struct fdset *fdset)
>        return -1;
> }
>=20
> -static void
> -fdset_pipe_read_cb(int readfd, void *dat,
> -                  int *remove __rte_unused)
> -{
> -       char charbuf[16];
> -       struct fdset *fdset =3D dat;
> -       int r =3D read(readfd, charbuf, sizeof(charbuf));
> -       /*
> -        * Just an optimization, we don't care if read() failed
> -        * so ignore explicitly its return value to make the
> -        * compiler happy
> -        */
> -       RTE_SET_USED(r);
> -
> -       pthread_mutex_lock(&fdset->sync_mutex);
> -       fdset->sync =3D true;
> -       pthread_cond_broadcast(&fdset->sync_cond);
> -       pthread_mutex_unlock(&fdset->sync_mutex);
> -}
> -
> -static void
> -fdset_pipe_uninit(struct fdset *fdset)
> -{
> -       fdset_del(fdset, fdset->u.readfd);
> -       close(fdset->u.readfd);
> -       fdset->u.readfd =3D -1;
> -       close(fdset->u.writefd);
> -       fdset->u.writefd =3D -1;
> -}
> -
> -static int
> -fdset_pipe_init(struct fdset *fdset)
> -{
> -       int ret;
> -
> -       pthread_mutex_init(&fdset->sync_mutex, NULL);
> -       pthread_cond_init(&fdset->sync_cond, NULL);
> -
> -       if (pipe(fdset->u.pipefd) < 0) {
> -               VHOST_FDMAN_LOG(ERR,
> -                       "failed to create pipe for vhost fdset");
> -               return -1;
> -       }
> -
> -       ret =3D fdset_add_no_sync(fdset, fdset->u.readfd,
> -                       fdset_pipe_read_cb, NULL, fdset);
> -       if (ret < 0) {
> -               VHOST_FDMAN_LOG(ERR,
> -                       "failed to add pipe readfd %d into vhost server f=
dset",
> -                       fdset->u.readfd);
> -
> -               fdset_pipe_uninit(fdset);
> -               return -1;
> -       }
> -
> -       return 0;
> -}
> -
> -static void
> -fdset_sync(struct fdset *fdset)
> -{
> -       int ret;
> -
> -       pthread_mutex_lock(&fdset->sync_mutex);
> -
> -       fdset->sync =3D false;
> -       ret =3D write(fdset->u.writefd, "1", 1);
> -       if (ret < 0) {
> -               VHOST_FDMAN_LOG(ERR,
> -                       "Failed to write to notification pipe: %s",
> -                       strerror(errno));
> -               goto out_unlock;
> -       }
> -
> -       while (!fdset->sync)
> -               pthread_cond_wait(&fdset->sync_cond, &fdset->sync_mutex);
> -
> -out_unlock:
> -       pthread_mutex_unlock(&fdset->sync_mutex);
> -}
> -
> -static int
> -get_last_valid_idx(struct fdset *pfdset, int last_valid_idx)
> -{
> -       int i;
> -
> -       for (i =3D last_valid_idx; i >=3D 0 && pfdset->fd[i].fd =3D=3D -1=
; i--)
> -               ;
> -
> -       return i;
> -}
> -
> -static void
> -fdset_move(struct fdset *pfdset, int dst, int src)
> -{
> -       pfdset->fd[dst]    =3D pfdset->fd[src];
> -       pfdset->rwfds[dst] =3D pfdset->rwfds[src];
> -}
> -
> -static void
> -fdset_shrink_nolock(struct fdset *pfdset)
> -{
> -       int i;
> -       int last_valid_idx =3D get_last_valid_idx(pfdset, pfdset->num - 1=
);
> -
> -       for (i =3D 0; i < last_valid_idx; i++) {
> -               if (pfdset->fd[i].fd !=3D -1)
> -                       continue;
> -
> -               fdset_move(pfdset, i, last_valid_idx);
> -               last_valid_idx =3D get_last_valid_idx(pfdset, last_valid_=
idx - 1);
> -       }
> -       pfdset->num =3D last_valid_idx + 1;
> -}
> -
> -/*
> - * Find deleted fd entries and remove them
> - */
> -static void
> -fdset_shrink(struct fdset *pfdset)
> -{
> -       pthread_mutex_lock(&pfdset->fd_mutex);
> -       fdset_shrink_nolock(pfdset);
> -       pthread_mutex_unlock(&pfdset->fd_mutex);
> -}
> -
> -/**
> - * Returns the index in the fdset for a given fd.
> - * @return
> - *   index for the fd, or -1 if fd isn't in the fdset.
> - */
> -static int
> -fdset_find_fd(struct fdset *pfdset, int fd)
> -{
> -       int i;
> -
> -       for (i =3D 0; i < pfdset->num && pfdset->fd[i].fd !=3D fd; i++)
> -               ;
> -
> -       return i =3D=3D pfdset->num ? -1 : i;
> -}
> -
> -static void
> -fdset_add_fd(struct fdset *pfdset, int idx, int fd,
> -       fd_cb rcb, fd_cb wcb, void *dat)
> -{
> -       struct fdentry *pfdentry =3D &pfdset->fd[idx];
> -       struct pollfd *pfd =3D &pfdset->rwfds[idx];
> -
> -       pfdentry->fd  =3D fd;
> -       pfdentry->rcb =3D rcb;
> -       pfdentry->wcb =3D wcb;
> -       pfdentry->dat =3D dat;
> -
> -       pfd->fd =3D fd;
> -       pfd->events  =3D rcb ? POLLIN : 0;
> -       pfd->events |=3D wcb ? POLLOUT : 0;
> -       pfd->revents =3D 0;
> -}
> -
> struct fdset *
> fdset_init(const char *name)
> {
> @@ -284,16 +109,20 @@ fdset_init(const char *name)
>        rte_strscpy(fdset->name, name, RTE_THREAD_NAME_SIZE);
>=20
>        pthread_mutex_init(&fdset->fd_mutex, NULL);
> -       pthread_mutex_init(&fdset->fd_polling_mutex, NULL);
>=20
> -       for (i =3D 0; i < MAX_FDS; i++) {
> +       for (i =3D 0; i < (int)RTE_DIM(fdset->fd); i++) {
>                fdset->fd[i].fd =3D -1;
>                fdset->fd[i].dat =3D NULL;
>        }
> -       fdset->num =3D 0;
> +       LIST_INIT(&fdset->fdlist);
>=20
> -       if (fdset_pipe_init(fdset)) {
> -               VHOST_FDMAN_LOG(ERR, "Failed to init pipe for %s", name);
> +       /*
> +        * Any non-zero value would work (see man epoll_create),
> +        * but pass MAX_FDS for consistency.
> +        */
> +       fdset->epfd =3D epoll_create(MAX_FDS);
> +       if (fdset->epfd < 0) {
> +               VHOST_FDMAN_LOG(ERR, "failed to create epoll for %s fdset=
", name);

failed -> Failed like other logs

>                goto err_free;
>        }
>=20
> @@ -301,7 +130,7 @@ fdset_init(const char *name)
>                                        fdset_event_dispatch, fdset)) {
>                VHOST_FDMAN_LOG(ERR, "Failed to create %s event dispatch t=
hread",
>                                fdset->name);
> -               goto err_pipe;
> +               goto err_epoll;
>        }
>=20
>        if (fdset_insert(fdset)) {
> @@ -315,10 +144,9 @@ fdset_init(const char *name)
>=20
> err_thread:
>        fdset->destroy =3D true;
> -       fdset_sync(fdset);
>        rte_thread_join(fdset->tid, &val);
> -err_pipe:
> -       fdset_pipe_uninit(fdset);
> +err_epoll:
> +       close(fdset->epfd);
> err_free:
>        rte_free(fdset);
> err_unlock:
> @@ -330,78 +158,99 @@ fdset_init(const char *name)
> /**
>  * Register the fd in the fdset with read/write handler and context.
>  */
> -static int
> -fdset_add_no_sync(struct fdset *pfdset, int fd, fd_cb rcb, fd_cb wcb, vo=
id *dat)
> +int
> +fdset_add(struct fdset *pfdset, int fd, fd_cb rcb, fd_cb wcb, void *dat)
> {
> -       int i;
> +       struct fdentry *pfdentry;
> +       struct epoll_event ev;
>=20
>        if (pfdset =3D=3D NULL || fd =3D=3D -1)
>                return -1;
>=20
>        pthread_mutex_lock(&pfdset->fd_mutex);
> -       i =3D pfdset->num < MAX_FDS ? pfdset->num++ : -1;
> -       if (i =3D=3D -1) {
> -               pthread_mutex_lock(&pfdset->fd_polling_mutex);
> -               fdset_shrink_nolock(pfdset);
> -               pthread_mutex_unlock(&pfdset->fd_polling_mutex);
> -               i =3D pfdset->num < MAX_FDS ? pfdset->num++ : -1;
> -               if (i =3D=3D -1) {
> -                       pthread_mutex_unlock(&pfdset->fd_mutex);
> -                       return -2;
> -               }
> +       if (pfdset->next_free_idx >=3D (int)RTE_DIM(pfdset->fd)) {
> +               pthread_mutex_unlock(&pfdset->fd_mutex);
> +               return -2;
>        }
>=20
> -       fdset_add_fd(pfdset, i, fd, rcb, wcb, dat);
> +       pfdentry =3D &pfdset->fd[pfdset->next_free_idx];
> +       pfdentry->fd  =3D fd;
> +       pfdentry->rcb =3D rcb;
> +       pfdentry->wcb =3D wcb;
> +       pfdentry->dat =3D dat;
> +
> +       LIST_INSERT_HEAD(&pfdset->fdlist, pfdentry, next);
> +
> +       /* Find next free slot */
> +       pfdset->next_free_idx++;
> +       for (; pfdset->next_free_idx < (int)RTE_DIM(pfdset->fd); pfdset->=
next_free_idx++) {
> +               if (pfdset->fd[pfdset->next_free_idx].fd !=3D -1)
> +                       continue;
> +               break;
> +       }
>        pthread_mutex_unlock(&pfdset->fd_mutex);
>=20
> +       ev.events =3D EPOLLERR;
> +       ev.events |=3D rcb ? EPOLLIN : 0;
> +       ev.events |=3D wcb ? EPOLLOUT : 0;
> +       ev.data.fd =3D fd;
> +
> +       if (epoll_ctl(pfdset->epfd, EPOLL_CTL_ADD, fd, &ev) =3D=3D -1)
> +               VHOST_FDMAN_LOG(ERR, "could not add %d fd to %d epfd: %s"=
,
> +                       fd, pfdset->epfd, strerror(errno));

Should not return 0 if this fails ?

> +
>        return 0;
> }
>=20
> -int
> -fdset_add(struct fdset *pfdset, int fd, fd_cb rcb, fd_cb wcb, void *dat)
> +static struct fdentry *
> +fdset_find_entry_locked(struct fdset *pfdset, int fd)
> {
> -       int ret;
> +       struct fdentry *pfdentry;
>=20
> -       ret =3D fdset_add_no_sync(pfdset, fd, rcb, wcb, dat);
> -       if (ret < 0)
> -               return ret;
> +       LIST_FOREACH(pfdentry, &pfdset->fdlist, next) {
> +               if (pfdentry->fd !=3D fd)
> +                       continue;
> +               return pfdentry;
> +       }
>=20
> -       fdset_sync(pfdset);
> +       return NULL;
> +}
>=20
> -       return 0;
> +static void
> +fdset_del_locked(struct fdset *pfdset, struct fdentry *pfdentry)
> +{
> +       int entry_idx;
> +
> +       if (epoll_ctl(pfdset->epfd, EPOLL_CTL_DEL, pfdentry->fd, NULL) =
=3D=3D -1)
> +               VHOST_FDMAN_LOG(ERR, "could not remove %d fd from %d epfd=
: %s",
> +                       pfdentry->fd, pfdset->epfd, strerror(errno));
> +
> +       pfdentry->fd =3D -1;
> +       pfdentry->rcb =3D pfdentry->wcb =3D NULL;
> +       pfdentry->dat =3D NULL;
> +       entry_idx =3D pfdentry - pfdset->fd;
> +       if (entry_idx < pfdset->next_free_idx)
> +               pfdset->next_free_idx =3D entry_idx;
> +       LIST_REMOVE(pfdentry, next);
> }
>=20
> -/**
> - *  Unregister the fd from the fdset.
> - *  Returns context of a given fd or NULL.
> - */
> -void *
> +void
> fdset_del(struct fdset *pfdset, int fd)
> {
> -       int i;
> -       void *dat =3D NULL;
> +       struct fdentry *pfdentry;
>=20
>        if (pfdset =3D=3D NULL || fd =3D=3D -1)
> -               return NULL;
> +               return;
>=20
>        do {
>                pthread_mutex_lock(&pfdset->fd_mutex);
> -
> -               i =3D fdset_find_fd(pfdset, fd);
> -               if (i !=3D -1 && pfdset->fd[i].busy =3D=3D 0) {
> -                       /* busy indicates r/wcb is executing! */
> -                       dat =3D pfdset->fd[i].dat;
> -                       pfdset->fd[i].fd =3D -1;
> -                       pfdset->fd[i].rcb =3D pfdset->fd[i].wcb =3D NULL;
> -                       pfdset->fd[i].dat =3D NULL;
> -                       i =3D -1;
> +               pfdentry =3D fdset_find_entry_locked(pfdset, fd);
> +               if (pfdentry !=3D NULL && pfdentry->busy =3D=3D 0) {
> +                       fdset_del_locked(pfdset, pfdentry);
> +                       pfdentry =3D NULL;
>                }
>                pthread_mutex_unlock(&pfdset->fd_mutex);
> -       } while (i !=3D -1);
> -
> -       fdset_sync(pfdset);
> -
> -       return dat;
> +       } while (pfdentry !=3D NULL);
> }
>=20
> /**
> @@ -415,28 +264,22 @@ fdset_del(struct fdset *pfdset, int fd)
> int
> fdset_try_del(struct fdset *pfdset, int fd)
> {
> -       int i;
> +       struct fdentry *pfdentry;
>=20
>        if (pfdset =3D=3D NULL || fd =3D=3D -1)
>                return -2;
>=20
>        pthread_mutex_lock(&pfdset->fd_mutex);
> -       i =3D fdset_find_fd(pfdset, fd);
> -       if (i !=3D -1 && pfdset->fd[i].busy) {
> +       pfdentry =3D fdset_find_entry_locked(pfdset, fd);
> +       if (pfdentry !=3D NULL && pfdentry->busy !=3D 0) {
>                pthread_mutex_unlock(&pfdset->fd_mutex);
>                return -1;
>        }
>=20
> -       if (i !=3D -1) {
> -               pfdset->fd[i].fd =3D -1;
> -               pfdset->fd[i].rcb =3D pfdset->fd[i].wcb =3D NULL;
> -               pfdset->fd[i].dat =3D NULL;
> -       }
> +       if (pfdentry !=3D NULL)
> +               fdset_del_locked(pfdset, pfdentry);
>=20
>        pthread_mutex_unlock(&pfdset->fd_mutex);
> -
> -       fdset_sync(pfdset);
> -
>        return 0;
> }
>=20
> @@ -453,53 +296,29 @@ static uint32_t
> fdset_event_dispatch(void *arg)
> {
>        int i;
> -       struct pollfd *pfd;
> -       struct fdentry *pfdentry;
>        fd_cb rcb, wcb;
>        void *dat;
>        int fd, numfds;
>        int remove1, remove2;
> -       int need_shrink;
>        struct fdset *pfdset =3D arg;
> -       int val;
>=20
>        if (pfdset =3D=3D NULL)
>                return 0;
>=20
>        while (1) {
> +               struct epoll_event events[MAX_FDS];
> +               struct fdentry *pfdentry;
>=20
> -               /*
> -                * When poll is blocked, other threads might unregister
> -                * listenfds from and register new listenfds into fdset.
> -                * When poll returns, the entries for listenfds in the fd=
set
> -                * might have been updated. It is ok if there is unwanted=
 call
> -                * for new listenfds.
> -                */
> -               pthread_mutex_lock(&pfdset->fd_mutex);
> -               numfds =3D pfdset->num;
> -               pthread_mutex_unlock(&pfdset->fd_mutex);
> -
> -               pthread_mutex_lock(&pfdset->fd_polling_mutex);
> -               val =3D poll(pfdset->rwfds, numfds, 1000 /* millisecs */)=
;
> -               pthread_mutex_unlock(&pfdset->fd_polling_mutex);
> -               if (val < 0)
> +               numfds =3D epoll_wait(pfdset->epfd, events, RTE_DIM(event=
s), 1000);
> +               if (numfds < 0)
>                        continue;
>=20
> -               need_shrink =3D 0;
>                for (i =3D 0; i < numfds; i++) {
>                        pthread_mutex_lock(&pfdset->fd_mutex);
>=20
> -                       pfdentry =3D &pfdset->fd[i];
> -                       fd =3D pfdentry->fd;
> -                       pfd =3D &pfdset->rwfds[i];
> -
> -                       if (fd < 0) {
> -                               need_shrink =3D 1;
> -                               pthread_mutex_unlock(&pfdset->fd_mutex);
> -                               continue;
> -                       }
> -
> -                       if (!pfd->revents) {
> +                       fd =3D events[i].data.fd;
> +                       pfdentry =3D fdset_find_entry_locked(pfdset, fd);
> +                       if (pfdentry =3D=3D NULL) {
>                                pthread_mutex_unlock(&pfdset->fd_mutex);
>                                continue;
>                        }
> @@ -513,9 +332,9 @@ fdset_event_dispatch(void *arg)
>=20
>                        pthread_mutex_unlock(&pfdset->fd_mutex);
>=20
> -                       if (rcb && pfd->revents & (POLLIN | FDPOLLERR))
> +                       if (rcb && events[i].events & (EPOLLIN | EPOLLERR=
 | EPOLLHUP))
>                                rcb(fd, dat, &remove1);
> -                       if (wcb && pfd->revents & (POLLOUT | FDPOLLERR))
> +                       if (wcb && events[i].events & (EPOLLOUT | EPOLLER=
R | EPOLLHUP))
>                                wcb(fd, dat, &remove2);
>                        pfdentry->busy =3D 0;
>                        /*
> @@ -524,23 +343,13 @@ fdset_event_dispatch(void *arg)
>                         * directly.
>                         */
>                        /*
> -                        * When we are to clean up the fd from fdset,
> -                        * because the fd is closed in the cb,
> -                        * the old fd val could be reused by when creates=
 new
> -                        * listen fd in another thread, we couldn't call
> -                        * fdset_del.
> +                        * A concurrent fdset_del may have been waiting f=
or the
> +                        * fdentry not to be busy, so we can't call
> +                        * fdset_del_locked().
>                         */
> -                       if (remove1 || remove2) {
> -                               pfdentry->fd =3D -1;
> -                               need_shrink =3D 1;
> -                       }
> +                       if (remove1 || remove2)
> +                               fdset_del(pfdset, fd);
>                }
> -
> -               if (need_shrink)
> -                       fdset_shrink(pfdset);
> -
> -               if (pfdset->destroy)
> -                       break;

I guess we want to keep the destroy logic

Thanks,
Chenbo

>        }
>=20
>        return 0;
> diff --git a/lib/vhost/fd_man.h b/lib/vhost/fd_man.h
> index 079fa0155f..6398343a6a 100644
> --- a/lib/vhost/fd_man.h
> +++ b/lib/vhost/fd_man.h
> @@ -6,7 +6,7 @@
> #define _FD_MAN_H_
> #include <pthread.h>
> #include <poll.h>
> -#include <stdbool.h>
> +#include <sys/queue.h>
>=20
> struct fdset;
>=20
> @@ -19,8 +19,7 @@ struct fdset *fdset_init(const char *name);
> int fdset_add(struct fdset *pfdset, int fd,
>        fd_cb rcb, fd_cb wcb, void *dat);
>=20
> -void *fdset_del(struct fdset *pfdset, int fd);
> -
> +void fdset_del(struct fdset *pfdset, int fd);
> int fdset_try_del(struct fdset *pfdset, int fd);
>=20
> #endif
> --
> 2.44.0
>=20