From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 472B9A0567; Tue, 10 Mar 2020 14:04:52 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 84A5B1BF7F; Tue, 10 Mar 2020 14:04:51 +0100 (CET) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) by dpdk.org (Postfix) with ESMTP id 844B423D for ; Tue, 10 Mar 2020 14:04:49 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1583845488; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=k8BmkaC+cVwgj/yL6Az37M6wYNiLHA7nkU1OQd/rEc8=; b=Ae3Fdm2+zetQK4zWk+15nWj1FqG8gkkP1w6iYt9NNxX/diePKtXTGe4Cit96fetSkxKoVf ETSbNUyAqxVaJqMVCT4K1Bqk0K7wS7WCLsrQjrrRvkiJES0MMizBYWjcAjzcBUlt8pM9ga aP9lcA63Z/4M8ojkCLI8rTBWuYlx0TI= Received: from mail-vk1-f199.google.com (mail-vk1-f199.google.com [209.85.221.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-345-L0DG_U6ONnqrIybKRJMwqg-1; Tue, 10 Mar 2020 09:04:47 -0400 X-MC-Unique: L0DG_U6ONnqrIybKRJMwqg-1 Received: by mail-vk1-f199.google.com with SMTP id 133so2835431vku.16 for ; Tue, 10 Mar 2020 06:04:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Na/ObqLSyNDFnPUF5KLBIhWeezpxpICCwjSqOBKlXZ8=; b=Iu1sSSIcy+DWmeg432Sj8FepCU1TfZjK6Dj6i7A1Vm30/RA8uNjZzb/ZuSJGqLSgYK s0t0HcRUUtmgMe+sTnSXQGq5dOVx98w3iHd8gyk6+YwrQ6GuCEL1D6fVRusKEV7PQk87 q0OMPrGyQEBg7pPfFuQ+n4GTLo4S2HNFx3zXu0CizZZtygpEWnH1Q8yb4slKrL2sAmVY 38nmxsuquSQ6y+Cxrqb2PJp4Hw4F3rRBoJnncRyvQ1l8FwegyAK+SyzHpQcnPuwkZ/Io aDBgz+qY5/y2Bkbp4ogn9LVfD+UbVE3uuX1lxlRq1D0HMHtUaVMtg+uLnhLDSKdImOdQ YBtA== X-Gm-Message-State: ANhLgQ2+sssqQzl3coctJguYXzPARa1dnW27uuo2rm7KUmg6YloJgIfI mggzOT4NYhSTdE1AkE0ZFSoYaKfh+JROx1oeE5GRRXmOV2UvtLmoXjWi5keif7TjQEPqbUo5Q/U 956ncn9qJKLwO+E4I6PQ= X-Received: by 2002:a05:6102:30ba:: with SMTP id y26mr10172024vsd.198.1583845486088; Tue, 10 Mar 2020 06:04:46 -0700 (PDT) X-Google-Smtp-Source: ADFU+vvT/yniL1+SlARrh4TgFzAxKFWP7yem6wLDBfWB9Pe8HRq19o4Zd8qfAKf7fUzVnt+7xHNjZLHfavyFuUyFJT0= X-Received: by 2002:a05:6102:30ba:: with SMTP id y26mr10171983vsd.198.1583845485562; Tue, 10 Mar 2020 06:04:45 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: David Marchand Date: Tue, 10 Mar 2020 14:04:34 +0100 Message-ID: To: "Van Haaren, Harry" Cc: Aaron Conole , dev X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize' X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Fri, Feb 21, 2020 at 1:28 PM Van Haaren, Harry wrote: > > > -----Original Message----- > > From: David Marchand > > Sent: Thursday, February 20, 2020 1:25 PM > > To: Van Haaren, Harry > > Cc: Aaron Conole ; dev > > Subject: Re: [RFC] service: stop lcore threads before 'finalize' > > > > On Mon, Feb 10, 2020 at 3:16 PM Van Haaren, Harry > > wrote: > > > > > We need a fix for this issue. > > > > > > > > +1 > > > > > > > > Interestingly, Stephen patch that joins all pthreads at > > > > > rte_eal_cleanup [1] makes this issue disappear. > > > > > So my understanding is that we are missing a api (well, I could n= ot > > > > > find a way) to synchronously stop service lcores. > > > > > > > > Maybe we can take that patch as a fix. I hate to see this segfault > > > > in the field. I need to figure out what I missed in my cleanup > > > > (probably missed a synchronization point). > > > > > > I haven't easily reproduced this yet - so I'll investigate a way to > > > reproduce with close to 100% rate, then we can identify the root caus= e > > > and actually get a clean fix. If you have pointers to reproduce easil= y, > > > please let me know. > > > > > > > ping. > > I want a fix in 20.05, or I will start considering how to drop this thi= ng. > > Hi David, > > I have been attempting to reproduce, unfortunately without success. > > Attempted you suggested meson test approach (thanks for suggesting!), but > I haven't had a segfault with that approach (yet, and its done a lot of i= terations..) I reproduced it on the first try, just now. Travis catches it every once in a while (look at the ovsrobot). For the reproduction, this is on my laptop (core i7-8650U), baremetal, no fancy stuff. FWIW, the cores are ruled by the "powersave" governor. I can see the frequency oscillates between 3.5GHz and 3.7Ghz while the max frequency is 4.2GHz. Travis runs virtual machines with 2 cores, and there must be quite some overprovisioning on those servers. We can expect some cycles being stolen or at least something happening on the various cores. > > I've made the service-cores unit tests delay before exit, in an attempt > to have them access previously rte_free()-ed memory, no luck to reproduce= . Ok, let's forget about the segfault, what do you think of the backtrace I caught? A service lcore thread is still in the service loop. The master thread of the application is in the libc exiting code. This is what I get in all crashes. > > Thinking perhaps we need it on exit, I've also POCed a unit test that lea= ves > service cores active on exit on purpose, to try have them poll after exit= , > still no luck. > > Simplifying the problem, and using hello-world sample app with a rte_eal_= cleaup() > call at the end also doesn't easily aggravate the problem. > > From code inspection, I agree there is an issue. It seems like a call to > rte_service_lcore_reset_all() from rte_service_finalize() is enough... > But without reproducing it is hard to have good confidence in a fix. You promised a doc update on the services API. Thanks. -- David Marchand