From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 3E0E2A0509;
	Thu, 14 Apr 2022 22:28:40 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 1C64E4067E;
	Thu, 14 Apr 2022 22:28:40 +0200 (CEST)
Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com
 [209.85.216.45]) by mails.dpdk.org (Postfix) with ESMTP id 7D40F4003C
 for <dev@dpdk.org>; Thu, 14 Apr 2022 22:28:39 +0200 (CEST)
Received: by mail-pj1-f45.google.com with SMTP id
 md20-20020a17090b23d400b001cb70ef790dso10232833pjb.5
 for <dev@dpdk.org>; Thu, 14 Apr 2022 13:28:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=networkplumber-org.20210112.gappssmtp.com; s=20210112;
 h=date:from:to:cc:subject:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=mCIeqfTBsVMMYF8Kc2lUi0tqpZ/JA0k841/55a2z/T0=;
 b=fPvTT4mP4MJUtLJMnE+naVYF/eAoG+OcO+WoCbsrRMAvgJhax7bD2dnWg156aeghLU
 dzlEB+OGfzSPdZKEjj49QHWDjoOSmEbTRx38SDyxy7x7u+KT166Zs/3soN6yguwZW/mB
 8C6bdQG0ltEihAKE19VZF6m32xkWUloTSS96my/3JTu64aHGyDhGPqeHwoeLcLM3WLQk
 SaiTe+ycJ3DPJ9SfdcL3Q2Hl/cqNVSpefJAcq0Jt50v4JK+dY0MNacGt0wwHy8E+O/Lz
 iuuySd7p8ePP1e4Drl8P5Sd0uNrsliyUbrnm0Yq/eHxUPH1/pGnoASz9O1IrLtoW1joa
 a1cA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=mCIeqfTBsVMMYF8Kc2lUi0tqpZ/JA0k841/55a2z/T0=;
 b=TLgCpN9OywMxNW/dMuj/Cz6CAEf0u15Luyp0OXxfpo0/MQfnLv+nL/ie91wKQv57ye
 wQKUbS4ksZ+kJgXlvHtGMvhIxgWOeoVT8agFErZ/myIGqRrpco0oLV/Q5iSTJqCBl2fV
 gtpq6Ci8VQXmbK6cCGiiJ+xYQDJfpv/Dn6CUJrVOqt1poIV33loI5oQDtyaFOFrmVJAz
 SzcIdl+foAcHNQQeqSoe9AqtUwiIySvz3eOgPMxxnqd7gu+YJJ9wfbDFk8Q4UsS7aIRP
 0yinsbi/eDO57GJTbdQJWKzfOcW5UQBVKIcLsXEVlMuovOyrofCTyhIj06NRUsXh8mlz
 Ej/g==
X-Gm-Message-State: AOAM530zcvHNFKkAH41mFLFQ0/2cfgh/+TWH7b19mPo9JKCFKdHhq6iR
 Ixxpc0vZ6L5Fo2LYLGOKEepc3A==
X-Google-Smtp-Source: ABdhPJymMe+P6LTbPYN14YE0D9oJ3RU0dgEGk3BMYrf6oOvWzTDZnYdckeqKoYz1KHHmAgvyr7em7A==
X-Received: by 2002:a17:902:b10e:b0:156:1bf8:bf26 with SMTP id
 q14-20020a170902b10e00b001561bf8bf26mr49301670plr.8.1649968118679; 
 Thu, 14 Apr 2022 13:28:38 -0700 (PDT)
Received: from hermes.local (204-195-112-199.wavecable.com. [204.195.112.199])
 by smtp.gmail.com with ESMTPSA id
 c7-20020a17090ab28700b001ca9514df81sm2604400pjr.45.2022.04.14.13.28.37
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 14 Apr 2022 13:28:38 -0700 (PDT)
Date: Thu, 14 Apr 2022 13:28:34 -0700
From: Stephen Hemminger <stephen@networkplumber.org>
To: Thomas Monjalon <thomas@monjalon.net>
Cc: anatoly.burakov@intel.com, stable@dpdk.org, dev@dpdk.org,
 david.marchand@redhat.com
Subject: Re: [PATCH] eal: fix data race in multi-process support
Message-ID: <20220414132834.5c073dad@hermes.local>
In-Reply-To: <9400637.ag9G3TJQzC@thomas>
References: <20211217181649.154972-1-stephen@networkplumber.org>
 <20211217182922.159503-1-stephen@networkplumber.org>
 <9400637.ag9G3TJQzC@thomas>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On Sun, 13 Feb 2022 12:39:59 +0100
Thomas Monjalon <thomas@monjalon.net> wrote:

> 17/12/2021 19:29, Stephen Hemminger:
> > If DPDK is built with thread sanitizer it reports a race
> > in setting of multiprocess file descriptor. The fix is to
> > use atomic operations when updating mp_fd.  
> 
> Please could explain more the condition of the race?
> Is it between init and cleanup of the same file descriptor?
> How atomic is helping here?
> 
> 
> > 
> > Simple example:
> > $ dpdk-testpmd -l 1-3 --no-huge
> > ...
> > EAL: Error - exiting with code: 1
> >   Cause: Creation of mbuf pool for socket 0 failed: Cannot allocate memory
> > ==================
> > WARNING: ThreadSanitizer: data race (pid=83054)
> >   Write of size 4 at 0x55e3b7fce450 by main thread:
> >     #0 rte_mp_channel_cleanup <null> (dpdk-testpmd+0x160d79c)
> >     #1 rte_eal_cleanup <null> (dpdk-testpmd+0x1614fb5)
> >     #2 rte_exit <null> (dpdk-testpmd+0x15ec97a)
> >     #3 mbuf_pool_create.cold <null> (dpdk-testpmd+0x242e1a)
> >     #4 main <null> (dpdk-testpmd+0x5ab05d)
> > 
> >   Previous read of size 4 at 0x55e3b7fce450 by thread T2:
> >     #0 mp_handle <null> (dpdk-testpmd+0x160c979)
> >     #1 ctrl_thread_init <null> (dpdk-testpmd+0x15ff76e)
> > 
> >   As if synchronized via sleep:
> >     #0 nanosleep ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:362 (libtsan.so.0+0x5cd8e)
> >     #1 get_tsc_freq <null> (dpdk-testpmd+0x1622889)
> >     #2 set_tsc_freq <null> (dpdk-testpmd+0x15ffb9c)
> >     #3 rte_eal_timer_init <null> (dpdk-testpmd+0x1622a34)
> >     #4 rte_eal_init.cold <null> (dpdk-testpmd+0x26b314)
> >     #5 main <null> (dpdk-testpmd+0x5aab45)
> > 
> >   Location is global 'mp_fd' of size 4 at 0x55e3b7fce450 (dpdk-testpmd+0x0000027c7450)
> > 
> >   Thread T2 'rte_mp_handle' (tid=83057, running) created by main thread at:
> >     #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:962 (libtsan.so.0+0x58ba2)
> >     #1 rte_ctrl_thread_create <null> (dpdk-testpmd+0x15ff870)
> >     #2 rte_mp_channel_init.cold <null> (dpdk-testpmd+0x269986)
> >     #3 rte_eal_init <null> (dpdk-testpmd+0x1615b28)
> >     #4 main <null> (dpdk-testpmd+0x5aab45)  
> 
> 
> 

The issue is that two threads are sharing a global variable without barriers or atomic.
The variable mp_fd is set in control thread rte_mp_channel_init/rte_mp_channel_cleanup
but then read by the thread that handles multiprocess (mp_handle).

This sharing of global data without barrier or lock is unsafe/undefined, and can
break on weakly ordered CPU's like ARM.

Kind of surprised that we don't see bug already since compiler could decide that
mp_fd in the function mp_handle() is invariant and not test it and have the thread
run forever.

This is a bug from the beginning of MP support in DPDK.