From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f169.google.com (mail-wr0-f169.google.com [209.85.128.169]) by dpdk.org (Postfix) with ESMTP id 6390C2C27 for ; Thu, 13 Apr 2017 16:59:19 +0200 (CEST) Received: by mail-wr0-f169.google.com with SMTP id c55so37390975wrc.3 for ; Thu, 13 Apr 2017 07:59:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=KUTQX03l4XsYTS8Teh4ZaSLUhY814ziMiwFt7rFhnyc=; b=n+xNlT+FJgba2ypE5gfb2KPvfpDQHddLS9GXh3W6WfSMJOshD0tzPwlNLHehVCeuh6 EjhNkOG/S6IKM8xIQBN0zsubMv2Og1LgpXtrmFpGGIyOBOdINSuTp2bLq8TiGgN3U/PM 4z5QHXTNPuCUhi2tcWs2U4rIgxInFTubRtc8K9mysRjujykWpwCzgYJKcQreNKh92gWx l4br/7RtzfzMpdEk+A9r1tj7AwjYV8HBicGb+fjnE7cNwJ1fPl9vJWzEc36nMsAOLGtX yKK9vgrU7irftTSVvsgjKldjGkaqmPnAUJe36vmh5rnXQ6/FOFDSD2CHTBCc2PQ60Xh/ 4XhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=KUTQX03l4XsYTS8Teh4ZaSLUhY814ziMiwFt7rFhnyc=; b=s9tx4K45XJvujYdwkmJ6QoiPvQD2RY/L4/tS/m7Co7PxikvyKW/j/GuDg9wx4xrQ7O xnKDofh2UlvHygnvkg0YMQGiWGBccdGbcWk2JkBgCtzVvHMPh1wCFB01sQweyexesieT sq4SjiVOtq1BE7VpgWawdlE/vdiDLLILzcmW2aKDIMyDo29HyWCWhDucYtS5nXEMQSjL xrAPXS+kyQP9qna23aEmaoD7ZBkg1ya+/sqbqTooVEAU4yVPuxBLt4Ssn9hIxshmcsp/ UWoVrOzJ8iGMU4Mj9A2jD2AQL/X3YPL5lyZAmf5gYBURPpN7WQB1Z1v8Yq8J4vBfEl1h YMvA== X-Gm-Message-State: AN3rC/4dRj+OunSk16IpKquczUakem8M4uwBQSmw9o79S90+5uxzjv6F owXKw4o89+pZaFFK X-Received: by 10.223.174.241 with SMTP id y104mr3756474wrc.79.1492095559009; Thu, 13 Apr 2017 07:59:19 -0700 (PDT) Received: from 6wind.com (host.78.145.23.62.rev.coltfrance.com. [62.23.145.78]) by smtp.gmail.com with ESMTPSA id d17sm10852554wmi.21.2017.04.13.07.59.17 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Apr 2017 07:59:17 -0700 (PDT) Date: Thu, 13 Apr 2017 16:59:10 +0200 From: Adrien Mazarguil To: jigsaw Cc: "dev@dpdk.org" Message-ID: <20170413145910.GG3790@6wind.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Apr 2017 14:59:19 -0000 Hi Qinglai, On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > Hi, > > I have a proposal for mbuf race condition detection and I would like to get > your opinions before > committing any patch. > > Race condition is the worst bug I can think of; as it causes crashing long > after the first crime scene, > and the consequence is unpredictable and difficult to understand. > > Besides, race condition is very difficult to reproduce. Usually we get a > coredump from live network, > but the coredump itself delivers nearly zero info. We just know that the > mbuf is somehow broken, and > it is perhaps overwritten by another core at any moment. > > There are tools such as Valgrind and ThreadSanitizer to capture this fault. > But the overhead brought > by the tools are too high to be deployed in performance test, not to > mention in the live network. But > race condition rarely happens under low pressure. > > Even worse, even in theory, the tools mentioned are not capable of > capturing the scenario, because > they requires explicit calls on locking primitives such as pthread mutex. > This is because the semantics > are not understood by the tools without explicit lock/unlock. > > With such known restrictions, I have a proposal roughly as below. > > The idea is to ask coder to do explicit lock/unlock on each mbuf that > matters. The pseudo code is as below: > > RTE_MBUF_LOCK(m); > /* use mbuf as usual */ > ... > RTE_MBUF_UNLOCK(m); > > During the protected critical section, only the core which holds the lock > can access the mbuf; while > other cores, if they dare to use mbuf, they will be punished by segfault. > > Since the proposal shall be feasible at least in performance test, the > overhead of locking and > punishment must be small. And the access to mbuf must be as usual. > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): > > RTE_MBUF_LOCK(m) > { > store_per_core_cache(m, m->buf_addr); > m->buf_addr = NULL; > mb(); // memory barrier > } > > And RTE_MBUF_UNLOCK is simply the reverse: > > RTE_MBUF_UNLOCK(m) > { > purge_per_core_cache(m); > m->buf_addr = load_per_core_cache(m); > mb(); > } > > As we can see that the idea is to re-write m->buf_addr with NULL, and store > it in a per-core > cache. The other cores, while trying to access the mbuf during critical > section, will be certainly > punished. > > Then of course we need to keep the owner core working. This is done by > making changes to > mtod, as below: > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + > (o))) > > The per-core cache of mbuf works like a key-value database, which works > like a direct-mapping > cache. If an eviction happens (a newly arriving mbuf must kick out an old > one), then the we restore > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then > take care of such > situation. > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed and > allocated, since > we don't trust the refcnt. A double-free is actually very rare; but data > race can occur without double-free. Therefore, we need to survive the > allocation of mbuf, that is rte_pktmbuf_init, which reset the > buf_addr. > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised. > But it is not a big deal (I hope). > > The runtime overhead of this implementation is very light-weighted, and > probably is able to turned > on even in live network. And it is not intrusive at all: no change is > needed in struct rte_mbuf; we just > need a O(N) space complexity (where N is number of cores), and O(1) runtime > complexity. > > Alright, that is basically what is in my mind. Before any patch is > provided, or any perf analysis is made, I would like to know what are the > risks form design point of view. Please leave your feedback. Your proposal makes sense but I'm not sure where developers should call RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications only? Is it to be used internally by DPDK as well? Does it include PMDs? I think any overhead outside of debugging mode, as minimal as it is, is too much of a "punishment" for well behaving applications. The cost of a memory barrier can be extremely high and this is one of the reasons DPDK processes mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled by default, and thought as an additional debugging tool. Also the implementation would probably be more expensive than your suggestion, in my opinion the reference count must be taken into account and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs work. Freeing mbufs multiple times is perfectly valid in many cases. How can one lock several mbufs at once by the way? Since it affects performance, this can only make sense as a debugging tool developers can use when they encounter some corruption issue they cannot identify, somewhat like poisoning buffers before they are freed. Not sure it's worth the trouble. Regards, -- Adrien Mazarguil 6WIND