From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f195.google.com (mail-qt0-f195.google.com [209.85.216.195]) by dpdk.org (Postfix) with ESMTP id 40018DE0 for ; Fri, 14 Apr 2017 10:51:27 +0200 (CEST) Received: by mail-qt0-f195.google.com with SMTP id v3so10809967qtd.3 for ; Fri, 14 Apr 2017 01:51:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=X2sfaMaE/eZz7nrsQW/KUhAWKDBR35IInUaRycgnP3s=; b=pazq75MmaIzdT8KOKXyGLPbg7RCHyoOG1Kh1Ot6404i9Kl/0/K+Y4KqYX85TE2ikLz 7aeX3fMNsNkaHnyUIVwCffNaDB5S0Z7wykuL1DXmTkvbR9cK7Ljd3Eb4J9ASMhuycjay kLiEzY80BNlrUK2D6cE9JubQFbaXwMLQ6r5hE0CWPIs4guNOFDe25DyvPJNe6uFchZBt d/5Tkyu4vzCXzGJ8iOxzPPPZBDmrpnfXHUiyUrRBaj3vnKSX/03yU1SvOQfluzZ3L+1l JPQylPPA7Gw3UUW0/wZjLtLCxpduUbWebGtEI/iVQQTtv0ukDtqJt/uBD/SWNf4mDtsh //VA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=X2sfaMaE/eZz7nrsQW/KUhAWKDBR35IInUaRycgnP3s=; b=dmzeDMOGtp7yFM6nD3sroyKBo3hfPrI7WZnmBnOZZFiBHYMeOoOht5BZxnb8ZfzUdJ 50gaP9UndWp6uDfe+QcxeTaPxjhJaYcNJWZ80chi7bAC+8e/eaWR0PaJX0pjlmnxNX98 0iCttTtMNpumjJHmzkWjiTyFCtFK08TLJeFsrFdA5oJJcPPPiK0iq7pyXMqVvbc5Y+BS GZ46QRh5xy5TqtTOj313aT6E+XfrmTk7lC1SfQes61wPDFYsuU3qdftKlPCZC/xOxrRy Hv46CPmc/FSfeytzaWJTlZfMpA1lDmHPzY9g45qwDSBvrcUVMZZB5CL2mn3BW3Zny0UM D0pw== X-Gm-Message-State: AN3rC/783UwUvt2jS7Z87/4KFHGNQzvu6Lr2hhGqjLad8mk5xe7F8m/N uvEJ5DY2RYeEM4noX6FynDWAdFqbYA== X-Received: by 10.237.59.198 with SMTP id s6mr5795755qte.97.1492159886536; Fri, 14 Apr 2017 01:51:26 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.23.240 with HTTP; Fri, 14 Apr 2017 01:51:26 -0700 (PDT) In-Reply-To: <20170413170306.0fd9c8d7@xeon-e3> References: <20170413145910.GG3790@6wind.com> <2601191342CEEE43887BDE71AB9772583FAE8F95@IRSMSX109.ger.corp.intel.com> <20170413170306.0fd9c8d7@xeon-e3> From: jigsaw Date: Fri, 14 Apr 2017 11:51:26 +0300 Message-ID: To: Stephen Hemminger Cc: "Ananyev, Konstantin" , Adrien Mazarguil , "dev@dpdk.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Apr 2017 08:51:27 -0000 Hi all, Thanks for the comments. The proposal seems refrained from wide usage in application. thx & rgds, -qinglai On Fri, Apr 14, 2017 at 3:03 AM, Stephen Hemminger < stephen@networkplumber.org> wrote: > On Thu, 13 Apr 2017 23:19:45 +0000 > "Ananyev, Konstantin" wrote: > > > > -----Original Message----- > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of jigsaw > > > Sent: Thursday, April 13, 2017 9:59 PM > > > To: Adrien Mazarguil > > > Cc: dev@dpdk.org > > > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection > > > > > > Hi Adrien, > > > > > > Thanks for your comment. > > > > > > The LOCK/UNLOCK may be called by user application only. There are > several > > > reasons. > > > > > > 1. If the lib calls LOCK, user application may be punished > unexpectedly. > > > Consider what if the Rx burst function calls the LOCK in core #1, and > then > > > the mbuf is handed over from > > > core #1 to core #2 through a enqueue/dequeue operation, as below: > > > > > > Rx(1) --> Enqueue(1) --> Dequeue(2) > > > LOCKED Panic! > > > > > > The core #2 will then panic because the mbuf is owned by core #1 > without > > > being released. > > > > > > 2. Rx and Tx both usually works in a burst mode, combined with prefetch > > > operation. Meanwhile > > > LOCK and UNLOCK cannot work well with prefetch, because it requires > data > > > access of mbuf header. > > > > > > 3. The critical session tends to be small. Here we (user application) > need > > > to find a balance: with longer interval of critical > > > section, we can have more secured code; however, longer duration leads > to > > > more complicated business > > > logic. > > > > > > Hence, we urge user application to use LOCK/UNLOCK with the common > sense of > > > critical sections. > > > > > > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown > > > below: > > > > > > ========================================== > > > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h > > > b/examples/l3fwd/l3fwd_em_hlm_sse.h > > > index 7714a20..9db0190 100644 > > > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h > > > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h > > > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > > > **pkts_burst, > > > > > > for (j = 0; j < n; j += 8) { > > > > > > + RTE_MBUF_LOCK(pkts_burst[j]); > > > + RTE_MBUF_LOCK(pkts_burst[j+1]); > > > + RTE_MBUF_LOCK(pkts_burst[j+2]); > > > + RTE_MBUF_LOCK(pkts_burst[j+3]); > > > + RTE_MBUF_LOCK(pkts_burst[j+4]); > > > + RTE_MBUF_LOCK(pkts_burst[j+5]); > > > + RTE_MBUF_LOCK(pkts_burst[j+6]); > > > + RTE_MBUF_LOCK(pkts_burst[j+7]); > > > + > > > uint32_t pkt_type = > > > pkts_burst[j]->packet_type & > > > pkts_burst[j+1]->packet_type & > > > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > > > **pkts_burst, > > > } > > > } > > > > > > - for (; j < nb_rx; j++) > > > + for (; j < nb_rx; j++) { > > > + RTE_MBUF_LOCK(pkts_burst[j]); > > > dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], > portid); > > > + } > > > > > > send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); > > > > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h > > > index 1afa1f0..2938558 100644 > > > --- a/examples/l3fwd/l3fwd_sse.h > > > +++ b/examples/l3fwd/l3fwd_sse.h > > > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t > port, > > > struct rte_mbuf *m[], > > > > > > len = qconf->tx_mbufs[port].len; > > > > > > + for (j = 0; j < num; ++j) > > > + RTE_MBUF_UNLOCK(m); > > > + > > > /* > > > * If TX buffer for that queue is empty, and we have enough > packets, > > > * then send them straightway. > > > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, > struct > > > rte_mbuf **pkts_burst, > > > if (likely(pn != BAD_PORT)) > > > send_packetsx4(qconf, pn, pkts_burst + j, k); > > > else > > > - for (m = j; m != j + k; m++) > > > + for (m = j; m != j + k; m++) { > > > + RTE_MBUF_UNLOCK(pkts_burst[m]); > > > rte_pktmbuf_free(pkts_burst[m]); > > > + } > > > > > > } > > > } > > > ========================================== > > > > > > Note that data race may or may not have visible consequence. If two > cores > > > unconsciously process same mbuf > > > at different time, they may never notice it; but if two cores access > same > > > mbuf at the same physical time, the > > > consequence is usually visible (crash). We don't seek for a solution > that > > > captures even potential data race; > > > instead, we seek for a solution that can capture data race that happens > > > simultaneously in two or more cores. > > > Therefore, we do not need to extend the border of locking as wide as > > > possible. We will apply locking only when > > > we are focusing on a single mbuf processing. > > > > > > In a real life application, a core will spend quite some time on each > mbuf. > > > The interval ranges from a few hundred > > > cycles to a few thousand cycles. And usually not more than a handful of > > > mbuf are involved. This is a ideal use case > > > for locking mbuf. > > > > > > I agree that the race detection shall not be compiled by default, since > > > mtod is often called, and every mtod implies a > > > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be > > > supported as well. But I don't think refcnt logic > > > should be taken into account; these two are orthogonal designs, IMO. > ***** > > > Pls correct me if I am wrong here. ***** > > > > > > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. > That > > > is why I said LOCK/UNLOCK needs to > > > survive mbuf alloc initialization. > > > > > > Of course we need to support locking multiple mbufs at the same time. > For > > > each core, we will then preserve, say, 8 slots. > > > It works exactly like a direct mapped cacheline. That is, we can use > 4bits > > > from the mbuf address to locate its cacheline. > > > If the cacheline has been occupied, we do an eviction; that is, the new > > > mbuf will take the place of the old one. The old one is > > > then UNLOCKed, unfortunately. > > > > > > Honestly I have not yet tried this approach in real life application. > But I > > > have been thinking over the problem of data race detection > > > for a long time, and I found the restriction and requirement makes this > > > solution the only viable one. There are hundreds of papers > > > published in the field on data race condition detection, but the > lightest > > > of the options has at least 5x of performance > > > penalty, not to mention the space complexity, making it not applicable > in > > > practice. > > > > > > Again, pls, anyone has same painful experience of data race bugs, share > > > with me your concerns. It would be nice to come up with > > > some practical device to address this problem. I believe 6Wind and > other IP > > > stack vendors must share the same feeling and opinion. > > > > > > thx & > > > rgds, > > > > > > Qinglai > > > > > > > > > > > > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil < > > > adrien.mazarguil@6wind.com> wrote: > > > > > > > Hi Qinglai, > > > > > > > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > > > > > Hi, > > > > > > > > > > I have a proposal for mbuf race condition detection and I would > like to > > > > get > > > > > your opinions before > > > > > committing any patch. > > > > > > > > > > Race condition is the worst bug I can think of; as it causes > crashing > > > > long > > > > > after the first crime scene, > > > > > and the consequence is unpredictable and difficult to understand. > > > > > > > > > > Besides, race condition is very difficult to reproduce. Usually we > get a > > > > > coredump from live network, > > > > > but the coredump itself delivers nearly zero info. We just know > that the > > > > > mbuf is somehow broken, and > > > > > it is perhaps overwritten by another core at any moment. > > > > > > > > > > There are tools such as Valgrind and ThreadSanitizer to capture > this > > > > fault. > > > > > But the overhead brought > > > > > by the tools are too high to be deployed in performance test, not > to > > > > > mention in the live network. But > > > > > race condition rarely happens under low pressure. > > > > > > > > > > Even worse, even in theory, the tools mentioned are not capable of > > > > > capturing the scenario, because > > > > > they requires explicit calls on locking primitives such as pthread > mutex. > > > > > This is because the semantics > > > > > are not understood by the tools without explicit lock/unlock. > > > > > > > > > > With such known restrictions, I have a proposal roughly as below. > > > > > > > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf > that > > > > > matters. The pseudo code is as below: > > > > > > > > > > RTE_MBUF_LOCK(m); > > > > > /* use mbuf as usual */ > > > > > ... > > > > > RTE_MBUF_UNLOCK(m); > > > > > > > > > > During the protected critical section, only the core which holds > the lock > > > > > can access the mbuf; while > > > > > other cores, if they dare to use mbuf, they will be punished by > segfault. > > > > > > > > > > Since the proposal shall be feasible at least in performance test, > the > > > > > overhead of locking and > > > > > punishment must be small. And the access to mbuf must be as usual. > > > > > > > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo > code): > > > > > > > > > > RTE_MBUF_LOCK(m) > > > > > { > > > > > store_per_core_cache(m, m->buf_addr); > > > > > m->buf_addr = NULL; > > > > > mb(); // memory barrier > > > > > } > > > > > > > > > > And RTE_MBUF_UNLOCK is simply the reverse: > > > > > > > > > > RTE_MBUF_UNLOCK(m) > > > > > { > > > > > purge_per_core_cache(m); > > > > > m->buf_addr = load_per_core_cache(m); > > > > > mb(); > > > > > } > > > > > > > > > > As we can see that the idea is to re-write m->buf_addr with NULL, > and > > > > store > > > > > it in a per-core > > > > > cache. The other cores, while trying to access the mbuf during > critical > > > > > section, will be certainly > > > > > punished. > > > > > > > > > > Then of course we need to keep the owner core working. This is > done by > > > > > making changes to > > > > > mtod, as below: > > > > > > > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > > > > > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + > (m)->data_off + > > > > > (o))) > > > > > > > > > > The per-core cache of mbuf works like a key-value database, which > works > > > > > like a direct-mapping > > > > > cache. If an eviction happens (a newly arriving mbuf must kick out > an old > > > > > one), then the we restore > > > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK > will then > > > > > take care of such > > > > > situation. > > > > > > > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is > freed > > > > and > > > > > allocated, since > > > > > we don't trust the refcnt. A double-free is actually very rare; > but data > > > > > race can occur without double-free. Therefore, we need to survive > the > > > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the > > > > > buf_addr. > > > > > > > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be > revised. > > > > > But it is not a big deal (I hope). > > > > > > > > > > The runtime overhead of this implementation is very > light-weighted, and > > > > > probably is able to turned > > > > > on even in live network. And it is not intrusive at all: no change > is > > > > > needed in struct rte_mbuf; we just > > > > > need a O(N) space complexity (where N is number of cores), and O(1) > > > > runtime > > > > > complexity. > > > > > > > > > > Alright, that is basically what is in my mind. Before any patch is > > > > > provided, or any perf analysis is made, I would like to know what > are the > > > > > risks form design point of view. Please leave your feedback. > > > > > > > > Your proposal makes sense but I'm not sure where developers should > call > > > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user > applications > > > > only? Is it to be used internally by DPDK as well? Does it include > PMDs? > > > > > > > > I think any overhead outside of debugging mode, as minimal as it is, > is too > > > > much of a "punishment" for well behaving applications. The cost of a > memory > > > > barrier can be extremely high and this is one of the reasons DPDK > processes > > > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and > > > > RTE_MBUF_UNLOCK() should thus exist under some compilation option > disabled > > > > by default, and thought as an additional debugging tool. > > > > > > > > Also the implementation would probably be more expensive than your > > > > suggestion, in my opinion the reference count must be taken into > account > > > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs > > > > work. Freeing mbufs multiple times is perfectly valid in many cases. > > > > > > > > How can one lock several mbufs at once by the way? > > > > > > > > Since it affects performance, this can only make sense as a > debugging tool > > > > developers can use when they encounter some corruption issue they > cannot > > > > identify, somewhat like poisoning buffers before they are freed. Not > sure > > > > it's worth the trouble. > > > > I am agree with Adrian here - it doesn't look worth it: > > From one side it adds quite a lot of overhead and complexity to the mbuf > code. > > From other side it usage seems pretty limited - there would be a lot of > cases when it wouldn't be > > able to detect race condition anyway.. > > So my opinion is NACK. > > BTW, if your app is intentionally trying to do read/write to the same > mbufs simultaneously from different threads > > without some explicit synchronization, then most likely there is > something wrong with it. > > Konstantin > > This is a one-off debug thing. If you have these kind of issues, you > really need to rethink > your threading design. The examples have clear thread model. If you follow > the documentation > about process model, and don't try and reinvent your own, then this kind > of hand holding > is not necessary. > > NAK as well. >