* [dpdk-dev] Proposal of mbuf Race Condition Detection @ 2017-04-13 13:38 jigsaw 2017-04-13 14:59 ` Adrien Mazarguil 0 siblings, 1 reply; 6+ messages in thread From: jigsaw @ 2017-04-13 13:38 UTC (permalink / raw) To: dev Hi, I have a proposal for mbuf race condition detection and I would like to get your opinions before committing any patch. Race condition is the worst bug I can think of; as it causes crashing long after the first crime scene, and the consequence is unpredictable and difficult to understand. Besides, race condition is very difficult to reproduce. Usually we get a coredump from live network, but the coredump itself delivers nearly zero info. We just know that the mbuf is somehow broken, and it is perhaps overwritten by another core at any moment. There are tools such as Valgrind and ThreadSanitizer to capture this fault. But the overhead brought by the tools are too high to be deployed in performance test, not to mention in the live network. But race condition rarely happens under low pressure. Even worse, even in theory, the tools mentioned are not capable of capturing the scenario, because they requires explicit calls on locking primitives such as pthread mutex. This is because the semantics are not understood by the tools without explicit lock/unlock. With such known restrictions, I have a proposal roughly as below. The idea is to ask coder to do explicit lock/unlock on each mbuf that matters. The pseudo code is as below: RTE_MBUF_LOCK(m); /* use mbuf as usual */ ... RTE_MBUF_UNLOCK(m); During the protected critical section, only the core which holds the lock can access the mbuf; while other cores, if they dare to use mbuf, they will be punished by segfault. Since the proposal shall be feasible at least in performance test, the overhead of locking and punishment must be small. And the access to mbuf must be as usual. Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): RTE_MBUF_LOCK(m) { store_per_core_cache(m, m->buf_addr); m->buf_addr = NULL; mb(); // memory barrier } And RTE_MBUF_UNLOCK is simply the reverse: RTE_MBUF_UNLOCK(m) { purge_per_core_cache(m); m->buf_addr = load_per_core_cache(m); mb(); } As we can see that the idea is to re-write m->buf_addr with NULL, and store it in a per-core cache. The other cores, while trying to access the mbuf during critical section, will be certainly punished. Then of course we need to keep the owner core working. This is done by making changes to mtod, as below: #define rte_pktmbuf_mtod_offset(m, t, o) \ ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + (o))) The per-core cache of mbuf works like a key-value database, which works like a direct-mapping cache. If an eviction happens (a newly arriving mbuf must kick out an old one), then the we restore the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then take care of such situation. Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed and allocated, since we don't trust the refcnt. A double-free is actually very rare; but data race can occur without double-free. Therefore, we need to survive the allocation of mbuf, that is rte_pktmbuf_init, which reset the buf_addr. Further, other dereference to buf_addr in rte_mbuf.h need to be revised. But it is not a big deal (I hope). The runtime overhead of this implementation is very light-weighted, and probably is able to turned on even in live network. And it is not intrusive at all: no change is needed in struct rte_mbuf; we just need a O(N) space complexity (where N is number of cores), and O(1) runtime complexity. Alright, that is basically what is in my mind. Before any patch is provided, or any perf analysis is made, I would like to know what are the risks form design point of view. Please leave your feedback. Thanks & Regards, Qinglai ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] Proposal of mbuf Race Condition Detection 2017-04-13 13:38 [dpdk-dev] Proposal of mbuf Race Condition Detection jigsaw @ 2017-04-13 14:59 ` Adrien Mazarguil 2017-04-13 20:58 ` jigsaw 0 siblings, 1 reply; 6+ messages in thread From: Adrien Mazarguil @ 2017-04-13 14:59 UTC (permalink / raw) To: jigsaw; +Cc: dev Hi Qinglai, On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > Hi, > > I have a proposal for mbuf race condition detection and I would like to get > your opinions before > committing any patch. > > Race condition is the worst bug I can think of; as it causes crashing long > after the first crime scene, > and the consequence is unpredictable and difficult to understand. > > Besides, race condition is very difficult to reproduce. Usually we get a > coredump from live network, > but the coredump itself delivers nearly zero info. We just know that the > mbuf is somehow broken, and > it is perhaps overwritten by another core at any moment. > > There are tools such as Valgrind and ThreadSanitizer to capture this fault. > But the overhead brought > by the tools are too high to be deployed in performance test, not to > mention in the live network. But > race condition rarely happens under low pressure. > > Even worse, even in theory, the tools mentioned are not capable of > capturing the scenario, because > they requires explicit calls on locking primitives such as pthread mutex. > This is because the semantics > are not understood by the tools without explicit lock/unlock. > > With such known restrictions, I have a proposal roughly as below. > > The idea is to ask coder to do explicit lock/unlock on each mbuf that > matters. The pseudo code is as below: > > RTE_MBUF_LOCK(m); > /* use mbuf as usual */ > ... > RTE_MBUF_UNLOCK(m); > > During the protected critical section, only the core which holds the lock > can access the mbuf; while > other cores, if they dare to use mbuf, they will be punished by segfault. > > Since the proposal shall be feasible at least in performance test, the > overhead of locking and > punishment must be small. And the access to mbuf must be as usual. > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): > > RTE_MBUF_LOCK(m) > { > store_per_core_cache(m, m->buf_addr); > m->buf_addr = NULL; > mb(); // memory barrier > } > > And RTE_MBUF_UNLOCK is simply the reverse: > > RTE_MBUF_UNLOCK(m) > { > purge_per_core_cache(m); > m->buf_addr = load_per_core_cache(m); > mb(); > } > > As we can see that the idea is to re-write m->buf_addr with NULL, and store > it in a per-core > cache. The other cores, while trying to access the mbuf during critical > section, will be certainly > punished. > > Then of course we need to keep the owner core working. This is done by > making changes to > mtod, as below: > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + > (o))) > > The per-core cache of mbuf works like a key-value database, which works > like a direct-mapping > cache. If an eviction happens (a newly arriving mbuf must kick out an old > one), then the we restore > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then > take care of such > situation. > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed and > allocated, since > we don't trust the refcnt. A double-free is actually very rare; but data > race can occur without double-free. Therefore, we need to survive the > allocation of mbuf, that is rte_pktmbuf_init, which reset the > buf_addr. > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised. > But it is not a big deal (I hope). > > The runtime overhead of this implementation is very light-weighted, and > probably is able to turned > on even in live network. And it is not intrusive at all: no change is > needed in struct rte_mbuf; we just > need a O(N) space complexity (where N is number of cores), and O(1) runtime > complexity. > > Alright, that is basically what is in my mind. Before any patch is > provided, or any perf analysis is made, I would like to know what are the > risks form design point of view. Please leave your feedback. Your proposal makes sense but I'm not sure where developers should call RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications only? Is it to be used internally by DPDK as well? Does it include PMDs? I think any overhead outside of debugging mode, as minimal as it is, is too much of a "punishment" for well behaving applications. The cost of a memory barrier can be extremely high and this is one of the reasons DPDK processes mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled by default, and thought as an additional debugging tool. Also the implementation would probably be more expensive than your suggestion, in my opinion the reference count must be taken into account and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs work. Freeing mbufs multiple times is perfectly valid in many cases. How can one lock several mbufs at once by the way? Since it affects performance, this can only make sense as a debugging tool developers can use when they encounter some corruption issue they cannot identify, somewhat like poisoning buffers before they are freed. Not sure it's worth the trouble. Regards, -- Adrien Mazarguil 6WIND ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] Proposal of mbuf Race Condition Detection 2017-04-13 14:59 ` Adrien Mazarguil @ 2017-04-13 20:58 ` jigsaw 2017-04-13 23:19 ` Ananyev, Konstantin 0 siblings, 1 reply; 6+ messages in thread From: jigsaw @ 2017-04-13 20:58 UTC (permalink / raw) To: Adrien Mazarguil; +Cc: dev Hi Adrien, Thanks for your comment. The LOCK/UNLOCK may be called by user application only. There are several reasons. 1. If the lib calls LOCK, user application may be punished unexpectedly. Consider what if the Rx burst function calls the LOCK in core #1, and then the mbuf is handed over from core #1 to core #2 through a enqueue/dequeue operation, as below: Rx(1) --> Enqueue(1) --> Dequeue(2) LOCKED Panic! The core #2 will then panic because the mbuf is owned by core #1 without being released. 2. Rx and Tx both usually works in a burst mode, combined with prefetch operation. Meanwhile LOCK and UNLOCK cannot work well with prefetch, because it requires data access of mbuf header. 3. The critical session tends to be small. Here we (user application) need to find a balance: with longer interval of critical section, we can have more secured code; however, longer duration leads to more complicated business logic. Hence, we urge user application to use LOCK/UNLOCK with the common sense of critical sections. Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown below: ========================================== diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h b/examples/l3fwd/l3fwd_em_hlm_sse.h index 7714a20..9db0190 100644 --- a/examples/l3fwd/l3fwd_em_hlm_sse.h +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, for (j = 0; j < n; j += 8) { + RTE_MBUF_LOCK(pkts_burst[j]); + RTE_MBUF_LOCK(pkts_burst[j+1]); + RTE_MBUF_LOCK(pkts_burst[j+2]); + RTE_MBUF_LOCK(pkts_burst[j+3]); + RTE_MBUF_LOCK(pkts_burst[j+4]); + RTE_MBUF_LOCK(pkts_burst[j+5]); + RTE_MBUF_LOCK(pkts_burst[j+6]); + RTE_MBUF_LOCK(pkts_burst[j+7]); + uint32_t pkt_type = pkts_burst[j]->packet_type & pkts_burst[j+1]->packet_type & @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, } } - for (; j < nb_rx; j++) + for (; j < nb_rx; j++) { + RTE_MBUF_LOCK(pkts_burst[j]); dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid); + } send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h index 1afa1f0..2938558 100644 --- a/examples/l3fwd/l3fwd_sse.h +++ b/examples/l3fwd/l3fwd_sse.h @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port, struct rte_mbuf *m[], len = qconf->tx_mbufs[port].len; + for (j = 0; j < num; ++j) + RTE_MBUF_UNLOCK(m); + /* * If TX buffer for that queue is empty, and we have enough packets, * then send them straightway. @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct rte_mbuf **pkts_burst, if (likely(pn != BAD_PORT)) send_packetsx4(qconf, pn, pkts_burst + j, k); else - for (m = j; m != j + k; m++) + for (m = j; m != j + k; m++) { + RTE_MBUF_UNLOCK(pkts_burst[m]); rte_pktmbuf_free(pkts_burst[m]); + } } } ========================================== Note that data race may or may not have visible consequence. If two cores unconsciously process same mbuf at different time, they may never notice it; but if two cores access same mbuf at the same physical time, the consequence is usually visible (crash). We don't seek for a solution that captures even potential data race; instead, we seek for a solution that can capture data race that happens simultaneously in two or more cores. Therefore, we do not need to extend the border of locking as wide as possible. We will apply locking only when we are focusing on a single mbuf processing. In a real life application, a core will spend quite some time on each mbuf. The interval ranges from a few hundred cycles to a few thousand cycles. And usually not more than a handful of mbuf are involved. This is a ideal use case for locking mbuf. I agree that the race detection shall not be compiled by default, since mtod is often called, and every mtod implies a visit to local cache. Further, recursive call of LOCK/UNLOCK shall be supported as well. But I don't think refcnt logic should be taken into account; these two are orthogonal designs, IMO. ***** Pls correct me if I am wrong here. ***** Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That is why I said LOCK/UNLOCK needs to survive mbuf alloc initialization. Of course we need to support locking multiple mbufs at the same time. For each core, we will then preserve, say, 8 slots. It works exactly like a direct mapped cacheline. That is, we can use 4bits from the mbuf address to locate its cacheline. If the cacheline has been occupied, we do an eviction; that is, the new mbuf will take the place of the old one. The old one is then UNLOCKed, unfortunately. Honestly I have not yet tried this approach in real life application. But I have been thinking over the problem of data race detection for a long time, and I found the restriction and requirement makes this solution the only viable one. There are hundreds of papers published in the field on data race condition detection, but the lightest of the options has at least 5x of performance penalty, not to mention the space complexity, making it not applicable in practice. Again, pls, anyone has same painful experience of data race bugs, share with me your concerns. It would be nice to come up with some practical device to address this problem. I believe 6Wind and other IP stack vendors must share the same feeling and opinion. thx & rgds, Qinglai On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil < adrien.mazarguil@6wind.com> wrote: > Hi Qinglai, > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > > Hi, > > > > I have a proposal for mbuf race condition detection and I would like to > get > > your opinions before > > committing any patch. > > > > Race condition is the worst bug I can think of; as it causes crashing > long > > after the first crime scene, > > and the consequence is unpredictable and difficult to understand. > > > > Besides, race condition is very difficult to reproduce. Usually we get a > > coredump from live network, > > but the coredump itself delivers nearly zero info. We just know that the > > mbuf is somehow broken, and > > it is perhaps overwritten by another core at any moment. > > > > There are tools such as Valgrind and ThreadSanitizer to capture this > fault. > > But the overhead brought > > by the tools are too high to be deployed in performance test, not to > > mention in the live network. But > > race condition rarely happens under low pressure. > > > > Even worse, even in theory, the tools mentioned are not capable of > > capturing the scenario, because > > they requires explicit calls on locking primitives such as pthread mutex. > > This is because the semantics > > are not understood by the tools without explicit lock/unlock. > > > > With such known restrictions, I have a proposal roughly as below. > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf that > > matters. The pseudo code is as below: > > > > RTE_MBUF_LOCK(m); > > /* use mbuf as usual */ > > ... > > RTE_MBUF_UNLOCK(m); > > > > During the protected critical section, only the core which holds the lock > > can access the mbuf; while > > other cores, if they dare to use mbuf, they will be punished by segfault. > > > > Since the proposal shall be feasible at least in performance test, the > > overhead of locking and > > punishment must be small. And the access to mbuf must be as usual. > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): > > > > RTE_MBUF_LOCK(m) > > { > > store_per_core_cache(m, m->buf_addr); > > m->buf_addr = NULL; > > mb(); // memory barrier > > } > > > > And RTE_MBUF_UNLOCK is simply the reverse: > > > > RTE_MBUF_UNLOCK(m) > > { > > purge_per_core_cache(m); > > m->buf_addr = load_per_core_cache(m); > > mb(); > > } > > > > As we can see that the idea is to re-write m->buf_addr with NULL, and > store > > it in a per-core > > cache. The other cores, while trying to access the mbuf during critical > > section, will be certainly > > punished. > > > > Then of course we need to keep the owner core working. This is done by > > making changes to > > mtod, as below: > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + > > (o))) > > > > The per-core cache of mbuf works like a key-value database, which works > > like a direct-mapping > > cache. If an eviction happens (a newly arriving mbuf must kick out an old > > one), then the we restore > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then > > take care of such > > situation. > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed > and > > allocated, since > > we don't trust the refcnt. A double-free is actually very rare; but data > > race can occur without double-free. Therefore, we need to survive the > > allocation of mbuf, that is rte_pktmbuf_init, which reset the > > buf_addr. > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised. > > But it is not a big deal (I hope). > > > > The runtime overhead of this implementation is very light-weighted, and > > probably is able to turned > > on even in live network. And it is not intrusive at all: no change is > > needed in struct rte_mbuf; we just > > need a O(N) space complexity (where N is number of cores), and O(1) > runtime > > complexity. > > > > Alright, that is basically what is in my mind. Before any patch is > > provided, or any perf analysis is made, I would like to know what are the > > risks form design point of view. Please leave your feedback. > > Your proposal makes sense but I'm not sure where developers should call > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications > only? Is it to be used internally by DPDK as well? Does it include PMDs? > > I think any overhead outside of debugging mode, as minimal as it is, is too > much of a "punishment" for well behaving applications. The cost of a memory > barrier can be extremely high and this is one of the reasons DPDK processes > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled > by default, and thought as an additional debugging tool. > > Also the implementation would probably be more expensive than your > suggestion, in my opinion the reference count must be taken into account > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs > work. Freeing mbufs multiple times is perfectly valid in many cases. > > How can one lock several mbufs at once by the way? > > Since it affects performance, this can only make sense as a debugging tool > developers can use when they encounter some corruption issue they cannot > identify, somewhat like poisoning buffers before they are freed. Not sure > it's worth the trouble. > > Regards, > > -- > Adrien Mazarguil > 6WIND > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] Proposal of mbuf Race Condition Detection 2017-04-13 20:58 ` jigsaw @ 2017-04-13 23:19 ` Ananyev, Konstantin 2017-04-14 0:03 ` Stephen Hemminger 0 siblings, 1 reply; 6+ messages in thread From: Ananyev, Konstantin @ 2017-04-13 23:19 UTC (permalink / raw) To: jigsaw, Adrien Mazarguil; +Cc: dev > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of jigsaw > Sent: Thursday, April 13, 2017 9:59 PM > To: Adrien Mazarguil <adrien.mazarguil@6wind.com> > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection > > Hi Adrien, > > Thanks for your comment. > > The LOCK/UNLOCK may be called by user application only. There are several > reasons. > > 1. If the lib calls LOCK, user application may be punished unexpectedly. > Consider what if the Rx burst function calls the LOCK in core #1, and then > the mbuf is handed over from > core #1 to core #2 through a enqueue/dequeue operation, as below: > > Rx(1) --> Enqueue(1) --> Dequeue(2) > LOCKED Panic! > > The core #2 will then panic because the mbuf is owned by core #1 without > being released. > > 2. Rx and Tx both usually works in a burst mode, combined with prefetch > operation. Meanwhile > LOCK and UNLOCK cannot work well with prefetch, because it requires data > access of mbuf header. > > 3. The critical session tends to be small. Here we (user application) need > to find a balance: with longer interval of critical > section, we can have more secured code; however, longer duration leads to > more complicated business > logic. > > Hence, we urge user application to use LOCK/UNLOCK with the common sense of > critical sections. > > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown > below: > > ========================================== > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h > b/examples/l3fwd/l3fwd_em_hlm_sse.h > index 7714a20..9db0190 100644 > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > **pkts_burst, > > for (j = 0; j < n; j += 8) { > > + RTE_MBUF_LOCK(pkts_burst[j]); > + RTE_MBUF_LOCK(pkts_burst[j+1]); > + RTE_MBUF_LOCK(pkts_burst[j+2]); > + RTE_MBUF_LOCK(pkts_burst[j+3]); > + RTE_MBUF_LOCK(pkts_burst[j+4]); > + RTE_MBUF_LOCK(pkts_burst[j+5]); > + RTE_MBUF_LOCK(pkts_burst[j+6]); > + RTE_MBUF_LOCK(pkts_burst[j+7]); > + > uint32_t pkt_type = > pkts_burst[j]->packet_type & > pkts_burst[j+1]->packet_type & > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > **pkts_burst, > } > } > > - for (; j < nb_rx; j++) > + for (; j < nb_rx; j++) { > + RTE_MBUF_LOCK(pkts_burst[j]); > dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid); > + } > > send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h > index 1afa1f0..2938558 100644 > --- a/examples/l3fwd/l3fwd_sse.h > +++ b/examples/l3fwd/l3fwd_sse.h > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port, > struct rte_mbuf *m[], > > len = qconf->tx_mbufs[port].len; > > + for (j = 0; j < num; ++j) > + RTE_MBUF_UNLOCK(m); > + > /* > * If TX buffer for that queue is empty, and we have enough packets, > * then send them straightway. > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct > rte_mbuf **pkts_burst, > if (likely(pn != BAD_PORT)) > send_packetsx4(qconf, pn, pkts_burst + j, k); > else > - for (m = j; m != j + k; m++) > + for (m = j; m != j + k; m++) { > + RTE_MBUF_UNLOCK(pkts_burst[m]); > rte_pktmbuf_free(pkts_burst[m]); > + } > > } > } > ========================================== > > Note that data race may or may not have visible consequence. If two cores > unconsciously process same mbuf > at different time, they may never notice it; but if two cores access same > mbuf at the same physical time, the > consequence is usually visible (crash). We don't seek for a solution that > captures even potential data race; > instead, we seek for a solution that can capture data race that happens > simultaneously in two or more cores. > Therefore, we do not need to extend the border of locking as wide as > possible. We will apply locking only when > we are focusing on a single mbuf processing. > > In a real life application, a core will spend quite some time on each mbuf. > The interval ranges from a few hundred > cycles to a few thousand cycles. And usually not more than a handful of > mbuf are involved. This is a ideal use case > for locking mbuf. > > I agree that the race detection shall not be compiled by default, since > mtod is often called, and every mtod implies a > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be > supported as well. But I don't think refcnt logic > should be taken into account; these two are orthogonal designs, IMO. ***** > Pls correct me if I am wrong here. ***** > > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That > is why I said LOCK/UNLOCK needs to > survive mbuf alloc initialization. > > Of course we need to support locking multiple mbufs at the same time. For > each core, we will then preserve, say, 8 slots. > It works exactly like a direct mapped cacheline. That is, we can use 4bits > from the mbuf address to locate its cacheline. > If the cacheline has been occupied, we do an eviction; that is, the new > mbuf will take the place of the old one. The old one is > then UNLOCKed, unfortunately. > > Honestly I have not yet tried this approach in real life application. But I > have been thinking over the problem of data race detection > for a long time, and I found the restriction and requirement makes this > solution the only viable one. There are hundreds of papers > published in the field on data race condition detection, but the lightest > of the options has at least 5x of performance > penalty, not to mention the space complexity, making it not applicable in > practice. > > Again, pls, anyone has same painful experience of data race bugs, share > with me your concerns. It would be nice to come up with > some practical device to address this problem. I believe 6Wind and other IP > stack vendors must share the same feeling and opinion. > > thx & > rgds, > > Qinglai > > > > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil < > adrien.mazarguil@6wind.com> wrote: > > > Hi Qinglai, > > > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > > > Hi, > > > > > > I have a proposal for mbuf race condition detection and I would like to > > get > > > your opinions before > > > committing any patch. > > > > > > Race condition is the worst bug I can think of; as it causes crashing > > long > > > after the first crime scene, > > > and the consequence is unpredictable and difficult to understand. > > > > > > Besides, race condition is very difficult to reproduce. Usually we get a > > > coredump from live network, > > > but the coredump itself delivers nearly zero info. We just know that the > > > mbuf is somehow broken, and > > > it is perhaps overwritten by another core at any moment. > > > > > > There are tools such as Valgrind and ThreadSanitizer to capture this > > fault. > > > But the overhead brought > > > by the tools are too high to be deployed in performance test, not to > > > mention in the live network. But > > > race condition rarely happens under low pressure. > > > > > > Even worse, even in theory, the tools mentioned are not capable of > > > capturing the scenario, because > > > they requires explicit calls on locking primitives such as pthread mutex. > > > This is because the semantics > > > are not understood by the tools without explicit lock/unlock. > > > > > > With such known restrictions, I have a proposal roughly as below. > > > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf that > > > matters. The pseudo code is as below: > > > > > > RTE_MBUF_LOCK(m); > > > /* use mbuf as usual */ > > > ... > > > RTE_MBUF_UNLOCK(m); > > > > > > During the protected critical section, only the core which holds the lock > > > can access the mbuf; while > > > other cores, if they dare to use mbuf, they will be punished by segfault. > > > > > > Since the proposal shall be feasible at least in performance test, the > > > overhead of locking and > > > punishment must be small. And the access to mbuf must be as usual. > > > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): > > > > > > RTE_MBUF_LOCK(m) > > > { > > > store_per_core_cache(m, m->buf_addr); > > > m->buf_addr = NULL; > > > mb(); // memory barrier > > > } > > > > > > And RTE_MBUF_UNLOCK is simply the reverse: > > > > > > RTE_MBUF_UNLOCK(m) > > > { > > > purge_per_core_cache(m); > > > m->buf_addr = load_per_core_cache(m); > > > mb(); > > > } > > > > > > As we can see that the idea is to re-write m->buf_addr with NULL, and > > store > > > it in a per-core > > > cache. The other cores, while trying to access the mbuf during critical > > > section, will be certainly > > > punished. > > > > > > Then of course we need to keep the owner core working. This is done by > > > making changes to > > > mtod, as below: > > > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > > > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + > > > (o))) > > > > > > The per-core cache of mbuf works like a key-value database, which works > > > like a direct-mapping > > > cache. If an eviction happens (a newly arriving mbuf must kick out an old > > > one), then the we restore > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then > > > take care of such > > > situation. > > > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed > > and > > > allocated, since > > > we don't trust the refcnt. A double-free is actually very rare; but data > > > race can occur without double-free. Therefore, we need to survive the > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the > > > buf_addr. > > > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised. > > > But it is not a big deal (I hope). > > > > > > The runtime overhead of this implementation is very light-weighted, and > > > probably is able to turned > > > on even in live network. And it is not intrusive at all: no change is > > > needed in struct rte_mbuf; we just > > > need a O(N) space complexity (where N is number of cores), and O(1) > > runtime > > > complexity. > > > > > > Alright, that is basically what is in my mind. Before any patch is > > > provided, or any perf analysis is made, I would like to know what are the > > > risks form design point of view. Please leave your feedback. > > > > Your proposal makes sense but I'm not sure where developers should call > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications > > only? Is it to be used internally by DPDK as well? Does it include PMDs? > > > > I think any overhead outside of debugging mode, as minimal as it is, is too > > much of a "punishment" for well behaving applications. The cost of a memory > > barrier can be extremely high and this is one of the reasons DPDK processes > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and > > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled > > by default, and thought as an additional debugging tool. > > > > Also the implementation would probably be more expensive than your > > suggestion, in my opinion the reference count must be taken into account > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs > > work. Freeing mbufs multiple times is perfectly valid in many cases. > > > > How can one lock several mbufs at once by the way? > > > > Since it affects performance, this can only make sense as a debugging tool > > developers can use when they encounter some corruption issue they cannot > > identify, somewhat like poisoning buffers before they are freed. Not sure > > it's worth the trouble. I am agree with Adrian here - it doesn't look worth it: From one side it adds quite a lot of overhead and complexity to the mbuf code. From other side it usage seems pretty limited - there would be a lot of cases when it wouldn't be able to detect race condition anyway.. So my opinion is NACK. BTW, if your app is intentionally trying to do read/write to the same mbufs simultaneously from different threads without some explicit synchronization, then most likely there is something wrong with it. Konstantin ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] Proposal of mbuf Race Condition Detection 2017-04-13 23:19 ` Ananyev, Konstantin @ 2017-04-14 0:03 ` Stephen Hemminger 2017-04-14 8:51 ` jigsaw 0 siblings, 1 reply; 6+ messages in thread From: Stephen Hemminger @ 2017-04-14 0:03 UTC (permalink / raw) To: Ananyev, Konstantin; +Cc: jigsaw, Adrien Mazarguil, dev On Thu, 13 Apr 2017 23:19:45 +0000 "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote: > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of jigsaw > > Sent: Thursday, April 13, 2017 9:59 PM > > To: Adrien Mazarguil <adrien.mazarguil@6wind.com> > > Cc: dev@dpdk.org > > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection > > > > Hi Adrien, > > > > Thanks for your comment. > > > > The LOCK/UNLOCK may be called by user application only. There are several > > reasons. > > > > 1. If the lib calls LOCK, user application may be punished unexpectedly. > > Consider what if the Rx burst function calls the LOCK in core #1, and then > > the mbuf is handed over from > > core #1 to core #2 through a enqueue/dequeue operation, as below: > > > > Rx(1) --> Enqueue(1) --> Dequeue(2) > > LOCKED Panic! > > > > The core #2 will then panic because the mbuf is owned by core #1 without > > being released. > > > > 2. Rx and Tx both usually works in a burst mode, combined with prefetch > > operation. Meanwhile > > LOCK and UNLOCK cannot work well with prefetch, because it requires data > > access of mbuf header. > > > > 3. The critical session tends to be small. Here we (user application) need > > to find a balance: with longer interval of critical > > section, we can have more secured code; however, longer duration leads to > > more complicated business > > logic. > > > > Hence, we urge user application to use LOCK/UNLOCK with the common sense of > > critical sections. > > > > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown > > below: > > > > ========================================== > > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h > > b/examples/l3fwd/l3fwd_em_hlm_sse.h > > index 7714a20..9db0190 100644 > > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h > > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h > > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > > **pkts_burst, > > > > for (j = 0; j < n; j += 8) { > > > > + RTE_MBUF_LOCK(pkts_burst[j]); > > + RTE_MBUF_LOCK(pkts_burst[j+1]); > > + RTE_MBUF_LOCK(pkts_burst[j+2]); > > + RTE_MBUF_LOCK(pkts_burst[j+3]); > > + RTE_MBUF_LOCK(pkts_burst[j+4]); > > + RTE_MBUF_LOCK(pkts_burst[j+5]); > > + RTE_MBUF_LOCK(pkts_burst[j+6]); > > + RTE_MBUF_LOCK(pkts_burst[j+7]); > > + > > uint32_t pkt_type = > > pkts_burst[j]->packet_type & > > pkts_burst[j+1]->packet_type & > > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > > **pkts_burst, > > } > > } > > > > - for (; j < nb_rx; j++) > > + for (; j < nb_rx; j++) { > > + RTE_MBUF_LOCK(pkts_burst[j]); > > dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid); > > + } > > > > send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); > > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h > > index 1afa1f0..2938558 100644 > > --- a/examples/l3fwd/l3fwd_sse.h > > +++ b/examples/l3fwd/l3fwd_sse.h > > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port, > > struct rte_mbuf *m[], > > > > len = qconf->tx_mbufs[port].len; > > > > + for (j = 0; j < num; ++j) > > + RTE_MBUF_UNLOCK(m); > > + > > /* > > * If TX buffer for that queue is empty, and we have enough packets, > > * then send them straightway. > > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct > > rte_mbuf **pkts_burst, > > if (likely(pn != BAD_PORT)) > > send_packetsx4(qconf, pn, pkts_burst + j, k); > > else > > - for (m = j; m != j + k; m++) > > + for (m = j; m != j + k; m++) { > > + RTE_MBUF_UNLOCK(pkts_burst[m]); > > rte_pktmbuf_free(pkts_burst[m]); > > + } > > > > } > > } > > ========================================== > > > > Note that data race may or may not have visible consequence. If two cores > > unconsciously process same mbuf > > at different time, they may never notice it; but if two cores access same > > mbuf at the same physical time, the > > consequence is usually visible (crash). We don't seek for a solution that > > captures even potential data race; > > instead, we seek for a solution that can capture data race that happens > > simultaneously in two or more cores. > > Therefore, we do not need to extend the border of locking as wide as > > possible. We will apply locking only when > > we are focusing on a single mbuf processing. > > > > In a real life application, a core will spend quite some time on each mbuf. > > The interval ranges from a few hundred > > cycles to a few thousand cycles. And usually not more than a handful of > > mbuf are involved. This is a ideal use case > > for locking mbuf. > > > > I agree that the race detection shall not be compiled by default, since > > mtod is often called, and every mtod implies a > > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be > > supported as well. But I don't think refcnt logic > > should be taken into account; these two are orthogonal designs, IMO. ***** > > Pls correct me if I am wrong here. ***** > > > > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That > > is why I said LOCK/UNLOCK needs to > > survive mbuf alloc initialization. > > > > Of course we need to support locking multiple mbufs at the same time. For > > each core, we will then preserve, say, 8 slots. > > It works exactly like a direct mapped cacheline. That is, we can use 4bits > > from the mbuf address to locate its cacheline. > > If the cacheline has been occupied, we do an eviction; that is, the new > > mbuf will take the place of the old one. The old one is > > then UNLOCKed, unfortunately. > > > > Honestly I have not yet tried this approach in real life application. But I > > have been thinking over the problem of data race detection > > for a long time, and I found the restriction and requirement makes this > > solution the only viable one. There are hundreds of papers > > published in the field on data race condition detection, but the lightest > > of the options has at least 5x of performance > > penalty, not to mention the space complexity, making it not applicable in > > practice. > > > > Again, pls, anyone has same painful experience of data race bugs, share > > with me your concerns. It would be nice to come up with > > some practical device to address this problem. I believe 6Wind and other IP > > stack vendors must share the same feeling and opinion. > > > > thx & > > rgds, > > > > Qinglai > > > > > > > > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil < > > adrien.mazarguil@6wind.com> wrote: > > > > > Hi Qinglai, > > > > > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > > > > Hi, > > > > > > > > I have a proposal for mbuf race condition detection and I would like to > > > get > > > > your opinions before > > > > committing any patch. > > > > > > > > Race condition is the worst bug I can think of; as it causes crashing > > > long > > > > after the first crime scene, > > > > and the consequence is unpredictable and difficult to understand. > > > > > > > > Besides, race condition is very difficult to reproduce. Usually we get a > > > > coredump from live network, > > > > but the coredump itself delivers nearly zero info. We just know that the > > > > mbuf is somehow broken, and > > > > it is perhaps overwritten by another core at any moment. > > > > > > > > There are tools such as Valgrind and ThreadSanitizer to capture this > > > fault. > > > > But the overhead brought > > > > by the tools are too high to be deployed in performance test, not to > > > > mention in the live network. But > > > > race condition rarely happens under low pressure. > > > > > > > > Even worse, even in theory, the tools mentioned are not capable of > > > > capturing the scenario, because > > > > they requires explicit calls on locking primitives such as pthread mutex. > > > > This is because the semantics > > > > are not understood by the tools without explicit lock/unlock. > > > > > > > > With such known restrictions, I have a proposal roughly as below. > > > > > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf that > > > > matters. The pseudo code is as below: > > > > > > > > RTE_MBUF_LOCK(m); > > > > /* use mbuf as usual */ > > > > ... > > > > RTE_MBUF_UNLOCK(m); > > > > > > > > During the protected critical section, only the core which holds the lock > > > > can access the mbuf; while > > > > other cores, if they dare to use mbuf, they will be punished by segfault. > > > > > > > > Since the proposal shall be feasible at least in performance test, the > > > > overhead of locking and > > > > punishment must be small. And the access to mbuf must be as usual. > > > > > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): > > > > > > > > RTE_MBUF_LOCK(m) > > > > { > > > > store_per_core_cache(m, m->buf_addr); > > > > m->buf_addr = NULL; > > > > mb(); // memory barrier > > > > } > > > > > > > > And RTE_MBUF_UNLOCK is simply the reverse: > > > > > > > > RTE_MBUF_UNLOCK(m) > > > > { > > > > purge_per_core_cache(m); > > > > m->buf_addr = load_per_core_cache(m); > > > > mb(); > > > > } > > > > > > > > As we can see that the idea is to re-write m->buf_addr with NULL, and > > > store > > > > it in a per-core > > > > cache. The other cores, while trying to access the mbuf during critical > > > > section, will be certainly > > > > punished. > > > > > > > > Then of course we need to keep the owner core working. This is done by > > > > making changes to > > > > mtod, as below: > > > > > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > > > > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + > > > > (o))) > > > > > > > > The per-core cache of mbuf works like a key-value database, which works > > > > like a direct-mapping > > > > cache. If an eviction happens (a newly arriving mbuf must kick out an old > > > > one), then the we restore > > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then > > > > take care of such > > > > situation. > > > > > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed > > > and > > > > allocated, since > > > > we don't trust the refcnt. A double-free is actually very rare; but data > > > > race can occur without double-free. Therefore, we need to survive the > > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the > > > > buf_addr. > > > > > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised. > > > > But it is not a big deal (I hope). > > > > > > > > The runtime overhead of this implementation is very light-weighted, and > > > > probably is able to turned > > > > on even in live network. And it is not intrusive at all: no change is > > > > needed in struct rte_mbuf; we just > > > > need a O(N) space complexity (where N is number of cores), and O(1) > > > runtime > > > > complexity. > > > > > > > > Alright, that is basically what is in my mind. Before any patch is > > > > provided, or any perf analysis is made, I would like to know what are the > > > > risks form design point of view. Please leave your feedback. > > > > > > Your proposal makes sense but I'm not sure where developers should call > > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications > > > only? Is it to be used internally by DPDK as well? Does it include PMDs? > > > > > > I think any overhead outside of debugging mode, as minimal as it is, is too > > > much of a "punishment" for well behaving applications. The cost of a memory > > > barrier can be extremely high and this is one of the reasons DPDK processes > > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and > > > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled > > > by default, and thought as an additional debugging tool. > > > > > > Also the implementation would probably be more expensive than your > > > suggestion, in my opinion the reference count must be taken into account > > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs > > > work. Freeing mbufs multiple times is perfectly valid in many cases. > > > > > > How can one lock several mbufs at once by the way? > > > > > > Since it affects performance, this can only make sense as a debugging tool > > > developers can use when they encounter some corruption issue they cannot > > > identify, somewhat like poisoning buffers before they are freed. Not sure > > > it's worth the trouble. > > I am agree with Adrian here - it doesn't look worth it: > From one side it adds quite a lot of overhead and complexity to the mbuf code. > From other side it usage seems pretty limited - there would be a lot of cases when it wouldn't be > able to detect race condition anyway.. > So my opinion is NACK. > BTW, if your app is intentionally trying to do read/write to the same mbufs simultaneously from different threads > without some explicit synchronization, then most likely there is something wrong with it. > Konstantin This is a one-off debug thing. If you have these kind of issues, you really need to rethink your threading design. The examples have clear thread model. If you follow the documentation about process model, and don't try and reinvent your own, then this kind of hand holding is not necessary. NAK as well. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] Proposal of mbuf Race Condition Detection 2017-04-14 0:03 ` Stephen Hemminger @ 2017-04-14 8:51 ` jigsaw 0 siblings, 0 replies; 6+ messages in thread From: jigsaw @ 2017-04-14 8:51 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Ananyev, Konstantin, Adrien Mazarguil, dev Hi all, Thanks for the comments. The proposal seems refrained from wide usage in application. thx & rgds, -qinglai On Fri, Apr 14, 2017 at 3:03 AM, Stephen Hemminger < stephen@networkplumber.org> wrote: > On Thu, 13 Apr 2017 23:19:45 +0000 > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote: > > > > -----Original Message----- > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of jigsaw > > > Sent: Thursday, April 13, 2017 9:59 PM > > > To: Adrien Mazarguil <adrien.mazarguil@6wind.com> > > > Cc: dev@dpdk.org > > > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection > > > > > > Hi Adrien, > > > > > > Thanks for your comment. > > > > > > The LOCK/UNLOCK may be called by user application only. There are > several > > > reasons. > > > > > > 1. If the lib calls LOCK, user application may be punished > unexpectedly. > > > Consider what if the Rx burst function calls the LOCK in core #1, and > then > > > the mbuf is handed over from > > > core #1 to core #2 through a enqueue/dequeue operation, as below: > > > > > > Rx(1) --> Enqueue(1) --> Dequeue(2) > > > LOCKED Panic! > > > > > > The core #2 will then panic because the mbuf is owned by core #1 > without > > > being released. > > > > > > 2. Rx and Tx both usually works in a burst mode, combined with prefetch > > > operation. Meanwhile > > > LOCK and UNLOCK cannot work well with prefetch, because it requires > data > > > access of mbuf header. > > > > > > 3. The critical session tends to be small. Here we (user application) > need > > > to find a balance: with longer interval of critical > > > section, we can have more secured code; however, longer duration leads > to > > > more complicated business > > > logic. > > > > > > Hence, we urge user application to use LOCK/UNLOCK with the common > sense of > > > critical sections. > > > > > > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown > > > below: > > > > > > ========================================== > > > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h > > > b/examples/l3fwd/l3fwd_em_hlm_sse.h > > > index 7714a20..9db0190 100644 > > > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h > > > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h > > > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > > > **pkts_burst, > > > > > > for (j = 0; j < n; j += 8) { > > > > > > + RTE_MBUF_LOCK(pkts_burst[j]); > > > + RTE_MBUF_LOCK(pkts_burst[j+1]); > > > + RTE_MBUF_LOCK(pkts_burst[j+2]); > > > + RTE_MBUF_LOCK(pkts_burst[j+3]); > > > + RTE_MBUF_LOCK(pkts_burst[j+4]); > > > + RTE_MBUF_LOCK(pkts_burst[j+5]); > > > + RTE_MBUF_LOCK(pkts_burst[j+6]); > > > + RTE_MBUF_LOCK(pkts_burst[j+7]); > > > + > > > uint32_t pkt_type = > > > pkts_burst[j]->packet_type & > > > pkts_burst[j+1]->packet_type & > > > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > > > **pkts_burst, > > > } > > > } > > > > > > - for (; j < nb_rx; j++) > > > + for (; j < nb_rx; j++) { > > > + RTE_MBUF_LOCK(pkts_burst[j]); > > > dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], > portid); > > > + } > > > > > > send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); > > > > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h > > > index 1afa1f0..2938558 100644 > > > --- a/examples/l3fwd/l3fwd_sse.h > > > +++ b/examples/l3fwd/l3fwd_sse.h > > > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t > port, > > > struct rte_mbuf *m[], > > > > > > len = qconf->tx_mbufs[port].len; > > > > > > + for (j = 0; j < num; ++j) > > > + RTE_MBUF_UNLOCK(m); > > > + > > > /* > > > * If TX buffer for that queue is empty, and we have enough > packets, > > > * then send them straightway. > > > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, > struct > > > rte_mbuf **pkts_burst, > > > if (likely(pn != BAD_PORT)) > > > send_packetsx4(qconf, pn, pkts_burst + j, k); > > > else > > > - for (m = j; m != j + k; m++) > > > + for (m = j; m != j + k; m++) { > > > + RTE_MBUF_UNLOCK(pkts_burst[m]); > > > rte_pktmbuf_free(pkts_burst[m]); > > > + } > > > > > > } > > > } > > > ========================================== > > > > > > Note that data race may or may not have visible consequence. If two > cores > > > unconsciously process same mbuf > > > at different time, they may never notice it; but if two cores access > same > > > mbuf at the same physical time, the > > > consequence is usually visible (crash). We don't seek for a solution > that > > > captures even potential data race; > > > instead, we seek for a solution that can capture data race that happens > > > simultaneously in two or more cores. > > > Therefore, we do not need to extend the border of locking as wide as > > > possible. We will apply locking only when > > > we are focusing on a single mbuf processing. > > > > > > In a real life application, a core will spend quite some time on each > mbuf. > > > The interval ranges from a few hundred > > > cycles to a few thousand cycles. And usually not more than a handful of > > > mbuf are involved. This is a ideal use case > > > for locking mbuf. > > > > > > I agree that the race detection shall not be compiled by default, since > > > mtod is often called, and every mtod implies a > > > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be > > > supported as well. But I don't think refcnt logic > > > should be taken into account; these two are orthogonal designs, IMO. > ***** > > > Pls correct me if I am wrong here. ***** > > > > > > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. > That > > > is why I said LOCK/UNLOCK needs to > > > survive mbuf alloc initialization. > > > > > > Of course we need to support locking multiple mbufs at the same time. > For > > > each core, we will then preserve, say, 8 slots. > > > It works exactly like a direct mapped cacheline. That is, we can use > 4bits > > > from the mbuf address to locate its cacheline. > > > If the cacheline has been occupied, we do an eviction; that is, the new > > > mbuf will take the place of the old one. The old one is > > > then UNLOCKed, unfortunately. > > > > > > Honestly I have not yet tried this approach in real life application. > But I > > > have been thinking over the problem of data race detection > > > for a long time, and I found the restriction and requirement makes this > > > solution the only viable one. There are hundreds of papers > > > published in the field on data race condition detection, but the > lightest > > > of the options has at least 5x of performance > > > penalty, not to mention the space complexity, making it not applicable > in > > > practice. > > > > > > Again, pls, anyone has same painful experience of data race bugs, share > > > with me your concerns. It would be nice to come up with > > > some practical device to address this problem. I believe 6Wind and > other IP > > > stack vendors must share the same feeling and opinion. > > > > > > thx & > > > rgds, > > > > > > Qinglai > > > > > > > > > > > > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil < > > > adrien.mazarguil@6wind.com> wrote: > > > > > > > Hi Qinglai, > > > > > > > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > > > > > Hi, > > > > > > > > > > I have a proposal for mbuf race condition detection and I would > like to > > > > get > > > > > your opinions before > > > > > committing any patch. > > > > > > > > > > Race condition is the worst bug I can think of; as it causes > crashing > > > > long > > > > > after the first crime scene, > > > > > and the consequence is unpredictable and difficult to understand. > > > > > > > > > > Besides, race condition is very difficult to reproduce. Usually we > get a > > > > > coredump from live network, > > > > > but the coredump itself delivers nearly zero info. We just know > that the > > > > > mbuf is somehow broken, and > > > > > it is perhaps overwritten by another core at any moment. > > > > > > > > > > There are tools such as Valgrind and ThreadSanitizer to capture > this > > > > fault. > > > > > But the overhead brought > > > > > by the tools are too high to be deployed in performance test, not > to > > > > > mention in the live network. But > > > > > race condition rarely happens under low pressure. > > > > > > > > > > Even worse, even in theory, the tools mentioned are not capable of > > > > > capturing the scenario, because > > > > > they requires explicit calls on locking primitives such as pthread > mutex. > > > > > This is because the semantics > > > > > are not understood by the tools without explicit lock/unlock. > > > > > > > > > > With such known restrictions, I have a proposal roughly as below. > > > > > > > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf > that > > > > > matters. The pseudo code is as below: > > > > > > > > > > RTE_MBUF_LOCK(m); > > > > > /* use mbuf as usual */ > > > > > ... > > > > > RTE_MBUF_UNLOCK(m); > > > > > > > > > > During the protected critical section, only the core which holds > the lock > > > > > can access the mbuf; while > > > > > other cores, if they dare to use mbuf, they will be punished by > segfault. > > > > > > > > > > Since the proposal shall be feasible at least in performance test, > the > > > > > overhead of locking and > > > > > punishment must be small. And the access to mbuf must be as usual. > > > > > > > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo > code): > > > > > > > > > > RTE_MBUF_LOCK(m) > > > > > { > > > > > store_per_core_cache(m, m->buf_addr); > > > > > m->buf_addr = NULL; > > > > > mb(); // memory barrier > > > > > } > > > > > > > > > > And RTE_MBUF_UNLOCK is simply the reverse: > > > > > > > > > > RTE_MBUF_UNLOCK(m) > > > > > { > > > > > purge_per_core_cache(m); > > > > > m->buf_addr = load_per_core_cache(m); > > > > > mb(); > > > > > } > > > > > > > > > > As we can see that the idea is to re-write m->buf_addr with NULL, > and > > > > store > > > > > it in a per-core > > > > > cache. The other cores, while trying to access the mbuf during > critical > > > > > section, will be certainly > > > > > punished. > > > > > > > > > > Then of course we need to keep the owner core working. This is > done by > > > > > making changes to > > > > > mtod, as below: > > > > > > > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > > > > > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + > (m)->data_off + > > > > > (o))) > > > > > > > > > > The per-core cache of mbuf works like a key-value database, which > works > > > > > like a direct-mapping > > > > > cache. If an eviction happens (a newly arriving mbuf must kick out > an old > > > > > one), then the we restore > > > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK > will then > > > > > take care of such > > > > > situation. > > > > > > > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is > freed > > > > and > > > > > allocated, since > > > > > we don't trust the refcnt. A double-free is actually very rare; > but data > > > > > race can occur without double-free. Therefore, we need to survive > the > > > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the > > > > > buf_addr. > > > > > > > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be > revised. > > > > > But it is not a big deal (I hope). > > > > > > > > > > The runtime overhead of this implementation is very > light-weighted, and > > > > > probably is able to turned > > > > > on even in live network. And it is not intrusive at all: no change > is > > > > > needed in struct rte_mbuf; we just > > > > > need a O(N) space complexity (where N is number of cores), and O(1) > > > > runtime > > > > > complexity. > > > > > > > > > > Alright, that is basically what is in my mind. Before any patch is > > > > > provided, or any perf analysis is made, I would like to know what > are the > > > > > risks form design point of view. Please leave your feedback. > > > > > > > > Your proposal makes sense but I'm not sure where developers should > call > > > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user > applications > > > > only? Is it to be used internally by DPDK as well? Does it include > PMDs? > > > > > > > > I think any overhead outside of debugging mode, as minimal as it is, > is too > > > > much of a "punishment" for well behaving applications. The cost of a > memory > > > > barrier can be extremely high and this is one of the reasons DPDK > processes > > > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and > > > > RTE_MBUF_UNLOCK() should thus exist under some compilation option > disabled > > > > by default, and thought as an additional debugging tool. > > > > > > > > Also the implementation would probably be more expensive than your > > > > suggestion, in my opinion the reference count must be taken into > account > > > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs > > > > work. Freeing mbufs multiple times is perfectly valid in many cases. > > > > > > > > How can one lock several mbufs at once by the way? > > > > > > > > Since it affects performance, this can only make sense as a > debugging tool > > > > developers can use when they encounter some corruption issue they > cannot > > > > identify, somewhat like poisoning buffers before they are freed. Not > sure > > > > it's worth the trouble. > > > > I am agree with Adrian here - it doesn't look worth it: > > From one side it adds quite a lot of overhead and complexity to the mbuf > code. > > From other side it usage seems pretty limited - there would be a lot of > cases when it wouldn't be > > able to detect race condition anyway.. > > So my opinion is NACK. > > BTW, if your app is intentionally trying to do read/write to the same > mbufs simultaneously from different threads > > without some explicit synchronization, then most likely there is > something wrong with it. > > Konstantin > > This is a one-off debug thing. If you have these kind of issues, you > really need to rethink > your threading design. The examples have clear thread model. If you follow > the documentation > about process model, and don't try and reinvent your own, then this kind > of hand holding > is not necessary. > > NAK as well. > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-04-14 8:51 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-04-13 13:38 [dpdk-dev] Proposal of mbuf Race Condition Detection jigsaw 2017-04-13 14:59 ` Adrien Mazarguil 2017-04-13 20:58 ` jigsaw 2017-04-13 23:19 ` Ananyev, Konstantin 2017-04-14 0:03 ` Stephen Hemminger 2017-04-14 8:51 ` jigsaw
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).