From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f182.google.com (mail-qk0-f182.google.com [209.85.220.182]) by dpdk.org (Postfix) with ESMTP id ECCA5316B for ; Thu, 13 Apr 2017 15:38:20 +0200 (CEST) Received: by mail-qk0-f182.google.com with SMTP id h67so48163550qke.0 for ; Thu, 13 Apr 2017 06:38:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=LLC9K9xiX400e6A/LCJrOPRwmWQrYWzn0OxJOsOtD30=; b=sqyGWlQK7OLzbLWCWr/5HsJD+q95JTpwcVEl8bweG73gC/su+dOmY+j8ptma2BsE6r w8r8Hoo7Gp5IwhOYY2/UFhrR77hDgqiz7oSAv5Q6CI2bankTqp26mbAM+iX8QBfJi1iv RAxUkS6bpN8Ch8BlhxowRFyUGNk79/0+K0PG3rUI82PacyeCjM1L+dNJwKVyNKA0PRFJ mF3s0Zi0hA4eLA5Wr3oAwNIpix/dq3cCLjogSLBqOh5JCt7TF7Ai/CRuD10dxIQlnrzm rJzEQeB+TfKoqif0g+Ze0a+MuYZkjFlGdZ7byM5xzeFHvPCSfXPQCjkj/zyw7fbX2yOm ae0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=LLC9K9xiX400e6A/LCJrOPRwmWQrYWzn0OxJOsOtD30=; b=BUSKP2fnOES147zFpUfpOkyCLl3GC3DQzDP7MuvM55q7/hhB1jsqy/KLiWUq94nOlW lCxCPLQdxP/fUxN9CvTrjGzLXcMWvYH7qgrhUswnct+bB7wbbtvPK/wbgC3chwDtE0B6 ozmWigdtLn09CTMlMuKdvPBrBuDkxFGOimStquFWOGl59O+CCiXRIiLvgPf/uUrB0neM XPah1Z9Ekufbw6EZgU4WvtWn9TOEPmBIuZiaOH9mNA1JYnlV0nnjaU5f7tbnKGAzaFBU j+aQ0yUqii8AEU7AtiPDVg2JnZZizRxYZXaIe+h1IYfc6WG3TN2t9pCjQNNTCeZTMeGb uadg== X-Gm-Message-State: AN3rC/6BE+nxAEf1sQq4ff2wMO/29I2Pr3B0vhFaTIYlvQ+dVvhzrQgz wOjAZZL1ATR7V1qI2ZHuTn/maA95vkGv X-Received: by 10.55.158.215 with SMTP id h206mr2367720qke.154.1492090700008; Thu, 13 Apr 2017 06:38:20 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.23.240 with HTTP; Thu, 13 Apr 2017 06:38:19 -0700 (PDT) From: jigsaw Date: Thu, 13 Apr 2017 16:38:19 +0300 Message-ID: To: "dev@dpdk.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: [dpdk-dev] Proposal of mbuf Race Condition Detection X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Apr 2017 13:38:21 -0000 Hi, I have a proposal for mbuf race condition detection and I would like to get your opinions before committing any patch. Race condition is the worst bug I can think of; as it causes crashing long after the first crime scene, and the consequence is unpredictable and difficult to understand. Besides, race condition is very difficult to reproduce. Usually we get a coredump from live network, but the coredump itself delivers nearly zero info. We just know that the mbuf is somehow broken, and it is perhaps overwritten by another core at any moment. There are tools such as Valgrind and ThreadSanitizer to capture this fault. But the overhead brought by the tools are too high to be deployed in performance test, not to mention in the live network. But race condition rarely happens under low pressure. Even worse, even in theory, the tools mentioned are not capable of capturing the scenario, because they requires explicit calls on locking primitives such as pthread mutex. This is because the semantics are not understood by the tools without explicit lock/unlock. With such known restrictions, I have a proposal roughly as below. The idea is to ask coder to do explicit lock/unlock on each mbuf that matters. The pseudo code is as below: RTE_MBUF_LOCK(m); /* use mbuf as usual */ ... RTE_MBUF_UNLOCK(m); During the protected critical section, only the core which holds the lock can access the mbuf; while other cores, if they dare to use mbuf, they will be punished by segfault. Since the proposal shall be feasible at least in performance test, the overhead of locking and punishment must be small. And the access to mbuf must be as usual. Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): RTE_MBUF_LOCK(m) { store_per_core_cache(m, m->buf_addr); m->buf_addr = NULL; mb(); // memory barrier } And RTE_MBUF_UNLOCK is simply the reverse: RTE_MBUF_UNLOCK(m) { purge_per_core_cache(m); m->buf_addr = load_per_core_cache(m); mb(); } As we can see that the idea is to re-write m->buf_addr with NULL, and store it in a per-core cache. The other cores, while trying to access the mbuf during critical section, will be certainly punished. Then of course we need to keep the owner core working. This is done by making changes to mtod, as below: #define rte_pktmbuf_mtod_offset(m, t, o) \ ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + (o))) The per-core cache of mbuf works like a key-value database, which works like a direct-mapping cache. If an eviction happens (a newly arriving mbuf must kick out an old one), then the we restore the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then take care of such situation. Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed and allocated, since we don't trust the refcnt. A double-free is actually very rare; but data race can occur without double-free. Therefore, we need to survive the allocation of mbuf, that is rte_pktmbuf_init, which reset the buf_addr. Further, other dereference to buf_addr in rte_mbuf.h need to be revised. But it is not a big deal (I hope). The runtime overhead of this implementation is very light-weighted, and probably is able to turned on even in live network. And it is not intrusive at all: no change is needed in struct rte_mbuf; we just need a O(N) space complexity (where N is number of cores), and O(1) runtime complexity. Alright, that is basically what is in my mind. Before any patch is provided, or any perf analysis is made, I would like to know what are the risks form design point of view. Please leave your feedback. Thanks & Regards, Qinglai