From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 108A71B262; Thu, 21 Dec 2017 13:41:15 +0100 (CET) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3A106C03BD72; Thu, 21 Dec 2017 12:41:15 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2CFE317191; Thu, 21 Dec 2017 12:41:15 +0000 (UTC) Received: from zmail17.collab.prod.int.phx2.redhat.com (zmail17.collab.prod.int.phx2.redhat.com [10.5.83.19]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id F01B54BB78; Thu, 21 Dec 2017 12:41:14 +0000 (UTC) Date: Thu, 21 Dec 2017 07:41:14 -0500 (EST) From: Victor Kaplansky To: Stephen Hemminger Cc: dev@dpdk.org, stable@dpdk.org, Jens Freimann , Maxime Coquelin , Yuanhan Liu , Tiwei Bie , Jianfeng Tan , Chao Zhu , Roman Dementiev Message-ID: <991767555.2220884.1513860074866.JavaMail.zimbra@redhat.com> In-Reply-To: <20171220121945.0143b0af@xeon-e3> References: <20171220163752-mutt-send-email-victork@redhat.com> <20171220110616.21301e11@xeon-e3> <634157847.2119460.1513800390896.JavaMail.zimbra@redhat.com> <20171220121945.0143b0af@xeon-e3> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.35.206.33, 10.4.195.29] Thread-Topic: vhost_user: protect active rings from async ring changes Thread-Index: ZXwOFy+n6qcl1aRdvybV45YWnhuaFA== X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Thu, 21 Dec 2017 12:41:15 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH v4] vhost_user: protect active rings from async ring changes X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Dec 2017 12:41:16 -0000 ----- Original Message ----- > From: "Stephen Hemminger" > To: "Victor Kaplansky" > Cc: dev@dpdk.org, stable@dpdk.org, "Jens Freimann" ,= "Maxime Coquelin" > , "Yuanhan Liu" , "Tiwe= i Bie" , "Jianfeng > Tan" > Sent: Wednesday, December 20, 2017 10:19:45 PM > Subject: Re: [dpdk-dev] [PATCH v4] vhost_user: protect active rings from = async ring changes >=20 > On Wed, 20 Dec 2017 15:06:30 -0500 (EST) > Victor Kaplansky wrote: >=20 > > > Wrapping locking inline's adds nothing and makes life harder > > > for static analysis tools. > >=20 > > Yep. In this case it inhibits the details of how the locking is > > implemented (e.g. the name of the lock). It also facilitates > > replacement of locking mechanism, by another implementation. > > See below. >=20 > YAGNI You aren't gonna need it. >=20 > Don't build infrastructure for things that you forsee. Good point, thanks. I'll simplify this. >=20 >=20 >=20 >=20 > > >=20 > > > The bigger problem is that doing locking on all enqueue/dequeue > > > can have a visible performance impact. Did you measure that? > > >=20 > > > Could you invent an RCUish mechanism using compiler barriers? > > > =20 > >=20 > > I've played a bit with measuring performance impact. Successful > > lock adds on the average about 30 cycles on my Haswell cpu. > > (and it successes 99.999...% of time). > >=20 > > I can investigate it more, but my initial feeling is that adding a > > memory barrier (the real one, not the compiler barrier) would add > > about the same overhead. > >=20 > > By the way, the way update_queuing_status() in > > drivers/net/vhost/rte_eth_vhost.c tries to avoid contention with > > the active queue by playing with "allow_queuing" and "while_queuing" > > seems to be broken, since memory barriers are missing. >=20 > CPU cycles alone don't matter on modern x86. > What matters is cache and instructions per cycle. > In this case locking requires locked instruction which causes the cpu > prefetching and instruction pipeline to stall. >=20 I agree. I've measured total overhead of added pair of lock/unlock and it appears to be around 28 cycles per lock/unlock pair on my 3.5GHz Haswell= . >>From "Intel=C2=AE 64 and IA-32 Architectures Software Developer=E2=80=99s M= anual Volume 3A: System Programming Guide, Part 1": In the Pentium 4, Intel Xeon, and P6 family processors, the locking operation is handled with either a cache lock or bus lock. If a memory access is cacheable and affects only a single cache line, a cache lock is invoked and the system bus and the actual memory location in system memory are not locked during the operation. Here, other Pentium 4, Intel Xeon, or P6 family processors on the bus write-back any modified data and invalidate their caches as necessary to maintain system memory coherency. If the memory access is not cacheable and/or it crosses a cache line boundary, the processor=E2=80=99s LOCK# signal is asserted and the processor does not respond to requests for bus control during the locked operation. So, the whole memory bus is locked only if the memory access crosses=20 memory cache line. Anyway, I'm open to ways to reduce this overhead. This patch fixes a critic= al host of bugs reported in bugzilla, so if we can pull this fix and try to optimize it later by a subsequent patch it would be great. See below a quick test program I've used to test and measure the overhead. It also demonstrates the problem I'm trying to fix. Do you have any idea about using RCU of how to reduce the overhead? BTW, our implementation of rte_spinlock_unlock() could be slightly faster, if we would use regular move instead of xchg instruction. Also our implementation of rte_spinlock_lock() could be faster if we optimize it for success path by making conditional branch to fall-trough or even better if we reimplement the spinlock using gcc builtins. --=20 Victor diff --git a/tlock.c b/tlock.c deleted file mode 100644 index 62c68852..00000000 --- a/tlock.c +++ /dev/null @@ -1,91 +0,0 @@ -#include -#include -#include -#include - -/* build me with:=20 - gcc -march=3Dnative -std=3Dgnu11 -O3 -lpthread tlock.c -o tlock -*/ - - - -typedef struct { - volatile int locked; /**< lock status 0 =3D unlocked, 1 =3D locked= */ -} rte_spinlock_t; - -static inline void -rte_spinlock_lock(rte_spinlock_t *sl) -{ - int lock_val =3D 1; - asm volatile ( - "1:\n" - "xchg %[locked], %[lv]\n" - "test %[lv], %[lv]\n" - "jz 3f\n" - "2:\n" - "pause\n" - "cmpl $0, %[locked]\n" - "jnz 2b\n" - "jmp 1b\n" - "3:\n" - : [locked] "=3Dm" (sl->locked), [lv] "=3Dq" (lock_= val) - : "[lv]" (lock_val) - : "memory"); -} - -static inline void -rte_spinlock_unlock (rte_spinlock_t *sl) -{ - int unlock_val =3D 0; - asm volatile ( - "xchg %[locked], %[ulv]\n" - : [locked] "=3Dm" (sl->locked), [ulv] "=3Dq" (unlo= ck_val) - : "[ulv]" (unlock_val) - : "memory"); -} - -static unsigned * volatile pointer; -static rte_spinlock_t reader_access; - -void * -worker(void *unused) -{ - int i =3D 0; - while (1) { - unsigned *new_pointer =3D (unsigned *) mmap(NULL, 4096, PRO= T_READ | PROT_WRITE, - MAP_SHARED | MAP_ANON= YMOUS, -1, 0); - unsigned *old_pointer =3D pointer; - - rte_spinlock_lock(&reader_access); - pointer =3D new_pointer; - rte_spinlock_unlock(&reader_access); - - munmap (old_pointer, 4096); - - usleep(10000); - =20 - } - return 0; - -} - -int main() -{ - pthread_t t; - pointer =3D (unsigned *) mmap(NULL, 4096, PROT_READ | PROT_WRITE, - MAP_SHARED | MAP_ANONYMOUS, -1, 0); - - pthread_create(&t, 0, worker, NULL); - - unsigned n =3D 400000000; - - while (n--) { - rte_spinlock_lock(&reader_access); - *pointer =3D 1; - rte_spinlock_unlock(&reader_access); - } - - pthread_cancel(t); - return 0; - -}