From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 5639F8D8A for ; Thu, 1 Oct 2015 13:23:22 +0200 (CEST) Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (Postfix) with ESMTPS id AD69CC0798E9; Thu, 1 Oct 2015 11:23:21 +0000 (UTC) Received: from redhat.com (ovpn-116-83.ams2.redhat.com [10.36.116.83]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with SMTP id t91BNIqF016998; Thu, 1 Oct 2015 07:23:19 -0400 Date: Thu, 1 Oct 2015 14:23:17 +0300 From: "Michael S. Tsirkin" To: Bruce Richardson Message-ID: <20151001141124-mutt-send-email-mst@redhat.com> References: <560C0171.7080507@scylladb.com> <20150930204016.GA29975@redhat.com> <20151001113828-mutt-send-email-mst@redhat.com> <560CF44A.60102@scylladb.com> <20151001120027-mutt-send-email-mst@redhat.com> <560CFB66.5050904@scylladb.com> <20151001124211-mutt-send-email-mst@redhat.com> <560D0413.5080401@scylladb.com> <20151001131754-mutt-send-email-mst@redhat.com> <20151001110806.GA16248@bricha3-MOBL3> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151001110806.GA16248@bricha3-MOBL3> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 Cc: "dev@dpdk.org" , Avi Kivity Subject: Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Oct 2015 11:23:22 -0000 On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote: > On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote: > > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote: > > > > > > > > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote: > > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > > > >>It's easy to claim that > > > >>a solution is around the corner, only no one was looking for it, but the > > > >>reality is that kernel bypass has been a solution for years for high > > > >>performance users, > > > >I never said that it's trivial. > > > > > > > >It's probably a lot of work. It's definitely more work than just abusing > > > >sysfs. > > > > > > > >But it looks like a write system call into an eventfd is about 1.5 > > > >microseconds on my laptop. Even with a system call per packet, system > > > >call overhead is not what makes DPDK drivers outperform Linux ones. > > > > > > > > > > 1.5 us = 0.6 Mpps per core limit. > > > > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps. > > But for RX, you can batch a lot of packets. > > > > You can see by now I'm not that good at benchmarking. > > Here's what I wrote: > > > > > > #include > > #include > > #include > > #include > > > > > > int main(int argc, char **argv) > > { > > int e = eventfd(0, 0); > > uint64_t v = 1; > > > > int i; > > > > for (i = 0; i < 10000000; ++i) { > > write(e, &v, sizeof v); > > } > > } > > > > > > This takes 1.5 seconds to run on my laptop: > > > > $ time ./a.out > > > > real 0m1.507s > > user 0m0.179s > > sys 0m1.328s > > > > > > > dpdk performance is in the tens of > > > millions of packets per system. > > > > I think that's with a bunch of batching though. > > > > > It's not just the lack of system calls, of course, the architecture is > > > completely different. > > > > Absolutely - I'm not saying move all of DPDK into kernel. > > We just need to protect the RX rings so hardware does > > not corrupt kernel memory. > > > > > > Thinking about it some more, many devices > > have separate rings for DMA: TX (device reads memory) > > and RX (device writes memory). > > With such devices, a mode where userspace can write TX ring > > but not RX ring might make sense. > > > > This will mean userspace might read kernel memory > > through the device, but can not corrupt it. > > > > That's already a big win! > > > > And RX buffers do not have to be added one at a time. > > If we assume 0.2usec per system call, batching some 100 buffers per > > system call gives you 2 nano seconds overhead. That seems quite > > reasonable. > > > Hi, > > just to jump in a bit on this. > > Batching of 100 packets is a very large batch, and will add to latency. This is not on transmit or receive path! This is only for re-adding buffers to the receive ring. This batching should not add latency at all: process rx: get packet packets[n] = alloc packet if (++n > 100) { system call: add bufs(packets, n); } > The > standard batch size in DPDK right now is 32, and even that may be too high for > applications in certain domains. > > However, even with that 2ns of overhead calculation, I'd make a few additional > points. > * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX > and TX on a single thread. 10GB of IO doesn't really stress a core any more. For > 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a > huge batch size of 100 packets, your system call overhead on RX is taking almost > 12% of our processing time. For a batch size of 32 this overhead would rise to > over 35% of our packet processing time. As I said, yes, measureable, but not breaking the bank, and that's with 40GB which still are not widespread. With 10GB and 100 packets, only 3% overhead. > For 100G line rate, the packet arrival > rate is just 6.7ns... Hypervisors still have time get their act together and support IOMMUs by the time 100G systems become widespread. > * As well as this overhead from the system call itself, you are also omitting > the overhead of scanning the RX descriptors. I omit it because scanning descriptors can still be done in userspace, just write-protect the RX ring page. > This in itself is going to use up > a good proportion of the processing time, as well as that we have to spend cycles > copying the descriptors from one ring in memory to another. Given that right now > with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen > cycles on modern cores, every additional cycle (fraction of a nanosecond) has > an impact. > > Regards, > /Bruce See above. There is no need for that on data path. Only re-adding buffers requires a system call. -- MST