From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mst@redhat.com>
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
 by dpdk.org (Postfix) with ESMTP id 5639F8D8A
 for <dev@dpdk.org>; Thu,  1 Oct 2015 13:23:22 +0200 (CEST)
Received: from int-mx11.intmail.prod.int.phx2.redhat.com
 (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])
 by mx1.redhat.com (Postfix) with ESMTPS id AD69CC0798E9;
 Thu,  1 Oct 2015 11:23:21 +0000 (UTC)
Received: from redhat.com (ovpn-116-83.ams2.redhat.com [10.36.116.83])
 by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with SMTP id
 t91BNIqF016998; Thu, 1 Oct 2015 07:23:19 -0400
Date: Thu, 1 Oct 2015 14:23:17 +0300
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Bruce Richardson <bruce.richardson@intel.com>
Message-ID: <20151001141124-mutt-send-email-mst@redhat.com>
References: <560C0171.7080507@scylladb.com> <20150930204016.GA29975@redhat.com>
 <20151001113828-mutt-send-email-mst@redhat.com>
 <560CF44A.60102@scylladb.com>
 <20151001120027-mutt-send-email-mst@redhat.com>
 <560CFB66.5050904@scylladb.com>
 <20151001124211-mutt-send-email-mst@redhat.com>
 <560D0413.5080401@scylladb.com>
 <20151001131754-mutt-send-email-mst@redhat.com>
 <20151001110806.GA16248@bricha3-MOBL3>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20151001110806.GA16248@bricha3-MOBL3>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24
Cc: "dev@dpdk.org" <dev@dpdk.org>, Avi Kivity <avi@scylladb.com>
Subject: Re: [dpdk-dev] Having troubles binding an SR-IOV VF to
 uio_pci_generic on Amazon instance
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Oct 2015 11:23:22 -0000

On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote:
> On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > > 
> > > 
> > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > > >>It's easy to claim that
> > > >>a solution is around the corner, only no one was looking for it, but the
> > > >>reality is that kernel bypass has been a solution for years for high
> > > >>performance users,
> > > >I never said that it's trivial.
> > > >
> > > >It's probably a lot of work. It's definitely more work than just abusing
> > > >sysfs.
> > > >
> > > >But it looks like a write system call into an eventfd is about 1.5
> > > >microseconds on my laptop. Even with a system call per packet, system
> > > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > > >
> > > 
> > > 1.5 us = 0.6 Mpps per core limit.
> > 
> > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> > But for RX, you can batch a lot of packets.
> > 
> > You can see by now I'm not that good at benchmarking.
> > Here's what I wrote:
> > 
> > 
> > #include <stdbool.h>
> > #include <sys/eventfd.h>
> > #include <inttypes.h>
> > #include <unistd.h>
> > 
> > 
> > int main(int argc, char **argv)
> > {
> >         int e = eventfd(0, 0);
> >         uint64_t v = 1;
> > 
> >         int i;
> > 
> >         for (i = 0; i < 10000000; ++i) {
> >                 write(e, &v, sizeof v);
> >         }
> > }
> > 
> > 
> > This takes 1.5 seconds to run on my laptop:
> > 
> > $ time ./a.out 
> > 
> > real    0m1.507s
> > user    0m0.179s
> > sys     0m1.328s
> > 
> > 
> > > dpdk performance is in the tens of
> > > millions of packets per system.
> > 
> > I think that's with a bunch of batching though.
> > 
> > > It's not just the lack of system calls, of course, the architecture is
> > > completely different.
> > 
> > Absolutely - I'm not saying move all of DPDK into kernel.
> > We just need to protect the RX rings so hardware does
> > not corrupt kernel memory.
> > 
> > 
> > Thinking about it some more, many devices
> > have separate rings for DMA: TX (device reads memory)
> > and RX (device writes memory).
> > With such devices, a mode where userspace can write TX ring
> > but not RX ring might make sense.
> > 
> > This will mean userspace might read kernel memory
> > through the device, but can not corrupt it.
> > 
> > That's already a big win!
> > 
> > And RX buffers do not have to be added one at a time.
> > If we assume 0.2usec per system call, batching some 100 buffers per
> > system call gives you 2 nano seconds overhead.  That seems quite
> > reasonable.
> > 
> Hi,
> 
> just to jump in a bit on this.
> 
> Batching of 100 packets is a very large batch, and will add to latency.


This is not on transmit or receive path!
This is only for re-adding buffers to the receive ring.
This batching should not add latency at all:


process rx:
	get packet
	packets[n] = alloc packet
	if (++n > 100) {
		system call: add bufs(packets, n);
	}


> The
> standard batch size in DPDK right now is 32, and even that may be too high for
> applications in certain domains.
> 
> However, even with that 2ns of overhead calculation, I'd make a few additional
> points.
> * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX 
> and TX on a single thread. 10GB of IO doesn't really stress a core any more. For
> 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a
> huge batch size of 100 packets, your system call overhead on RX is taking almost
> 12% of our processing time. For a batch size of 32 this overhead would rise to
> over 35% of our packet processing time.

As I said, yes, measureable, but not breaking the bank, and that's with
40GB which still are not widespread.
With 10GB and 100 packets, only 3% overhead.

> For 100G line rate, the packet arrival
> rate is just 6.7ns...

Hypervisors still have time get their act together and support IOMMUs
by the time 100G systems become widespread.

> * As well as this overhead from the system call itself, you are also omitting
> the overhead of scanning the RX descriptors.

I omit it because scanning descriptors can still be done in userspace,
just write-protect the RX ring page.


> This in itself is going to use up
> a good proportion of the processing time, as well as that we have to spend cycles
> copying the descriptors from one ring in memory to another. Given that right now
> with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen
> cycles on modern cores, every additional cycle (fraction of a nanosecond) has
> an impact.
> 
> Regards,
> /Bruce

See above.  There is no need for that on data path. Only re-adding
buffers requires a system call.

-- 
MST