From: "Michael S. Tsirkin" <mst@redhat.com>
To: Bruce Richardson <bruce.richardson@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, Avi Kivity <avi@scylladb.com>
Subject: Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
Date: Thu, 1 Oct 2015 14:23:17 +0300 [thread overview]
Message-ID: <20151001141124-mutt-send-email-mst@redhat.com> (raw)
In-Reply-To: <20151001110806.GA16248@bricha3-MOBL3>
On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote:
> On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > >
> > >
> > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > > >>It's easy to claim that
> > > >>a solution is around the corner, only no one was looking for it, but the
> > > >>reality is that kernel bypass has been a solution for years for high
> > > >>performance users,
> > > >I never said that it's trivial.
> > > >
> > > >It's probably a lot of work. It's definitely more work than just abusing
> > > >sysfs.
> > > >
> > > >But it looks like a write system call into an eventfd is about 1.5
> > > >microseconds on my laptop. Even with a system call per packet, system
> > > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > > >
> > >
> > > 1.5 us = 0.6 Mpps per core limit.
> >
> > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> > But for RX, you can batch a lot of packets.
> >
> > You can see by now I'm not that good at benchmarking.
> > Here's what I wrote:
> >
> >
> > #include <stdbool.h>
> > #include <sys/eventfd.h>
> > #include <inttypes.h>
> > #include <unistd.h>
> >
> >
> > int main(int argc, char **argv)
> > {
> > int e = eventfd(0, 0);
> > uint64_t v = 1;
> >
> > int i;
> >
> > for (i = 0; i < 10000000; ++i) {
> > write(e, &v, sizeof v);
> > }
> > }
> >
> >
> > This takes 1.5 seconds to run on my laptop:
> >
> > $ time ./a.out
> >
> > real 0m1.507s
> > user 0m0.179s
> > sys 0m1.328s
> >
> >
> > > dpdk performance is in the tens of
> > > millions of packets per system.
> >
> > I think that's with a bunch of batching though.
> >
> > > It's not just the lack of system calls, of course, the architecture is
> > > completely different.
> >
> > Absolutely - I'm not saying move all of DPDK into kernel.
> > We just need to protect the RX rings so hardware does
> > not corrupt kernel memory.
> >
> >
> > Thinking about it some more, many devices
> > have separate rings for DMA: TX (device reads memory)
> > and RX (device writes memory).
> > With such devices, a mode where userspace can write TX ring
> > but not RX ring might make sense.
> >
> > This will mean userspace might read kernel memory
> > through the device, but can not corrupt it.
> >
> > That's already a big win!
> >
> > And RX buffers do not have to be added one at a time.
> > If we assume 0.2usec per system call, batching some 100 buffers per
> > system call gives you 2 nano seconds overhead. That seems quite
> > reasonable.
> >
> Hi,
>
> just to jump in a bit on this.
>
> Batching of 100 packets is a very large batch, and will add to latency.
This is not on transmit or receive path!
This is only for re-adding buffers to the receive ring.
This batching should not add latency at all:
process rx:
get packet
packets[n] = alloc packet
if (++n > 100) {
system call: add bufs(packets, n);
}
> The
> standard batch size in DPDK right now is 32, and even that may be too high for
> applications in certain domains.
>
> However, even with that 2ns of overhead calculation, I'd make a few additional
> points.
> * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX
> and TX on a single thread. 10GB of IO doesn't really stress a core any more. For
> 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a
> huge batch size of 100 packets, your system call overhead on RX is taking almost
> 12% of our processing time. For a batch size of 32 this overhead would rise to
> over 35% of our packet processing time.
As I said, yes, measureable, but not breaking the bank, and that's with
40GB which still are not widespread.
With 10GB and 100 packets, only 3% overhead.
> For 100G line rate, the packet arrival
> rate is just 6.7ns...
Hypervisors still have time get their act together and support IOMMUs
by the time 100G systems become widespread.
> * As well as this overhead from the system call itself, you are also omitting
> the overhead of scanning the RX descriptors.
I omit it because scanning descriptors can still be done in userspace,
just write-protect the RX ring page.
> This in itself is going to use up
> a good proportion of the processing time, as well as that we have to spend cycles
> copying the descriptors from one ring in memory to another. Given that right now
> with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen
> cycles on modern cores, every additional cycle (fraction of a nanosecond) has
> an impact.
>
> Regards,
> /Bruce
See above. There is no need for that on data path. Only re-adding
buffers requires a system call.
--
MST
next prev parent reply other threads:[~2015-10-01 11:23 UTC|newest]
Thread overview: 100+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-27 7:05 Vlad Zolotarov
2015-09-27 9:43 ` Michael S. Tsirkin
2015-09-27 10:50 ` Vladislav Zolotarov
2015-09-29 16:41 ` Vlad Zolotarov
2015-09-29 20:54 ` Michael S. Tsirkin
2015-09-29 21:46 ` Stephen Hemminger
2015-09-29 21:49 ` Michael S. Tsirkin
2015-09-30 10:37 ` Vlad Zolotarov
2015-09-30 10:58 ` Michael S. Tsirkin
2015-09-30 11:26 ` Vlad Zolotarov
[not found] ` <20150930143927-mutt-send-email-mst@redhat.com>
2015-09-30 11:53 ` Vlad Zolotarov
2015-09-30 12:03 ` Michael S. Tsirkin
2015-09-30 12:16 ` Vlad Zolotarov
2015-09-30 12:27 ` Michael S. Tsirkin
2015-09-30 12:50 ` Vlad Zolotarov
2015-09-30 15:26 ` Michael S. Tsirkin
2015-09-30 18:15 ` Vlad Zolotarov
2015-09-30 18:55 ` Michael S. Tsirkin
2015-09-30 19:06 ` Vlad Zolotarov
2015-09-30 19:10 ` Vlad Zolotarov
2015-09-30 19:11 ` Vlad Zolotarov
2015-09-30 19:39 ` Michael S. Tsirkin
2015-09-30 20:09 ` Vlad Zolotarov
2015-09-30 21:36 ` Stephen Hemminger
2015-09-30 21:53 ` Michael S. Tsirkin
2015-09-30 22:20 ` Vlad Zolotarov
2015-10-01 8:00 ` Vlad Zolotarov
2015-10-01 14:47 ` Stephen Hemminger
2015-10-01 15:03 ` Vlad Zolotarov
2015-09-30 13:05 ` Avi Kivity
2015-09-30 14:39 ` Michael S. Tsirkin
2015-09-30 14:53 ` Avi Kivity
2015-09-30 15:21 ` Michael S. Tsirkin
2015-09-30 15:36 ` Avi Kivity
2015-09-30 20:40 ` Michael S. Tsirkin
2015-09-30 21:00 ` Avi Kivity
2015-10-01 8:44 ` Michael S. Tsirkin
2015-10-01 8:46 ` Vlad Zolotarov
2015-10-01 8:52 ` Avi Kivity
2015-10-01 9:15 ` Michael S. Tsirkin
2015-10-01 9:22 ` Avi Kivity
2015-10-01 9:42 ` Michael S. Tsirkin
2015-10-01 9:53 ` Avi Kivity
2015-10-01 10:17 ` Michael S. Tsirkin
2015-10-01 10:24 ` Avi Kivity
2015-10-01 10:25 ` Avi Kivity
2015-10-01 10:44 ` Michael S. Tsirkin
2015-10-01 10:55 ` Avi Kivity
2015-10-01 21:17 ` Alexander Duyck
2015-10-02 13:50 ` Michael S. Tsirkin
2015-10-01 9:42 ` Vincent JARDIN
2015-10-01 9:43 ` Avi Kivity
2015-10-01 9:48 ` Vincent JARDIN
2015-10-01 9:54 ` Avi Kivity
2015-10-01 10:14 ` Michael S. Tsirkin
2015-10-01 10:23 ` Avi Kivity
2015-10-01 14:55 ` Stephen Hemminger
2015-10-01 15:49 ` Michael S. Tsirkin
2015-10-01 14:54 ` Stephen Hemminger
2015-10-01 9:55 ` Michael S. Tsirkin
2015-10-01 9:59 ` Avi Kivity
2015-10-01 10:38 ` Michael S. Tsirkin
2015-10-01 10:50 ` Avi Kivity
2015-10-01 11:09 ` Michael S. Tsirkin
2015-10-01 11:20 ` Avi Kivity
2015-10-01 11:27 ` Michael S. Tsirkin
2015-10-01 11:32 ` Avi Kivity
2015-10-01 15:01 ` Stephen Hemminger
2015-10-01 15:08 ` Avi Kivity
2015-10-01 15:46 ` Michael S. Tsirkin
2015-10-01 15:11 ` Michael S. Tsirkin
2015-10-01 15:19 ` Avi Kivity
2015-10-01 15:40 ` Michael S. Tsirkin
2015-10-01 11:31 ` Michael S. Tsirkin
2015-10-01 11:34 ` Avi Kivity
2015-10-01 11:08 ` Bruce Richardson
2015-10-01 11:23 ` Michael S. Tsirkin [this message]
2015-10-01 12:07 ` Bruce Richardson
2015-10-01 13:14 ` Michael S. Tsirkin
2015-10-01 16:04 ` Michael S. Tsirkin
2015-10-01 21:02 ` Alexander Duyck
2015-10-02 14:00 ` Michael S. Tsirkin
2015-10-02 14:07 ` Bruce Richardson
2015-10-04 9:07 ` Michael S. Tsirkin
2015-10-02 15:56 ` Gleb Natapov
2015-10-02 16:57 ` Alexander Duyck
2015-10-01 9:15 ` Avi Kivity
2015-10-01 9:29 ` Michael S. Tsirkin
2015-10-01 9:38 ` Avi Kivity
2015-10-01 10:07 ` Michael S. Tsirkin
2015-10-01 10:11 ` Avi Kivity
2015-10-01 9:16 ` Michael S. Tsirkin
2015-09-30 17:28 ` Stephen Hemminger
2015-09-30 17:39 ` Michael S. Tsirkin
2015-09-30 17:43 ` Stephen Hemminger
2015-09-30 18:50 ` Michael S. Tsirkin
2015-09-30 20:00 ` Gleb Natapov
2015-09-30 20:36 ` Michael S. Tsirkin
2015-10-01 5:04 ` Gleb Natapov
2015-09-30 17:44 ` Gleb Natapov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151001141124-mutt-send-email-mst@redhat.com \
--to=mst@redhat.com \
--cc=avi@scylladb.com \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).