From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mb@smartsharesystems.com>
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by dpdk.org (Postfix) with ESMTP id 533492716
 for <dev@dpdk.org>; Wed, 16 Dec 2015 23:45:50 +0100 (CET)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Date: Wed, 16 Dec 2015 23:45:46 +0100
Message-ID: <98CBD80474FA8B44BF855DF32C47DC358AF776@smartserver.smartshare.dk>
In-Reply-To: <20151216131249.GC10020@bricha3-MOBL3>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [dpdk-dev] tcpdump support in DPDK 2.3
Thread-Index: AdE4A4lz+JlZandyRCyIXQ1x3tYxkwANlUUw
References: <98CBD80474FA8B44BF855DF32C47DC358AF758@smartserver.smartshare.dk>
 <20151214182931.GA17279@mhcomputing.net> <f7t7fkglu13.fsf@aconole.bos.csb>
 <CAMFWN9kaSuza_x3qDo_kA+ugcX8szV+UaFB9WB9cDr8ihnyRmA@mail.gmail.com>
 <20151214223613.GC21163@mhcomputing.net>
 <20151216104502.GA10020@bricha3-MOBL3>
 <98CBD80474FA8B44BF855DF32C47DC358AF76F@smartserver.smartshare.dk>
 <20151216115611.GB10020@bricha3-MOBL3>
 <98CBD80474FA8B44BF855DF32C47DC358AF771@smartserver.smartshare.dk>
 <20151216131249.GC10020@bricha3-MOBL3>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Dec 2015 22:45:50 -0000

Bruce,

Matthew presented a very important point a few hours ago: We don't need =
tcpdump support for debugging the application in a lab; we already have =
plenty of other tools for debugging what we are developing. We need =
tcpdump support for debugging network issues in a production network.

In my "hardened network appliance" world, a solution designed purely for =
legacy applications (tcpdump, Wireshark etc.) is useless because the =
network technician doesn't have access to these applications on the =
appliance.

While a PC system running a DPDK based application might have plenty of =
spare lcores for filtering, the SmartShare appliances are already using =
all lcores for dedicated purposes, so the runtime filtering has to be =
done by the IO lcores (otherwise we would have to rehash everything and =
reallocate some lcores for mirroring, which I strongly oppose). Our =
non-DPDK firmware has also always been filtering directly in the fast =
path.

If the filter is so complex that it unexpectedly degrades the normal =
traffic forwarding performance, the mirror still reflects all the =
forwarded network traffic, not just some of it. In many real life =
network debugging scenarios this is better than the alternative: keeping =
the traffic forwarding up at full performance and having a network =
technician trying to understand a mirror output where some of the =
relevant packets are unexpectedly missing.

Although it is generally considered bad design if a system's behavior =
(or performance) changes unexpectedly when debugging features are being =
used, experienced network technicians have already grown accustomed to =
the performance of most non-trivial network equipment depending on the =
number of features enabled and how it is configured, so reality might =
beat theory here. (Still, other companies might prefer to keep their =
fast path performance unaffected and dedicate/reallocate some lcores for =
filtering.)

I am probably repeating myself here, but I would prefer if the DPDK =
provided the packet capturing framework in the form of a set of =
efficient libraries for 1. BPF filtering (e.g. a simple BPF interpreter =
or a DPDK variant of bpfjit), 2. scalable packet queueing for the =
mirrored packets (probably multi producer, single or multi consumer), as =
well as 3. high resolution time stamping (preferably easily convertible =
to the pcap file packet timestamp format). Then the DPDK application can =
take care of interfacing to the attached application and outputting the =
mirrored packets to the appropriate destination, e.g. a pcap file, a =
Wireshark excap named pipe, a dedicated RSPAN VLAN, or an ERSPAN tunnel. =
And an example application should show how to bind all this together in =
a tcpdump-like scenario for debugging a production network.

A note about timestamps: In theory, the captured packets should be time =
stamped as early as possible. In practice though, it is probably =
sufficiently accurate to time stamp the accepted packets after =
filtering, especially if they are filtered by an IO lcore. =
Alternatively, they can be time stamped when consumed from the mirror =
output queue.

A note about packet ordering: Mirrored packets belonging to different =
flows are probably out of order because of RSS, where multiple lcores =
contribute to the mirror output. This packet ordering inaccuracy could =
also serve as a reason for not being too strict about the accuracy of =
the timestamps on the mirrored packets.


Med venlig hilsen / kind regards
- Morten Br=F8rup


-----Original Message-----
From: Bruce Richardson [mailto:bruce.richardson@intel.com]=20
Sent: 16. december 2015 14:13
To: Morten Br=F8rup
Cc: Matthew Hall; Kyle Larose; dev@dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 01:26:11PM +0100, Morten Br=F8rup wrote:
> Bruce,
>=20
> Please note that tcpdump is a stupid name for a packet capture =
application that supports much more than just TCP.
>=20
> I had missed the point about ethdev supporting virtual interfaces, so =
thank you for pointing that out. That covers my concerns about capturing =
packets inside tunnels.
>=20
> I will gladly admit that you Intel guys are probably much more =
competent in the field of DPDK performance and scalability than I am. So =
Matthew and I have been asking you to kindly ensure that your solution =
scales well at very high packet rates too, and pointing out that =
filtering before copying is probably cheaper than copying before =
filtering. You mention that it leads to an important choice about which =
lcores get to do the work of filtering the packets, so that might be =
worth some discussion.
>=20
> :-)
>=20
> Med venlig hilsen / kind regards
> - Morten Br=F8rup
>=20

Thanks for your support.

We may look at having a certain amount of flexibility in the =
configuration of the setup, so as to avoid limiting the use of the =
functionality.

For scalability at very high packet rates, it's something we'll need you =
guys to give us pointers on too - what's acceptable or not inside an =
app, and what level of scalabilty is needed. I'd admit that most of our =
initial thinking in this area was for debugging apps at less than line =
rate i.e. for functional testing.
For full line rate introspection, we'll have to see when we get some =
working code.

/Bruce

>=20
> -----Original Message-----
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: 16. december 2015 12:56
> To: Morten Br=F8rup
> Cc: Matthew Hall; Kyle Larose; dev@dpdk.org
> Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3
>=20
> On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Br=F8rup wrote:
> > Bruce,
> >=20
> > This doesn't really sound like tcpdump to me; it sounds like port =
mirroring.
>=20
> It's actually a bit of both, in my opinion, it's designed to allow =
basic mirroring of traffic on a port to allow that traffic to be sent to =
a tcpdump destination.
> By going with a more generic approach, we hope to enable more possible =
use cases than just focusing on TCP.
>=20
>=20
> >=20
> > Your suggestion is limited to physical ports only, and cannot be =
attached further inside the application, e.g. for mirroring packets =
related to a specific VLAN.
>=20
> Yes, the lack of attachment inside the app is a limitation. There are =
two types of scenarios that could be considered for packet capture:
> * ones where the application can be modified to do it's own filtering =
and capturing.
> * ones where you want a generic capture mechanism which can be used on =
any application without modification.
> We have chosen to focus more on the second one, as that is where a=20
> generic solution for DPDK is likely to lie. For the first case, the=20
> application writer himself knows the type of traffic and how best to=20
> capture and filter it, so I don't think a generic one-size-fits-all=20
> solution is possible. [Though a couple of helper libraries may be of=20
> use]
>=20
> As for physical ports, the scheme should work for any ethdev - why do =
you see it only being limited to physical ports? What would you want to =
see monitored that we are missing.
>=20
> >=20
> > Furthermore, it doesn't sound like the filtering part scales well. =
Consider a fully loaded 40 Gbit/s port. You would need to copy all =
packets into a single rte_ring to the attached filtering process, which =
would then require its own set of lcores to probably discard most of =
these packets when filtering. I agree with Matthew that the filtering =
needs to happen as close to the source as possible, and must be scalable =
to multiple lcores.
>=20
> Without modifying the application itself to do it's own filtering I =
suspect scalability is always going to be a problem. That being said, =
there is no particular reason why a single rte_ring needs to be used - =
we could allow one ring per NIC queue for instance. The trouble with =
filtering at the source itself is that you put extra load on the IO =
cores. By using a ring, we put the filtering load on extra cores in a =
secondary process which can be scaled by the user without touching the =
main app.
>=20
> >=20
> > On the positive side, your idea has the advantage that the filter =
can be any application, and is not limited to BPF. However if the =
purpose is "tcpdump", we should probably consider BPF, which is the type =
of filtering offered by tcpdump.
>=20
> Having this work with any application is one of our primary targets =
here. The app author should not have to worry too much about getting =
basic debug support.
> Even if it doesn't work at 40G small packet rates, you can get a lot =
of benefit from a scheme that provides functional debugging for an app. =
Obviously, though we aim to make this as scalable as possible, which is =
why we want to allow fitlering in userspace before sending packets =
externally to DPDK.
>=20
> >=20
> > I would prefer having a BPF library available that the application =
can use at any point, either at the lowest level (when =
receiving/transmitting Ethernet packets) or at a higher level (e.g. when =
working with packets that go into or come out of a tunnel). The BPF =
library should implement packet length and relevant ancillary data, such =
as SKF_AD_VLAN_TAG etc. based on metadata in the mbuf.
> >=20
> > Transferring a BPF filter from an outside application could be done =
by using a simple text format, e.g. the output format of "tcpdump -ddd". =
This also opens an easy roadmap for Wireshark integration by simply =
extending excap to include such a BPF filter format.
> >=20
> >=20
> > Lots of negativity above. I very much like the idea of attaching the =
secondary process and going through an rte_ring. This allows the =
secondary process to pass the filtered and captured packets on in any =
format it likes to any destination it likes.
>=20
> Good, so we're not completely off-base here. :-)
>=20
> /Bruce
>=20
> >=20
> >=20
> > Med venlig hilsen / kind regards
> > - Morten Br=F8rup
> >=20
> > -----Original Message-----
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: 16. december 2015 11:45
> >=20
> > Hi,
> >=20
> > we are currently doing some investigation and prototyping for this =
feature.
> > Our current thinking is the following:
> > * to allow dynamic control of the filtering, we are thinking of =
making use of
> >   the multi-process infrastructure in DPDK. A secondary process can =
attach to a
> >   primary at runtime and provide the packet filtering and dumping =
capability.
> > * ideally we want to create a generic packet mirroring callback =
inside the EAL,
> >   that can be set up to mirror packets going through Rx/Tx on an =
ethdev.
> > * using this, packets being received on the port to be monitored are =
sent via
> >   an rte_ring (ring ethdev) to the secondary process which takes =
those packets
> >   and does any filtering on them. [This would be where BPF could fit =
into
> >   things, but it's not something we have looked at yet.]
> > * initially we plan to have the secondary process then write packets =
to a pcap
> >   file using a pcap PMD, but down the road if we get other PMDs, =
like a KNI PMD
> >   or a TAP device PMD, those could be used as targets instead.
> >=20
> > This implementation we hope should provide enough hooks to enable =
the standard tools to be used for monitoring and capturing packets. We =
will send out draft implementation code for various parts of this as =
soon as we have it.
> >=20
> > Additional feedback welcome, as always. :-)
> >=20
> > Regards,
> > /Bruce
> >=20
> >=20
>=20