From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by dpdk.org (Postfix) with ESMTP id 533492716 for ; Wed, 16 Dec 2015 23:45:50 +0100 (CET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Wed, 16 Dec 2015 23:45:46 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC358AF776@smartserver.smartshare.dk> In-Reply-To: <20151216131249.GC10020@bricha3-MOBL3> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [dpdk-dev] tcpdump support in DPDK 2.3 Thread-Index: AdE4A4lz+JlZandyRCyIXQ1x3tYxkwANlUUw References: <98CBD80474FA8B44BF855DF32C47DC358AF758@smartserver.smartshare.dk> <20151214182931.GA17279@mhcomputing.net> <20151214223613.GC21163@mhcomputing.net> <20151216104502.GA10020@bricha3-MOBL3> <98CBD80474FA8B44BF855DF32C47DC358AF76F@smartserver.smartshare.dk> <20151216115611.GB10020@bricha3-MOBL3> <98CBD80474FA8B44BF855DF32C47DC358AF771@smartserver.smartshare.dk> <20151216131249.GC10020@bricha3-MOBL3> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: dev@dpdk.org Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Dec 2015 22:45:50 -0000 Bruce, Matthew presented a very important point a few hours ago: We don't need = tcpdump support for debugging the application in a lab; we already have = plenty of other tools for debugging what we are developing. We need = tcpdump support for debugging network issues in a production network. In my "hardened network appliance" world, a solution designed purely for = legacy applications (tcpdump, Wireshark etc.) is useless because the = network technician doesn't have access to these applications on the = appliance. While a PC system running a DPDK based application might have plenty of = spare lcores for filtering, the SmartShare appliances are already using = all lcores for dedicated purposes, so the runtime filtering has to be = done by the IO lcores (otherwise we would have to rehash everything and = reallocate some lcores for mirroring, which I strongly oppose). Our = non-DPDK firmware has also always been filtering directly in the fast = path. If the filter is so complex that it unexpectedly degrades the normal = traffic forwarding performance, the mirror still reflects all the = forwarded network traffic, not just some of it. In many real life = network debugging scenarios this is better than the alternative: keeping = the traffic forwarding up at full performance and having a network = technician trying to understand a mirror output where some of the = relevant packets are unexpectedly missing. Although it is generally considered bad design if a system's behavior = (or performance) changes unexpectedly when debugging features are being = used, experienced network technicians have already grown accustomed to = the performance of most non-trivial network equipment depending on the = number of features enabled and how it is configured, so reality might = beat theory here. (Still, other companies might prefer to keep their = fast path performance unaffected and dedicate/reallocate some lcores for = filtering.) I am probably repeating myself here, but I would prefer if the DPDK = provided the packet capturing framework in the form of a set of = efficient libraries for 1. BPF filtering (e.g. a simple BPF interpreter = or a DPDK variant of bpfjit), 2. scalable packet queueing for the = mirrored packets (probably multi producer, single or multi consumer), as = well as 3. high resolution time stamping (preferably easily convertible = to the pcap file packet timestamp format). Then the DPDK application can = take care of interfacing to the attached application and outputting the = mirrored packets to the appropriate destination, e.g. a pcap file, a = Wireshark excap named pipe, a dedicated RSPAN VLAN, or an ERSPAN tunnel. = And an example application should show how to bind all this together in = a tcpdump-like scenario for debugging a production network. A note about timestamps: In theory, the captured packets should be time = stamped as early as possible. In practice though, it is probably = sufficiently accurate to time stamp the accepted packets after = filtering, especially if they are filtered by an IO lcore. = Alternatively, they can be time stamped when consumed from the mirror = output queue. A note about packet ordering: Mirrored packets belonging to different = flows are probably out of order because of RSS, where multiple lcores = contribute to the mirror output. This packet ordering inaccuracy could = also serve as a reason for not being too strict about the accuracy of = the timestamps on the mirrored packets. Med venlig hilsen / kind regards - Morten Br=F8rup -----Original Message----- From: Bruce Richardson [mailto:bruce.richardson@intel.com]=20 Sent: 16. december 2015 14:13 To: Morten Br=F8rup Cc: Matthew Hall; Kyle Larose; dev@dpdk.org Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3 On Wed, Dec 16, 2015 at 01:26:11PM +0100, Morten Br=F8rup wrote: > Bruce, >=20 > Please note that tcpdump is a stupid name for a packet capture = application that supports much more than just TCP. >=20 > I had missed the point about ethdev supporting virtual interfaces, so = thank you for pointing that out. That covers my concerns about capturing = packets inside tunnels. >=20 > I will gladly admit that you Intel guys are probably much more = competent in the field of DPDK performance and scalability than I am. So = Matthew and I have been asking you to kindly ensure that your solution = scales well at very high packet rates too, and pointing out that = filtering before copying is probably cheaper than copying before = filtering. You mention that it leads to an important choice about which = lcores get to do the work of filtering the packets, so that might be = worth some discussion. >=20 > :-) >=20 > Med venlig hilsen / kind regards > - Morten Br=F8rup >=20 Thanks for your support. We may look at having a certain amount of flexibility in the = configuration of the setup, so as to avoid limiting the use of the = functionality. For scalability at very high packet rates, it's something we'll need you = guys to give us pointers on too - what's acceptable or not inside an = app, and what level of scalabilty is needed. I'd admit that most of our = initial thinking in this area was for debugging apps at less than line = rate i.e. for functional testing. For full line rate introspection, we'll have to see when we get some = working code. /Bruce >=20 > -----Original Message----- > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: 16. december 2015 12:56 > To: Morten Br=F8rup > Cc: Matthew Hall; Kyle Larose; dev@dpdk.org > Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3 >=20 > On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Br=F8rup wrote: > > Bruce, > >=20 > > This doesn't really sound like tcpdump to me; it sounds like port = mirroring. >=20 > It's actually a bit of both, in my opinion, it's designed to allow = basic mirroring of traffic on a port to allow that traffic to be sent to = a tcpdump destination. > By going with a more generic approach, we hope to enable more possible = use cases than just focusing on TCP. >=20 >=20 > >=20 > > Your suggestion is limited to physical ports only, and cannot be = attached further inside the application, e.g. for mirroring packets = related to a specific VLAN. >=20 > Yes, the lack of attachment inside the app is a limitation. There are = two types of scenarios that could be considered for packet capture: > * ones where the application can be modified to do it's own filtering = and capturing. > * ones where you want a generic capture mechanism which can be used on = any application without modification. > We have chosen to focus more on the second one, as that is where a=20 > generic solution for DPDK is likely to lie. For the first case, the=20 > application writer himself knows the type of traffic and how best to=20 > capture and filter it, so I don't think a generic one-size-fits-all=20 > solution is possible. [Though a couple of helper libraries may be of=20 > use] >=20 > As for physical ports, the scheme should work for any ethdev - why do = you see it only being limited to physical ports? What would you want to = see monitored that we are missing. >=20 > >=20 > > Furthermore, it doesn't sound like the filtering part scales well. = Consider a fully loaded 40 Gbit/s port. You would need to copy all = packets into a single rte_ring to the attached filtering process, which = would then require its own set of lcores to probably discard most of = these packets when filtering. I agree with Matthew that the filtering = needs to happen as close to the source as possible, and must be scalable = to multiple lcores. >=20 > Without modifying the application itself to do it's own filtering I = suspect scalability is always going to be a problem. That being said, = there is no particular reason why a single rte_ring needs to be used - = we could allow one ring per NIC queue for instance. The trouble with = filtering at the source itself is that you put extra load on the IO = cores. By using a ring, we put the filtering load on extra cores in a = secondary process which can be scaled by the user without touching the = main app. >=20 > >=20 > > On the positive side, your idea has the advantage that the filter = can be any application, and is not limited to BPF. However if the = purpose is "tcpdump", we should probably consider BPF, which is the type = of filtering offered by tcpdump. >=20 > Having this work with any application is one of our primary targets = here. The app author should not have to worry too much about getting = basic debug support. > Even if it doesn't work at 40G small packet rates, you can get a lot = of benefit from a scheme that provides functional debugging for an app. = Obviously, though we aim to make this as scalable as possible, which is = why we want to allow fitlering in userspace before sending packets = externally to DPDK. >=20 > >=20 > > I would prefer having a BPF library available that the application = can use at any point, either at the lowest level (when = receiving/transmitting Ethernet packets) or at a higher level (e.g. when = working with packets that go into or come out of a tunnel). The BPF = library should implement packet length and relevant ancillary data, such = as SKF_AD_VLAN_TAG etc. based on metadata in the mbuf. > >=20 > > Transferring a BPF filter from an outside application could be done = by using a simple text format, e.g. the output format of "tcpdump -ddd". = This also opens an easy roadmap for Wireshark integration by simply = extending excap to include such a BPF filter format. > >=20 > >=20 > > Lots of negativity above. I very much like the idea of attaching the = secondary process and going through an rte_ring. This allows the = secondary process to pass the filtered and captured packets on in any = format it likes to any destination it likes. >=20 > Good, so we're not completely off-base here. :-) >=20 > /Bruce >=20 > >=20 > >=20 > > Med venlig hilsen / kind regards > > - Morten Br=F8rup > >=20 > > -----Original Message----- > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > Sent: 16. december 2015 11:45 > >=20 > > Hi, > >=20 > > we are currently doing some investigation and prototyping for this = feature. > > Our current thinking is the following: > > * to allow dynamic control of the filtering, we are thinking of = making use of > > the multi-process infrastructure in DPDK. A secondary process can = attach to a > > primary at runtime and provide the packet filtering and dumping = capability. > > * ideally we want to create a generic packet mirroring callback = inside the EAL, > > that can be set up to mirror packets going through Rx/Tx on an = ethdev. > > * using this, packets being received on the port to be monitored are = sent via > > an rte_ring (ring ethdev) to the secondary process which takes = those packets > > and does any filtering on them. [This would be where BPF could fit = into > > things, but it's not something we have looked at yet.] > > * initially we plan to have the secondary process then write packets = to a pcap > > file using a pcap PMD, but down the road if we get other PMDs, = like a KNI PMD > > or a TAP device PMD, those could be used as targets instead. > >=20 > > This implementation we hope should provide enough hooks to enable = the standard tools to be used for monitoring and capturing packets. We = will send out draft implementation code for various parts of this as = soon as we have it. > >=20 > > Additional feedback welcome, as always. :-) > >=20 > > Regards, > > /Bruce > >=20 > >=20 >=20