[dpdk-dev] tcpdump support in DPDK 2.3

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] tcpdump support in DPDK 2.3
@ 2015-12-14  9:57 Morten Brørup
  2015-12-14 15:45 ` Aaron Conole
  2015-12-14 18:29 ` Matthew Hall
  0 siblings, 2 replies; 26+ messages in thread
From: Morten Brørup @ 2015-12-14  9:57 UTC (permalink / raw)
  To: dev

I noticed a discussion about support for tcpdump in DPDK 2.3.

Please consider which scenarios you want to support:

1. Compatibility with legacy non-DPDK applications (e.g. a DHCP server application) that captures specific packets by opening RAW sockets and attaching BPF filters to these sockets?

I agree that KNI (or TUN/TAP for the non-KNI kernels) is a realistic and simple way to interact with the kernel regarding raw packet capture, which might be filtered by the kernel. In this case, all packets will be passed on from DPDK to the kernel, which will handle the BPF filtering, and then pass up the packets to the application.

2. Compatibility with Wireshark?

Check out the new "extcap" feature of Wireshark. It uses named pipes for the packets, already mentioned by Stephen Hemminger.

3. tcpdump/libpcap support?

Tcpdump is an open source application, so it should be possible to define an efficient interface between DPDK and tcpdump, and implement it in both DPDK and tcpdump. The same goes for libpcap. An efficient interface has a primary feature: passing packets from DPDK to tcpdump/libpcap without too much overhead. It possibly also has a secondary feature: passing a BPF program from tcpdump/libpcap to DPDK, so packets can be filtered in DPDK and don't need to be passed on to tcpdump/libpcap.

4. Efficient fast path packet filtering using BPF?

Technically, this has nothing to do with tcpdump. Just add a BPF library (librte_bpf) to DPDK, preferably with a compiler. The application initially calls the library's BPF compiler function once with the BPF program to compile it, and in the fast path the application calls a library function that takes an mbuf and the compiled BPF program and returns an integer value indicating how many bytes of the packet should be mirrored by the capturing application. +1 to Matthew Hall for taking this direction!

5. Pcap formatted output?

The pcap file format contains a header in front of each packet, which is extremely simple. But it has a timestamp (which uses 32 bit for tv_sec and tv_usec in files), so it needs to be considered how to handle this efficiently.

PS: Remember that the packets received on the port might be distributed to multiple lcores by RSS, and all these lcores need to write to a single queue (named pipe, TUN/TAP port, pcap file, or whatever).

PPS: Bruce Richardson suggested adding a port mirroring callback. If you want port mirroring or tcpdump support in your application, it belongs in your application. Callbacks come at a cost (especially if not used), so don't start adding callbacks and hooks for new features if not strictly required. You might also want port mirroring or tcpdump support for something further down the application's fast path, e.g. mirroring PPPoE tunneled packets after they come out of the PPPoE tunnel. In this case, you need to add it to your application anyway.

Med venlig hilsen / kind regards

Morten Brørup

CTO

SmartShare Systems A/S

Tonsbakken 16-18

DK-2740 Skovlunde

Denmark

Office      +45 70 20 00 93

Direct      +45 89 93 50 22

Mobile      +45 25 40 82 12

mb@smartsharesystems.com <mailto:mb@smartsharesystems.com> 

www.smartsharesystems.com <http://www.smartsharesystems.com/> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14  9:57 [dpdk-dev] tcpdump support in DPDK 2.3 Morten Brørup
@ 2015-12-14 15:45 ` Aaron Conole
  2015-12-14 15:48   ` Thomas Monjalon
  2015-12-14 18:29 ` Matthew Hall
  1 sibling, 1 reply; 26+ messages in thread
From: Aaron Conole @ 2015-12-14 15:45 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

Morten Brørup <mb@smartsharesystems.com> writes:
> I noticed a discussion about support for tcpdump in DPDK 2.3.
>
>  
>
> Please consider which scenarios you want to support:

Morten,

Thanks for your input here. I think there's a different way of
approaching this: "debuggability" (sorry, it's not grammatical).

The end goal of having tcpdump is not just for another feature checklist
that folks can just say "okay, welp we got that too!" When something is
going wrong with communications, being able to fire up tcpdump without
disturbing anything else is hugely important to isolating issues. I
think that's an important scenario, and may be enabled by one or more of
the features you've listed.

There are other scenarios as well, that you hinted at - using existing
applications built around libpcap. That is important to enable as well,
but I think the biggest hurdle to getting anyone to use a DPDK enabled
application will always be: "How much work do I have to do when
something goes wrong?"

There are certainly things that should belong in an application. But I
think easy enabling of a tcpdump capable mechanism is DPDK's
responsibility. After all, it's a networking stack, right?

Whichever combination of features is used, we shouldn't really
discourage them, I think. Any way the user can debug something using
familiar workflows and tools is a way that dpdk-dev doesn't need to get
involved.

Just my $.02, anyway.

Thanks,
-Aaron

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14 15:45 ` Aaron Conole
@ 2015-12-14 15:48   ` Thomas Monjalon
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Monjalon @ 2015-12-14 15:48 UTC (permalink / raw)
  To: Aaron Conole; +Cc: dev, Morten Brørup

2015-12-14 10:45, Aaron Conole:
> After all, it's a networking stack, right?

No, not currently.
DPDK allows to build some specific lightweight or more complete stacks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14  9:57 [dpdk-dev] tcpdump support in DPDK 2.3 Morten Brørup
  2015-12-14 15:45 ` Aaron Conole
@ 2015-12-14 18:29 ` Matthew Hall
  2015-12-14 19:14   ` Stephen Hemminger
  2015-12-14 19:17   ` Aaron Conole
  1 sibling, 2 replies; 26+ messages in thread
From: Matthew Hall @ 2015-12-14 18:29 UTC (permalink / raw)
  To: Morten B; +Cc: dev

FYI your last name comes in as a corrupt character for me. You might have to 
think about converting it from ISO 8859-1 / 8859-15 to UTF-8.

On Mon, Dec 14, 2015 at 10:57:10AM +0100, Morten B wrote:
> Check out the new "extcap" feature of Wireshark. It uses named pipes for the 
> packets, already mentioned by Stephen Hemminger.

I looked at it a bit. I wasn't 100% clear if there is a way to pass down the 
BPF expression for compilation and usage inside the DPDK application.

> Tcpdump is an open source application, so it should be possible to define an 
> efficient interface between DPDK and tcpdump, and implement it in both DPDK 
> and tcpdump. The same goes for libpcap.

Easier said than done. A whole ton of libpcap assumes it's talking to a very 
specific kernel interface, and the code is quite complicated.

> It possibly also has a secondary feature: passing a BPF program 
> from tcpdump/libpcap to DPDK, so packets can be filtered in DPDK and don't 
> need to be passed on to tcpdump/libpcap.

If we can figure out how to get this feature to work in extcap, I think that 
will be the winning solution by far.

> [A]dd a BPF library (librte_bpf) to DPDK, preferably with a compiler. The 
> application initially calls the library's BPF compiler function once with 
> the BPF program to compile it, and in the fast path the application calls a 
> library function that takes an mbuf and the compiled BPF program and returns 
> an integer value indicating how many bytes of the packet should be mirrored 
> by the capturing application. +1 to Matthew Hall for taking this direction!

Yes, performance wise I think this is the only way that will really work 100% 
of the time. Otherwise I think we end up in the very bad situation where the 
guy who tries to make a capture of a single flow for debugging on i40e ends up 
crashing his system or dropping all his traffic when the capture system 
unhelpfully redirects a storm of unfiltered traffic outside of DPDK to KNI or 
some pipe devices or another place it does not belong.

There is one complexity though... the list of BPF filters should probably be a 
linked list, where they get added and removed, or you can't do > 1 filter at a 
time. I know how to code some of this stuff but I only work on DPDK in my 
spare time so I don't have the cycles to do all of the work.

> The pcap file format contains a header in front of each packet, which is 
> extremely simple. But it has a timestamp (which uses 32 bit for tv_sec and 
> tv_usec in files), so it needs to be considered how to handle this 
> efficiently.

I already wrote some C code for generating the original pcap format files a 
while ago which I think could be donated. For the timestamps to work at 
highest efficiency we'd need to run an rte_timer every X microseconds that 
updates a global volatile copy of tv_sec and tv_usec.

Or make some code that calculates the offset of rte_rdtsc from 01 January 1970 
00:00:00 UTC and uses TSC value to generate the right tv_sec and tv_usec would 
also work fine.

Matthew.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14 18:29 ` Matthew Hall
@ 2015-12-14 19:14   ` Stephen Hemminger
  2015-12-14 22:23     ` Matthew Hall
  2015-12-14 19:17   ` Aaron Conole
  1 sibling, 1 reply; 26+ messages in thread
From: Stephen Hemminger @ 2015-12-14 19:14 UTC (permalink / raw)
  To: Matthew Hall; +Cc: dev, Morten B

On Mon, 14 Dec 2015 13:29:31 -0500
Matthew Hall <mhall@mhcomputing.net> wrote:

> FYI your last name comes in as a corrupt character for me. You might have to 
> think about converting it from ISO 8859-1 / 8859-15 to UTF-8.
> 
> On Mon, Dec 14, 2015 at 10:57:10AM +0100, Morten B wrote:
> > Check out the new "extcap" feature of Wireshark. It uses named pipes for the 
> > packets, already mentioned by Stephen Hemminger.
> 
> I looked at it a bit. I wasn't 100% clear if there is a way to pass down the 
> BPF expression for compilation and usage inside the DPDK application.
> 
> > Tcpdump is an open source application, so it should be possible to define an 
> > efficient interface between DPDK and tcpdump, and implement it in both DPDK 
> > and tcpdump. The same goes for libpcap.
> 
> Easier said than done. A whole ton of libpcap assumes it's talking to a very 
> specific kernel interface, and the code is quite complicated.
> 
> > It possibly also has a secondary feature: passing a BPF program 
> > from tcpdump/libpcap to DPDK, so packets can be filtered in DPDK and don't 
> > need to be passed on to tcpdump/libpcap.
> 
> If we can figure out how to get this feature to work in extcap, I think that 
> will be the winning solution by far.
> 
> > [A]dd a BPF library (librte_bpf) to DPDK, preferably with a compiler. The 
> > application initially calls the library's BPF compiler function once with 
> > the BPF program to compile it, and in the fast path the application calls a 
> > library function that takes an mbuf and the compiled BPF program and returns 
> > an integer value indicating how many bytes of the packet should be mirrored 
> > by the capturing application. +1 to Matthew Hall for taking this direction!

There are already several BPF libraries available. I would prefer DPDK not
start copying existing code.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14 18:29 ` Matthew Hall
  2015-12-14 19:14   ` Stephen Hemminger
@ 2015-12-14 19:17   ` Aaron Conole
  2015-12-14 21:29     ` Kyle Larose
  2015-12-14 22:25     ` Matthew Hall
  1 sibling, 2 replies; 26+ messages in thread
From: Aaron Conole @ 2015-12-14 19:17 UTC (permalink / raw)
  To: Matthew Hall; +Cc: dev, Morten B

Matthew Hall <mhall@mhcomputing.net> writes:
>> The pcap file format contains a header in front of each packet, which is 
>> extremely simple. But it has a timestamp (which uses 32 bit for tv_sec and 
>> tv_usec in files), so it needs to be considered how to handle this 
>> efficiently.
>
> I already wrote some C code for generating the original pcap format files a 
> while ago which I think could be donated. For the timestamps to work at 
> highest efficiency we'd need to run an rte_timer every X microseconds that 
> updates a global volatile copy of tv_sec and tv_usec.
>
> Or make some code that calculates the offset of rte_rdtsc from 01 January 1970 
> 00:00:00 UTC and uses TSC value to generate the right tv_sec and tv_usec would 
> also work fine.

Why not just use libpcap to write out pcap files? I bet it does a better
job that any of us will ;) It's BSD licensed, so there should be no
issues with linking against it (DPDK currently does for the pcap PMD), and
it supports both pcap and pcap-ng (although -ng support may not be 100%,
I expect it will get better).

No need to donate to the cause on this one, I think :) The issues
surrounding tcpdump are, imo, ones of library/application workflow. HOW
does the user enable tcpdump-like support? The current option is to
start up with a pcap PMD configured, capture to a file for a bit, then
stop. I think the issues being discussed are what other options to give
the user. Then again, I may have my signals crossed somewhere.

-Aaron

> Matthew.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14 19:17   ` Aaron Conole
@ 2015-12-14 21:29     ` Kyle Larose
  2015-12-14 22:36       ` Matthew Hall
  2015-12-14 22:25     ` Matthew Hall
  1 sibling, 1 reply; 26+ messages in thread
From: Kyle Larose @ 2015-12-14 21:29 UTC (permalink / raw)
  To: Aaron Conole; +Cc: dev, Morten B

On Mon, Dec 14, 2015 at 2:17 PM, Aaron Conole <aconole@redhat.com> wrote:

> No need to donate to the cause on this one, I think :) The issues
> surrounding tcpdump are, imo, ones of library/application workflow. HOW
> does the user enable tcpdump-like support? The current option is to
> start up with a pcap PMD configured, capture to a file for a bit, then
> stop. I think the issues being discussed are what other options to give
> the user. Then again, I may have my signals crossed somewhere.
>

I don't think you're crossing signals on giving options to users.
However, I think we're discussing more than just high level UI
options; we're getting into the details internal to any application
involved in capturing packets. While it's great to give options to the
user, we still need to get the captured packets to them. This poses a
few challenges, since we need to do it with low impact(e.g. don't just
write the packet to the HDD in the main packet processing loop), while
not hammering the system with a crazy flood that takes down the kernel
(copy everything to into some critical task). Both of these have been
discussed in earlier threads/earlier in this thread.

To me, these challenges boil down to:
1) Balancing a nice generic output interface with the most efficient
way to get packets out of the application .
2) Filtering as close to the capture point as possible.

Putting that together with giving options, we need to:
1) Give the users a convenient API to start a capture and provide a filter.
2) Balance a nice generic output interface with the most efficient way
to get packets out of the application.
3) Filter as close to the capture point as possible.

I've seen lots of ideas and options tossed around which would solve
some or all of the above items, but nobody actually committing to
anything. What can we do to actually agree on a solution to go and
implement? I'm relatively new to the community, so I don't really know
how this stuff works. Do people typically form a working group where
they go off and discuss the problem, and then come back to the main
community with a proposal? Or do people just submit RFCs independently
with their own ideas?

Thanks,

Kyle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14 19:14   ` Stephen Hemminger
@ 2015-12-14 22:23     ` Matthew Hall
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Hall @ 2015-12-14 22:23 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Morten B

On Mon, Dec 14, 2015 at 11:14:42AM -0800, Stephen Hemminger wrote:
> There are already several BPF libraries available. I would prefer DPDK not
> start copying existing code.

I didn't copy or reduplicate any code. I was planning to use bpfjit from Alex 
Nasonov, but a userspace version instead of the kernel one. If somebody makes 
an shlib version of course I could use that instead. But I didn't hear of one 
yet.

Matthew.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14 19:17   ` Aaron Conole
  2015-12-14 21:29     ` Kyle Larose
@ 2015-12-14 22:25     ` Matthew Hall
  1 sibling, 0 replies; 26+ messages in thread
From: Matthew Hall @ 2015-12-14 22:25 UTC (permalink / raw)
  To: Aaron Conole; +Cc: dev, Morten B

On Mon, Dec 14, 2015 at 02:17:12PM -0500, Aaron Conole wrote:
> Why not just use libpcap to write out pcap files? I bet it does a better
> job that any of us will ;) It's BSD licensed, so there should be no
> issues with linking against it (DPDK currently does for the pcap PMD), and
> it supports both pcap and pcap-ng (although -ng support may not be 100%,
> I expect it will get better).

It doesn't do things such as scatter-gather vector IO. So it causes a lot more 
system calls than needed. It's an issue if you are doing I40E and such. But I 
don't really care so much how it works.

> The current option is to start up with a pcap PMD configured, capture to a 
> file for a bit, then stop. I think the issues being discussed are what other 
> options to give the user. Then again, I may have my signals crossed 
> somewhere.

For me I think it's very important to make something that works even with 
tremendous load, not causing tons of writes and syscalls on packets that match 
no filters and are not even wanted. None of the solutions I saw so far could 
do this except bpfjit combined with extcap.

Matthew.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14 21:29     ` Kyle Larose
@ 2015-12-14 22:36       ` Matthew Hall
  2015-12-16 10:45         ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Hall @ 2015-12-14 22:36 UTC (permalink / raw)
  To: Kyle Larose; +Cc: dev, Morten B

On Mon, Dec 14, 2015 at 04:29:41PM -0500, Kyle Larose wrote:
> I've seen lots of ideas and options tossed around which would solve
> some or all of the above items, but nobody actually committing to
> anything. What can we do to actually agree on a solution to go and
> implement? I'm relatively new to the community, so I don't really know
> how this stuff works. Do people typically form a working group where
> they go off and discuss the problem, and then come back to the main
> community with a proposal? Or do people just submit RFCs independently
> with their own ideas?
> 
> Thanks,
> Kyle

I am getting the impression of a misplaced sense of urgency / panic. I don't 
think anybody came up with a reason why we have to answer all these questions 
tremendously quickly. It will take some more time, particularly with the 
holidays, for the developers to finish the last bug fixes on the current 
release before they have time to discuss 2.3 features.

When that happens, someone working on DPDK full time will be identified as the 
leader for the feature, that will lead the effort on PCAP, and help us 
formulate the plan. Until then, what we really could use at this point is not 
necessarily more writings and speculation, but an answer on some key tech 
questions, particularly from some kernel guys:

1) How do we get the pcap filter string and/or BPF opcode vector from libpcap 
/ tcpdump / tshark / wireshark, into the DPDK application? There we can 
compile it using the user-space bpfjit, so we can filter the packets at very 
high speeds and not end up breaking everything doing a ton of stupid copies 
when somebody does a capture of one flow on his i40e device or such. libpcap 
is crappy about this, as it sends it all over syscalls which are always 
assuming the kernel is on the other end, which is a bad assumption on their 
part but many decades old and not so easy to fix.

2) How do we get the matched packets back out to the extcap or libpcap? From 
what I saw extcap is tshark / wireshark only, which are 1) GPL licensed in 
various ways, 2) not as widely used as libpcap. So using only extcap might be 
kind of crappy.

3) For libpcap to work, maybe it will help if some of our kernel guys can help 
us find out how to "detect" the kernel put a BPF capture filter onto a TUN / 
TAP interface, and copy that filter to the DPDK app. Then, take any matched 
packets and write them back onto the TUN / TAP. This would also be super 
efficient and work with more off-the-shelf tools besides just tshark / 
wireshark.

If we don't find the answers for these items I don't think we have a path to a 
working solution, forgetting about all the nice-to-have points such as UX 
issues, troubleshooting, debugging, etc.

Matthew.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-14 22:36       ` Matthew Hall
@ 2015-12-16 10:45         ` Bruce Richardson
  2015-12-16 11:37           ` Arnon Warshavsky
  2015-12-16 11:40           ` Morten Brørup
  0 siblings, 2 replies; 26+ messages in thread
From: Bruce Richardson @ 2015-12-16 10:45 UTC (permalink / raw)
  To: Matthew Hall; +Cc: dev, Morten B

On Mon, Dec 14, 2015 at 05:36:13PM -0500, Matthew Hall wrote:
> On Mon, Dec 14, 2015 at 04:29:41PM -0500, Kyle Larose wrote:
> > I've seen lots of ideas and options tossed around which would solve
> > some or all of the above items, but nobody actually committing to
> > anything. What can we do to actually agree on a solution to go and
> > implement? I'm relatively new to the community, so I don't really know
> > how this stuff works. Do people typically form a working group where
> > they go off and discuss the problem, and then come back to the main
> > community with a proposal? Or do people just submit RFCs independently
> > with their own ideas?
> > 
> > Thanks,
> > Kyle
> 
> I am getting the impression of a misplaced sense of urgency / panic. I don't 
> think anybody came up with a reason why we have to answer all these questions 
> tremendously quickly. It will take some more time, particularly with the 
> holidays, for the developers to finish the last bug fixes on the current 
> release before they have time to discuss 2.3 features.
> 
> When that happens, someone working on DPDK full time will be identified as the 
> leader for the feature, that will lead the effort on PCAP, and help us 
> formulate the plan. Until then, what we really could use at this point is not 
> necessarily more writings and speculation, but an answer on some key tech 
> questions, particularly from some kernel guys:
> 
> 1) How do we get the pcap filter string and/or BPF opcode vector from libpcap 
> / tcpdump / tshark / wireshark, into the DPDK application? There we can 
> compile it using the user-space bpfjit, so we can filter the packets at very 
> high speeds and not end up breaking everything doing a ton of stupid copies 
> when somebody does a capture of one flow on his i40e device or such. libpcap 
> is crappy about this, as it sends it all over syscalls which are always 
> assuming the kernel is on the other end, which is a bad assumption on their 
> part but many decades old and not so easy to fix.
> 
> 2) How do we get the matched packets back out to the extcap or libpcap? From 
> what I saw extcap is tshark / wireshark only, which are 1) GPL licensed in 
> various ways, 2) not as widely used as libpcap. So using only extcap might be 
> kind of crappy.
> 
> 3) For libpcap to work, maybe it will help if some of our kernel guys can help 
> us find out how to "detect" the kernel put a BPF capture filter onto a TUN / 
> TAP interface, and copy that filter to the DPDK app. Then, take any matched 
> packets and write them back onto the TUN / TAP. This would also be super 
> efficient and work with more off-the-shelf tools besides just tshark / 
> wireshark.
> 
> If we don't find the answers for these items I don't think we have a path to a 
> working solution, forgetting about all the nice-to-have points such as UX 
> issues, troubleshooting, debugging, etc.
> 
> Matthew.

Hi,

we are currently doing some investigation and prototyping for this feature.
Our current thinking is the following:
* to allow dynamic control of the filtering, we are thinking of making use of
  the multi-process infrastructure in DPDK. A secondary process can attach to a
  primary at runtime and provide the packet filtering and dumping capability.
* ideally we want to create a generic packet mirroring callback inside the EAL,
  that can be set up to mirror packets going through Rx/Tx on an ethdev.
* using this, packets being received on the port to be monitored are sent via
  an rte_ring (ring ethdev) to the secondary process which takes those packets
  and does any filtering on them. [This would be where BPF could fit into
  things, but it's not something we have looked at yet.]
* initially we plan to have the secondary process then write packets to a pcap
  file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
  or a TAP device PMD, those could be used as targets instead.

This implementation we hope should provide enough hooks to enable the standard
tools to be used for monitoring and capturing packets. We will send out draft
implementation code for various parts of this as soon as we have it.

Additional feedback welcome, as always. :-)

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 10:45         ` Bruce Richardson
@ 2015-12-16 11:37           ` Arnon Warshavsky
  2015-12-16 11:56             ` Morten Brørup
  2015-12-16 11:40           ` Morten Brørup
  1 sibling, 1 reply; 26+ messages in thread
From: Arnon Warshavsky @ 2015-12-16 11:37 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Morten B

2 points from our experience in saving pcap files from a dpdk 10G fire hose:

1)
Our capture module provides a small "bit-vector" to the code that handles
the packets.
Since our packet processing code is already finding out basic stuff about
the packet traversing it (is it IPv4? v6?  is it TCP? is it fragmented?
..etc), it sets the relevant bits ON as it goes ,so that the capture module
can later quickly (mask against desired filters) decide if the a packet
needs to be captured.
Point is - when a capture layer exposes a slim API that lets it utilize
info coming from other modules , its easier and less expensive to handle
the fire hose.

2)
In many cases we are interested in capturing complete TCP flows, or at
least the first X packets of them.
In this case, A more expensive filter may be applied only on the SYN packet
and when matches, turns ON a bit on the tcp flow applicative context that
says we want to capture any packet falling under this tuple.
Point is - applicative filters at different costs are applied on different
packet types utilizing the mask from the previous bullet

Such a model should obviously need to be optional on a formal capture layer,
but when dealing with a fire hose - I find it very useful.

/Arnon

-

On Wed, Dec 16, 2015 at 12:45 PM, Bruce Richardson <
bruce.richardson@intel.com> wrote:

> On Mon, Dec 14, 2015 at 05:36:13PM -0500, Matthew Hall wrote:
> > On Mon, Dec 14, 2015 at 04:29:41PM -0500, Kyle Larose wrote:
> > > I've seen lots of ideas and options tossed around which would solve
> > > some or all of the above items, but nobody actually committing to
> > > anything. What can we do to actually agree on a solution to go and
> > > implement? I'm relatively new to the community, so I don't really know
> > > how this stuff works. Do people typically form a working group where
> > > they go off and discuss the problem, and then come back to the main
> > > community with a proposal? Or do people just submit RFCs independently
> > > with their own ideas?
> > >
> > > Thanks,
> > > Kyle
> >
> > I am getting the impression of a misplaced sense of urgency / panic. I
> don't
> > think anybody came up with a reason why we have to answer all these
> questions
> > tremendously quickly. It will take some more time, particularly with the
> > holidays, for the developers to finish the last bug fixes on the current
> > release before they have time to discuss 2.3 features.
> >
> > When that happens, someone working on DPDK full time will be identified
> as the
> > leader for the feature, that will lead the effort on PCAP, and help us
> > formulate the plan. Until then, what we really could use at this point
> is not
> > necessarily more writings and speculation, but an answer on some key tech
> > questions, particularly from some kernel guys:
> >
> > 1) How do we get the pcap filter string and/or BPF opcode vector from
> libpcap
> > / tcpdump / tshark / wireshark, into the DPDK application? There we can
> > compile it using the user-space bpfjit, so we can filter the packets at
> very
> > high speeds and not end up breaking everything doing a ton of stupid
> copies
> > when somebody does a capture of one flow on his i40e device or such.
> libpcap
> > is crappy about this, as it sends it all over syscalls which are always
> > assuming the kernel is on the other end, which is a bad assumption on
> their
> > part but many decades old and not so easy to fix.
> >
> > 2) How do we get the matched packets back out to the extcap or libpcap?
> From
> > what I saw extcap is tshark / wireshark only, which are 1) GPL licensed
> in
> > various ways, 2) not as widely used as libpcap. So using only extcap
> might be
> > kind of crappy.
> >
> > 3) For libpcap to work, maybe it will help if some of our kernel guys
> can help
> > us find out how to "detect" the kernel put a BPF capture filter onto a
> TUN /
> > TAP interface, and copy that filter to the DPDK app. Then, take any
> matched
> > packets and write them back onto the TUN / TAP. This would also be super
> > efficient and work with more off-the-shelf tools besides just tshark /
> > wireshark.
> >
> > If we don't find the answers for these items I don't think we have a
> path to a
> > working solution, forgetting about all the nice-to-have points such as UX
> > issues, troubleshooting, debugging, etc.
> >
> > Matthew.
>
> Hi,
>
> we are currently doing some investigation and prototyping for this feature.
> Our current thinking is the following:
> * to allow dynamic control of the filtering, we are thinking of making use
> of
>   the multi-process infrastructure in DPDK. A secondary process can attach
> to a
>   primary at runtime and provide the packet filtering and dumping
> capability.
> * ideally we want to create a generic packet mirroring callback inside the
> EAL,
>   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> * using this, packets being received on the port to be monitored are sent
> via
>   an rte_ring (ring ethdev) to the secondary process which takes those
> packets
>   and does any filtering on them. [This would be where BPF could fit into
>   things, but it's not something we have looked at yet.]
> * initially we plan to have the secondary process then write packets to a
> pcap
>   file using a pcap PMD, but down the road if we get other PMDs, like a
> KNI PMD
>   or a TAP device PMD, those could be used as targets instead.
>
> This implementation we hope should provide enough hooks to enable the
> standard
> tools to be used for monitoring and capturing packets. We will send out
> draft
> implementation code for various parts of this as soon as we have it.
>
> Additional feedback welcome, as always. :-)
>
> Regards,
> /Bruce
>
>


-- 

*Arnon Warshavsky*
*Qwilt | work: +972-72-2221634 | mobile: +972-50-8583058 | arnon@qwilt.com
<arnon@qwilt.com>*

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 10:45         ` Bruce Richardson
  2015-12-16 11:37           ` Arnon Warshavsky
@ 2015-12-16 11:40           ` Morten Brørup
  2015-12-16 11:56             ` Bruce Richardson
  1 sibling, 1 reply; 26+ messages in thread
From: Morten Brørup @ 2015-12-16 11:40 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Bruce,

This doesn't really sound like tcpdump to me; it sounds like port mirroring.

Your suggestion is limited to physical ports only, and cannot be attached further inside the application, e.g. for mirroring packets related to a specific VLAN.

Furthermore, it doesn't sound like the filtering part scales well. Consider a fully loaded 40 Gbit/s port. You would need to copy all packets into a single rte_ring to the attached filtering process, which would then require its own set of lcores to probably discard most of these packets when filtering. I agree with Matthew that the filtering needs to happen as close to the source as possible, and must be scalable to multiple lcores.

On the positive side, your idea has the advantage that the filter can be any application, and is not limited to BPF. However if the purpose is "tcpdump", we should probably consider BPF, which is the type of filtering offered by tcpdump.

I would prefer having a BPF library available that the application can use at any point, either at the lowest level (when receiving/transmitting Ethernet packets) or at a higher level (e.g. when working with packets that go into or come out of a tunnel). The BPF library should implement packet length and relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in the mbuf.

Transferring a BPF filter from an outside application could be done by using a simple text format, e.g. the output format of "tcpdump -ddd". This also opens an easy roadmap for Wireshark integration by simply extending excap to include such a BPF filter format.

Lots of negativity above. I very much like the idea of attaching the secondary process and going through an rte_ring. This allows the secondary process to pass the filtered and captured packets on in any format it likes to any destination it likes.

Med venlig hilsen / kind regards
- Morten Brørup

-----Original Message-----
From: Bruce Richardson [mailto:bruce.richardson@intel.com] 
Sent: 16. december 2015 11:45

Hi,

we are currently doing some investigation and prototyping for this feature.
Our current thinking is the following:
* to allow dynamic control of the filtering, we are thinking of making use of
  the multi-process infrastructure in DPDK. A secondary process can attach to a
  primary at runtime and provide the packet filtering and dumping capability.
* ideally we want to create a generic packet mirroring callback inside the EAL,
  that can be set up to mirror packets going through Rx/Tx on an ethdev.
* using this, packets being received on the port to be monitored are sent via
  an rte_ring (ring ethdev) to the secondary process which takes those packets
  and does any filtering on them. [This would be where BPF could fit into
  things, but it's not something we have looked at yet.]
* initially we plan to have the secondary process then write packets to a pcap
  file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
  or a TAP device PMD, those could be used as targets instead.

This implementation we hope should provide enough hooks to enable the standard tools to be used for monitoring and capturing packets. We will send out draft implementation code for various parts of this as soon as we have it.

Additional feedback welcome, as always. :-)

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 11:40           ` Morten Brørup
@ 2015-12-16 11:56             ` Bruce Richardson
  2015-12-16 12:26               ` Morten Brørup
  2015-12-16 18:15               ` Matthew Hall
  0 siblings, 2 replies; 26+ messages in thread
From: Bruce Richardson @ 2015-12-16 11:56 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Brørup wrote:
> Bruce,
> 
> This doesn't really sound like tcpdump to me; it sounds like port mirroring.

It's actually a bit of both, in my opinion, it's designed to allow basic mirroring
of traffic on a port to allow that traffic to be sent to a tcpdump destination.
By going with a more generic approach, we hope to enable more possible use
cases than just focusing on TCP.

> 
> Your suggestion is limited to physical ports only, and cannot be attached further inside the application, e.g. for mirroring packets related to a specific VLAN.

Yes, the lack of attachment inside the app is a limitation. There are two types
of scenarios that could be considered for packet capture:
* ones where the application can be modified to do it's own filtering and
capturing.
* ones where you want a generic capture mechanism which can be used on any
application without modification.
We have chosen to focus more on the second one, as that is where a generic
solution for DPDK is likely to lie. For the first case, the application writer
himself knows the type of traffic and how best to capture and filter it, so I
don't think a generic one-size-fits-all solution is possible. [Though a couple
of helper libraries may be of use]

As for physical ports, the scheme should work for any ethdev - why do you see
it only being limited to physical ports? What would you want to see monitored
that we are missing.

> 
> Furthermore, it doesn't sound like the filtering part scales well. Consider a fully loaded 40 Gbit/s port. You would need to copy all packets into a single rte_ring to the attached filtering process, which would then require its own set of lcores to probably discard most of these packets when filtering. I agree with Matthew that the filtering needs to happen as close to the source as possible, and must be scalable to multiple lcores.

Without modifying the application itself to do it's own filtering I suspect
scalability is always going to be a problem. That being said, there is no
particular reason why a single rte_ring needs to be used - we could allow one
ring per NIC queue for instance. The trouble with filtering at the source itself
is that you put extra load on the IO cores. By using a ring, we put the filtering
load on extra cores in a secondary process which can be scaled by the user without
touching the main app.

> 
> On the positive side, your idea has the advantage that the filter can be any application, and is not limited to BPF. However if the purpose is "tcpdump", we should probably consider BPF, which is the type of filtering offered by tcpdump.

Having this work with any application is one of our primary targets here. The
app author should not have to worry too much about getting basic debug support.
Even if it doesn't work at 40G small packet rates, you can get a lot of benefit
from a scheme that provides functional debugging for an app. Obviously, though
we aim to make this as scalable as possible, which is why we want to allow fitlering
in userspace before sending packets externally to DPDK.

> 
> I would prefer having a BPF library available that the application can use at any point, either at the lowest level (when receiving/transmitting Ethernet packets) or at a higher level (e.g. when working with packets that go into or come out of a tunnel). The BPF library should implement packet length and relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in the mbuf.
> 
> Transferring a BPF filter from an outside application could be done by using a simple text format, e.g. the output format of "tcpdump -ddd". This also opens an easy roadmap for Wireshark integration by simply extending excap to include such a BPF filter format.
> 
> 
> Lots of negativity above. I very much like the idea of attaching the secondary process and going through an rte_ring. This allows the secondary process to pass the filtered and captured packets on in any format it likes to any destination it likes.

Good, so we're not completely off-base here. :-)

/Bruce

> 
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 
> -----Original Message-----
> From: Bruce Richardson [mailto:bruce.richardson@intel.com] 
> Sent: 16. december 2015 11:45
> 
> Hi,
> 
> we are currently doing some investigation and prototyping for this feature.
> Our current thinking is the following:
> * to allow dynamic control of the filtering, we are thinking of making use of
>   the multi-process infrastructure in DPDK. A secondary process can attach to a
>   primary at runtime and provide the packet filtering and dumping capability.
> * ideally we want to create a generic packet mirroring callback inside the EAL,
>   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> * using this, packets being received on the port to be monitored are sent via
>   an rte_ring (ring ethdev) to the secondary process which takes those packets
>   and does any filtering on them. [This would be where BPF could fit into
>   things, but it's not something we have looked at yet.]
> * initially we plan to have the secondary process then write packets to a pcap
>   file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
>   or a TAP device PMD, those could be used as targets instead.
> 
> This implementation we hope should provide enough hooks to enable the standard tools to be used for monitoring and capturing packets. We will send out draft implementation code for various parts of this as soon as we have it.
> 
> Additional feedback welcome, as always. :-)
> 
> Regards,
> /Bruce
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 11:37           ` Arnon Warshavsky
@ 2015-12-16 11:56             ` Morten Brørup
  0 siblings, 0 replies; 26+ messages in thread
From: Morten Brørup @ 2015-12-16 11:56 UTC (permalink / raw)
  To: Arnon Warshavsky, Bruce Richardson; +Cc: dev

Great idea, Arnon. Let’s look at existing use cases from the real world.

Our company makes network appliances. They are not running GNU/Linux or similar, so they do not offer a BASH prompt or any other BSD/Linux like command line interface.

Here’s a simplified description of how the user interacts with the packet capture feature in our appliances:

Our GUI allows you to input a filter, e.g. a MAC address, an IP address or a compiled BPF program as a single hexadecimal string (roughly “tcpdump –ddd” output), and start capturing. The captured packets can then be downloaded from the GUI in pcap format.

The other packet filters our appliance needs, e.g. DHCP, ARP etc., are not provided by the user (or by any other external interaction), but are hardcoded in C, just like any other part of our firmware.

Med venlig hilsen / kind regards

Morten Brørup

CTO

SmartShare Systems A/S

Tonsbakken 16-18

DK-2740 Skovlunde

Denmark

Office      +45 70 20 00 93

Direct      +45 89 93 50 22

Mobile      +45 25 40 82 12

mb@smartsharesystems.com <mailto:mb@smartsharesystems.com> 

www.smartsharesystems.com <http://www.smartsharesystems.com/> 

From: Arnon Warshavsky [mailto:arnon@qwilt.com] 
Sent: 16. december 2015 12:37
To: Bruce Richardson
Cc: Matthew Hall; dev@dpdk.org; Morten Brørup
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

2 points from our experience in saving pcap files from a dpdk 10G fire hose:

1) 
Our capture module provides a small "bit-vector" to the code that handles the packets. 
Since our packet processing code is already finding out basic stuff about the packet traversing it (is it IPv4? v6?  is it TCP? is it fragmented? ..etc), it sets the relevant bits ON as it goes ,so that the capture module can later quickly (mask against desired filters) decide if the a packet needs to be captured.

Point is - when a capture layer exposes a slim API that lets it utilize info coming from other modules , its easier and less expensive to handle the fire hose.

2)

In many cases we are interested in capturing complete TCP flows, or at least the first X packets of them.

In this case, A more expensive filter may be applied only on the SYN packet and when matches, turns ON a bit on the tcp flow applicative context that says we want to capture any packet falling under this tuple.

Point is - applicative filters at different costs are applied on different packet types utilizing the mask from the previous bullet 

Such a model should obviously need to be optional on a formal capture layer,

but when dealing with a fire hose - I find it very useful.

/Arnon

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 11:56             ` Bruce Richardson
@ 2015-12-16 12:26               ` Morten Brørup
  2015-12-16 13:12                 ` Bruce Richardson
  2015-12-16 18:15               ` Matthew Hall
  1 sibling, 1 reply; 26+ messages in thread
From: Morten Brørup @ 2015-12-16 12:26 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Bruce,

Please note that tcpdump is a stupid name for a packet capture application that supports much more than just TCP.

I had missed the point about ethdev supporting virtual interfaces, so thank you for pointing that out. That covers my concerns about capturing packets inside tunnels.

I will gladly admit that you Intel guys are probably much more competent in the field of DPDK performance and scalability than I am. So Matthew and I have been asking you to kindly ensure that your solution scales well at very high packet rates too, and pointing out that filtering before copying is probably cheaper than copying before filtering. You mention that it leads to an important choice about which lcores get to do the work of filtering the packets, so that might be worth some discussion.

:-)

Med venlig hilsen / kind regards
- Morten Brørup

-----Original Message-----
From: Bruce Richardson [mailto:bruce.richardson@intel.com] 
Sent: 16. december 2015 12:56
To: Morten Brørup
Cc: Matthew Hall; Kyle Larose; dev@dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Brørup wrote:
> Bruce,
> 
> This doesn't really sound like tcpdump to me; it sounds like port mirroring.

It's actually a bit of both, in my opinion, it's designed to allow basic mirroring of traffic on a port to allow that traffic to be sent to a tcpdump destination.
By going with a more generic approach, we hope to enable more possible use cases than just focusing on TCP.

> 
> Your suggestion is limited to physical ports only, and cannot be attached further inside the application, e.g. for mirroring packets related to a specific VLAN.

Yes, the lack of attachment inside the app is a limitation. There are two types of scenarios that could be considered for packet capture:
* ones where the application can be modified to do it's own filtering and capturing.
* ones where you want a generic capture mechanism which can be used on any application without modification.
We have chosen to focus more on the second one, as that is where a generic solution for DPDK is likely to lie. For the first case, the application writer himself knows the type of traffic and how best to capture and filter it, so I don't think a generic one-size-fits-all solution is possible. [Though a couple of helper libraries may be of use]

As for physical ports, the scheme should work for any ethdev - why do you see it only being limited to physical ports? What would you want to see monitored that we are missing.

> 
> Furthermore, it doesn't sound like the filtering part scales well. Consider a fully loaded 40 Gbit/s port. You would need to copy all packets into a single rte_ring to the attached filtering process, which would then require its own set of lcores to probably discard most of these packets when filtering. I agree with Matthew that the filtering needs to happen as close to the source as possible, and must be scalable to multiple lcores.

Without modifying the application itself to do it's own filtering I suspect scalability is always going to be a problem. That being said, there is no particular reason why a single rte_ring needs to be used - we could allow one ring per NIC queue for instance. The trouble with filtering at the source itself is that you put extra load on the IO cores. By using a ring, we put the filtering load on extra cores in a secondary process which can be scaled by the user without touching the main app.

> 
> On the positive side, your idea has the advantage that the filter can be any application, and is not limited to BPF. However if the purpose is "tcpdump", we should probably consider BPF, which is the type of filtering offered by tcpdump.

Having this work with any application is one of our primary targets here. The app author should not have to worry too much about getting basic debug support.
Even if it doesn't work at 40G small packet rates, you can get a lot of benefit from a scheme that provides functional debugging for an app. Obviously, though we aim to make this as scalable as possible, which is why we want to allow fitlering in userspace before sending packets externally to DPDK.

> 
> I would prefer having a BPF library available that the application can use at any point, either at the lowest level (when receiving/transmitting Ethernet packets) or at a higher level (e.g. when working with packets that go into or come out of a tunnel). The BPF library should implement packet length and relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in the mbuf.
> 
> Transferring a BPF filter from an outside application could be done by using a simple text format, e.g. the output format of "tcpdump -ddd". This also opens an easy roadmap for Wireshark integration by simply extending excap to include such a BPF filter format.
> 
> 
> Lots of negativity above. I very much like the idea of attaching the secondary process and going through an rte_ring. This allows the secondary process to pass the filtered and captured packets on in any format it likes to any destination it likes.

Good, so we're not completely off-base here. :-)

/Bruce

> 
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 
> -----Original Message-----
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: 16. december 2015 11:45
> 
> Hi,
> 
> we are currently doing some investigation and prototyping for this feature.
> Our current thinking is the following:
> * to allow dynamic control of the filtering, we are thinking of making use of
>   the multi-process infrastructure in DPDK. A secondary process can attach to a
>   primary at runtime and provide the packet filtering and dumping capability.
> * ideally we want to create a generic packet mirroring callback inside the EAL,
>   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> * using this, packets being received on the port to be monitored are sent via
>   an rte_ring (ring ethdev) to the secondary process which takes those packets
>   and does any filtering on them. [This would be where BPF could fit into
>   things, but it's not something we have looked at yet.]
> * initially we plan to have the secondary process then write packets to a pcap
>   file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
>   or a TAP device PMD, those could be used as targets instead.
> 
> This implementation we hope should provide enough hooks to enable the standard tools to be used for monitoring and capturing packets. We will send out draft implementation code for various parts of this as soon as we have it.
> 
> Additional feedback welcome, as always. :-)
> 
> Regards,
> /Bruce
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 12:26               ` Morten Brørup
@ 2015-12-16 13:12                 ` Bruce Richardson
  2015-12-16 22:45                   ` Morten Brørup
  0 siblings, 1 reply; 26+ messages in thread
From: Bruce Richardson @ 2015-12-16 13:12 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Wed, Dec 16, 2015 at 01:26:11PM +0100, Morten Brørup wrote:
> Bruce,
> 
> Please note that tcpdump is a stupid name for a packet capture application that supports much more than just TCP.
> 
> I had missed the point about ethdev supporting virtual interfaces, so thank you for pointing that out. That covers my concerns about capturing packets inside tunnels.
> 
> I will gladly admit that you Intel guys are probably much more competent in the field of DPDK performance and scalability than I am. So Matthew and I have been asking you to kindly ensure that your solution scales well at very high packet rates too, and pointing out that filtering before copying is probably cheaper than copying before filtering. You mention that it leads to an important choice about which lcores get to do the work of filtering the packets, so that might be worth some discussion.
> 
> :-)
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 

Thanks for your support.

We may look at having a certain amount of flexibility in the configuration of
the setup, so as to avoid limiting the use of the functionality.

For scalability at very high packet rates, it's something we'll need you guys to
give us pointers on too - what's acceptable or not inside an app, and what
level of scalabilty is needed. I'd admit that most of our initial thinking in this
area was for debugging apps at less than line rate i.e. for functional testing.
For full line rate introspection, we'll have to see when we get some working code.

/Bruce

> 
> -----Original Message-----
> From: Bruce Richardson [mailto:bruce.richardson@intel.com] 
> Sent: 16. december 2015 12:56
> To: Morten Brørup
> Cc: Matthew Hall; Kyle Larose; dev@dpdk.org
> Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3
> 
> On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Brørup wrote:
> > Bruce,
> > 
> > This doesn't really sound like tcpdump to me; it sounds like port mirroring.
> 
> It's actually a bit of both, in my opinion, it's designed to allow basic mirroring of traffic on a port to allow that traffic to be sent to a tcpdump destination.
> By going with a more generic approach, we hope to enable more possible use cases than just focusing on TCP.
> 
> 
> > 
> > Your suggestion is limited to physical ports only, and cannot be attached further inside the application, e.g. for mirroring packets related to a specific VLAN.
> 
> Yes, the lack of attachment inside the app is a limitation. There are two types of scenarios that could be considered for packet capture:
> * ones where the application can be modified to do it's own filtering and capturing.
> * ones where you want a generic capture mechanism which can be used on any application without modification.
> We have chosen to focus more on the second one, as that is where a generic solution for DPDK is likely to lie. For the first case, the application writer himself knows the type of traffic and how best to capture and filter it, so I don't think a generic one-size-fits-all solution is possible. [Though a couple of helper libraries may be of use]
> 
> As for physical ports, the scheme should work for any ethdev - why do you see it only being limited to physical ports? What would you want to see monitored that we are missing.
> 
> > 
> > Furthermore, it doesn't sound like the filtering part scales well. Consider a fully loaded 40 Gbit/s port. You would need to copy all packets into a single rte_ring to the attached filtering process, which would then require its own set of lcores to probably discard most of these packets when filtering. I agree with Matthew that the filtering needs to happen as close to the source as possible, and must be scalable to multiple lcores.
> 
> Without modifying the application itself to do it's own filtering I suspect scalability is always going to be a problem. That being said, there is no particular reason why a single rte_ring needs to be used - we could allow one ring per NIC queue for instance. The trouble with filtering at the source itself is that you put extra load on the IO cores. By using a ring, we put the filtering load on extra cores in a secondary process which can be scaled by the user without touching the main app.
> 
> > 
> > On the positive side, your idea has the advantage that the filter can be any application, and is not limited to BPF. However if the purpose is "tcpdump", we should probably consider BPF, which is the type of filtering offered by tcpdump.
> 
> Having this work with any application is one of our primary targets here. The app author should not have to worry too much about getting basic debug support.
> Even if it doesn't work at 40G small packet rates, you can get a lot of benefit from a scheme that provides functional debugging for an app. Obviously, though we aim to make this as scalable as possible, which is why we want to allow fitlering in userspace before sending packets externally to DPDK.
> 
> > 
> > I would prefer having a BPF library available that the application can use at any point, either at the lowest level (when receiving/transmitting Ethernet packets) or at a higher level (e.g. when working with packets that go into or come out of a tunnel). The BPF library should implement packet length and relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in the mbuf.
> > 
> > Transferring a BPF filter from an outside application could be done by using a simple text format, e.g. the output format of "tcpdump -ddd". This also opens an easy roadmap for Wireshark integration by simply extending excap to include such a BPF filter format.
> > 
> > 
> > Lots of negativity above. I very much like the idea of attaching the secondary process and going through an rte_ring. This allows the secondary process to pass the filtered and captured packets on in any format it likes to any destination it likes.
> 
> Good, so we're not completely off-base here. :-)
> 
> /Bruce
> 
> > 
> > 
> > Med venlig hilsen / kind regards
> > - Morten Brørup
> > 
> > -----Original Message-----
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: 16. december 2015 11:45
> > 
> > Hi,
> > 
> > we are currently doing some investigation and prototyping for this feature.
> > Our current thinking is the following:
> > * to allow dynamic control of the filtering, we are thinking of making use of
> >   the multi-process infrastructure in DPDK. A secondary process can attach to a
> >   primary at runtime and provide the packet filtering and dumping capability.
> > * ideally we want to create a generic packet mirroring callback inside the EAL,
> >   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> > * using this, packets being received on the port to be monitored are sent via
> >   an rte_ring (ring ethdev) to the secondary process which takes those packets
> >   and does any filtering on them. [This would be where BPF could fit into
> >   things, but it's not something we have looked at yet.]
> > * initially we plan to have the secondary process then write packets to a pcap
> >   file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
> >   or a TAP device PMD, those could be used as targets instead.
> > 
> > This implementation we hope should provide enough hooks to enable the standard tools to be used for monitoring and capturing packets. We will send out draft implementation code for various parts of this as soon as we have it.
> > 
> > Additional feedback welcome, as always. :-)
> > 
> > Regards,
> > /Bruce
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 11:56             ` Bruce Richardson
  2015-12-16 12:26               ` Morten Brørup
@ 2015-12-16 18:15               ` Matthew Hall
  2015-12-21 15:39                 ` Bruce Richardson
  1 sibling, 1 reply; 26+ messages in thread
From: Matthew Hall @ 2015-12-16 18:15 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Morten Brørup

On Wed, Dec 16, 2015 at 11:56:11AM +0000, Bruce Richardson wrote:
> Having this work with any application is one of our primary targets here. 
> The app author should not have to worry too much about getting basic debug 
> support. Even if it doesn't work at 40G small packet rates, you can get a 
> lot of benefit from a scheme that provides functional debugging for an app. 

I think my issue is that I don't think I buy into this particular set of 
assumptions above.

I don't think a capture mechanism that doesn't work right in the real use 
cases of the apps actually buys us much. If all we care about is quickly 
dumping some frames to a pcap for occasional debugging, I already have some C 
code for that I can donate which is a lot less complicated than the trouble 
being proposed for "basic debug support". Or we could use libpcap's 
equivalent... but it's quite a lot more complicated than the code I have.

If we're going to assign engineers to this it's costing somebody a lot of time 
and money. So I'd prefer to get them focused on something that will always 
work even with high loads, such as real bpfjit support.

Matthew.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 13:12                 ` Bruce Richardson
@ 2015-12-16 22:45                   ` Morten Brørup
  2015-12-16 23:38                     ` Matthew Hall
  0 siblings, 1 reply; 26+ messages in thread
From: Morten Brørup @ 2015-12-16 22:45 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Bruce,

Matthew presented a very important point a few hours ago: We don't need tcpdump support for debugging the application in a lab; we already have plenty of other tools for debugging what we are developing. We need tcpdump support for debugging network issues in a production network.

In my "hardened network appliance" world, a solution designed purely for legacy applications (tcpdump, Wireshark etc.) is useless because the network technician doesn't have access to these applications on the appliance.

While a PC system running a DPDK based application might have plenty of spare lcores for filtering, the SmartShare appliances are already using all lcores for dedicated purposes, so the runtime filtering has to be done by the IO lcores (otherwise we would have to rehash everything and reallocate some lcores for mirroring, which I strongly oppose). Our non-DPDK firmware has also always been filtering directly in the fast path.

If the filter is so complex that it unexpectedly degrades the normal traffic forwarding performance, the mirror still reflects all the forwarded network traffic, not just some of it. In many real life network debugging scenarios this is better than the alternative: keeping the traffic forwarding up at full performance and having a network technician trying to understand a mirror output where some of the relevant packets are unexpectedly missing.

Although it is generally considered bad design if a system's behavior (or performance) changes unexpectedly when debugging features are being used, experienced network technicians have already grown accustomed to the performance of most non-trivial network equipment depending on the number of features enabled and how it is configured, so reality might beat theory here. (Still, other companies might prefer to keep their fast path performance unaffected and dedicate/reallocate some lcores for filtering.)

I am probably repeating myself here, but I would prefer if the DPDK provided the packet capturing framework in the form of a set of efficient libraries for 1. BPF filtering (e.g. a simple BPF interpreter or a DPDK variant of bpfjit), 2. scalable packet queueing for the mirrored packets (probably multi producer, single or multi consumer), as well as 3. high resolution time stamping (preferably easily convertible to the pcap file packet timestamp format). Then the DPDK application can take care of interfacing to the attached application and outputting the mirrored packets to the appropriate destination, e.g. a pcap file, a Wireshark excap named pipe, a dedicated RSPAN VLAN, or an ERSPAN tunnel. And an example application should show how to bind all this together in a tcpdump-like scenario for debugging a production network.

A note about timestamps: In theory, the captured packets should be time stamped as early as possible. In practice though, it is probably sufficiently accurate to time stamp the accepted packets after filtering, especially if they are filtered by an IO lcore. Alternatively, they can be time stamped when consumed from the mirror output queue.

A note about packet ordering: Mirrored packets belonging to different flows are probably out of order because of RSS, where multiple lcores contribute to the mirror output. This packet ordering inaccuracy could also serve as a reason for not being too strict about the accuracy of the timestamps on the mirrored packets.

Med venlig hilsen / kind regards
- Morten Brørup

-----Original Message-----
From: Bruce Richardson [mailto:bruce.richardson@intel.com] 
Sent: 16. december 2015 14:13
To: Morten Brørup
Cc: Matthew Hall; Kyle Larose; dev@dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 01:26:11PM +0100, Morten Brørup wrote:
> Bruce,
> 
> Please note that tcpdump is a stupid name for a packet capture application that supports much more than just TCP.
> 
> I had missed the point about ethdev supporting virtual interfaces, so thank you for pointing that out. That covers my concerns about capturing packets inside tunnels.
> 
> I will gladly admit that you Intel guys are probably much more competent in the field of DPDK performance and scalability than I am. So Matthew and I have been asking you to kindly ensure that your solution scales well at very high packet rates too, and pointing out that filtering before copying is probably cheaper than copying before filtering. You mention that it leads to an important choice about which lcores get to do the work of filtering the packets, so that might be worth some discussion.
> 
> :-)
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 

Thanks for your support.

We may look at having a certain amount of flexibility in the configuration of the setup, so as to avoid limiting the use of the functionality.

For scalability at very high packet rates, it's something we'll need you guys to give us pointers on too - what's acceptable or not inside an app, and what level of scalabilty is needed. I'd admit that most of our initial thinking in this area was for debugging apps at less than line rate i.e. for functional testing.
For full line rate introspection, we'll have to see when we get some working code.

/Bruce

> 
> -----Original Message-----
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: 16. december 2015 12:56
> To: Morten Brørup
> Cc: Matthew Hall; Kyle Larose; dev@dpdk.org
> Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3
> 
> On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Brørup wrote:
> > Bruce,
> > 
> > This doesn't really sound like tcpdump to me; it sounds like port mirroring.
> 
> It's actually a bit of both, in my opinion, it's designed to allow basic mirroring of traffic on a port to allow that traffic to be sent to a tcpdump destination.
> By going with a more generic approach, we hope to enable more possible use cases than just focusing on TCP.
> 
> 
> > 
> > Your suggestion is limited to physical ports only, and cannot be attached further inside the application, e.g. for mirroring packets related to a specific VLAN.
> 
> Yes, the lack of attachment inside the app is a limitation. There are two types of scenarios that could be considered for packet capture:
> * ones where the application can be modified to do it's own filtering and capturing.
> * ones where you want a generic capture mechanism which can be used on any application without modification.
> We have chosen to focus more on the second one, as that is where a 
> generic solution for DPDK is likely to lie. For the first case, the 
> application writer himself knows the type of traffic and how best to 
> capture and filter it, so I don't think a generic one-size-fits-all 
> solution is possible. [Though a couple of helper libraries may be of 
> use]
> 
> As for physical ports, the scheme should work for any ethdev - why do you see it only being limited to physical ports? What would you want to see monitored that we are missing.
> 
> > 
> > Furthermore, it doesn't sound like the filtering part scales well. Consider a fully loaded 40 Gbit/s port. You would need to copy all packets into a single rte_ring to the attached filtering process, which would then require its own set of lcores to probably discard most of these packets when filtering. I agree with Matthew that the filtering needs to happen as close to the source as possible, and must be scalable to multiple lcores.
> 
> Without modifying the application itself to do it's own filtering I suspect scalability is always going to be a problem. That being said, there is no particular reason why a single rte_ring needs to be used - we could allow one ring per NIC queue for instance. The trouble with filtering at the source itself is that you put extra load on the IO cores. By using a ring, we put the filtering load on extra cores in a secondary process which can be scaled by the user without touching the main app.
> 
> > 
> > On the positive side, your idea has the advantage that the filter can be any application, and is not limited to BPF. However if the purpose is "tcpdump", we should probably consider BPF, which is the type of filtering offered by tcpdump.
> 
> Having this work with any application is one of our primary targets here. The app author should not have to worry too much about getting basic debug support.
> Even if it doesn't work at 40G small packet rates, you can get a lot of benefit from a scheme that provides functional debugging for an app. Obviously, though we aim to make this as scalable as possible, which is why we want to allow fitlering in userspace before sending packets externally to DPDK.
> 
> > 
> > I would prefer having a BPF library available that the application can use at any point, either at the lowest level (when receiving/transmitting Ethernet packets) or at a higher level (e.g. when working with packets that go into or come out of a tunnel). The BPF library should implement packet length and relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in the mbuf.
> > 
> > Transferring a BPF filter from an outside application could be done by using a simple text format, e.g. the output format of "tcpdump -ddd". This also opens an easy roadmap for Wireshark integration by simply extending excap to include such a BPF filter format.
> > 
> > 
> > Lots of negativity above. I very much like the idea of attaching the secondary process and going through an rte_ring. This allows the secondary process to pass the filtered and captured packets on in any format it likes to any destination it likes.
> 
> Good, so we're not completely off-base here. :-)
> 
> /Bruce
> 
> > 
> > 
> > Med venlig hilsen / kind regards
> > - Morten Brørup
> > 
> > -----Original Message-----
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: 16. december 2015 11:45
> > 
> > Hi,
> > 
> > we are currently doing some investigation and prototyping for this feature.
> > Our current thinking is the following:
> > * to allow dynamic control of the filtering, we are thinking of making use of
> >   the multi-process infrastructure in DPDK. A secondary process can attach to a
> >   primary at runtime and provide the packet filtering and dumping capability.
> > * ideally we want to create a generic packet mirroring callback inside the EAL,
> >   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> > * using this, packets being received on the port to be monitored are sent via
> >   an rte_ring (ring ethdev) to the secondary process which takes those packets
> >   and does any filtering on them. [This would be where BPF could fit into
> >   things, but it's not something we have looked at yet.]
> > * initially we plan to have the secondary process then write packets to a pcap
> >   file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
> >   or a TAP device PMD, those could be used as targets instead.
> > 
> > This implementation we hope should provide enough hooks to enable the standard tools to be used for monitoring and capturing packets. We will send out draft implementation code for various parts of this as soon as we have it.
> > 
> > Additional feedback welcome, as always. :-)
> > 
> > Regards,
> > /Bruce
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 22:45                   ` Morten Brørup
@ 2015-12-16 23:38                     ` Matthew Hall
  2015-12-17  5:59                       ` Arnon Warshavsky
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Hall @ 2015-12-16 23:38 UTC (permalink / raw)
  To: Morten B; +Cc: dev

On Wed, Dec 16, 2015 at 11:45:46PM +0100, Morten Brørup wrote:
> Matthew presented a very important point a few hours ago: We don't need 
> tcpdump support for debugging the application in a lab; we already have 
> plenty of other tools for debugging what we are developing. We need tcpdump 
> support for debugging network issues in a production network.

+1

> In my "hardened network appliance" world, a solution designed purely for 
> legacy applications (tcpdump, Wireshark etc.) is useless because the network 
> technician doesn't have access to these applications on the appliance.

Maybe that's true on one exact system. But I've used a whole ton of systems 
including appliances where this was not true. I really do want to find a way 
to support them, but according to my recent discussions w/ Alex Nasonov who 
made bpfjit, I don't think it is possible without really tearing apart 
libpcap. So for now the only good hope is Wireshark's Extcap support.

> While a PC system running a DPDK based application might have plenty of 
> spare lcores for filtering, the SmartShare appliances are already using all 
> lcores for dedicated purposes, so the runtime filtering has to be done by 
> the IO lcores (otherwise we would have to rehash everything and reallocate 
> some lcores for mirroring, which I strongly oppose). Our non-DPDK firmware 
> has also always been filtering directly in the fast path.

The shared process stuff and weird leftover lcore stuff seems way too complex 
for me whether or not there are any spare lcores. To me it seems easier if I 
just call some function and hand it mbufs, and it would quickly check them 
against a linked list of active filters if filters are present, or do nothing 
and return if no filter is active.

> If the filter is so complex that it unexpectedly degrades the normal traffic 
> forwarding performance

If bpfjit is used, I think it is very hard to affect the performance much. 
Unless you do something incredibly crazy.

> Although it is generally considered bad design if a system's behavior (or 
> performance) changes unexpectedly when debugging features are being used, 

I think we can keep the behavior change quite small using something like what 
I described.

> Other companies might prefer to keep their fast path performance unaffected 
> and dedicate/reallocate some lcores for filtering.

It always starts out unaffected... then goes back to accepting a bit of 
slowness when people are forced to re-learn how bad it is with no debugging. I 
have seen it again and again in many companies. Hence my proposal for 
efficient lightweight debugging support from the beginning.

> 1. BPF filtering (... a DPDK variant of bpfjit),

+1

> 2. scalable packet queueing for the mirrored packets (probably multi 
> producer, single or multi consumer)

I hate queueing. Queueing always reduces max possible throughput because 
queueing is inefficient. It is better just to put them where they need to go 
immediately (run to completion) while the mbufs are already prefetched.

> Then the DPDK application can take care of interfacing to 
> the attached application and outputting the mirrored packets to the 
> appropriate destination

Too complicated. Pcap and extcap should be working by default.

> A note about packet ordering: Mirrored packets belonging to different flows 
> are probably out of order because of RSS, where multiple lcores contribute 
> to the mirror output.

Where I worry is weird configurations where a flow can occur in >1 cores. But 
I think most users try not to do this.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 23:38                     ` Matthew Hall
@ 2015-12-17  5:59                       ` Arnon Warshavsky
  0 siblings, 0 replies; 26+ messages in thread
From: Arnon Warshavsky @ 2015-12-17  5:59 UTC (permalink / raw)
  To: Matthew Hall; +Cc: dev, Morten B

Filtering and serializing are 2 different components.
No need to bind them by default, and nothing prevents you from calling them
both from the same context if that what works for your use case.


On Thu, Dec 17, 2015 at 1:38 AM, Matthew Hall <mhall@mhcomputing.net> wrote:

> On Wed, Dec 16, 2015 at 11:45:46PM +0100, Morten Brørup wrote:
> > Matthew presented a very important point a few hours ago: We don't need
> > tcpdump support for debugging the application in a lab; we already have
> > plenty of other tools for debugging what we are developing. We need
> tcpdump
> > support for debugging network issues in a production network.
>
> +1
>
> > In my "hardened network appliance" world, a solution designed purely for
> > legacy applications (tcpdump, Wireshark etc.) is useless because the
> network
> > technician doesn't have access to these applications on the appliance.
>
> Maybe that's true on one exact system. But I've used a whole ton of systems
> including appliances where this was not true. I really do want to find a
> way
> to support them, but according to my recent discussions w/ Alex Nasonov who
> made bpfjit, I don't think it is possible without really tearing apart
> libpcap. So for now the only good hope is Wireshark's Extcap support.
>
> > While a PC system running a DPDK based application might have plenty of
> > spare lcores for filtering, the SmartShare appliances are already using
> all
> > lcores for dedicated purposes, so the runtime filtering has to be done by
> > the IO lcores (otherwise we would have to rehash everything and
> reallocate
> > some lcores for mirroring, which I strongly oppose). Our non-DPDK
> firmware
> > has also always been filtering directly in the fast path.
>
> The shared process stuff and weird leftover lcore stuff seems way too
> complex
> for me whether or not there are any spare lcores. To me it seems easier if
> I
> just call some function and hand it mbufs, and it would quickly check them
> against a linked list of active filters if filters are present, or do
> nothing
> and return if no filter is active.
>
> > If the filter is so complex that it unexpectedly degrades the normal
> traffic
> > forwarding performance
>
> If bpfjit is used, I think it is very hard to affect the performance much.
> Unless you do something incredibly crazy.
>
> > Although it is generally considered bad design if a system's behavior (or
> > performance) changes unexpectedly when debugging features are being used,
>
> I think we can keep the behavior change quite small using something like
> what
> I described.
>
> > Other companies might prefer to keep their fast path performance
> unaffected
> > and dedicate/reallocate some lcores for filtering.
>
> It always starts out unaffected... then goes back to accepting a bit of
> slowness when people are forced to re-learn how bad it is with no
> debugging. I
> have seen it again and again in many companies. Hence my proposal for
> efficient lightweight debugging support from the beginning.
>
> > 1. BPF filtering (... a DPDK variant of bpfjit),
>
> +1
>
> > 2. scalable packet queueing for the mirrored packets (probably multi
> > producer, single or multi consumer)
>
> I hate queueing. Queueing always reduces max possible throughput because
> queueing is inefficient. It is better just to put them where they need to
> go
> immediately (run to completion) while the mbufs are already prefetched.
>
> > Then the DPDK application can take care of interfacing to
> > the attached application and outputting the mirrored packets to the
> > appropriate destination
>
> Too complicated. Pcap and extcap should be working by default.
>
> > A note about packet ordering: Mirrored packets belonging to different
> flows
> > are probably out of order because of RSS, where multiple lcores
> contribute
> > to the mirror output.
>
> Where I worry is weird configurations where a flow can occur in >1 cores.
> But
> I think most users try not to do this.
>



-- 

*Arnon Warshavsky*
*Qwilt | work: +972-72-2221634 | mobile: +972-50-8583058 | arnon@qwilt.com
<arnon@qwilt.com>*

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-16 18:15               ` Matthew Hall
@ 2015-12-21 15:39                 ` Bruce Richardson
  2015-12-21 16:08                   ` Morten Brørup
  2015-12-21 16:11                   ` Gray, Mark D
  0 siblings, 2 replies; 26+ messages in thread
From: Bruce Richardson @ 2015-12-21 15:39 UTC (permalink / raw)
  To: Matthew Hall; +Cc: dev, Morten Brørup

On Wed, Dec 16, 2015 at 01:15:57PM -0500, Matthew Hall wrote:
> On Wed, Dec 16, 2015 at 11:56:11AM +0000, Bruce Richardson wrote:
> > Having this work with any application is one of our primary targets here. 
> > The app author should not have to worry too much about getting basic debug 
> > support. Even if it doesn't work at 40G small packet rates, you can get a 
> > lot of benefit from a scheme that provides functional debugging for an app. 
> 
> I think my issue is that I don't think I buy into this particular set of 
> assumptions above.
> 
> I don't think a capture mechanism that doesn't work right in the real use 
> cases of the apps actually buys us much. If all we care about is quickly 
> dumping some frames to a pcap for occasional debugging, I already have some C 
> code for that I can donate which is a lot less complicated than the trouble 
> being proposed for "basic debug support". Or we could use libpcap's 
> equivalent... but it's quite a lot more complicated than the code I have.
> 
> If we're going to assign engineers to this it's costing somebody a lot of time 
> and money. So I'd prefer to get them focused on something that will always 
> work even with high loads, such as real bpfjit support.
> 
> Matthew.

Hi,

I think it basic boils down to the fact that we are trying to solve different
problems. Our current focus is the generic usability of all DPDK applications,
as discussed at the DPDK Userspace Summit. Our plan is to provide some way to
allow standard packet capture apps, such as tcpdump, to be used easily with
DPDK. This is something also being looked for by folks such as those working
on OVS e.g. called out at http://openvswitch.org/pipermail/dev/2015-August/058814.html

  "- Insight into the system and debuggability: nothing beats tcpdump for the
    kernel datapath.  Can something similar be done for the userspace
    datapath?

  - Consistency of the tools: some commands are slightly different for the
    userspace/kernel datapath.  Ideally there shouldn't be any difference."

Providing libraries for packet capture at high packet rates is a related, but
different problem, that we'll maybe look to investigate in the future - assuming
that nobody else solves it first.

/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-21 15:39                 ` Bruce Richardson
@ 2015-12-21 16:08                   ` Morten Brørup
  2015-12-21 16:17                     ` Gray, Mark D
  2015-12-21 16:11                   ` Gray, Mark D
  1 sibling, 1 reply; 26+ messages in thread
From: Morten Brørup @ 2015-12-21 16:08 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Bruce,

Please reconsider your interpretation of the word "debuggability". Debugging is not only something that R&D staff does in a lab. Debuggability can also be interpreted as a network engineer's ability to debug what is happening in a production network.

Referring to the link you kindly provided (to the discussion on the OVF mailing list), in my eyes the context of the itemized requirements is a production environment, not a development environment. Daniele Di Proietto wrote:

>I think we can agree that there are a few rough spots that prevent it from being easily deployed and used.

>I was hoping to get some feedback from the community about those rough spots, i.e. areas where OVS+DPDK can/needs to improve to become more "production ready" and user-friendly.

Med venlig hilsen / kind regards
- Morten Brørup

-----Original Message-----
From: Bruce Richardson [mailto:bruce.richardson@intel.com] 
Sent: 21. december 2015 16:40
To: Matthew Hall
Cc: Morten Brørup; Kyle Larose; dev@dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 01:15:57PM -0500, Matthew Hall wrote:
> On Wed, Dec 16, 2015 at 11:56:11AM +0000, Bruce Richardson wrote:
> > Having this work with any application is one of our primary targets here. 
> > The app author should not have to worry too much about getting basic 
> > debug support. Even if it doesn't work at 40G small packet rates, 
> > you can get a lot of benefit from a scheme that provides functional debugging for an app.
> 
> I think my issue is that I don't think I buy into this particular set 
> of assumptions above.
> 
> I don't think a capture mechanism that doesn't work right in the real 
> use cases of the apps actually buys us much. If all we care about is 
> quickly dumping some frames to a pcap for occasional debugging, I 
> already have some C code for that I can donate which is a lot less 
> complicated than the trouble being proposed for "basic debug support". 
> Or we could use libpcap's equivalent... but it's quite a lot more complicated than the code I have.
> 
> If we're going to assign engineers to this it's costing somebody a lot 
> of time and money. So I'd prefer to get them focused on something that 
> will always work even with high loads, such as real bpfjit support.
> 
> Matthew.

Hi,

I think it basic boils down to the fact that we are trying to solve different problems. Our current focus is the generic usability of all DPDK applications, as discussed at the DPDK Userspace Summit. Our plan is to provide some way to allow standard packet capture apps, such as tcpdump, to be used easily with DPDK. This is something also being looked for by folks such as those working on OVS e.g. called out at http://openvswitch.org/pipermail/dev/2015-August/058814.html

  "- Insight into the system and debuggability: nothing beats tcpdump for the
    kernel datapath.  Can something similar be done for the userspace
    datapath?

  - Consistency of the tools: some commands are slightly different for the
    userspace/kernel datapath.  Ideally there shouldn't be any difference."

Providing libraries for packet capture at high packet rates is a related, but different problem, that we'll maybe look to investigate in the future - assuming that nobody else solves it first.

/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-21 15:39                 ` Bruce Richardson
  2015-12-21 16:08                   ` Morten Brørup
@ 2015-12-21 16:11                   ` Gray, Mark D
  1 sibling, 0 replies; 26+ messages in thread
From: Gray, Mark D @ 2015-12-21 16:11 UTC (permalink / raw)
  To: Richardson, Bruce, Matthew Hall; +Cc: dev, Morten Brørup

> This is something also being looked for by folks such as those
> working on OVS e.g. called out at
> http://openvswitch.org/pipermail/dev/2015-August/058814.html
> 
>   "- Insight into the system and debuggability: nothing beats tcpdump for the
>     kernel datapath.  Can something similar be done for the userspace
>     datapath?
> 
>   - Consistency of the tools: some commands are slightly different for the
>     userspace/kernel datapath.  Ideally there shouldn't be any difference."
> 

I had a painful experience with OVS-DPDK recently which may be representative
of a typical usability issue encountered. 

I was trying to connect two Openstack compute nodes together.  I had done
the configuration without DPDK first. It was easy to debug as I could use
tcpdump to look at the eth ports and see what type of traffic
was entering the compute node. I also needed to check if the traffic
was actually VxLAN traffic and what the VNI was in order to be able to
follow the traffic around the bridges in OVS. This all went quite well and
I was able to bring up my set up quite easily. 

Then I tried to set up the same thing with DPDK. I couldn't get traffic between
the compute nodes but I had no easy way to just dump the traffic coming into
(or out of) the compute node. Of course, there were some things I could do but,
for me, DPDK would be far more usable if I could just use tcpdump. As I know
DPDK to some extent, I can usually get around these problems but I suspect
that a new user to DPDK  would get very discouraged and frustrated by an 
experience like that. 

I'm not sure how often tcpdump is used in production environments but it is
very useful when debugging a live system without having to modify code. It would be
good if it could work at high rates and be really flexible but it probably makes
sense to focus on the basics first.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-21 16:08                   ` Morten Brørup
@ 2015-12-21 16:17                     ` Gray, Mark D
  2015-12-21 17:22                       ` Matthew Hall
  0 siblings, 1 reply; 26+ messages in thread
From: Gray, Mark D @ 2015-12-21 16:17 UTC (permalink / raw)
  To: Morten Brørup, Richardson,  Bruce; +Cc: dev

> Bruce,
> 
> Please reconsider your interpretation of the word "debuggability".
> Debugging is not only something that R&D staff does in a lab. Debuggability
> can also be interpreted as a network engineer's ability to debug what is
> happening in a production network.

Is tcpdump used in large production cloud environments? I would have 
thought other less intrusive (and less manual) tools would be used? Isn't
that one of the benefits of SDN.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] tcpdump support in DPDK 2.3
  2015-12-21 16:17                     ` Gray, Mark D
@ 2015-12-21 17:22                       ` Matthew Hall
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Hall @ 2015-12-21 17:22 UTC (permalink / raw)
  To: Gray, Mark D; +Cc: dev, Morten Brørup

On Mon, Dec 21, 2015 at 04:17:26PM +0000, Gray, Mark D wrote:
> Is tcpdump used in large production cloud environments? I would have 
> thought other less intrusive (and less manual) tools would be used? Isn't
> that one of the benefits of SDN.

tcpdump, tshark, wireshark, libpcap, etc. have been used every single place I 
ever worked, including in production under heavy load.

This is because nobody wants to redo the library of many tens of thousands of 
hours of protocol dissectors.

This is also why I am trying to point out what is required to get a solution 
that I am confident will really work when people are counting on it, which I 
am concerned the current proposals do not cover.

Matthew.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-12-21 17:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-14  9:57 [dpdk-dev] tcpdump support in DPDK 2.3 Morten Brørup
2015-12-14 15:45 ` Aaron Conole
2015-12-14 15:48   ` Thomas Monjalon
2015-12-14 18:29 ` Matthew Hall
2015-12-14 19:14   ` Stephen Hemminger
2015-12-14 22:23     ` Matthew Hall
2015-12-14 19:17   ` Aaron Conole
2015-12-14 21:29     ` Kyle Larose
2015-12-14 22:36       ` Matthew Hall
2015-12-16 10:45         ` Bruce Richardson
2015-12-16 11:37           ` Arnon Warshavsky
2015-12-16 11:56             ` Morten Brørup
2015-12-16 11:40           ` Morten Brørup
2015-12-16 11:56             ` Bruce Richardson
2015-12-16 12:26               ` Morten Brørup
2015-12-16 13:12                 ` Bruce Richardson
2015-12-16 22:45                   ` Morten Brørup
2015-12-16 23:38                     ` Matthew Hall
2015-12-17  5:59                       ` Arnon Warshavsky
2015-12-16 18:15               ` Matthew Hall
2015-12-21 15:39                 ` Bruce Richardson
2015-12-21 16:08                   ` Morten Brørup
2015-12-21 16:17                     ` Gray, Mark D
2015-12-21 17:22                       ` Matthew Hall
2015-12-21 16:11                   ` Gray, Mark D
2015-12-14 22:25     ` Matthew Hall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).