DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Stephen Hemminger" <stephen@networkplumber.org>
Cc: dpdk-dev <dev@dpdk.org>, Jerin Jacob <jerinj@marvell.com>
Subject: Re: [dpdk-dev] packet data access bug in bpf and pdump libs
Date: Thu, 10 Oct 2019 15:36:30 +0000	[thread overview]
Message-ID: <2601191342CEEE43887BDE71AB9772580191975172@irsmsx105.ger.corp.intel.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C60B6B@smartserver.smartshare.dk>



> -----Original Message-----
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, October 10, 2019 8:30 AM
> To: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dpdk-dev <dev@dpdk.org>; Jerin Jacob <jerinj@marvell.com>
> Subject: RE: [dpdk-dev] packet data access bug in bpf and pdump libs
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
> > Sent: Wednesday, October 9, 2019 7:25 PM
> >
> > On Wed, 9 Oct 2019 17:20:58 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > Sent: Wednesday, October 9, 2019 5:15 PM
> > > >
> > > > On Wed, 9 Oct 2019 17:06:24 +0200
> > > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > > >
> > > > > > -----Original Message-----
> > > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > > Sent: Wednesday, October 9, 2019 5:02 PM
> > > > > >
> > > > > > On Wed, 9 Oct 2019 11:11:46 +0000
> > > > > > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> > > > > >
> > > > > > > Hi Morten,
> > > > > > >
> > > > > > > >
> > > > > > > > Hi Konstantin and Stephen,
> > > > > > > >
> > > > > > > > I just noticed the same bug in your bpf and pcap libraries:
> > > > > > > >
> > > > > > > > You are using rte_pktmbuf_mtod(), but should be using
> > > > > > rte_pktmbuf_read(). Otherwise you cannot read data across
> > multiple
> > > > > > segments.
> > > > > > >
> > > > > > > In plain data buffer mode expected input for BPF program is
> > start
> > > > of
> > > > > > first segment packet data.
> > > > > > > Other segments are simply not available to BPF program in
> > that
> > > > mode.
> > > > > > > AFAIK, cBPF uses the same model.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Med venlig hilsen / kind regards
> > > > > > > > - Morten Brørup
> > > > > > >
> > > > > >
> > > > > > For packet capture, the BPF program is only allowed to look at
> > > > first
> > > > > > segment.
> > > > > > pktmbuf_read is expensive and can cause a copy.
> > > > >
> > > > > It is only expensive if going beyond the first segment:
> > > > >
> > > > > static inline const void *rte_pktmbuf_read(const struct rte_mbuf
> > *m,
> > > > > 	uint32_t off, uint32_t len, void *buf)
> > > > > {
> > > > > 	if (likely(off + len <= rte_pktmbuf_data_len(m)))
> > > > > 		return rte_pktmbuf_mtod_offset(m, char *, off);
> > > > > 	else
> > > > > 		return __rte_pktmbuf_read(m, off, len, buf);
> > > > > }
> > > >
> > > > But it would mean potentially big buffer on the stack (in case)
> > >
> > > No, the buffer only needs to be the size of the accessed data. I use
> > it like this:
> > >
> > > char buffer[sizeof(uint32_t)];
> > >
> > > for (;; pc++) {
> > >     switch (pc->code) {
> > >         case BPF_LD_ABS_32:
> > >             p = rte_pktmbuf_read(m, pc->k, sizeof(uint32_t), buffer);
> > >             if (unlikely(p == NULL)) return 0; /* Attempting to read
> > beyond packet. Bail out. */
> > >             a = rte_be_to_cpu_32(*(const uint32_t *)p);
> > >             continue;
> > >         case BPF_LD_ABS_16:
> > >             p = rte_pktmbuf_read(m, pc->k, sizeof(uint16_t), buffer);
> > >             if (unlikely(p == NULL)) return 0; /* Attempting to read
> > beyond packet. Bail out. */
> > >             a = rte_be_to_cpu_16(*(const uint16_t *)p);
> > >             continue;
> > >
> >
> > Reading down the chain of mbuf segments to find a uint32_t (and that
> > potentially crosses)
> > seems like a waste.

+1
Again just imagine how painful it would be to support it in JIT...

Another thing - in librte_bpf, if RTE_BPF_ARG_PTR is used,
it means that input parameter for exec/jit is just a plain data buffer and it doesn’t
make any assumptions about it (mbuf or not, etc.).

Though there is a possibility to access mbuf metadata
and call  rte_pktmbuf_read() or any other functions from your eBPF code,
but for that you need to specify RTE_BPF_ARG_PTR_MBUF at load stage. 

> >
> 
> Slow and painful is the only way to read beyond the first segment, I agree.
> 
> But when reading from the first segment, rte_pktmbuf_read() basically does the same as your code. So there shouldn't be any performance
> penalty from supporting both by using rte_pktmbuf_read() instead.
> 
> I think the modification in the pdump library is simple, as you already pass the mbuf. But the bpf library requires more work, as it passes a
> pointer to the data in the first segment to the processing function instead of passing the mbuf.
> 
> > The purpose of the filter is to look at packet headers.
> 
> Some might look deeper. So why prevent it? E.g. our StraighShaper appliance sometimes looks deeper, but for performance reasons we
> stopped using BPF for this a long time ago.
> 
> > Any driver
> > making mbufs that
> > are dripples of data is broken.
> 
> I agree very much with you on this regarding general-purpose NICs! Although I know of an exception that confirms the rule... a few year
> ago we worked with a component vendor with some very clever performance optimizations doing exactly this for specific purposes.
> Unfortunately it's under NDA, so I can't go into details.
> 
> > chaining is really meant for case of jumbo or tso.
> >
> 
> Try thinking beyond PMD ingress. There are multiple use cases in egress. Here are a couple:
> 
> - IP Multicast to multiple subnets on a Layer 3 switch. The VLAN ID and Source MAC must be replaced in each packet; this can be done using
> segments.
> - Tunnel encapsulation. E.g. putting a packet into a VXLAN tunnel could be done using segments.


  reply	other threads:[~2019-10-10 15:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09 11:03 Morten Brørup
2019-10-09 11:11 ` Ananyev, Konstantin
2019-10-09 11:35   ` Morten Brørup
2019-10-09 15:02   ` Stephen Hemminger
2019-10-09 15:06     ` Morten Brørup
2019-10-09 15:14       ` Stephen Hemminger
2019-10-09 15:20         ` Morten Brørup
2019-10-09 17:24           ` Stephen Hemminger
2019-10-10  7:29             ` Morten Brørup
2019-10-10 15:36               ` Ananyev, Konstantin [this message]
2019-10-11  8:01                 ` Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2601191342CEEE43887BDE71AB9772580191975172@irsmsx105.ger.corp.intel.com \
    --to=konstantin.ananyev@intel.com \
    --cc=dev@dpdk.org \
    --cc=jerinj@marvell.com \
    --cc=mb@smartsharesystems.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).