DPDK patches and discussions
 help / color / Atom feed
* [dpdk-dev] packet data access bug in bpf and pdump libs
@ 2019-10-09 11:03 Morten Brørup
  2019-10-09 11:11 ` Ananyev, Konstantin
  0 siblings, 1 reply; 11+ messages in thread
From: Morten Brørup @ 2019-10-09 11:03 UTC (permalink / raw)
  To: Konstantin Ananyev, Stephen Hemminger; +Cc: dpdk-dev, Jerin Jacob

Hi Konstantin and Stephen,

I just noticed the same bug in your bpf and pcap libraries:

You are using rte_pktmbuf_mtod(), but should be using rte_pktmbuf_read(). Otherwise you cannot read data across multiple segments.


Med venlig hilsen / kind regards
- Morten Brørup


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-09 11:03 [dpdk-dev] packet data access bug in bpf and pdump libs Morten Brørup
@ 2019-10-09 11:11 ` Ananyev, Konstantin
  2019-10-09 11:35   ` Morten Brørup
  2019-10-09 15:02   ` Stephen Hemminger
  0 siblings, 2 replies; 11+ messages in thread
From: Ananyev, Konstantin @ 2019-10-09 11:11 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger; +Cc: dpdk-dev, Jerin Jacob

Hi Morten,

> 
> Hi Konstantin and Stephen,
> 
> I just noticed the same bug in your bpf and pcap libraries:
> 
> You are using rte_pktmbuf_mtod(), but should be using rte_pktmbuf_read(). Otherwise you cannot read data across multiple segments.

In plain data buffer mode expected input for BPF program is start of first segment packet data.
Other segments are simply not available to BPF program in that mode.
AFAIK, cBPF uses the same model.

> 
> 
> Med venlig hilsen / kind regards
> - Morten Brørup


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-09 11:11 ` Ananyev, Konstantin
@ 2019-10-09 11:35   ` Morten Brørup
  2019-10-09 15:02   ` Stephen Hemminger
  1 sibling, 0 replies; 11+ messages in thread
From: Morten Brørup @ 2019-10-09 11:35 UTC (permalink / raw)
  To: Ananyev, Konstantin, Stephen Hemminger; +Cc: dpdk-dev, Jerin Jacob

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> Sent: Wednesday, October 9, 2019 1:12 PM
> 
> Hi Morten,
> 
> >
> > Hi Konstantin and Stephen,
> >
> > I just noticed the same bug in your bpf and pcap libraries:
> >
> > You are using rte_pktmbuf_mtod(), but should be using
> rte_pktmbuf_read(). Otherwise you cannot read data across multiple
> segments.
> 
> In plain data buffer mode expected input for BPF program is start of
> first segment packet data.
> Other segments are simply not available to BPF program in that mode.

I understand the implementation, but I still consider this a bug, not a feature.

Why should a BPF program not be able to access all data in packet? It might be used for DPI.

What if header splitting is being used, so the first segment only contains the header? E.g. the first segment on egress could be really small in a multicast scenario.

Furthermore, VLAN information cannot be accessed unless the BPF runtime has access to the mbuf. E.g. BPF_STMT(BPF_LD | BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG) is supposed to read m->vlan_tci.

> AFAIK, cBPF uses the same model.
>

AFAIK, the Linux kernel can read across fragments.


> >
> >
> > Med venlig hilsen / kind regards
> > - Morten Brørup


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-09 11:11 ` Ananyev, Konstantin
  2019-10-09 11:35   ` Morten Brørup
@ 2019-10-09 15:02   ` Stephen Hemminger
  2019-10-09 15:06     ` Morten Brørup
  1 sibling, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2019-10-09 15:02 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: Morten Brørup, dpdk-dev, Jerin Jacob

On Wed, 9 Oct 2019 11:11:46 +0000
"Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

> Hi Morten,
> 
> > 
> > Hi Konstantin and Stephen,
> > 
> > I just noticed the same bug in your bpf and pcap libraries:
> > 
> > You are using rte_pktmbuf_mtod(), but should be using rte_pktmbuf_read(). Otherwise you cannot read data across multiple segments.  
> 
> In plain data buffer mode expected input for BPF program is start of first segment packet data.
> Other segments are simply not available to BPF program in that mode.
> AFAIK, cBPF uses the same model.
> 
> > 
> > 
> > Med venlig hilsen / kind regards
> > - Morten Brørup  
> 

For packet capture, the BPF program is only allowed to look at first segment.
pktmbuf_read is expensive and can cause a copy.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-09 15:02   ` Stephen Hemminger
@ 2019-10-09 15:06     ` Morten Brørup
  2019-10-09 15:14       ` Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: Morten Brørup @ 2019-10-09 15:06 UTC (permalink / raw)
  To: Stephen Hemminger, Ananyev, Konstantin; +Cc: dpdk-dev, Jerin Jacob

> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, October 9, 2019 5:02 PM
> 
> On Wed, 9 Oct 2019 11:11:46 +0000
> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> 
> > Hi Morten,
> >
> > >
> > > Hi Konstantin and Stephen,
> > >
> > > I just noticed the same bug in your bpf and pcap libraries:
> > >
> > > You are using rte_pktmbuf_mtod(), but should be using
> rte_pktmbuf_read(). Otherwise you cannot read data across multiple
> segments.
> >
> > In plain data buffer mode expected input for BPF program is start of
> first segment packet data.
> > Other segments are simply not available to BPF program in that mode.
> > AFAIK, cBPF uses the same model.
> >
> > >
> > >
> > > Med venlig hilsen / kind regards
> > > - Morten Brørup
> >
> 
> For packet capture, the BPF program is only allowed to look at first
> segment.
> pktmbuf_read is expensive and can cause a copy.

It is only expensive if going beyond the first segment:

static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
	uint32_t off, uint32_t len, void *buf)
{
	if (likely(off + len <= rte_pktmbuf_data_len(m)))
		return rte_pktmbuf_mtod_offset(m, char *, off);
	else
		return __rte_pktmbuf_read(m, off, len, buf);
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-09 15:06     ` Morten Brørup
@ 2019-10-09 15:14       ` Stephen Hemminger
  2019-10-09 15:20         ` Morten Brørup
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2019-10-09 15:14 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Ananyev, Konstantin, dpdk-dev, Jerin Jacob

On Wed, 9 Oct 2019 17:06:24 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > -----Original Message-----
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Wednesday, October 9, 2019 5:02 PM
> > 
> > On Wed, 9 Oct 2019 11:11:46 +0000
> > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> >   
> > > Hi Morten,
> > >  
> > > >
> > > > Hi Konstantin and Stephen,
> > > >
> > > > I just noticed the same bug in your bpf and pcap libraries:
> > > >
> > > > You are using rte_pktmbuf_mtod(), but should be using  
> > rte_pktmbuf_read(). Otherwise you cannot read data across multiple
> > segments.  
> > >
> > > In plain data buffer mode expected input for BPF program is start of  
> > first segment packet data.  
> > > Other segments are simply not available to BPF program in that mode.
> > > AFAIK, cBPF uses the same model.
> > >  
> > > >
> > > >
> > > > Med venlig hilsen / kind regards
> > > > - Morten Brørup  
> > >  
> > 
> > For packet capture, the BPF program is only allowed to look at first
> > segment.
> > pktmbuf_read is expensive and can cause a copy.  
> 
> It is only expensive if going beyond the first segment:
> 
> static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
> 	uint32_t off, uint32_t len, void *buf)
> {
> 	if (likely(off + len <= rte_pktmbuf_data_len(m)))
> 		return rte_pktmbuf_mtod_offset(m, char *, off);
> 	else
> 		return __rte_pktmbuf_read(m, off, len, buf);
> }

But it would mean potentially big buffer on the stack (in case)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-09 15:14       ` Stephen Hemminger
@ 2019-10-09 15:20         ` Morten Brørup
  2019-10-09 17:24           ` Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: Morten Brørup @ 2019-10-09 15:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ananyev, Konstantin, dpdk-dev, Jerin Jacob

> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, October 9, 2019 5:15 PM
> 
> On Wed, 9 Oct 2019 17:06:24 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > -----Original Message-----
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Wednesday, October 9, 2019 5:02 PM
> > >
> > > On Wed, 9 Oct 2019 11:11:46 +0000
> > > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> > >
> > > > Hi Morten,
> > > >
> > > > >
> > > > > Hi Konstantin and Stephen,
> > > > >
> > > > > I just noticed the same bug in your bpf and pcap libraries:
> > > > >
> > > > > You are using rte_pktmbuf_mtod(), but should be using
> > > rte_pktmbuf_read(). Otherwise you cannot read data across multiple
> > > segments.
> > > >
> > > > In plain data buffer mode expected input for BPF program is start
> of
> > > first segment packet data.
> > > > Other segments are simply not available to BPF program in that
> mode.
> > > > AFAIK, cBPF uses the same model.
> > > >
> > > > >
> > > > >
> > > > > Med venlig hilsen / kind regards
> > > > > - Morten Brørup
> > > >
> > >
> > > For packet capture, the BPF program is only allowed to look at
> first
> > > segment.
> > > pktmbuf_read is expensive and can cause a copy.
> >
> > It is only expensive if going beyond the first segment:
> >
> > static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
> > 	uint32_t off, uint32_t len, void *buf)
> > {
> > 	if (likely(off + len <= rte_pktmbuf_data_len(m)))
> > 		return rte_pktmbuf_mtod_offset(m, char *, off);
> > 	else
> > 		return __rte_pktmbuf_read(m, off, len, buf);
> > }
> 
> But it would mean potentially big buffer on the stack (in case)

No, the buffer only needs to be the size of the accessed data. I use it like this:

char buffer[sizeof(uint32_t)];

for (;; pc++) {
    switch (pc->code) {
        case BPF_LD_ABS_32:
            p = rte_pktmbuf_read(m, pc->k, sizeof(uint32_t), buffer);
            if (unlikely(p == NULL)) return 0; /* Attempting to read beyond packet. Bail out. */
            a = rte_be_to_cpu_32(*(const uint32_t *)p);
            continue;
        case BPF_LD_ABS_16:
            p = rte_pktmbuf_read(m, pc->k, sizeof(uint16_t), buffer);
            if (unlikely(p == NULL)) return 0; /* Attempting to read beyond packet. Bail out. */
            a = rte_be_to_cpu_16(*(const uint16_t *)p);
            continue;


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-09 15:20         ` Morten Brørup
@ 2019-10-09 17:24           ` Stephen Hemminger
  2019-10-10  7:29             ` Morten Brørup
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2019-10-09 17:24 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Ananyev, Konstantin, dpdk-dev, Jerin Jacob

On Wed, 9 Oct 2019 17:20:58 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > -----Original Message-----
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Wednesday, October 9, 2019 5:15 PM
> > 
> > On Wed, 9 Oct 2019 17:06:24 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > Sent: Wednesday, October 9, 2019 5:02 PM
> > > >
> > > > On Wed, 9 Oct 2019 11:11:46 +0000
> > > > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> > > >  
> > > > > Hi Morten,
> > > > >  
> > > > > >
> > > > > > Hi Konstantin and Stephen,
> > > > > >
> > > > > > I just noticed the same bug in your bpf and pcap libraries:
> > > > > >
> > > > > > You are using rte_pktmbuf_mtod(), but should be using  
> > > > rte_pktmbuf_read(). Otherwise you cannot read data across multiple
> > > > segments.  
> > > > >
> > > > > In plain data buffer mode expected input for BPF program is start  
> > of  
> > > > first segment packet data.  
> > > > > Other segments are simply not available to BPF program in that  
> > mode.  
> > > > > AFAIK, cBPF uses the same model.
> > > > >  
> > > > > >
> > > > > >
> > > > > > Med venlig hilsen / kind regards
> > > > > > - Morten Brørup  
> > > > >  
> > > >
> > > > For packet capture, the BPF program is only allowed to look at  
> > first  
> > > > segment.
> > > > pktmbuf_read is expensive and can cause a copy.  
> > >
> > > It is only expensive if going beyond the first segment:
> > >
> > > static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
> > > 	uint32_t off, uint32_t len, void *buf)
> > > {
> > > 	if (likely(off + len <= rte_pktmbuf_data_len(m)))
> > > 		return rte_pktmbuf_mtod_offset(m, char *, off);
> > > 	else
> > > 		return __rte_pktmbuf_read(m, off, len, buf);
> > > }  
> > 
> > But it would mean potentially big buffer on the stack (in case)  
> 
> No, the buffer only needs to be the size of the accessed data. I use it like this:
> 
> char buffer[sizeof(uint32_t)];
> 
> for (;; pc++) {
>     switch (pc->code) {
>         case BPF_LD_ABS_32:
>             p = rte_pktmbuf_read(m, pc->k, sizeof(uint32_t), buffer);
>             if (unlikely(p == NULL)) return 0; /* Attempting to read beyond packet. Bail out. */
>             a = rte_be_to_cpu_32(*(const uint32_t *)p);
>             continue;
>         case BPF_LD_ABS_16:
>             p = rte_pktmbuf_read(m, pc->k, sizeof(uint16_t), buffer);
>             if (unlikely(p == NULL)) return 0; /* Attempting to read beyond packet. Bail out. */
>             a = rte_be_to_cpu_16(*(const uint16_t *)p);
>             continue;
> 

Reading down the chain of mbuf segments to find a uint32_t (and that potentially crosses)
seems like a waste.

The purpose of the filter is to look at packet headers. Any driver making mbufs that
are dripples of data is broken. chaining is really meant for case of jumbo or tso.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-09 17:24           ` Stephen Hemminger
@ 2019-10-10  7:29             ` Morten Brørup
  2019-10-10 15:36               ` Ananyev, Konstantin
  0 siblings, 1 reply; 11+ messages in thread
From: Morten Brørup @ 2019-10-10  7:29 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ananyev, Konstantin, dpdk-dev, Jerin Jacob

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
> Sent: Wednesday, October 9, 2019 7:25 PM
> 
> On Wed, 9 Oct 2019 17:20:58 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > -----Original Message-----
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Wednesday, October 9, 2019 5:15 PM
> > >
> > > On Wed, 9 Oct 2019 17:06:24 +0200
> > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > Sent: Wednesday, October 9, 2019 5:02 PM
> > > > >
> > > > > On Wed, 9 Oct 2019 11:11:46 +0000
> > > > > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> > > > >
> > > > > > Hi Morten,
> > > > > >
> > > > > > >
> > > > > > > Hi Konstantin and Stephen,
> > > > > > >
> > > > > > > I just noticed the same bug in your bpf and pcap libraries:
> > > > > > >
> > > > > > > You are using rte_pktmbuf_mtod(), but should be using
> > > > > rte_pktmbuf_read(). Otherwise you cannot read data across
> multiple
> > > > > segments.
> > > > > >
> > > > > > In plain data buffer mode expected input for BPF program is
> start
> > > of
> > > > > first segment packet data.
> > > > > > Other segments are simply not available to BPF program in
> that
> > > mode.
> > > > > > AFAIK, cBPF uses the same model.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Med venlig hilsen / kind regards
> > > > > > > - Morten Brørup
> > > > > >
> > > > >
> > > > > For packet capture, the BPF program is only allowed to look at
> > > first
> > > > > segment.
> > > > > pktmbuf_read is expensive and can cause a copy.
> > > >
> > > > It is only expensive if going beyond the first segment:
> > > >
> > > > static inline const void *rte_pktmbuf_read(const struct rte_mbuf
> *m,
> > > > 	uint32_t off, uint32_t len, void *buf)
> > > > {
> > > > 	if (likely(off + len <= rte_pktmbuf_data_len(m)))
> > > > 		return rte_pktmbuf_mtod_offset(m, char *, off);
> > > > 	else
> > > > 		return __rte_pktmbuf_read(m, off, len, buf);
> > > > }
> > >
> > > But it would mean potentially big buffer on the stack (in case)
> >
> > No, the buffer only needs to be the size of the accessed data. I use
> it like this:
> >
> > char buffer[sizeof(uint32_t)];
> >
> > for (;; pc++) {
> >     switch (pc->code) {
> >         case BPF_LD_ABS_32:
> >             p = rte_pktmbuf_read(m, pc->k, sizeof(uint32_t), buffer);
> >             if (unlikely(p == NULL)) return 0; /* Attempting to read
> beyond packet. Bail out. */
> >             a = rte_be_to_cpu_32(*(const uint32_t *)p);
> >             continue;
> >         case BPF_LD_ABS_16:
> >             p = rte_pktmbuf_read(m, pc->k, sizeof(uint16_t), buffer);
> >             if (unlikely(p == NULL)) return 0; /* Attempting to read
> beyond packet. Bail out. */
> >             a = rte_be_to_cpu_16(*(const uint16_t *)p);
> >             continue;
> >
> 
> Reading down the chain of mbuf segments to find a uint32_t (and that
> potentially crosses)
> seems like a waste.
> 
	
Slow and painful is the only way to read beyond the first segment, I agree.

But when reading from the first segment, rte_pktmbuf_read() basically does the same as your code. So there shouldn't be any performance penalty from supporting both by using rte_pktmbuf_read() instead.

I think the modification in the pdump library is simple, as you already pass the mbuf. But the bpf library requires more work, as it passes a pointer to the data in the first segment to the processing function instead of passing the mbuf.

> The purpose of the filter is to look at packet headers.

Some might look deeper. So why prevent it? E.g. our StraighShaper appliance sometimes looks deeper, but for performance reasons we stopped using BPF for this a long time ago.

> Any driver
> making mbufs that
> are dripples of data is broken. 

I agree very much with you on this regarding general-purpose NICs! Although I know of an exception that confirms the rule... a few year ago we worked with a component vendor with some very clever performance optimizations doing exactly this for specific purposes. Unfortunately it's under NDA, so I can't go into details.

> chaining is really meant for case of jumbo or tso.
> 

Try thinking beyond PMD ingress. There are multiple use cases in egress. Here are a couple:

- IP Multicast to multiple subnets on a Layer 3 switch. The VLAN ID and Source MAC must be replaced in each packet; this can be done using segments.
- Tunnel encapsulation. E.g. putting a packet into a VXLAN tunnel could be done using segments.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-10  7:29             ` Morten Brørup
@ 2019-10-10 15:36               ` Ananyev, Konstantin
  2019-10-11  8:01                 ` Morten Brørup
  0 siblings, 1 reply; 11+ messages in thread
From: Ananyev, Konstantin @ 2019-10-10 15:36 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger; +Cc: dpdk-dev, Jerin Jacob



> -----Original Message-----
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, October 10, 2019 8:30 AM
> To: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dpdk-dev <dev@dpdk.org>; Jerin Jacob <jerinj@marvell.com>
> Subject: RE: [dpdk-dev] packet data access bug in bpf and pdump libs
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
> > Sent: Wednesday, October 9, 2019 7:25 PM
> >
> > On Wed, 9 Oct 2019 17:20:58 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > Sent: Wednesday, October 9, 2019 5:15 PM
> > > >
> > > > On Wed, 9 Oct 2019 17:06:24 +0200
> > > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > > >
> > > > > > -----Original Message-----
> > > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > > Sent: Wednesday, October 9, 2019 5:02 PM
> > > > > >
> > > > > > On Wed, 9 Oct 2019 11:11:46 +0000
> > > > > > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> > > > > >
> > > > > > > Hi Morten,
> > > > > > >
> > > > > > > >
> > > > > > > > Hi Konstantin and Stephen,
> > > > > > > >
> > > > > > > > I just noticed the same bug in your bpf and pcap libraries:
> > > > > > > >
> > > > > > > > You are using rte_pktmbuf_mtod(), but should be using
> > > > > > rte_pktmbuf_read(). Otherwise you cannot read data across
> > multiple
> > > > > > segments.
> > > > > > >
> > > > > > > In plain data buffer mode expected input for BPF program is
> > start
> > > > of
> > > > > > first segment packet data.
> > > > > > > Other segments are simply not available to BPF program in
> > that
> > > > mode.
> > > > > > > AFAIK, cBPF uses the same model.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Med venlig hilsen / kind regards
> > > > > > > > - Morten Brørup
> > > > > > >
> > > > > >
> > > > > > For packet capture, the BPF program is only allowed to look at
> > > > first
> > > > > > segment.
> > > > > > pktmbuf_read is expensive and can cause a copy.
> > > > >
> > > > > It is only expensive if going beyond the first segment:
> > > > >
> > > > > static inline const void *rte_pktmbuf_read(const struct rte_mbuf
> > *m,
> > > > > 	uint32_t off, uint32_t len, void *buf)
> > > > > {
> > > > > 	if (likely(off + len <= rte_pktmbuf_data_len(m)))
> > > > > 		return rte_pktmbuf_mtod_offset(m, char *, off);
> > > > > 	else
> > > > > 		return __rte_pktmbuf_read(m, off, len, buf);
> > > > > }
> > > >
> > > > But it would mean potentially big buffer on the stack (in case)
> > >
> > > No, the buffer only needs to be the size of the accessed data. I use
> > it like this:
> > >
> > > char buffer[sizeof(uint32_t)];
> > >
> > > for (;; pc++) {
> > >     switch (pc->code) {
> > >         case BPF_LD_ABS_32:
> > >             p = rte_pktmbuf_read(m, pc->k, sizeof(uint32_t), buffer);
> > >             if (unlikely(p == NULL)) return 0; /* Attempting to read
> > beyond packet. Bail out. */
> > >             a = rte_be_to_cpu_32(*(const uint32_t *)p);
> > >             continue;
> > >         case BPF_LD_ABS_16:
> > >             p = rte_pktmbuf_read(m, pc->k, sizeof(uint16_t), buffer);
> > >             if (unlikely(p == NULL)) return 0; /* Attempting to read
> > beyond packet. Bail out. */
> > >             a = rte_be_to_cpu_16(*(const uint16_t *)p);
> > >             continue;
> > >
> >
> > Reading down the chain of mbuf segments to find a uint32_t (and that
> > potentially crosses)
> > seems like a waste.

+1
Again just imagine how painful it would be to support it in JIT...

Another thing - in librte_bpf, if RTE_BPF_ARG_PTR is used,
it means that input parameter for exec/jit is just a plain data buffer and it doesn’t
make any assumptions about it (mbuf or not, etc.).

Though there is a possibility to access mbuf metadata
and call  rte_pktmbuf_read() or any other functions from your eBPF code,
but for that you need to specify RTE_BPF_ARG_PTR_MBUF at load stage. 

> >
> 
> Slow and painful is the only way to read beyond the first segment, I agree.
> 
> But when reading from the first segment, rte_pktmbuf_read() basically does the same as your code. So there shouldn't be any performance
> penalty from supporting both by using rte_pktmbuf_read() instead.
> 
> I think the modification in the pdump library is simple, as you already pass the mbuf. But the bpf library requires more work, as it passes a
> pointer to the data in the first segment to the processing function instead of passing the mbuf.
> 
> > The purpose of the filter is to look at packet headers.
> 
> Some might look deeper. So why prevent it? E.g. our StraighShaper appliance sometimes looks deeper, but for performance reasons we
> stopped using BPF for this a long time ago.
> 
> > Any driver
> > making mbufs that
> > are dripples of data is broken.
> 
> I agree very much with you on this regarding general-purpose NICs! Although I know of an exception that confirms the rule... a few year
> ago we worked with a component vendor with some very clever performance optimizations doing exactly this for specific purposes.
> Unfortunately it's under NDA, so I can't go into details.
> 
> > chaining is really meant for case of jumbo or tso.
> >
> 
> Try thinking beyond PMD ingress. There are multiple use cases in egress. Here are a couple:
> 
> - IP Multicast to multiple subnets on a Layer 3 switch. The VLAN ID and Source MAC must be replaced in each packet; this can be done using
> segments.
> - Tunnel encapsulation. E.g. putting a packet into a VXLAN tunnel could be done using segments.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] packet data access bug in bpf and pdump libs
  2019-10-10 15:36               ` Ananyev, Konstantin
@ 2019-10-11  8:01                 ` Morten Brørup
  0 siblings, 0 replies; 11+ messages in thread
From: Morten Brørup @ 2019-10-11  8:01 UTC (permalink / raw)
  To: Ananyev, Konstantin, Stephen Hemminger; +Cc: dpdk-dev, Jerin Jacob

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> Sent: Thursday, October 10, 2019 5:37 PM
> 
> > -----Original Message-----
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Thursday, October 10, 2019 8:30 AM
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen
> Hemminger
> > > Sent: Wednesday, October 9, 2019 7:25 PM
> > >
> > > On Wed, 9 Oct 2019 17:20:58 +0200
> > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > Sent: Wednesday, October 9, 2019 5:15 PM
> > > > >
> > > > > On Wed, 9 Oct 2019 17:06:24 +0200
> > > > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > > > Sent: Wednesday, October 9, 2019 5:02 PM
> > > > > > >
> > > > > > > On Wed, 9 Oct 2019 11:11:46 +0000
> > > > > > > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> > > > > > >
> > > > > > > > Hi Morten,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Konstantin and Stephen,
> > > > > > > > >
> > > > > > > > > I just noticed the same bug in your bpf and pcap
> libraries:
> > > > > > > > >
> > > > > > > > > You are using rte_pktmbuf_mtod(), but should be using
> > > > > > > rte_pktmbuf_read(). Otherwise you cannot read data across
> > > multiple
> > > > > > > segments.
> > > > > > > >
> > > > > > > > In plain data buffer mode expected input for BPF program
> is
> > > start
> > > > > of
> > > > > > > first segment packet data.
> > > > > > > > Other segments are simply not available to BPF program in
> > > that
> > > > > mode.
> > > > > > > > AFAIK, cBPF uses the same model.
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Med venlig hilsen / kind regards
> > > > > > > > > - Morten Brørup
> > > > > > > >
> > > > > > >
> > > > > > > For packet capture, the BPF program is only allowed to look
> at
> > > > > first
> > > > > > > segment.
> > > > > > > pktmbuf_read is expensive and can cause a copy.
> > > > > >
> > > > > > It is only expensive if going beyond the first segment:
> > > > > >
> > > > > > static inline const void *rte_pktmbuf_read(const struct
> rte_mbuf
> > > *m,
> > > > > > 	uint32_t off, uint32_t len, void *buf)
> > > > > > {
> > > > > > 	if (likely(off + len <= rte_pktmbuf_data_len(m)))
> > > > > > 		return rte_pktmbuf_mtod_offset(m, char *, off);
> > > > > > 	else
> > > > > > 		return __rte_pktmbuf_read(m, off, len, buf);
> > > > > > }
> > > > >
> > > > > But it would mean potentially big buffer on the stack (in case)
> > > >
> > > > No, the buffer only needs to be the size of the accessed data. I
> use
> > > it like this:
> > > >
> > > > char buffer[sizeof(uint32_t)];
> > > >
> > > > for (;; pc++) {
> > > >     switch (pc->code) {
> > > >         case BPF_LD_ABS_32:
> > > >             p = rte_pktmbuf_read(m, pc->k, sizeof(uint32_t),
> buffer);
> > > >             if (unlikely(p == NULL)) return 0; /* Attempting to
> read
> > > beyond packet. Bail out. */
> > > >             a = rte_be_to_cpu_32(*(const uint32_t *)p);
> > > >             continue;
> > > >         case BPF_LD_ABS_16:
> > > >             p = rte_pktmbuf_read(m, pc->k, sizeof(uint16_t),
> buffer);
> > > >             if (unlikely(p == NULL)) return 0; /* Attempting to
> read
> > > beyond packet. Bail out. */
> > > >             a = rte_be_to_cpu_16(*(const uint16_t *)p);
> > > >             continue;
> > > >
> > >
> > > Reading down the chain of mbuf segments to find a uint32_t (and
> that
> > > potentially crosses)
> > > seems like a waste.
> 
> +1
> Again just imagine how painful it would be to support it in JIT...
> 
Compiling the BPF_LD_ABS instruction into a function call (i.e. a series of instructions) instead of a single instruction doesn't seem like a showstopper to me. The major modification is probably in the bpf library, passing the mbuf instead of a pointer to the data of the first segment.

> Another thing - in librte_bpf, if RTE_BPF_ARG_PTR is used,
> it means that input parameter for exec/jit is just a plain data buffer
> and it doesn’t
> make any assumptions about it (mbuf or not, etc.).
> 
> Though there is a possibility to access mbuf metadata
> and call  rte_pktmbuf_read() or any other functions from your eBPF
> code,
> but for that you need to specify RTE_BPF_ARG_PTR_MBUF at load stage.
> 
So one way of solving it is making an cBPF to eBPF converter that does this trickery, if conversion to eBPF is the path taken. Although that could be harder to implement than simply adding a cBPF processor (or JIT compiler) directly in the bpf library.

C is not C++, and perhaps cBPF and eBPF do not have as many things in common as their names might suggest.

> > >
> >
> > Slow and painful is the only way to read beyond the first segment, I
> agree.
> >
> > But when reading from the first segment, rte_pktmbuf_read() basically
> does the same as your code. So there shouldn't be any performance
> > penalty from supporting both by using rte_pktmbuf_read() instead.
> >
> > I think the modification in the pdump library is simple, as you
> already pass the mbuf. But the bpf library requires more work, as it
> passes a
> > pointer to the data in the first segment to the processing function
> instead of passing the mbuf.
> >
> > > The purpose of the filter is to look at packet headers.
> >
> > Some might look deeper. So why prevent it? E.g. our StraighShaper
> appliance sometimes looks deeper, but for performance reasons we
> > stopped using BPF for this a long time ago.
> >
> > > Any driver
> > > making mbufs that
> > > are dripples of data is broken.
> >
> > I agree very much with you on this regarding general-purpose NICs!
> Although I know of an exception that confirms the rule... a few year
> > ago we worked with a component vendor with some very clever
> performance optimizations doing exactly this for specific purposes.
> > Unfortunately it's under NDA, so I can't go into details.
> >
> > > chaining is really meant for case of jumbo or tso.
> > >
> >
> > Try thinking beyond PMD ingress. There are multiple use cases in
> egress. Here are a couple:
> >
> > - IP Multicast to multiple subnets on a Layer 3 switch. The VLAN ID
> and Source MAC must be replaced in each packet; this can be done using
> > segments.
> > - Tunnel encapsulation. E.g. putting a packet into a VXLAN tunnel
> could be done using segments.

I applaud Stephen's initiative to add a pdump library with cBPF filtering! It is an important network appliance feature for network consultants and network operators in small/medium businesses - they are not programmers, so eBPF is not an option for these guys.

Although I have tried to make a point that support for multiple packet segments could be relevant, not supporting multiple packet segments is certainly better than not adding the pdump library! Just please don't take too many steps in a direction that makes it too hard to add support for multiple packet segments later.

Also thanks to Konstantin for the bpf library, its hooks and the possibilities it opens!

This discussion has clearly revealed that what I reported as being a "bug" is hard to correct in the bpf library, so let's downgrade it to a "limitation".


Med venlig hilsen / kind regards
- Morten Brørup




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-09 11:03 [dpdk-dev] packet data access bug in bpf and pdump libs Morten Brørup
2019-10-09 11:11 ` Ananyev, Konstantin
2019-10-09 11:35   ` Morten Brørup
2019-10-09 15:02   ` Stephen Hemminger
2019-10-09 15:06     ` Morten Brørup
2019-10-09 15:14       ` Stephen Hemminger
2019-10-09 15:20         ` Morten Brørup
2019-10-09 17:24           ` Stephen Hemminger
2019-10-10  7:29             ` Morten Brørup
2019-10-10 15:36               ` Ananyev, Konstantin
2019-10-11  8:01                 ` Morten Brørup

DPDK patches and discussions

Archives are clonable:
	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev


Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox