DPDK patches and discussions
 help / color / mirror / Atom feed
From: Tudor Cornea <tudor.cornea@gmail.com>
To: Ferruh Yigit <ferruh.yigit@intel.com>
Cc: linville@tuxdriver.com, thomas@monjalon.net, dev@dpdk.org,
	 Mihai Pogonaru <pogonarumihai@gmail.com>
Subject: Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx
Date: Mon, 6 Sep 2021 13:23:17 +0300	[thread overview]
Message-ID: <CAOuQ8vWDV30MYgeNJ11wVHs1UYHtPQQ09efe08v1n4pkDRi3_g@mail.gmail.com> (raw)
In-Reply-To: <3154f4e8-9f0e-b04f-6e8f-096dc39f489c@intel.com>

Hi Ferruh,

Would you mind separate timestamp status fix to its own patch? I think
> better to
> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
> patch.


Agreed. There are two issues solved by this patch. We will break it in two
different patches.

I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
> kernel
> versions the bug exists, this flag is not set, but can you please confirm?


And does it only seen with veth, if so I wonder if we can ignore it, not
> sure
> how common to use af_packet PMD over veth interface, do you have this
> usecase?


We've seen the timestamping issue only when running af_packet over
veth interfaces. We have a particular use-case internally, in which we need
to run inside a Kubernetes cluster.
We've found the following resources [1] , [2] related to this behavior in
the kernel.

We believe that issue #2 (the ring getting full), can theoretically occur
on any type of NIC.
We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3
interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the
interface on / off
ifconfig $iface_name down; sleep 10; ifconfig $iface_name up

We will attempt to give more context on the issue below, about what we
think happens:
- we have a 2048 queue shared between the kernel and the dpdk socket,
there's an index the queue in both the kernel and the dpdk driver
- the dpdk driver writes a packet or a burst, advances its idx and tells
the kernel to send the packets via a call to sendto() and the kernel sends
the packets and advances its idx
- once the interface is down the kernel can no longer send packets, but it
doesn't drop them, it just doesn't advance its idx
- for each packet there is header and in the header there is a status
integer which, among others, indicates the owner of the packet: the
userspace or the kernel - the userspace (dpdk driver) sets the status as
owned by the kernel when it adds another packet ; the kernel sets the
status back to owned by the userspace once it sends a packet
- the dpdk driver was ignoring this status bit and,  even after the queue
was full, it would continue to put packets in the queue - its idx would be
"after" that of the kernel
- once the interface is brought up, the kernel would send all the packets
in the queue (since they have the status of being owned by the kernel) on
the next call to sendto() and the idx would be back to where it was before
the interface was brought up (let's call it k1)
- the dpdk driver idx into the queue would point somewhere in the queue
(let's call it d1) and would continue to add packets at that point, but the
kernel wouldn't send any packet anymore since there is now a gap of packets
owned by the userspace between the kernel index (k1) and the dpdk driver
idx (d1)
- the dpdk idx would eventually reach k1 and packets would be transferred
at a normal rate until both the dpdk idx and the kernel idx would reach d1
again
- between d1 and k1 there are only packets with the status as owned by the
kernel - which where added by the dpdk driver while its index was between
d1 and k1 ; thus the kernel would burst all the packets till k1, while the
dpdk idx is at d1
- the cycle repeats

If a new traffic config comes (in our application) while this cycle is
happening, it could be that some of the packets of the old config are still
in queue (between d1 and k1) and will be bursted when the dpdk and kernel
idx reach d1 ; this would explain seeing packets from an old config, but
only in the first 2048 packets (which is the queue size)


[1] https://www.spinics.net/lists/kernel/msg3959391.html
[2] https://www.spinics.net/lists/netdev/msg739372.html

On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> On 8/20/2021 2:39 PM, Tudor Cornea wrote:
> > The poll call can return POLLERR which is ignored, or it can return
> > POLLOUT, even if there are no free frames in the mmap-ed area.
> >
> > We can account for both of these cases by re-checking if the next
> > frame is empty before writing into it.
> >
> > We also now eliminate the timestamp status from the frame status.
> >
>
> Hi Tudor,
>
> Would you mind separate timestamp status fix to its own patch? I think
> better to
> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
> patch.
>
> > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
> > ---
> >  drivers/net/af_packet/rte_eth_af_packet.c | 47
> +++++++++++++++++++++++++++++--
> >  1 file changed, 45 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
> b/drivers/net/af_packet/rte_eth_af_packet.c
> > index b73b211..3845df5 100644
> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> > @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >       return num_rx;
> >  }
> >
> > +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status)
> > +{
> > +     return *tp_status &
> > +             ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
>
> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
> kernel
> versions the bug exists, this flag is not set, but can you please confirm?
>
> > +}
> > +
> >  /*
> >   * Callback to handle sending packets through a real NIC.
> >   */
> > @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >                       }
> >               }
> >
> > +             /*
> > +              * We must eliminate the timestamp status from the packet
> > +              * status. This should only matter if timestamping is
> enabled
> > +              * on the socket, but there is a BUG in the kernel which is
> > +              * fixed in newer releases.
> > +
> > +              * For interfaces of type 'veth', the sent skb is forwarded
> > +              * to the peer and back into the network stack which
> timestamps
> > +              * it on the RX path if timestamping is enabled globally
> > +              * (which happens if any socket enables timestamping).
> > +
> > +              * When the skb is destructed, tpacket_destruct_skb() is
> called
> > +              * and it calls __packet_set_timestamp() which doesn't
> check
> > +              * the flags on the socket and returns the timestamp if it
> is
> > +              * set in the skb (and for veth it is, as mentioned above).
> > +              */
> > +
>
> Can you give some more details on this bug, any link etc..
>
> And does it only seen with veth, if so I wonder if we can ignore it, not
> sure
> how common to use af_packet PMD over veth interface, do you have this
> usecase?
>
> And if only specific kernel versions impacted from this bug, what do you
> think
> about adding kernel version check to reduce the scope of the fix in
> af_packet PMD.
>
> >               /* point at the next incoming frame */
> > -             if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > -                 (poll(&pfd, 1, -1) < 0))
> > +             if ((tx_ring_status_remove_ts(&ppd->tp_status)
> > +                     != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0))
> > +                     break;
> > +
> > +             /*
> > +              * Poll can return POLLERR if the interface is down or
> POLLOUT,
> > +              * even if there are no extra buffers available.
> > +              * This happens, because packet_poll() calls
> datagram_poll()
> > +              * which checks the space left in the socket buffer and in
> the
> > +              * case of packet_mmap the default socket buffer length
> > +              * doesn't match the requested size for the tx_ring so
> there
> > +              * is always space left in socket buffer, which doesn't
> seem
> > +              * to be correlated to the requested size for the tx_ring
> > +              * in packet_mmap.
> > +              */
> > +             if (tx_ring_status_remove_ts(&ppd->tp_status)
> > +                     != TP_STATUS_AVAILABLE)
> >                       break;
>
> In this case should we break or poll again?
>
> If 'POLLERR' is received when interface is down, it makes sense to break.
> Do you
> know if is there any other case that 'POLLERR' is returned?
>
> And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received,
> can we
> poll again to wait Tx ring available to send more packets?
>
> >
> >               /* copy the tx frame data */
> > @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >               rte_pktmbuf_free(mbuf);
> >       }
> >
> > +     /*
> > +      * We might have to ignore a few more errnos here since the packets
> > +      * remain in the mmap-ed queue and will be sent later, presumably.
> > +      */
> > +
>
> Can you please describe above comment more? What errors ignored?
> And won't all packets sent will below sendto() call?
>
> >       /* kick-off transmits */
> >       if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 &&
> >                       errno != ENOBUFS && errno != EAGAIN) {
> >
>
>

  reply	other threads:[~2021-09-06 10:23 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-20 13:39 Tudor Cornea
2021-09-01 16:34 ` Ferruh Yigit
2021-09-06 10:23   ` Tudor Cornea [this message]
2021-09-20 17:11     ` Ferruh Yigit
2021-09-13 13:45 ` [dpdk-dev] [PATCH v2] " Tudor Cornea
2021-09-20 17:44   ` Ferruh Yigit
2021-09-29 10:03     ` Tudor Cornea
2021-10-05 15:11       ` Tudor Cornea
2021-10-26 14:30         ` Ferruh Yigit
2021-11-02 15:24           ` Tudor Cornea
2021-11-02 15:47   ` [dpdk-dev] [PATCH v3] " Tudor Cornea
2021-11-02 16:47     ` Ferruh Yigit
2021-11-03  9:31     ` [dpdk-dev] [PATCH v4] " Tudor Cornea
2021-11-04 12:07       ` Ferruh Yigit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOuQ8vWDV30MYgeNJ11wVHs1UYHtPQQ09efe08v1n4pkDRi3_g@mail.gmail.com \
    --to=tudor.cornea@gmail.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=linville@tuxdriver.com \
    --cc=pogonarumihai@gmail.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).