DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ferruh Yigit <ferruh.yigit@intel.com>
To: Tudor Cornea <tudor.cornea@gmail.com>
Cc: <linville@tuxdriver.com>, <thomas@monjalon.net>, <dev@dpdk.org>,
	"Mihai Pogonaru" <pogonarumihai@gmail.com>
Subject: Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx
Date: Mon, 20 Sep 2021 18:11:32 +0100	[thread overview]
Message-ID: <924853ed-d3fa-f4e5-fde7-96147de4ad83@intel.com> (raw)
In-Reply-To: <CAOuQ8vWDV30MYgeNJ11wVHs1UYHtPQQ09efe08v1n4pkDRi3_g@mail.gmail.com>

On 9/6/2021 11:23 AM, Tudor Cornea wrote:
> Hi Ferruh,
> 
> Would you mind separate timestamp status fix to its own patch? I think
>> better to
>> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
>> patch.
> 
> 
> Agreed. There are two issues solved by this patch. We will break it in two
> different patches.
> 
> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
>> kernel
>> versions the bug exists, this flag is not set, but can you please confirm?
> 
> 
> And does it only seen with veth, if so I wonder if we can ignore it, not
>> sure
>> how common to use af_packet PMD over veth interface, do you have this
>> usecase?
> 
> 
> We've seen the timestamping issue only when running af_packet over
> veth interfaces. We have a particular use-case internally, in which we need
> to run inside a Kubernetes cluster.
> We've found the following resources [1] , [2] related to this behavior in
> the kernel.
> 
> We believe that issue #2 (the ring getting full), can theoretically occur
> on any type of NIC.
> We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3
> interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the
> interface on / off
> ifconfig $iface_name down; sleep 10; ifconfig $iface_name up
> 
> We will attempt to give more context on the issue below, about what we
> think happens:
> - we have a 2048 queue shared between the kernel and the dpdk socket,
> there's an index the queue in both the kernel and the dpdk driver
> - the dpdk driver writes a packet or a burst, advances its idx and tells
> the kernel to send the packets via a call to sendto() and the kernel sends
> the packets and advances its idx
> - once the interface is down the kernel can no longer send packets, but it
> doesn't drop them, it just doesn't advance its idx
> - for each packet there is header and in the header there is a status
> integer which, among others, indicates the owner of the packet: the
> userspace or the kernel - the userspace (dpdk driver) sets the status as
> owned by the kernel when it adds another packet ; the kernel sets the
> status back to owned by the userspace once it sends a packet
> - the dpdk driver was ignoring this status bit and,  even after the queue
> was full, it would continue to put packets in the queue - its idx would be
> "after" that of the kernel
> - once the interface is brought up, the kernel would send all the packets
> in the queue (since they have the status of being owned by the kernel) on
> the next call to sendto() and the idx would be back to where it was before
> the interface was brought up (let's call it k1)
> - the dpdk driver idx into the queue would point somewhere in the queue
> (let's call it d1) and would continue to add packets at that point, but the
> kernel wouldn't send any packet anymore since there is now a gap of packets
> owned by the userspace between the kernel index (k1) and the dpdk driver
> idx (d1)
> - the dpdk idx would eventually reach k1 and packets would be transferred
> at a normal rate until both the dpdk idx and the kernel idx would reach d1
> again
> - between d1 and k1 there are only packets with the status as owned by the
> kernel - which where added by the dpdk driver while its index was between
> d1 and k1 ; thus the kernel would burst all the packets till k1, while the
> dpdk idx is at d1
> - the cycle repeats
> 
> If a new traffic config comes (in our application) while this cycle is
> happening, it could be that some of the packets of the old config are still
> in queue (between d1 and k1) and will be bursted when the dpdk and kernel
> idx reach d1 ; this would explain seeing packets from an old config, but
> only in the first 2048 packets (which is the queue size)
> 
> 

Hi Tudor,

If there is an usage on of veth, OK to fix the timestamps issue.

What you described above looks like a ring buffer with single producer and
single consumer, and producer overwrites the not consumed items.

I assume this happens because af_packet (consumer) can't send the packets
because of the timestamp defect. (Also producer (dpdk app) should have checks to
prevent overwrite, but that is a different issue.)

I will comment to the new versions of the patches.

Our of curiosity, are you using an modified af_packet implementation in kernel
for above described usage?

> [1] https://www.spinics.net/lists/kernel/msg3959391.html
> [2] https://www.spinics.net/lists/netdev/msg739372.html
> 
> On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit <ferruh.yigit@intel.com> wrote:
> 
>> On 8/20/2021 2:39 PM, Tudor Cornea wrote:
>>> The poll call can return POLLERR which is ignored, or it can return
>>> POLLOUT, even if there are no free frames in the mmap-ed area.
>>>
>>> We can account for both of these cases by re-checking if the next
>>> frame is empty before writing into it.
>>>
>>> We also now eliminate the timestamp status from the frame status.
>>>
>>
>> Hi Tudor,
>>
>> Would you mind separate timestamp status fix to its own patch? I think
>> better to
>> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
>> patch.
>>
>>> Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
>>> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
>>> ---
>>>  drivers/net/af_packet/rte_eth_af_packet.c | 47
>> +++++++++++++++++++++++++++++--
>>>  1 file changed, 45 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index b73b211..3845df5 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>       return num_rx;
>>>  }
>>>
>>> +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status)
>>> +{
>>> +     return *tp_status &
>>> +             ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
>>
>> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
>> kernel
>> versions the bug exists, this flag is not set, but can you please confirm?
>>
>>> +}
>>> +
>>>  /*
>>>   * Callback to handle sending packets through a real NIC.
>>>   */
>>> @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>                       }
>>>               }
>>>
>>> +             /*
>>> +              * We must eliminate the timestamp status from the packet
>>> +              * status. This should only matter if timestamping is
>> enabled
>>> +              * on the socket, but there is a BUG in the kernel which is
>>> +              * fixed in newer releases.
>>> +
>>> +              * For interfaces of type 'veth', the sent skb is forwarded
>>> +              * to the peer and back into the network stack which
>> timestamps
>>> +              * it on the RX path if timestamping is enabled globally
>>> +              * (which happens if any socket enables timestamping).
>>> +
>>> +              * When the skb is destructed, tpacket_destruct_skb() is
>> called
>>> +              * and it calls __packet_set_timestamp() which doesn't
>> check
>>> +              * the flags on the socket and returns the timestamp if it
>> is
>>> +              * set in the skb (and for veth it is, as mentioned above).
>>> +              */
>>> +
>>
>> Can you give some more details on this bug, any link etc..
>>
>> And does it only seen with veth, if so I wonder if we can ignore it, not
>> sure
>> how common to use af_packet PMD over veth interface, do you have this
>> usecase?
>>
>> And if only specific kernel versions impacted from this bug, what do you
>> think
>> about adding kernel version check to reduce the scope of the fix in
>> af_packet PMD.
>>
>>>               /* point at the next incoming frame */
>>> -             if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
>>> -                 (poll(&pfd, 1, -1) < 0))
>>> +             if ((tx_ring_status_remove_ts(&ppd->tp_status)
>>> +                     != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0))
>>> +                     break;
>>> +
>>> +             /*
>>> +              * Poll can return POLLERR if the interface is down or
>> POLLOUT,
>>> +              * even if there are no extra buffers available.
>>> +              * This happens, because packet_poll() calls
>> datagram_poll()
>>> +              * which checks the space left in the socket buffer and in
>> the
>>> +              * case of packet_mmap the default socket buffer length
>>> +              * doesn't match the requested size for the tx_ring so
>> there
>>> +              * is always space left in socket buffer, which doesn't
>> seem
>>> +              * to be correlated to the requested size for the tx_ring
>>> +              * in packet_mmap.
>>> +              */
>>> +             if (tx_ring_status_remove_ts(&ppd->tp_status)
>>> +                     != TP_STATUS_AVAILABLE)
>>>                       break;
>>
>> In this case should we break or poll again?
>>
>> If 'POLLERR' is received when interface is down, it makes sense to break.
>> Do you
>> know if is there any other case that 'POLLERR' is returned?
>>
>> And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received,
>> can we
>> poll again to wait Tx ring available to send more packets?
>>
>>>
>>>               /* copy the tx frame data */
>>> @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>               rte_pktmbuf_free(mbuf);
>>>       }
>>>
>>> +     /*
>>> +      * We might have to ignore a few more errnos here since the packets
>>> +      * remain in the mmap-ed queue and will be sent later, presumably.
>>> +      */
>>> +
>>
>> Can you please describe above comment more? What errors ignored?
>> And won't all packets sent will below sendto() call?
>>
>>>       /* kick-off transmits */
>>>       if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 &&
>>>                       errno != ENOBUFS && errno != EAGAIN) {
>>>
>>
>>


  reply	other threads:[~2021-09-20 17:11 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-20 13:39 Tudor Cornea
2021-09-01 16:34 ` Ferruh Yigit
2021-09-06 10:23   ` Tudor Cornea
2021-09-20 17:11     ` Ferruh Yigit [this message]
2021-09-13 13:45 ` [dpdk-dev] [PATCH v2] " Tudor Cornea
2021-09-20 17:44   ` Ferruh Yigit
2021-09-29 10:03     ` Tudor Cornea
2021-10-05 15:11       ` Tudor Cornea
2021-10-26 14:30         ` Ferruh Yigit
2021-11-02 15:24           ` Tudor Cornea
2021-11-02 15:47   ` [dpdk-dev] [PATCH v3] " Tudor Cornea
2021-11-02 16:47     ` Ferruh Yigit
2021-11-03  9:31     ` [dpdk-dev] [PATCH v4] " Tudor Cornea
2021-11-04 12:07       ` Ferruh Yigit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=924853ed-d3fa-f4e5-fde7-96147de4ad83@intel.com \
    --to=ferruh.yigit@intel.com \
    --cc=dev@dpdk.org \
    --cc=linville@tuxdriver.com \
    --cc=pogonarumihai@gmail.com \
    --cc=thomas@monjalon.net \
    --cc=tudor.cornea@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).