From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 4051DA0C4D; Mon, 6 Sep 2021 12:23:30 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 049D9410E6; Mon, 6 Sep 2021 12:23:30 +0200 (CEST) Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by mails.dpdk.org (Postfix) with ESMTP id 6CA1E40E32 for ; Mon, 6 Sep 2021 12:23:28 +0200 (CEST) Received: by mail-ej1-f54.google.com with SMTP id n27so12535579eja.5 for ; Mon, 06 Sep 2021 03:23:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=msNzXYsZ+44eLYUaIhJ3kB+G/eYSmiwxru625ilpItc=; b=TBpJQN9dNAwKHgjyT4xJbXQfdUmCbV+nmltntLJT70da0adoNsO6DoXVEV2ig7jMDn 5E/fKAbevKN8RVMDU0UlOGw1LB1R24J8EY4+oTg+c5nJLAbz0RizaHEj0rTU6/gbEnVn rCa+g64L/hUI0MKzV7cQGZ7/tka904IBImIQA+m4v9TWDoKv0RryN92Hu945OGTJ9dAe BbQx9UGCFzbfxq5Cp2fwWMuXESu3FnQHxEO7b8OmlKYnQTkqyvTSX1q2unWnN49uuugp mlfHaQAN5KtHsBSjhvX7C2Q9umYU3nqaQrrzFPJPenHpVI+xCJs0Q0B5OO9dDaZJExgj QeRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=msNzXYsZ+44eLYUaIhJ3kB+G/eYSmiwxru625ilpItc=; b=iz2/a83z4Yz2QgOotpGDNyap3ibfMEZ5U7IF6kL8EcD5QGdDIf7wr6MmG7c19e7mZq q7yotLWjno159e273Qb2U/26r/eOnhzHzuxlkgOs0c9aljVKF929sHsGh5VcS42hTcQn tbzv9I1NU9VoyAKYr/rG4pzzGdOJ+Zcy2zFuCjTsnDqgNBSojNn9u/o6uyJdxslFxEul 3aBs3EAde1252L8I7boylz/F9t14nJpwvQg3pbOmh7dIdb3gbRqh6mmCQwVKr3Gpy8Zz 2Ci3ImaHFKW19/9orC8L2qxN/PSpPE9maxII+iNqxPd6gqFyKtp8TaGKEBw0B/m6FoJ9 YuDw== X-Gm-Message-State: AOAM532Vsz7S5klTpM5cYscq7jXU0eiJNCCiPoZxm2wyom7BmL30MDXV smcf418aHUUKE2dV2xxKJgozSHmfGfrs0B+xIFy4PPyqyGo= X-Google-Smtp-Source: ABdhPJyC0JL4Bois/NdaziXHJAdVjvfkkO5OaZb/n7rgFCqtFs4oztSTQGy+it0H0yqsWSbY4Z9KnYrzv6wtIOUt2Fo= X-Received: by 2002:a17:906:6cc:: with SMTP id v12mr12929720ejb.153.1630923808138; Mon, 06 Sep 2021 03:23:28 -0700 (PDT) MIME-Version: 1.0 References: <1629466761-127333-1-git-send-email-tudor.cornea@gmail.com> <3154f4e8-9f0e-b04f-6e8f-096dc39f489c@intel.com> In-Reply-To: <3154f4e8-9f0e-b04f-6e8f-096dc39f489c@intel.com> From: Tudor Cornea Date: Mon, 6 Sep 2021 13:23:17 +0300 Message-ID: To: Ferruh Yigit Cc: linville@tuxdriver.com, thomas@monjalon.net, dev@dpdk.org, Mihai Pogonaru Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Ferruh, Would you mind separate timestamp status fix to its own patch? I think > better to > fix 'ignoring full Tx ring' first, to not make it dependent to timestamp > patch. Agreed. There are two issues solved by this patch. We will break it in two different patches. I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the > kernel > versions the bug exists, this flag is not set, but can you please confirm? And does it only seen with veth, if so I wonder if we can ignore it, not > sure > how common to use af_packet PMD over veth interface, do you have this > usecase? We've seen the timestamping issue only when running af_packet over veth interfaces. We have a particular use-case internally, in which we need to run inside a Kubernetes cluster. We've found the following resources [1] , [2] related to this behavior in the kernel. We believe that issue #2 (the ring getting full), can theoretically occur on any type of NIC. We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3 interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the interface on / off ifconfig $iface_name down; sleep 10; ifconfig $iface_name up We will attempt to give more context on the issue below, about what we think happens: - we have a 2048 queue shared between the kernel and the dpdk socket, there's an index the queue in both the kernel and the dpdk driver - the dpdk driver writes a packet or a burst, advances its idx and tells the kernel to send the packets via a call to sendto() and the kernel sends the packets and advances its idx - once the interface is down the kernel can no longer send packets, but it doesn't drop them, it just doesn't advance its idx - for each packet there is header and in the header there is a status integer which, among others, indicates the owner of the packet: the userspace or the kernel - the userspace (dpdk driver) sets the status as owned by the kernel when it adds another packet ; the kernel sets the status back to owned by the userspace once it sends a packet - the dpdk driver was ignoring this status bit and, even after the queue was full, it would continue to put packets in the queue - its idx would be "after" that of the kernel - once the interface is brought up, the kernel would send all the packets in the queue (since they have the status of being owned by the kernel) on the next call to sendto() and the idx would be back to where it was before the interface was brought up (let's call it k1) - the dpdk driver idx into the queue would point somewhere in the queue (let's call it d1) and would continue to add packets at that point, but the kernel wouldn't send any packet anymore since there is now a gap of packets owned by the userspace between the kernel index (k1) and the dpdk driver idx (d1) - the dpdk idx would eventually reach k1 and packets would be transferred at a normal rate until both the dpdk idx and the kernel idx would reach d1 again - between d1 and k1 there are only packets with the status as owned by the kernel - which where added by the dpdk driver while its index was between d1 and k1 ; thus the kernel would burst all the packets till k1, while the dpdk idx is at d1 - the cycle repeats If a new traffic config comes (in our application) while this cycle is happening, it could be that some of the packets of the old config are still in queue (between d1 and k1) and will be bursted when the dpdk and kernel idx reach d1 ; this would explain seeing packets from an old config, but only in the first 2048 packets (which is the queue size) [1] https://www.spinics.net/lists/kernel/msg3959391.html [2] https://www.spinics.net/lists/netdev/msg739372.html On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit wrote: > On 8/20/2021 2:39 PM, Tudor Cornea wrote: > > The poll call can return POLLERR which is ignored, or it can return > > POLLOUT, even if there are no free frames in the mmap-ed area. > > > > We can account for both of these cases by re-checking if the next > > frame is empty before writing into it. > > > > We also now eliminate the timestamp status from the frame status. > > > > Hi Tudor, > > Would you mind separate timestamp status fix to its own patch? I think > better to > fix 'ignoring full Tx ring' first, to not make it dependent to timestamp > patch. > > > Signed-off-by: Mihai Pogonaru > > Signed-off-by: Tudor Cornea > > --- > > drivers/net/af_packet/rte_eth_af_packet.c | 47 > +++++++++++++++++++++++++++++-- > > 1 file changed, 45 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c > b/drivers/net/af_packet/rte_eth_af_packet.c > > index b73b211..3845df5 100644 > > --- a/drivers/net/af_packet/rte_eth_af_packet.c > > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > > @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > return num_rx; > > } > > > > +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status) > > +{ > > + return *tp_status & > > + ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); > > I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the > kernel > versions the bug exists, this flag is not set, but can you please confirm? > > > +} > > + > > /* > > * Callback to handle sending packets through a real NIC. > > */ > > @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > } > > } > > > > + /* > > + * We must eliminate the timestamp status from the packet > > + * status. This should only matter if timestamping is > enabled > > + * on the socket, but there is a BUG in the kernel which is > > + * fixed in newer releases. > > + > > + * For interfaces of type 'veth', the sent skb is forwarded > > + * to the peer and back into the network stack which > timestamps > > + * it on the RX path if timestamping is enabled globally > > + * (which happens if any socket enables timestamping). > > + > > + * When the skb is destructed, tpacket_destruct_skb() is > called > > + * and it calls __packet_set_timestamp() which doesn't > check > > + * the flags on the socket and returns the timestamp if it > is > > + * set in the skb (and for veth it is, as mentioned above). > > + */ > > + > > Can you give some more details on this bug, any link etc.. > > And does it only seen with veth, if so I wonder if we can ignore it, not > sure > how common to use af_packet PMD over veth interface, do you have this > usecase? > > And if only specific kernel versions impacted from this bug, what do you > think > about adding kernel version check to reduce the scope of the fix in > af_packet PMD. > > > /* point at the next incoming frame */ > > - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && > > - (poll(&pfd, 1, -1) < 0)) > > + if ((tx_ring_status_remove_ts(&ppd->tp_status) > > + != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0)) > > + break; > > + > > + /* > > + * Poll can return POLLERR if the interface is down or > POLLOUT, > > + * even if there are no extra buffers available. > > + * This happens, because packet_poll() calls > datagram_poll() > > + * which checks the space left in the socket buffer and in > the > > + * case of packet_mmap the default socket buffer length > > + * doesn't match the requested size for the tx_ring so > there > > + * is always space left in socket buffer, which doesn't > seem > > + * to be correlated to the requested size for the tx_ring > > + * in packet_mmap. > > + */ > > + if (tx_ring_status_remove_ts(&ppd->tp_status) > > + != TP_STATUS_AVAILABLE) > > break; > > In this case should we break or poll again? > > If 'POLLERR' is received when interface is down, it makes sense to break. > Do you > know if is there any other case that 'POLLERR' is returned? > > And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received, > can we > poll again to wait Tx ring available to send more packets? > > > > > /* copy the tx frame data */ > > @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > rte_pktmbuf_free(mbuf); > > } > > > > + /* > > + * We might have to ignore a few more errnos here since the packets > > + * remain in the mmap-ed queue and will be sent later, presumably. > > + */ > > + > > Can you please describe above comment more? What errors ignored? > And won't all packets sent will below sendto() call? > > > /* kick-off transmits */ > > if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && > > errno != ENOBUFS && errno != EAGAIN) { > > > >