* [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx @ 2021-08-20 13:39 Tudor Cornea 2021-09-01 16:34 ` Ferruh Yigit 2021-09-13 13:45 ` [dpdk-dev] [PATCH v2] " Tudor Cornea 0 siblings, 2 replies; 14+ messages in thread From: Tudor Cornea @ 2021-08-20 13:39 UTC (permalink / raw) To: linville; +Cc: thomas, dev, Tudor Cornea, Mihai Pogonaru The poll call can return POLLERR which is ignored, or it can return POLLOUT, even if there are no free frames in the mmap-ed area. We can account for both of these cases by re-checking if the next frame is empty before writing into it. We also now eliminate the timestamp status from the frame status. Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 47 +++++++++++++++++++++++++++++-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..3845df5 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) return num_rx; } +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status) +{ + return *tp_status & + ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); +} + /* * Callback to handle sending packets through a real NIC. */ @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } } + /* + * We must eliminate the timestamp status from the packet + * status. This should only matter if timestamping is enabled + * on the socket, but there is a BUG in the kernel which is + * fixed in newer releases. + + * For interfaces of type 'veth', the sent skb is forwarded + * to the peer and back into the network stack which timestamps + * it on the RX path if timestamping is enabled globally + * (which happens if any socket enables timestamping). + + * When the skb is destructed, tpacket_destruct_skb() is called + * and it calls __packet_set_timestamp() which doesn't check + * the flags on the socket and returns the timestamp if it is + * set in the skb (and for veth it is, as mentioned above). + */ + /* point at the next incoming frame */ - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && - (poll(&pfd, 1, -1) < 0)) + if ((tx_ring_status_remove_ts(&ppd->tp_status) + != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0)) + break; + + /* + * Poll can return POLLERR if the interface is down or POLLOUT, + * even if there are no extra buffers available. + * This happens, because packet_poll() calls datagram_poll() + * which checks the space left in the socket buffer and in the + * case of packet_mmap the default socket buffer length + * doesn't match the requested size for the tx_ring so there + * is always space left in socket buffer, which doesn't seem + * to be correlated to the requested size for the tx_ring + * in packet_mmap. + */ + if (tx_ring_status_remove_ts(&ppd->tp_status) + != TP_STATUS_AVAILABLE) break; /* copy the tx frame data */ @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) rte_pktmbuf_free(mbuf); } + /* + * We might have to ignore a few more errnos here since the packets + * remain in the mmap-ed queue and will be sent later, presumably. + */ + /* kick-off transmits */ if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && errno != ENOBUFS && errno != EAGAIN) { -- 2.7.4 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx 2021-08-20 13:39 [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx Tudor Cornea @ 2021-09-01 16:34 ` Ferruh Yigit 2021-09-06 10:23 ` Tudor Cornea 2021-09-13 13:45 ` [dpdk-dev] [PATCH v2] " Tudor Cornea 1 sibling, 1 reply; 14+ messages in thread From: Ferruh Yigit @ 2021-09-01 16:34 UTC (permalink / raw) To: Tudor Cornea, linville; +Cc: thomas, dev, Mihai Pogonaru On 8/20/2021 2:39 PM, Tudor Cornea wrote: > The poll call can return POLLERR which is ignored, or it can return > POLLOUT, even if there are no free frames in the mmap-ed area. > > We can account for both of these cases by re-checking if the next > frame is empty before writing into it. > > We also now eliminate the timestamp status from the frame status. > Hi Tudor, Would you mind separate timestamp status fix to its own patch? I think better to fix 'ignoring full Tx ring' first, to not make it dependent to timestamp patch. > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> > --- > drivers/net/af_packet/rte_eth_af_packet.c | 47 +++++++++++++++++++++++++++++-- > 1 file changed, 45 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c > index b73b211..3845df5 100644 > --- a/drivers/net/af_packet/rte_eth_af_packet.c > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > return num_rx; > } > > +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status) > +{ > + return *tp_status & > + ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the kernel versions the bug exists, this flag is not set, but can you please confirm? > +} > + > /* > * Callback to handle sending packets through a real NIC. > */ > @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > } > } > > + /* > + * We must eliminate the timestamp status from the packet > + * status. This should only matter if timestamping is enabled > + * on the socket, but there is a BUG in the kernel which is > + * fixed in newer releases. > + > + * For interfaces of type 'veth', the sent skb is forwarded > + * to the peer and back into the network stack which timestamps > + * it on the RX path if timestamping is enabled globally > + * (which happens if any socket enables timestamping). > + > + * When the skb is destructed, tpacket_destruct_skb() is called > + * and it calls __packet_set_timestamp() which doesn't check > + * the flags on the socket and returns the timestamp if it is > + * set in the skb (and for veth it is, as mentioned above). > + */ > + Can you give some more details on this bug, any link etc.. And does it only seen with veth, if so I wonder if we can ignore it, not sure how common to use af_packet PMD over veth interface, do you have this usecase? And if only specific kernel versions impacted from this bug, what do you think about adding kernel version check to reduce the scope of the fix in af_packet PMD. > /* point at the next incoming frame */ > - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && > - (poll(&pfd, 1, -1) < 0)) > + if ((tx_ring_status_remove_ts(&ppd->tp_status) > + != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0)) > + break; > + > + /* > + * Poll can return POLLERR if the interface is down or POLLOUT, > + * even if there are no extra buffers available. > + * This happens, because packet_poll() calls datagram_poll() > + * which checks the space left in the socket buffer and in the > + * case of packet_mmap the default socket buffer length > + * doesn't match the requested size for the tx_ring so there > + * is always space left in socket buffer, which doesn't seem > + * to be correlated to the requested size for the tx_ring > + * in packet_mmap. > + */ > + if (tx_ring_status_remove_ts(&ppd->tp_status) > + != TP_STATUS_AVAILABLE) > break; In this case should we break or poll again? If 'POLLERR' is received when interface is down, it makes sense to break. Do you know if is there any other case that 'POLLERR' is returned? And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received, can we poll again to wait Tx ring available to send more packets? > > /* copy the tx frame data */ > @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > rte_pktmbuf_free(mbuf); > } > > + /* > + * We might have to ignore a few more errnos here since the packets > + * remain in the mmap-ed queue and will be sent later, presumably. > + */ > + Can you please describe above comment more? What errors ignored? And won't all packets sent will below sendto() call? > /* kick-off transmits */ > if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && > errno != ENOBUFS && errno != EAGAIN) { > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx 2021-09-01 16:34 ` Ferruh Yigit @ 2021-09-06 10:23 ` Tudor Cornea 2021-09-20 17:11 ` Ferruh Yigit 0 siblings, 1 reply; 14+ messages in thread From: Tudor Cornea @ 2021-09-06 10:23 UTC (permalink / raw) To: Ferruh Yigit; +Cc: linville, thomas, dev, Mihai Pogonaru Hi Ferruh, Would you mind separate timestamp status fix to its own patch? I think > better to > fix 'ignoring full Tx ring' first, to not make it dependent to timestamp > patch. Agreed. There are two issues solved by this patch. We will break it in two different patches. I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the > kernel > versions the bug exists, this flag is not set, but can you please confirm? And does it only seen with veth, if so I wonder if we can ignore it, not > sure > how common to use af_packet PMD over veth interface, do you have this > usecase? We've seen the timestamping issue only when running af_packet over veth interfaces. We have a particular use-case internally, in which we need to run inside a Kubernetes cluster. We've found the following resources [1] , [2] related to this behavior in the kernel. We believe that issue #2 (the ring getting full), can theoretically occur on any type of NIC. We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3 interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the interface on / off ifconfig $iface_name down; sleep 10; ifconfig $iface_name up We will attempt to give more context on the issue below, about what we think happens: - we have a 2048 queue shared between the kernel and the dpdk socket, there's an index the queue in both the kernel and the dpdk driver - the dpdk driver writes a packet or a burst, advances its idx and tells the kernel to send the packets via a call to sendto() and the kernel sends the packets and advances its idx - once the interface is down the kernel can no longer send packets, but it doesn't drop them, it just doesn't advance its idx - for each packet there is header and in the header there is a status integer which, among others, indicates the owner of the packet: the userspace or the kernel - the userspace (dpdk driver) sets the status as owned by the kernel when it adds another packet ; the kernel sets the status back to owned by the userspace once it sends a packet - the dpdk driver was ignoring this status bit and, even after the queue was full, it would continue to put packets in the queue - its idx would be "after" that of the kernel - once the interface is brought up, the kernel would send all the packets in the queue (since they have the status of being owned by the kernel) on the next call to sendto() and the idx would be back to where it was before the interface was brought up (let's call it k1) - the dpdk driver idx into the queue would point somewhere in the queue (let's call it d1) and would continue to add packets at that point, but the kernel wouldn't send any packet anymore since there is now a gap of packets owned by the userspace between the kernel index (k1) and the dpdk driver idx (d1) - the dpdk idx would eventually reach k1 and packets would be transferred at a normal rate until both the dpdk idx and the kernel idx would reach d1 again - between d1 and k1 there are only packets with the status as owned by the kernel - which where added by the dpdk driver while its index was between d1 and k1 ; thus the kernel would burst all the packets till k1, while the dpdk idx is at d1 - the cycle repeats If a new traffic config comes (in our application) while this cycle is happening, it could be that some of the packets of the old config are still in queue (between d1 and k1) and will be bursted when the dpdk and kernel idx reach d1 ; this would explain seeing packets from an old config, but only in the first 2048 packets (which is the queue size) [1] https://www.spinics.net/lists/kernel/msg3959391.html [2] https://www.spinics.net/lists/netdev/msg739372.html On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit <ferruh.yigit@intel.com> wrote: > On 8/20/2021 2:39 PM, Tudor Cornea wrote: > > The poll call can return POLLERR which is ignored, or it can return > > POLLOUT, even if there are no free frames in the mmap-ed area. > > > > We can account for both of these cases by re-checking if the next > > frame is empty before writing into it. > > > > We also now eliminate the timestamp status from the frame status. > > > > Hi Tudor, > > Would you mind separate timestamp status fix to its own patch? I think > better to > fix 'ignoring full Tx ring' first, to not make it dependent to timestamp > patch. > > > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> > > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> > > --- > > drivers/net/af_packet/rte_eth_af_packet.c | 47 > +++++++++++++++++++++++++++++-- > > 1 file changed, 45 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c > b/drivers/net/af_packet/rte_eth_af_packet.c > > index b73b211..3845df5 100644 > > --- a/drivers/net/af_packet/rte_eth_af_packet.c > > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > > @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > return num_rx; > > } > > > > +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status) > > +{ > > + return *tp_status & > > + ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); > > I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the > kernel > versions the bug exists, this flag is not set, but can you please confirm? > > > +} > > + > > /* > > * Callback to handle sending packets through a real NIC. > > */ > > @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > } > > } > > > > + /* > > + * We must eliminate the timestamp status from the packet > > + * status. This should only matter if timestamping is > enabled > > + * on the socket, but there is a BUG in the kernel which is > > + * fixed in newer releases. > > + > > + * For interfaces of type 'veth', the sent skb is forwarded > > + * to the peer and back into the network stack which > timestamps > > + * it on the RX path if timestamping is enabled globally > > + * (which happens if any socket enables timestamping). > > + > > + * When the skb is destructed, tpacket_destruct_skb() is > called > > + * and it calls __packet_set_timestamp() which doesn't > check > > + * the flags on the socket and returns the timestamp if it > is > > + * set in the skb (and for veth it is, as mentioned above). > > + */ > > + > > Can you give some more details on this bug, any link etc.. > > And does it only seen with veth, if so I wonder if we can ignore it, not > sure > how common to use af_packet PMD over veth interface, do you have this > usecase? > > And if only specific kernel versions impacted from this bug, what do you > think > about adding kernel version check to reduce the scope of the fix in > af_packet PMD. > > > /* point at the next incoming frame */ > > - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && > > - (poll(&pfd, 1, -1) < 0)) > > + if ((tx_ring_status_remove_ts(&ppd->tp_status) > > + != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0)) > > + break; > > + > > + /* > > + * Poll can return POLLERR if the interface is down or > POLLOUT, > > + * even if there are no extra buffers available. > > + * This happens, because packet_poll() calls > datagram_poll() > > + * which checks the space left in the socket buffer and in > the > > + * case of packet_mmap the default socket buffer length > > + * doesn't match the requested size for the tx_ring so > there > > + * is always space left in socket buffer, which doesn't > seem > > + * to be correlated to the requested size for the tx_ring > > + * in packet_mmap. > > + */ > > + if (tx_ring_status_remove_ts(&ppd->tp_status) > > + != TP_STATUS_AVAILABLE) > > break; > > In this case should we break or poll again? > > If 'POLLERR' is received when interface is down, it makes sense to break. > Do you > know if is there any other case that 'POLLERR' is returned? > > And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received, > can we > poll again to wait Tx ring available to send more packets? > > > > > /* copy the tx frame data */ > > @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > rte_pktmbuf_free(mbuf); > > } > > > > + /* > > + * We might have to ignore a few more errnos here since the packets > > + * remain in the mmap-ed queue and will be sent later, presumably. > > + */ > > + > > Can you please describe above comment more? What errors ignored? > And won't all packets sent will below sendto() call? > > > /* kick-off transmits */ > > if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && > > errno != ENOBUFS && errno != EAGAIN) { > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx 2021-09-06 10:23 ` Tudor Cornea @ 2021-09-20 17:11 ` Ferruh Yigit 0 siblings, 0 replies; 14+ messages in thread From: Ferruh Yigit @ 2021-09-20 17:11 UTC (permalink / raw) To: Tudor Cornea; +Cc: linville, thomas, dev, Mihai Pogonaru On 9/6/2021 11:23 AM, Tudor Cornea wrote: > Hi Ferruh, > > Would you mind separate timestamp status fix to its own patch? I think >> better to >> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp >> patch. > > > Agreed. There are two issues solved by this patch. We will break it in two > different patches. > > I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the >> kernel >> versions the bug exists, this flag is not set, but can you please confirm? > > > And does it only seen with veth, if so I wonder if we can ignore it, not >> sure >> how common to use af_packet PMD over veth interface, do you have this >> usecase? > > > We've seen the timestamping issue only when running af_packet over > veth interfaces. We have a particular use-case internally, in which we need > to run inside a Kubernetes cluster. > We've found the following resources [1] , [2] related to this behavior in > the kernel. > > We believe that issue #2 (the ring getting full), can theoretically occur > on any type of NIC. > We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3 > interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the > interface on / off > ifconfig $iface_name down; sleep 10; ifconfig $iface_name up > > We will attempt to give more context on the issue below, about what we > think happens: > - we have a 2048 queue shared between the kernel and the dpdk socket, > there's an index the queue in both the kernel and the dpdk driver > - the dpdk driver writes a packet or a burst, advances its idx and tells > the kernel to send the packets via a call to sendto() and the kernel sends > the packets and advances its idx > - once the interface is down the kernel can no longer send packets, but it > doesn't drop them, it just doesn't advance its idx > - for each packet there is header and in the header there is a status > integer which, among others, indicates the owner of the packet: the > userspace or the kernel - the userspace (dpdk driver) sets the status as > owned by the kernel when it adds another packet ; the kernel sets the > status back to owned by the userspace once it sends a packet > - the dpdk driver was ignoring this status bit and, even after the queue > was full, it would continue to put packets in the queue - its idx would be > "after" that of the kernel > - once the interface is brought up, the kernel would send all the packets > in the queue (since they have the status of being owned by the kernel) on > the next call to sendto() and the idx would be back to where it was before > the interface was brought up (let's call it k1) > - the dpdk driver idx into the queue would point somewhere in the queue > (let's call it d1) and would continue to add packets at that point, but the > kernel wouldn't send any packet anymore since there is now a gap of packets > owned by the userspace between the kernel index (k1) and the dpdk driver > idx (d1) > - the dpdk idx would eventually reach k1 and packets would be transferred > at a normal rate until both the dpdk idx and the kernel idx would reach d1 > again > - between d1 and k1 there are only packets with the status as owned by the > kernel - which where added by the dpdk driver while its index was between > d1 and k1 ; thus the kernel would burst all the packets till k1, while the > dpdk idx is at d1 > - the cycle repeats > > If a new traffic config comes (in our application) while this cycle is > happening, it could be that some of the packets of the old config are still > in queue (between d1 and k1) and will be bursted when the dpdk and kernel > idx reach d1 ; this would explain seeing packets from an old config, but > only in the first 2048 packets (which is the queue size) > > Hi Tudor, If there is an usage on of veth, OK to fix the timestamps issue. What you described above looks like a ring buffer with single producer and single consumer, and producer overwrites the not consumed items. I assume this happens because af_packet (consumer) can't send the packets because of the timestamp defect. (Also producer (dpdk app) should have checks to prevent overwrite, but that is a different issue.) I will comment to the new versions of the patches. Our of curiosity, are you using an modified af_packet implementation in kernel for above described usage? > [1] https://www.spinics.net/lists/kernel/msg3959391.html > [2] https://www.spinics.net/lists/netdev/msg739372.html > > On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit <ferruh.yigit@intel.com> wrote: > >> On 8/20/2021 2:39 PM, Tudor Cornea wrote: >>> The poll call can return POLLERR which is ignored, or it can return >>> POLLOUT, even if there are no free frames in the mmap-ed area. >>> >>> We can account for both of these cases by re-checking if the next >>> frame is empty before writing into it. >>> >>> We also now eliminate the timestamp status from the frame status. >>> >> >> Hi Tudor, >> >> Would you mind separate timestamp status fix to its own patch? I think >> better to >> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp >> patch. >> >>> Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> >>> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> >>> --- >>> drivers/net/af_packet/rte_eth_af_packet.c | 47 >> +++++++++++++++++++++++++++++-- >>> 1 file changed, 45 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >> b/drivers/net/af_packet/rte_eth_af_packet.c >>> index b73b211..3845df5 100644 >>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>> @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf >> **bufs, uint16_t nb_pkts) >>> return num_rx; >>> } >>> >>> +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status) >>> +{ >>> + return *tp_status & >>> + ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); >> >> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the >> kernel >> versions the bug exists, this flag is not set, but can you please confirm? >> >>> +} >>> + >>> /* >>> * Callback to handle sending packets through a real NIC. >>> */ >>> @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf >> **bufs, uint16_t nb_pkts) >>> } >>> } >>> >>> + /* >>> + * We must eliminate the timestamp status from the packet >>> + * status. This should only matter if timestamping is >> enabled >>> + * on the socket, but there is a BUG in the kernel which is >>> + * fixed in newer releases. >>> + >>> + * For interfaces of type 'veth', the sent skb is forwarded >>> + * to the peer and back into the network stack which >> timestamps >>> + * it on the RX path if timestamping is enabled globally >>> + * (which happens if any socket enables timestamping). >>> + >>> + * When the skb is destructed, tpacket_destruct_skb() is >> called >>> + * and it calls __packet_set_timestamp() which doesn't >> check >>> + * the flags on the socket and returns the timestamp if it >> is >>> + * set in the skb (and for veth it is, as mentioned above). >>> + */ >>> + >> >> Can you give some more details on this bug, any link etc.. >> >> And does it only seen with veth, if so I wonder if we can ignore it, not >> sure >> how common to use af_packet PMD over veth interface, do you have this >> usecase? >> >> And if only specific kernel versions impacted from this bug, what do you >> think >> about adding kernel version check to reduce the scope of the fix in >> af_packet PMD. >> >>> /* point at the next incoming frame */ >>> - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && >>> - (poll(&pfd, 1, -1) < 0)) >>> + if ((tx_ring_status_remove_ts(&ppd->tp_status) >>> + != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0)) >>> + break; >>> + >>> + /* >>> + * Poll can return POLLERR if the interface is down or >> POLLOUT, >>> + * even if there are no extra buffers available. >>> + * This happens, because packet_poll() calls >> datagram_poll() >>> + * which checks the space left in the socket buffer and in >> the >>> + * case of packet_mmap the default socket buffer length >>> + * doesn't match the requested size for the tx_ring so >> there >>> + * is always space left in socket buffer, which doesn't >> seem >>> + * to be correlated to the requested size for the tx_ring >>> + * in packet_mmap. >>> + */ >>> + if (tx_ring_status_remove_ts(&ppd->tp_status) >>> + != TP_STATUS_AVAILABLE) >>> break; >> >> In this case should we break or poll again? >> >> If 'POLLERR' is received when interface is down, it makes sense to break. >> Do you >> know if is there any other case that 'POLLERR' is returned? >> >> And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received, >> can we >> poll again to wait Tx ring available to send more packets? >> >>> >>> /* copy the tx frame data */ >>> @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf >> **bufs, uint16_t nb_pkts) >>> rte_pktmbuf_free(mbuf); >>> } >>> >>> + /* >>> + * We might have to ignore a few more errnos here since the packets >>> + * remain in the mmap-ed queue and will be sent later, presumably. >>> + */ >>> + >> >> Can you please describe above comment more? What errors ignored? >> And won't all packets sent will below sendto() call? >> >>> /* kick-off transmits */ >>> if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && >>> errno != ENOBUFS && errno != EAGAIN) { >>> >> >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx 2021-08-20 13:39 [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx Tudor Cornea 2021-09-01 16:34 ` Ferruh Yigit @ 2021-09-13 13:45 ` Tudor Cornea 2021-09-20 17:44 ` Ferruh Yigit 2021-11-02 15:47 ` [dpdk-dev] [PATCH v3] " Tudor Cornea 1 sibling, 2 replies; 14+ messages in thread From: Tudor Cornea @ 2021-09-13 13:45 UTC (permalink / raw) To: ferruh.yigit; +Cc: linville, thomas, pogonarumihai, dev, Tudor Cornea The poll call can return POLLERR which is ignored, or it can return POLLOUT, even if there are no free frames in the mmap-ed area. We can account for both of these cases by re-checking if the next frame is empty before writing into it. Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..087c196 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) (poll(&pfd, 1, -1) < 0)) break; + /* + * Poll can return POLLERR if the interface is down + * + * It will almost always return POLLOUT, even if there + * are no extra buffers available + * + * This happens, because packet_poll() calls datagram_poll() + * which checks the space left in the socket buffer and, + * in the case of packet_mmap, the default socket buffer length + * doesn't match the requested size for the tx_ring. + * As such, there is almost always space left in socket buffer, + * which doesn't seem to be correlated to the requested size + * for the tx_ring in packet_mmap. + * + * This results in poll() returning POLLOUT. + */ + if (ppd->tp_status != TP_STATUS_AVAILABLE) + break; + /* copy the tx frame data */ pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - sizeof(struct sockaddr_ll); -- 2.7.4 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx 2021-09-13 13:45 ` [dpdk-dev] [PATCH v2] " Tudor Cornea @ 2021-09-20 17:44 ` Ferruh Yigit 2021-09-29 10:03 ` Tudor Cornea 2021-11-02 15:47 ` [dpdk-dev] [PATCH v3] " Tudor Cornea 1 sibling, 1 reply; 14+ messages in thread From: Ferruh Yigit @ 2021-09-20 17:44 UTC (permalink / raw) To: Tudor Cornea; +Cc: linville, thomas, pogonarumihai, dev On 9/13/2021 2:45 PM, Tudor Cornea wrote: > The poll call can return POLLERR which is ignored, or it can return > POLLOUT, even if there are no free frames in the mmap-ed area. > > We can account for both of these cases by re-checking if the next > frame is empty before writing into it. > > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> > --- > drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c > index b73b211..087c196 100644 > --- a/drivers/net/af_packet/rte_eth_af_packet.c > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > (poll(&pfd, 1, -1) < 0)) > break; > > + /* > + * Poll can return POLLERR if the interface is down > + * > + * It will almost always return POLLOUT, even if there > + * are no extra buffers available > + * > + * This happens, because packet_poll() calls datagram_poll() > + * which checks the space left in the socket buffer and, > + * in the case of packet_mmap, the default socket buffer length > + * doesn't match the requested size for the tx_ring. > + * As such, there is almost always space left in socket buffer, > + * which doesn't seem to be correlated to the requested size > + * for the tx_ring in packet_mmap. > + * > + * This results in poll() returning POLLOUT. > + */ > + if (ppd->tp_status != TP_STATUS_AVAILABLE) > + break; > + If 'POLLOUT' doesn't indicate that there is space in the buffer, what is the point of the 'poll()' at all? What can we test/reproduce the mentioned behavior? Or is there a way to fix the behavior of poll() or use an alternative of it? OK to break on the 'POLLERR', I guess it can be detected in the 'pfd.revent'. > /* copy the tx frame data */ > pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - > sizeof(struct sockaddr_ll); > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx 2021-09-20 17:44 ` Ferruh Yigit @ 2021-09-29 10:03 ` Tudor Cornea 2021-10-05 15:11 ` Tudor Cornea 0 siblings, 1 reply; 14+ messages in thread From: Tudor Cornea @ 2021-09-29 10:03 UTC (permalink / raw) To: Ferruh Yigit; +Cc: linville, Thomas Monjalon, Mihai Pogonaru, dev Hi Ferruh, What you described above looks like a ring buffer with single producer and > single consumer, and producer overwrites the not consumed items. Indeed. This is also my understanding of the bug. I am going to try to isolate the issue, and should probably be able to come up with a script in a few days. Our of curiosity, are you using an modified af_packet implementation in > kernel > for above described usage? We are currently using an Ubuntu-based distro with a 4.15 Linux kernel. We don't have any kernel patches for the af_packet implementation to my knowledge (probably excepting patches that are back-ported by Ubuntu maintainers from newer releases). On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit <ferruh.yigit@intel.com> wrote: > On 9/13/2021 2:45 PM, Tudor Cornea wrote: > > The poll call can return POLLERR which is ignored, or it can return > > POLLOUT, even if there are no free frames in the mmap-ed area. > > > > We can account for both of these cases by re-checking if the next > > frame is empty before writing into it. > > > > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> > > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> > > --- > > drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++ > > 1 file changed, 19 insertions(+) > > > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c > b/drivers/net/af_packet/rte_eth_af_packet.c > > index b73b211..087c196 100644 > > --- a/drivers/net/af_packet/rte_eth_af_packet.c > > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > > @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > (poll(&pfd, 1, -1) < 0)) > > break; > > > > + /* > > + * Poll can return POLLERR if the interface is down > > + * > > + * It will almost always return POLLOUT, even if there > > + * are no extra buffers available > > + * > > + * This happens, because packet_poll() calls > datagram_poll() > > + * which checks the space left in the socket buffer and, > > + * in the case of packet_mmap, the default socket buffer > length > > + * doesn't match the requested size for the tx_ring. > > + * As such, there is almost always space left in socket > buffer, > > + * which doesn't seem to be correlated to the requested > size > > + * for the tx_ring in packet_mmap. > > + * > > + * This results in poll() returning POLLOUT. > > + */ > > + if (ppd->tp_status != TP_STATUS_AVAILABLE) > > + break; > > + > > If 'POLLOUT' doesn't indicate that there is space in the buffer, what is > the > point of the 'poll()' at all? > > What can we test/reproduce the mentioned behavior? Or is there a way to > fix the > behavior of poll() or use an alternative of it? > > > OK to break on the 'POLLERR', I guess it can be detected in the > 'pfd.revent'. > > > > /* copy the tx frame data */ > > pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - > > sizeof(struct sockaddr_ll); > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx 2021-09-29 10:03 ` Tudor Cornea @ 2021-10-05 15:11 ` Tudor Cornea 2021-10-26 14:30 ` Ferruh Yigit 0 siblings, 1 reply; 14+ messages in thread From: Tudor Cornea @ 2021-10-05 15:11 UTC (permalink / raw) To: Ferruh Yigit; +Cc: linville, Thomas Monjalon, Mihai Pogonaru, dev Hi Ferruh, I have attempted to narrow down the issue. I have the following bash script, which computes packet rates on an interface. [root@localhost ~]# cat compute-rates.sh #!/usr/bin/env bash if [[ ${#} -ne 2 ]]; then echo "Usage: ${0} <iface-name> <sleep-interval-seconds>" exit 1 fi IFACE_NAME="${1}" SLEEP_INTERVAL_SECONDS="${2}" TMP_STATS_FILE="/tmp/netstat" # Clear Previous stats file echo "0 0 0 0" > "${TMP_STATS_FILE}" echo "Press CTRL+C to exit..." while true; do export "RxB=0" "RxP=0" "TxB=0" "TxP=0" # Extract Rx{Bytes,Packets} and Tx{Bytes,Packets} and # format the output. Individual fields will be exported export $(\ ifconfig "${IFACE_NAME}" \ | grep 'packets' \ | awk '{print $5, $3}' \ | xargs echo \ | sed -E -e \ "s/([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+)/RxB=\1 RxP=\2 TxB=\3 TxP=\4/") # Print Packet and Byte Rates # Format: | Rx Bytes | Rx Packets | Tx Bytes | Tx Packets | echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" $(cat "${TMP_STATS_FILE}") \ | awk '{print "RxB="$1-$5, "RxP="$2-$6, "TxB="$3-$7, "TxP="$4-$8}' # Save the new values echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" > "${TMP_STATS_FILE}" sleep "${SLEEP_INTERVAL_SECONDS}" done On the transmit side, I'm using the engine behind [1] with the af_packet PMD. The configuration for the af_packet PMD is the following: --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0 I'm configuring a Tx rate of 335 packets / second and a packet size of 300 Bytes. These seem to be the values using which we seem to have better chances of seeing the problem. I suspect it might also be linked with the af_packet configuration. I'm starting traffic using the specified configuration, and in parallel, running the script that computes the rates as follows: ./compute-rates.sh eth1 0.1 Initially, the packet rates seem steady RxB=0 RxP=0 TxB=10952 TxP=37 RxB=0 RxP=0 TxB=10656 TxP=36 RxB=0 RxP=0 TxB=10656 TxP=36 RxB=0 RxP=0 TxB=10656 TxP=36 RxB=0 RxP=0 TxB=10952 TxP=37 RxB=0 RxP=0 TxB=10952 TxP=37 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10952 TxP=37 [...] After a while, we toggle the interface up / down with a sleep between the steps. I suspect the length of the sleep might be a variable in the equation. ifconfig eth1 down; sleep 7; ifconfig eth1 up What we see, is that even after the interface is toggled back up, the rates never seem to recover. RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=2072 TxP=7 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=521256 TxP=1761 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 [...] I've attempted to mirror the same behavior using dpdk-pktgen [2] on a different machine (Ubuntu 20.04). This time, af_packet runs on top of a Linux virtio_net interface. I seem to be getting a similar behavior. I have used the following dpdk-pktgen configuration and run-time settings pktgen \ -l 1-4 \ -n 4 \ --proc-type=primary \ --no-pci \ --no-telemetry \ --no-huge \ -m 512 \ --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0 \ -- \ -P \ -T \ -m "3.0" \ -f themes/black-yellow.theme set 0 size 300 set 0 rate 0.008 set 0 burst 1 start 0 [1] https://github.com/open-traffic-generator/ixia-c [2] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md On Wed, 29 Sept 2021 at 13:03, Tudor Cornea <tudor.cornea@gmail.com> wrote: > Hi Ferruh, > > What you described above looks like a ring buffer with single producer and >> single consumer, and producer overwrites the not consumed items. > > > Indeed. This is also my understanding of the bug. > I am going to try to isolate the issue, and should probably be able to > come up with a script in a few days. > > Our of curiosity, are you using an modified af_packet implementation in >> kernel >> for above described usage? > > > We are currently using an Ubuntu-based distro with a 4.15 Linux kernel. > We don't have any kernel patches for the af_packet implementation to my > knowledge (probably excepting patches that are back-ported by Ubuntu > maintainers from newer releases). > > > On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit <ferruh.yigit@intel.com> > wrote: > >> On 9/13/2021 2:45 PM, Tudor Cornea wrote: >> > The poll call can return POLLERR which is ignored, or it can return >> > POLLOUT, even if there are no free frames in the mmap-ed area. >> > >> > We can account for both of these cases by re-checking if the next >> > frame is empty before writing into it. >> > >> > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> >> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> >> > --- >> > drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++ >> > 1 file changed, 19 insertions(+) >> > >> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >> b/drivers/net/af_packet/rte_eth_af_packet.c >> > index b73b211..087c196 100644 >> > --- a/drivers/net/af_packet/rte_eth_af_packet.c >> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c >> > @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf >> **bufs, uint16_t nb_pkts) >> > (poll(&pfd, 1, -1) < 0)) >> > break; >> > >> > + /* >> > + * Poll can return POLLERR if the interface is down >> > + * >> > + * It will almost always return POLLOUT, even if there >> > + * are no extra buffers available >> > + * >> > + * This happens, because packet_poll() calls >> datagram_poll() >> > + * which checks the space left in the socket buffer and, >> > + * in the case of packet_mmap, the default socket buffer >> length >> > + * doesn't match the requested size for the tx_ring. >> > + * As such, there is almost always space left in socket >> buffer, >> > + * which doesn't seem to be correlated to the requested >> size >> > + * for the tx_ring in packet_mmap. >> > + * >> > + * This results in poll() returning POLLOUT. >> > + */ >> > + if (ppd->tp_status != TP_STATUS_AVAILABLE) >> > + break; >> > + >> >> If 'POLLOUT' doesn't indicate that there is space in the buffer, what is >> the >> point of the 'poll()' at all? >> >> What can we test/reproduce the mentioned behavior? Or is there a way to >> fix the >> behavior of poll() or use an alternative of it? >> >> >> OK to break on the 'POLLERR', I guess it can be detected in the >> 'pfd.revent'. >> >> >> > /* copy the tx frame data */ >> > pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - >> > sizeof(struct sockaddr_ll); >> > >> >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx 2021-10-05 15:11 ` Tudor Cornea @ 2021-10-26 14:30 ` Ferruh Yigit 2021-11-02 15:24 ` Tudor Cornea 0 siblings, 1 reply; 14+ messages in thread From: Ferruh Yigit @ 2021-10-26 14:30 UTC (permalink / raw) To: Tudor Cornea; +Cc: linville, Thomas Monjalon, Mihai Pogonaru, dev On 10/5/2021 4:11 PM, Tudor Cornea wrote: > Hi Ferruh, > > I have attempted to narrow down the issue. > I have the following bash script, which computes packet rates on an > interface. > > [root@localhost ~]# cat compute-rates.sh > #!/usr/bin/env bash > > if [[ ${#} -ne 2 ]]; then > echo "Usage: ${0} <iface-name> <sleep-interval-seconds>" > exit 1 > fi > > IFACE_NAME="${1}" > SLEEP_INTERVAL_SECONDS="${2}" > TMP_STATS_FILE="/tmp/netstat" > > # Clear Previous stats file > echo "0 0 0 0" > "${TMP_STATS_FILE}" > > echo "Press CTRL+C to exit..." > > while true; do > export "RxB=0" "RxP=0" "TxB=0" "TxP=0" > > # Extract Rx{Bytes,Packets} and Tx{Bytes,Packets} and > # format the output. Individual fields will be exported > export $(\ > ifconfig "${IFACE_NAME}" \ > | grep 'packets' \ > | awk '{print $5, $3}' \ > | xargs echo \ > | sed -E -e \ > "s/([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+)/RxB=\1 RxP=\2 > TxB=\3 TxP=\4/") > > # Print Packet and Byte Rates > # Format: | Rx Bytes | Rx Packets | Tx Bytes | Tx Packets | > > echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" $(cat "${TMP_STATS_FILE}") \ > | awk '{print "RxB="$1-$5, "RxP="$2-$6, "TxB="$3-$7, "TxP="$4-$8}' > > # Save the new values > echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" > "${TMP_STATS_FILE}" > > sleep "${SLEEP_INTERVAL_SECONDS}" > > done > > On the transmit side, I'm using the engine behind [1] with the af_packet > PMD. > > The configuration for the af_packet PMD is the following: > --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0 > > I'm configuring a Tx rate of 335 packets / second and a packet size of 300 > Bytes. > These seem to be the values using which we seem to have better chances of > seeing the problem. I suspect it might also be linked with the af_packet > configuration. > > I'm starting traffic using the specified configuration, and in parallel, > running the script that computes the rates as follows: > ./compute-rates.sh eth1 0.1 > > Initially, the packet rates seem steady > > RxB=0 RxP=0 TxB=10952 TxP=37 > RxB=0 RxP=0 TxB=10656 TxP=36 > RxB=0 RxP=0 TxB=10656 TxP=36 > RxB=0 RxP=0 TxB=10656 TxP=36 > RxB=0 RxP=0 TxB=10952 TxP=37 > RxB=0 RxP=0 TxB=10952 TxP=37 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=10952 TxP=37 > > [...] > > After a while, we toggle the interface up / down with a sleep between the > steps. I suspect the length of the sleep might be a variable in the > equation. > > ifconfig eth1 down; sleep 7; ifconfig eth1 up > > > What we see, is that even after the interface is toggled back up, the rates > never seem to recover. > > RxB=0 RxP=0 TxB=0 TxP=0 > RxB=0 RxP=0 TxB=0 TxP=0 > RxB=0 RxP=0 TxB=0 TxP=0 > RxB=0 RxP=0 TxB=0 TxP=0 > RxB=0 RxP=0 TxB=2072 TxP=7 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=10360 TxP=35 > RxB=0 RxP=0 TxB=521256 TxP=1761 > RxB=0 RxP=0 TxB=0 TxP=0 > RxB=0 RxP=0 TxB=0 TxP=0 > RxB=0 RxP=0 TxB=0 TxP=0 > > [...] > > > I've attempted to mirror the same behavior using dpdk-pktgen [2] on a > different machine (Ubuntu 20.04). This time, af_packet runs on top of > a Linux virtio_net interface. > > I seem to be getting a similar behavior. I have used the following > dpdk-pktgen configuration and run-time settings > > > pktgen \ > -l 1-4 \ > -n 4 \ > --proc-type=primary \ > --no-pci \ > --no-telemetry \ > --no-huge \ > -m 512 \ > --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0 > \ > -- \ > -P \ > -T \ > -m "3.0" \ > -f themes/black-yellow.theme > > set 0 size 300 > set 0 rate 0.008 > set 0 burst 1 > start 0 > > > [1] https://github.com/open-traffic-generator/ixia-c > [2] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md > > On Wed, 29 Sept 2021 at 13:03, Tudor Cornea <tudor.cornea@gmail.com> wrote: > Hi Tudor, I have used testpmd, 'txonly' forwarding. Tx recovers after interface up, but by adding some debug logs I can see 'poll()' returns with POLLOUT even there is no space in the buffer. According the logic in the PMD, when 'poll()' returns success, it expects to have some space in the Tx buffer. So I agree to add the check. Only have a question on the POLLERR, should we separate the POLLERR check to cover ifdown case, what do you think about following logic: if (!TP_STATUS_AVAILABLE) { if (poll() < 0) break; if (pfd.revents & POLLERR) break; } if (!TP_STATUS_AVAILABLE) break; >> Hi Ferruh, >> >> What you described above looks like a ring buffer with single producer and >>> single consumer, and producer overwrites the not consumed items. >> >> >> Indeed. This is also my understanding of the bug. >> I am going to try to isolate the issue, and should probably be able to >> come up with a script in a few days. >> >> Our of curiosity, are you using an modified af_packet implementation in >>> kernel >>> for above described usage? >> >> >> We are currently using an Ubuntu-based distro with a 4.15 Linux kernel. >> We don't have any kernel patches for the af_packet implementation to my >> knowledge (probably excepting patches that are back-ported by Ubuntu >> maintainers from newer releases). >> >> >> On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit <ferruh.yigit@intel.com> >> wrote: >> >>> On 9/13/2021 2:45 PM, Tudor Cornea wrote: >>>> The poll call can return POLLERR which is ignored, or it can return >>>> POLLOUT, even if there are no free frames in the mmap-ed area. >>>> >>>> We can account for both of these cases by re-checking if the next >>>> frame is empty before writing into it. >>>> >>>> Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> >>>> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> >>>> --- >>>> drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++ >>>> 1 file changed, 19 insertions(+) >>>> >>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>> b/drivers/net/af_packet/rte_eth_af_packet.c >>>> index b73b211..087c196 100644 >>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>>> @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf >>> **bufs, uint16_t nb_pkts) >>>> (poll(&pfd, 1, -1) < 0)) >>>> break; >>>> >>>> + /* >>>> + * Poll can return POLLERR if the interface is down >>>> + * >>>> + * It will almost always return POLLOUT, even if there >>>> + * are no extra buffers available >>>> + * >>>> + * This happens, because packet_poll() calls >>> datagram_poll() >>>> + * which checks the space left in the socket buffer and, >>>> + * in the case of packet_mmap, the default socket buffer >>> length >>>> + * doesn't match the requested size for the tx_ring. >>>> + * As such, there is almost always space left in socket >>> buffer, >>>> + * which doesn't seem to be correlated to the requested >>> size >>>> + * for the tx_ring in packet_mmap. >>>> + * >>>> + * This results in poll() returning POLLOUT. >>>> + */ >>>> + if (ppd->tp_status != TP_STATUS_AVAILABLE) >>>> + break; >>>> + >>> >>> If 'POLLOUT' doesn't indicate that there is space in the buffer, what is >>> the >>> point of the 'poll()' at all? >>> >>> What can we test/reproduce the mentioned behavior? Or is there a way to >>> fix the >>> behavior of poll() or use an alternative of it? >>> >>> >>> OK to break on the 'POLLERR', I guess it can be detected in the >>> 'pfd.revent'. >>> >>> >>>> /* copy the tx frame data */ >>>> pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - >>>> sizeof(struct sockaddr_ll); >>>> >>> >>> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx 2021-10-26 14:30 ` Ferruh Yigit @ 2021-11-02 15:24 ` Tudor Cornea 0 siblings, 0 replies; 14+ messages in thread From: Tudor Cornea @ 2021-11-02 15:24 UTC (permalink / raw) To: Ferruh Yigit; +Cc: linville, Thomas Monjalon, Mihai Pogonaru, dev On Tue, 26 Oct 2021 at 17:41, Ferruh Yigit <ferruh.yigit@intel.com> wrote: > Hi Tudor, > > I have used testpmd, 'txonly' forwarding. Tx recovers after interface up, > but by adding some debug logs I can see 'poll()' returns with POLLOUT even > there is no space in the buffer. > > According the logic in the PMD, when 'poll()' returns success, it expects > to have some space in the Tx buffer. > > So I agree to add the check. > > Only have a question on the POLLERR, should we separate the POLLERR check > to cover ifdown case, what do you think about following logic: > > if (!TP_STATUS_AVAILABLE) { > if (poll() < 0) > break; > if (pfd.revents & POLLERR) > break; > } > > if (!TP_STATUS_AVAILABLE) > break; > > Hi Ferruh, Thanks for the suggestion. I was thinking of adding this check, and intuitively, it seems correct. I tried to do some further testing. I tested with the POLLERR check, and I don't see the issue anymore. However, if I remove the second if (!TP_STATUS_AVAILABLE), I seem to get bursts of 2048 packets followed by periods of not sending anything when toggling the interface link state. Without the second if(!TP_STATUS_AVAILABLE) statement, the issue seems to reproduce, regardless if I add the check for POLLERR or not. RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=606208 TxP=2048 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 I will send an updated version of the patch. Thanks, Tudor ^ permalink raw reply [flat|nested] 14+ messages in thread
* [dpdk-dev] [PATCH v3] net/af_packet: fix ignoring full ring on tx 2021-09-13 13:45 ` [dpdk-dev] [PATCH v2] " Tudor Cornea 2021-09-20 17:44 ` Ferruh Yigit @ 2021-11-02 15:47 ` Tudor Cornea 2021-11-02 16:47 ` Ferruh Yigit 2021-11-03 9:31 ` [dpdk-dev] [PATCH v4] " Tudor Cornea 1 sibling, 2 replies; 14+ messages in thread From: Tudor Cornea @ 2021-11-02 15:47 UTC (permalink / raw) To: ferruh.yigit; +Cc: linville, thomas, pogonarumihai, dev, Tudor Cornea The poll call can return POLLERR which is ignored, or it can return POLLOUT, even if there are no free frames in the mmap-ed area. We can account for both of these cases by re-checking if the next frame is empty before writing into it. We have attempted to reproduce this issue with pktgen-dpdk, using the following configuration. pktgen -l 1-4 -n 4 --proc-type=primary --no-pci --no-telemetry \ --no-huge -m 512 \ --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192, \ framecnt=2048,qpairs=1,qdisc_bypass=0 \ -- \ -P \ -T \ -m "3.0" \ -f themes/black-yellow.theme We configure a low tx rate (~ 335 packets / second) and a small packet size, of about 300 Bytes from the pktgen CLI. set 0 size 300 set 0 rate 0.008 set 0 burst 1 start 0 After bringing the interface down, and up again, we seem to arrive in a state in which the tx rate is inconsistent, and does not recover. ifconfig eth1 down; sleep 7; ifconfig eth1 up [1] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> --- v2: * Added check for POLLERR * Used tx_ring_status_available() for checking TP_STATUS_AVAILABLE --- drivers/net/af_packet/rte_eth_af_packet.c | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 559f5a0..d3d3104 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -237,8 +237,30 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status) && - poll(&pfd, 1, -1) < 0) + if (!tx_ring_status_available(ppd->tp_status)) { + if (poll(&pfd, 1, -1) < 0) + break; + if (pfd.revents & POLLERR) + break; + } + + /* + * Poll can return POLLERR if the interface is down + * + * It will almost always return POLLOUT, even if there + * are no extra buffers available + * + * This happens, because packet_poll() calls datagram_poll() + * which checks the space left in the socket buffer and, + * in the case of packet_mmap, the default socket buffer length + * doesn't match the requested size for the tx_ring. + * As such, there is almost always space left in socket buffer, + * which doesn't seem to be correlated to the requested size + * for the tx_ring in packet_mmap. + * + * This results in poll() returning POLLOUT. + */ + if (!tx_ring_status_available(ppd->tp_status)) break; /* copy the tx frame data */ -- 2.7.4 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH v3] net/af_packet: fix ignoring full ring on tx 2021-11-02 15:47 ` [dpdk-dev] [PATCH v3] " Tudor Cornea @ 2021-11-02 16:47 ` Ferruh Yigit 2021-11-03 9:31 ` [dpdk-dev] [PATCH v4] " Tudor Cornea 1 sibling, 0 replies; 14+ messages in thread From: Ferruh Yigit @ 2021-11-02 16:47 UTC (permalink / raw) To: Tudor Cornea; +Cc: linville, thomas, pogonarumihai, dev On 11/2/2021 3:47 PM, Tudor Cornea wrote: > The poll call can return POLLERR which is ignored, or it can return > POLLOUT, even if there are no free frames in the mmap-ed area. > > We can account for both of these cases by re-checking if the next > frame is empty before writing into it. > > We have attempted to reproduce this issue with pktgen-dpdk, using the > following configuration. > > pktgen -l 1-4 -n 4 --proc-type=primary --no-pci --no-telemetry \ > --no-huge -m 512 \ > --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192, \ > framecnt=2048,qpairs=1,qdisc_bypass=0 \ > -- \ > -P \ > -T \ > -m "3.0" \ > -f themes/black-yellow.theme > > We configure a low tx rate (~ 335 packets / second) and a small > packet size, of about 300 Bytes from the pktgen CLI. > > set 0 size 300 > set 0 rate 0.008 > set 0 burst 1 > start 0 > > After bringing the interface down, and up again, we seem to arrive > in a state in which the tx rate is inconsistent, and does not recover. > > ifconfig eth1 down; sleep 7; ifconfig eth1 up > > [1] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md > > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> > > --- > v2: > * Added check for POLLERR > * Used tx_ring_status_available() for checking TP_STATUS_AVAILABLE > --- > drivers/net/af_packet/rte_eth_af_packet.c | 26 ++++++++++++++++++++++++-- > 1 file changed, 24 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c > index 559f5a0..d3d3104 100644 > --- a/drivers/net/af_packet/rte_eth_af_packet.c > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > @@ -237,8 +237,30 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) > } > > /* point at the next incoming frame */ > - if (!tx_ring_status_available(ppd->tp_status) && > - poll(&pfd, 1, -1) < 0) > + if (!tx_ring_status_available(ppd->tp_status)) { > + if (poll(&pfd, 1, -1) < 0) > + break; > + if (pfd.revents & POLLERR) > + break; > + } > + > + /* > + * Poll can return POLLERR if the interface is down > + * Above comment fits better to above check, or can remove it completely. Can you send a new version, or I can remove it while merging if you prefer? Except from above comment, Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com> > + * It will almost always return POLLOUT, even if there > + * are no extra buffers available > + * > + * This happens, because packet_poll() calls datagram_poll() > + * which checks the space left in the socket buffer and, > + * in the case of packet_mmap, the default socket buffer length > + * doesn't match the requested size for the tx_ring. > + * As such, there is almost always space left in socket buffer, > + * which doesn't seem to be correlated to the requested size > + * for the tx_ring in packet_mmap. > + * > + * This results in poll() returning POLLOUT. > + */ > + if (!tx_ring_status_available(ppd->tp_status)) > break; > > /* copy the tx frame data */ > ^ permalink raw reply [flat|nested] 14+ messages in thread
* [dpdk-dev] [PATCH v4] net/af_packet: fix ignoring full ring on tx 2021-11-02 15:47 ` [dpdk-dev] [PATCH v3] " Tudor Cornea 2021-11-02 16:47 ` Ferruh Yigit @ 2021-11-03 9:31 ` Tudor Cornea 2021-11-04 12:07 ` Ferruh Yigit 1 sibling, 1 reply; 14+ messages in thread From: Tudor Cornea @ 2021-11-03 9:31 UTC (permalink / raw) To: ferruh.yigit; +Cc: linville, thomas, pogonarumihai, dev, Tudor Cornea The poll call can return POLLERR which is ignored, or it can return POLLOUT, even if there are no free frames in the mmap-ed area. We can account for both of these cases by re-checking if the next frame is empty before writing into it. We have attempted to reproduce this issue with pktgen-dpdk, using the following configuration. pktgen -l 1-4 -n 4 --proc-type=primary --no-pci --no-telemetry \ --no-huge -m 512 \ --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192, \ framecnt=2048,qpairs=1,qdisc_bypass=0 \ -- \ -P \ -T \ -m "3.0" \ -f themes/black-yellow.theme We configure a low tx rate (~ 335 packets / second) and a small packet size, of about 300 bytes from the pktgen CLI. set 0 size 300 set 0 rate 0.008 set 0 burst 1 start 0 After bringing the interface down, and up again, we seem to arrive in a state in which the tx rate is inconsistent, and does not recover. ifconfig eth1 down; sleep 7; ifconfig eth1 up [1] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com> --- v4: * Updated comment regarding POLLERR Removed some extra details, and left notice about poll() returning POLLERR, in case of interface link down. v3: * Added check for POLLERR * Used tx_ring_status_available() for checking TP_STATUS_AVAILABLE --- drivers/net/af_packet/rte_eth_af_packet.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 559f5a0..dd6d165 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -237,8 +237,16 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status) && - poll(&pfd, 1, -1) < 0) + if (!tx_ring_status_available(ppd->tp_status)) { + if (poll(&pfd, 1, -1) < 0) + break; + + /* poll() can return POLLERR if the interface is down */ + if (pfd.revents & POLLERR) + break; + } + + if (!tx_ring_status_available(ppd->tp_status)) break; /* copy the tx frame data */ -- 2.7.4 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH v4] net/af_packet: fix ignoring full ring on tx 2021-11-03 9:31 ` [dpdk-dev] [PATCH v4] " Tudor Cornea @ 2021-11-04 12:07 ` Ferruh Yigit 0 siblings, 0 replies; 14+ messages in thread From: Ferruh Yigit @ 2021-11-04 12:07 UTC (permalink / raw) To: Tudor Cornea; +Cc: linville, thomas, pogonarumihai, dev On 11/3/2021 9:31 AM, Tudor Cornea wrote: > The poll call can return POLLERR which is ignored, or it can return > POLLOUT, even if there are no free frames in the mmap-ed area. > > We can account for both of these cases by re-checking if the next > frame is empty before writing into it. > > We have attempted to reproduce this issue with pktgen-dpdk, using the > following configuration. > > pktgen -l 1-4 -n 4 --proc-type=primary --no-pci --no-telemetry \ > --no-huge -m 512 \ > --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192, \ > framecnt=2048,qpairs=1,qdisc_bypass=0 \ > -- \ > -P \ > -T \ > -m "3.0" \ > -f themes/black-yellow.theme > > We configure a low tx rate (~ 335 packets / second) and a small > packet size, of about 300 bytes from the pktgen CLI. > > set 0 size 300 > set 0 rate 0.008 > set 0 burst 1 > start 0 > > After bringing the interface down, and up again, we seem to arrive > in a state in which the tx rate is inconsistent, and does not recover. > > ifconfig eth1 down; sleep 7; ifconfig eth1 up > > [1] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md > > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> > Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com> Used comment from v3, only POLLERR related part kept as in this patch. Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: stable@dpdk.org Applied to dpdk-next-net/main, thanks. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-11-04 12:07 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-20 13:39 [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx Tudor Cornea 2021-09-01 16:34 ` Ferruh Yigit 2021-09-06 10:23 ` Tudor Cornea 2021-09-20 17:11 ` Ferruh Yigit 2021-09-13 13:45 ` [dpdk-dev] [PATCH v2] " Tudor Cornea 2021-09-20 17:44 ` Ferruh Yigit 2021-09-29 10:03 ` Tudor Cornea 2021-10-05 15:11 ` Tudor Cornea 2021-10-26 14:30 ` Ferruh Yigit 2021-11-02 15:24 ` Tudor Cornea 2021-11-02 15:47 ` [dpdk-dev] [PATCH v3] " Tudor Cornea 2021-11-02 16:47 ` Ferruh Yigit 2021-11-03 9:31 ` [dpdk-dev] [PATCH v4] " Tudor Cornea 2021-11-04 12:07 ` Ferruh Yigit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).