[dpdk-users] DPDK TX problems

DPDK usage discussions
 help / color / mirror / Atom feed

* [dpdk-users] DPDK TX problems
@ 2019-03-20  9:22 Hrvoje Habjanic
  2019-03-29  7:24 ` Hrvoje Habjanić
  0 siblings, 1 reply; 6+ messages in thread
From: Hrvoje Habjanic @ 2019-03-20  9:22 UTC (permalink / raw)
  To: users

Hi.

I did write an application using dpdk 17.11 (did try also with 18.11),
and when doing some performance testing, i'm seeing very odd behavior.
To verify that this is not because of my app, i did the same test with
l2fwd example app, and i'm still confused by results.

In short, i'm trying to push a lot of L2 packets through dpdk engine -
packet processing is minimal. When testing, i'm starting with small
number of packets-per-second, and then gradually increase it to see
where is the limit. At some point, i do reach this limit - packets start
to get dropped. And this is when stuff become weird.

When i reach peek packet rate (at which packets start to get dropped), i
would expect that reducing packet rate will remove packet drops. But,
this is not the case. For example, let's assume that peek packet rate is
3.5Mpps. At this point everything works ok. Increasing pps to 4.0Mpps,
makes a lot of dropped packets. When reducing pps back to 3.5Mpps, app
is still broken - packets are still dropped.

At this point, i need to drastically reduce pps (1.4Mpps) to make
dropped packets go away. Also, app is unable to successfully forward
anything beyond this 1.4M, despite the fact that in the beginning it did
forward 3.5M! Only way to recover is to restart the app.

Also, sometimes, the app just stops forwarding any packets - packets are
received (as seen by counters), but app is unable to send anything back.

As i did mention, i'm seeing the same behavior with l2fwd example app. I
did test dpdk 17.11 and also dpdk 18.11 - the results are the same.

My test environment is HP DL380G8, with 82599ES 10Gig (ixgbe) cards,
connected with Cisco nexus 9300 sw. On the other side is ixia test
appliance. Application is run in virtual machine (VM), using KVM
(openstack, with sriov enabled, and numa restrictions). I did check that
VM is using only cpu's from NUMA node on which network card is
connected, so there is no cross-numa traffic. Openstack is Queens,
Ubuntu is Bionic release. Virtual machine is also using ubuntu bionic as OS.

I do not know how to debug this? Does someone else have the same
observations?

Regards,

H.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] DPDK TX problems
  2019-03-20  9:22 [dpdk-users] DPDK TX problems Hrvoje Habjanic
@ 2019-03-29  7:24 ` Hrvoje Habjanić
  2019-04-08  9:52   ` Hrvoje Habjanić
  0 siblings, 1 reply; 6+ messages in thread
From: Hrvoje Habjanić @ 2019-03-29  7:24 UTC (permalink / raw)
  To: users

> Hi.
>
> I did write an application using dpdk 17.11 (did try also with 18.11),
> and when doing some performance testing, i'm seeing very odd behavior.
> To verify that this is not because of my app, i did the same test with
> l2fwd example app, and i'm still confused by results.
>
> In short, i'm trying to push a lot of L2 packets through dpdk engine -
> packet processing is minimal. When testing, i'm starting with small
> number of packets-per-second, and then gradually increase it to see
> where is the limit. At some point, i do reach this limit - packets start
> to get dropped. And this is when stuff become weird.
>
> When i reach peek packet rate (at which packets start to get dropped), i
> would expect that reducing packet rate will remove packet drops. But,
> this is not the case. For example, let's assume that peek packet rate is
> 3.5Mpps. At this point everything works ok. Increasing pps to 4.0Mpps,
> makes a lot of dropped packets. When reducing pps back to 3.5Mpps, app
> is still broken - packets are still dropped.
>
> At this point, i need to drastically reduce pps (1.4Mpps) to make
> dropped packets go away. Also, app is unable to successfully forward
> anything beyond this 1.4M, despite the fact that in the beginning it did
> forward 3.5M! Only way to recover is to restart the app.
>
> Also, sometimes, the app just stops forwarding any packets - packets are
> received (as seen by counters), but app is unable to send anything back.
>
> As i did mention, i'm seeing the same behavior with l2fwd example app. I
> did test dpdk 17.11 and also dpdk 18.11 - the results are the same.
>
> My test environment is HP DL380G8, with 82599ES 10Gig (ixgbe) cards,
> connected with Cisco nexus 9300 sw. On the other side is ixia test
> appliance. Application is run in virtual machine (VM), using KVM
> (openstack, with sriov enabled, and numa restrictions). I did check that
> VM is using only cpu's from NUMA node on which network card is
> connected, so there is no cross-numa traffic. Openstack is Queens,
> Ubuntu is Bionic release. Virtual machine is also using ubuntu bionic
> as OS.
>
> I do not know how to debug this? Does someone else have the same
> observations?
>
> Regards,
>
> H.
There are additional findings. It seems that when i reach peak pps rate,
application is not fast enough, and i can see rx missed errors on card
statistics on the host. At the same time, tx side starts to show
problems (tx burst starts to show it did not send all packets). Shortly
after that, tx falls apart completely and top pps rate drops.

Since i did not disable pause frames, i can see on the switch "RX pause"
frame counter is increasing. On the other hand, if i disable pause
frames (on the nic of server), host driver (ixgbe) reports "TX unit
hang" in dmesg, and issues card reset. Of course, after reset none of
the dpdk apps in VM's on this host does not work.

Is it possible that at time of congestion DPDK does not release mbufs
back to the pool, and tx ring becomes "filled" with zombie packets (not
send by card and also having ref counter as they are in use)?

Is there a way to check mempool or tx ring for "left-owers"? Is is
possible to somehow "flush" tx ring and/or mempool?

H.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] DPDK TX problems
  2019-03-29  7:24 ` Hrvoje Habjanić
@ 2019-04-08  9:52   ` Hrvoje Habjanić
  2020-02-18  8:36     ` Hrvoje Habjanic
  0 siblings, 1 reply; 6+ messages in thread
From: Hrvoje Habjanić @ 2019-04-08  9:52 UTC (permalink / raw)
  To: users

On 29/03/2019 08:24, Hrvoje Habjanić wrote:
>> Hi.
>>
>> I did write an application using dpdk 17.11 (did try also with 18.11),
>> and when doing some performance testing, i'm seeing very odd behavior.
>> To verify that this is not because of my app, i did the same test with
>> l2fwd example app, and i'm still confused by results.
>>
>> In short, i'm trying to push a lot of L2 packets through dpdk engine -
>> packet processing is minimal. When testing, i'm starting with small
>> number of packets-per-second, and then gradually increase it to see
>> where is the limit. At some point, i do reach this limit - packets start
>> to get dropped. And this is when stuff become weird.
>>
>> When i reach peek packet rate (at which packets start to get dropped), i
>> would expect that reducing packet rate will remove packet drops. But,
>> this is not the case. For example, let's assume that peek packet rate is
>> 3.5Mpps. At this point everything works ok. Increasing pps to 4.0Mpps,
>> makes a lot of dropped packets. When reducing pps back to 3.5Mpps, app
>> is still broken - packets are still dropped.
>>
>> At this point, i need to drastically reduce pps (1.4Mpps) to make
>> dropped packets go away. Also, app is unable to successfully forward
>> anything beyond this 1.4M, despite the fact that in the beginning it did
>> forward 3.5M! Only way to recover is to restart the app.
>>
>> Also, sometimes, the app just stops forwarding any packets - packets are
>> received (as seen by counters), but app is unable to send anything back.
>>
>> As i did mention, i'm seeing the same behavior with l2fwd example app. I
>> did test dpdk 17.11 and also dpdk 18.11 - the results are the same.
>>
>> My test environment is HP DL380G8, with 82599ES 10Gig (ixgbe) cards,
>> connected with Cisco nexus 9300 sw. On the other side is ixia test
>> appliance. Application is run in virtual machine (VM), using KVM
>> (openstack, with sriov enabled, and numa restrictions). I did check that
>> VM is using only cpu's from NUMA node on which network card is
>> connected, so there is no cross-numa traffic. Openstack is Queens,
>> Ubuntu is Bionic release. Virtual machine is also using ubuntu bionic
>> as OS.
>>
>> I do not know how to debug this? Does someone else have the same
>> observations?
>>
>> Regards,
>>
>> H.
> There are additional findings. It seems that when i reach peak pps
> rate, application is not fast enough, and i can see rx missed errors
> on card statistics on the host. At the same time, tx side starts to
> show problems (tx burst starts to show it did not send all packets).
> Shortly after that, tx falls apart completely and top pps rate drops.
>
> Since i did not disable pause frames, i can see on the switch "RX
> pause" frame counter is increasing. On the other hand, if i disable
> pause frames (on the nic of server), host driver (ixgbe) reports "TX
> unit hang" in dmesg, and issues card reset. Of course, after reset
> none of the dpdk apps in VM's on this host does not work.
>
> Is it possible that at time of congestion DPDK does not release mbufs
> back to the pool, and tx ring becomes "filled" with zombie packets
> (not send by card and also having ref counter as they are in use)?
>
> Is there a way to check mempool or tx ring for "left-owers"? Is is
> possible to somehow "flush" tx ring and/or mempool?
>
> H.

After few more test, things become even weirder - if i do not free mbufs
which are not sent, but resend them again, i can "survive" over-the-peek
event! But, then peek rate starts to drop gradually ...

I would ask if someone can try this on their platform and report back? I
would really like to know if this is problem with my deployment, or
there is something wrong with dpdk?

Test should be simple - use l2fwd or l3fwd, and determine max pps. Then
drive pps 30%over max, and then return back and confirm that you can
still get max pps.

Thanks in advance.

H.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] DPDK TX problems
  2019-04-08  9:52   ` Hrvoje Habjanić
@ 2020-02-18  8:36     ` Hrvoje Habjanic
  2020-03-26 20:54       ` Thomas Monjalon
  0 siblings, 1 reply; 6+ messages in thread
From: Hrvoje Habjanic @ 2020-02-18  8:36 UTC (permalink / raw)
  To: users

On 08. 04. 2019. 11:52, Hrvoje Habjanić wrote:
> On 29/03/2019 08:24, Hrvoje Habjanić wrote:
>>> Hi.
>>>
>>> I did write an application using dpdk 17.11 (did try also with 18.11),
>>> and when doing some performance testing, i'm seeing very odd behavior.
>>> To verify that this is not because of my app, i did the same test with
>>> l2fwd example app, and i'm still confused by results.
>>>
>>> In short, i'm trying to push a lot of L2 packets through dpdk engine -
>>> packet processing is minimal. When testing, i'm starting with small
>>> number of packets-per-second, and then gradually increase it to see
>>> where is the limit. At some point, i do reach this limit - packets start
>>> to get dropped. And this is when stuff become weird.
>>>
>>> When i reach peek packet rate (at which packets start to get dropped), i
>>> would expect that reducing packet rate will remove packet drops. But,
>>> this is not the case. For example, let's assume that peek packet rate is
>>> 3.5Mpps. At this point everything works ok. Increasing pps to 4.0Mpps,
>>> makes a lot of dropped packets. When reducing pps back to 3.5Mpps, app
>>> is still broken - packets are still dropped.
>>>
>>> At this point, i need to drastically reduce pps (1.4Mpps) to make
>>> dropped packets go away. Also, app is unable to successfully forward
>>> anything beyond this 1.4M, despite the fact that in the beginning it did
>>> forward 3.5M! Only way to recover is to restart the app.
>>>
>>> Also, sometimes, the app just stops forwarding any packets - packets are
>>> received (as seen by counters), but app is unable to send anything back.
>>>
>>> As i did mention, i'm seeing the same behavior with l2fwd example app. I
>>> did test dpdk 17.11 and also dpdk 18.11 - the results are the same.
>>>
>>> My test environment is HP DL380G8, with 82599ES 10Gig (ixgbe) cards,
>>> connected with Cisco nexus 9300 sw. On the other side is ixia test
>>> appliance. Application is run in virtual machine (VM), using KVM
>>> (openstack, with sriov enabled, and numa restrictions). I did check that
>>> VM is using only cpu's from NUMA node on which network card is
>>> connected, so there is no cross-numa traffic. Openstack is Queens,
>>> Ubuntu is Bionic release. Virtual machine is also using ubuntu bionic
>>> as OS.
>>>
>>> I do not know how to debug this? Does someone else have the same
>>> observations?
>>>
>>> Regards,
>>>
>>> H.
>> There are additional findings. It seems that when i reach peak pps
>> rate, application is not fast enough, and i can see rx missed errors
>> on card statistics on the host. At the same time, tx side starts to
>> show problems (tx burst starts to show it did not send all packets).
>> Shortly after that, tx falls apart completely and top pps rate drops.
>>
>> Since i did not disable pause frames, i can see on the switch "RX
>> pause" frame counter is increasing. On the other hand, if i disable
>> pause frames (on the nic of server), host driver (ixgbe) reports "TX
>> unit hang" in dmesg, and issues card reset. Of course, after reset
>> none of the dpdk apps in VM's on this host does not work.
>>
>> Is it possible that at time of congestion DPDK does not release mbufs
>> back to the pool, and tx ring becomes "filled" with zombie packets
>> (not send by card and also having ref counter as they are in use)?
>>
>> Is there a way to check mempool or tx ring for "left-owers"? Is is
>> possible to somehow "flush" tx ring and/or mempool?
>>
>> H.
> After few more test, things become even weirder - if i do not free mbufs
> which are not sent, but resend them again, i can "survive" over-the-peek
> event! But, then peek rate starts to drop gradually ...
>
> I would ask if someone can try this on their platform and report back? I
> would really like to know if this is problem with my deployment, or
> there is something wrong with dpdk?
>
> Test should be simple - use l2fwd or l3fwd, and determine max pps. Then
> drive pps 30%over max, and then return back and confirm that you can
> still get max pps.
>
> Thanks in advance.
>
> H.
>

I did receive few mails from users facing this issue, asking how it was
resolved.

Unfortunately, there is no real fix. It seems that this issue is related
to card and hardware used. I'm still not sure which is more to blame,
but the combination i had is definitely problematic.

Anyhow, in the end, i did conclude that card driver have some issues
when it is saturated with packets. My suspicion is that driver/software
does not properly free packets, and then DPDK mempool becomes
fragmented, and this causes performance drops. Restarting software
releases pools, and restores proper functionality.

After no luck with ixgbe, we migrated to Mellanox (4LX), and now there
is no more of this permanent performance drop. With mlx, when limit is
reached, reducing number of packets restores packet forwarding, and this
limit seems to be stable.

Also, we moved to newer servers - DL380G10, and got significant
performance increase. Also, we moved to newer switch (also cisco), with
25G ports, which reduced latency - almost by factor of 2!

I did not try old ixgbe on newer server, but i did try Intel's XL710,
and it is not as happy as Mellanox. It gives better PPS, but it is more
unstable in terms of maximum bw (has similar issues as ixgbe).

Regards,

H.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] DPDK TX problems
  2020-02-18  8:36     ` Hrvoje Habjanic
@ 2020-03-26 20:54       ` Thomas Monjalon
  2020-03-27 18:25         ` [dpdk-users] [dpdk-ci] " Lincoln Lavoie
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Monjalon @ 2020-03-26 20:54 UTC (permalink / raw)
  To: Hrvoje Habjanic; +Cc: users, galco, asafp, olgas, ci

Thanks for the interesting feedback.
It seems we should test this performance use case in our labs.


18/02/2020 09:36, Hrvoje Habjanic:
> On 08. 04. 2019. 11:52, Hrvoje Habjanić wrote:
> > On 29/03/2019 08:24, Hrvoje Habjanić wrote:
> >>> Hi.
> >>>
> >>> I did write an application using dpdk 17.11 (did try also with 18.11),
> >>> and when doing some performance testing, i'm seeing very odd behavior.
> >>> To verify that this is not because of my app, i did the same test with
> >>> l2fwd example app, and i'm still confused by results.
> >>>
> >>> In short, i'm trying to push a lot of L2 packets through dpdk engine -
> >>> packet processing is minimal. When testing, i'm starting with small
> >>> number of packets-per-second, and then gradually increase it to see
> >>> where is the limit. At some point, i do reach this limit - packets start
> >>> to get dropped. And this is when stuff become weird.
> >>>
> >>> When i reach peek packet rate (at which packets start to get dropped), i
> >>> would expect that reducing packet rate will remove packet drops. But,
> >>> this is not the case. For example, let's assume that peek packet rate is
> >>> 3.5Mpps. At this point everything works ok. Increasing pps to 4.0Mpps,
> >>> makes a lot of dropped packets. When reducing pps back to 3.5Mpps, app
> >>> is still broken - packets are still dropped.
> >>>
> >>> At this point, i need to drastically reduce pps (1.4Mpps) to make
> >>> dropped packets go away. Also, app is unable to successfully forward
> >>> anything beyond this 1.4M, despite the fact that in the beginning it did
> >>> forward 3.5M! Only way to recover is to restart the app.
> >>>
> >>> Also, sometimes, the app just stops forwarding any packets - packets are
> >>> received (as seen by counters), but app is unable to send anything back.
> >>>
> >>> As i did mention, i'm seeing the same behavior with l2fwd example app. I
> >>> did test dpdk 17.11 and also dpdk 18.11 - the results are the same.
> >>>
> >>> My test environment is HP DL380G8, with 82599ES 10Gig (ixgbe) cards,
> >>> connected with Cisco nexus 9300 sw. On the other side is ixia test
> >>> appliance. Application is run in virtual machine (VM), using KVM
> >>> (openstack, with sriov enabled, and numa restrictions). I did check that
> >>> VM is using only cpu's from NUMA node on which network card is
> >>> connected, so there is no cross-numa traffic. Openstack is Queens,
> >>> Ubuntu is Bionic release. Virtual machine is also using ubuntu bionic
> >>> as OS.
> >>>
> >>> I do not know how to debug this? Does someone else have the same
> >>> observations?
> >>>
> >>> Regards,
> >>>
> >>> H.
> >> There are additional findings. It seems that when i reach peak pps
> >> rate, application is not fast enough, and i can see rx missed errors
> >> on card statistics on the host. At the same time, tx side starts to
> >> show problems (tx burst starts to show it did not send all packets).
> >> Shortly after that, tx falls apart completely and top pps rate drops.
> >>
> >> Since i did not disable pause frames, i can see on the switch "RX
> >> pause" frame counter is increasing. On the other hand, if i disable
> >> pause frames (on the nic of server), host driver (ixgbe) reports "TX
> >> unit hang" in dmesg, and issues card reset. Of course, after reset
> >> none of the dpdk apps in VM's on this host does not work.
> >>
> >> Is it possible that at time of congestion DPDK does not release mbufs
> >> back to the pool, and tx ring becomes "filled" with zombie packets
> >> (not send by card and also having ref counter as they are in use)?
> >>
> >> Is there a way to check mempool or tx ring for "left-owers"? Is is
> >> possible to somehow "flush" tx ring and/or mempool?
> >>
> >> H.
> > After few more test, things become even weirder - if i do not free mbufs
> > which are not sent, but resend them again, i can "survive" over-the-peek
> > event! But, then peek rate starts to drop gradually ...
> >
> > I would ask if someone can try this on their platform and report back? I
> > would really like to know if this is problem with my deployment, or
> > there is something wrong with dpdk?
> >
> > Test should be simple - use l2fwd or l3fwd, and determine max pps. Then
> > drive pps 30%over max, and then return back and confirm that you can
> > still get max pps.
> >
> > Thanks in advance.
> >
> > H.
> >
> 
> I did receive few mails from users facing this issue, asking how it was
> resolved.
> 
> Unfortunately, there is no real fix. It seems that this issue is related
> to card and hardware used. I'm still not sure which is more to blame,
> but the combination i had is definitely problematic.
> 
> Anyhow, in the end, i did conclude that card driver have some issues
> when it is saturated with packets. My suspicion is that driver/software
> does not properly free packets, and then DPDK mempool becomes
> fragmented, and this causes performance drops. Restarting software
> releases pools, and restores proper functionality.
> 
> After no luck with ixgbe, we migrated to Mellanox (4LX), and now there
> is no more of this permanent performance drop. With mlx, when limit is
> reached, reducing number of packets restores packet forwarding, and this
> limit seems to be stable.
> 
> Also, we moved to newer servers - DL380G10, and got significant
> performance increase. Also, we moved to newer switch (also cisco), with
> 25G ports, which reduced latency - almost by factor of 2!
> 
> I did not try old ixgbe on newer server, but i did try Intel's XL710,
> and it is not as happy as Mellanox. It gives better PPS, but it is more
> unstable in terms of maximum bw (has similar issues as ixgbe).
> 
> Regards,
> 
> H.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] [dpdk-ci]  DPDK TX problems
  2020-03-26 20:54       ` Thomas Monjalon
@ 2020-03-27 18:25         ` Lincoln Lavoie
  0 siblings, 0 replies; 6+ messages in thread
From: Lincoln Lavoie @ 2020-03-27 18:25 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Hrvoje Habjanic, users, galco, asafp, olgas, ci

Hi Thomas,

I've captured this as https://bugs.dpdk.org/show_bug.cgi?id=429, so we can
add this to the list of development items for the testing, etc.

Cheers,
Lincoln

On Thu, Mar 26, 2020 at 4:54 PM Thomas Monjalon <thomas@monjalon.net> wrote:

> Thanks for the interesting feedback.
> It seems we should test this performance use case in our labs.
>
>
> 18/02/2020 09:36, Hrvoje Habjanic:
> > On 08. 04. 2019. 11:52, Hrvoje Habjanić wrote:
> > > On 29/03/2019 08:24, Hrvoje Habjanić wrote:
> > >>> Hi.
> > >>>
> > >>> I did write an application using dpdk 17.11 (did try also with
> 18.11),
> > >>> and when doing some performance testing, i'm seeing very odd
> behavior.
> > >>> To verify that this is not because of my app, i did the same test
> with
> > >>> l2fwd example app, and i'm still confused by results.
> > >>>
> > >>> In short, i'm trying to push a lot of L2 packets through dpdk engine
> -
> > >>> packet processing is minimal. When testing, i'm starting with small
> > >>> number of packets-per-second, and then gradually increase it to see
> > >>> where is the limit. At some point, i do reach this limit - packets
> start
> > >>> to get dropped. And this is when stuff become weird.
> > >>>
> > >>> When i reach peek packet rate (at which packets start to get
> dropped), i
> > >>> would expect that reducing packet rate will remove packet drops. But,
> > >>> this is not the case. For example, let's assume that peek packet
> rate is
> > >>> 3.5Mpps. At this point everything works ok. Increasing pps to
> 4.0Mpps,
> > >>> makes a lot of dropped packets. When reducing pps back to 3.5Mpps,
> app
> > >>> is still broken - packets are still dropped.
> > >>>
> > >>> At this point, i need to drastically reduce pps (1.4Mpps) to make
> > >>> dropped packets go away. Also, app is unable to successfully forward
> > >>> anything beyond this 1.4M, despite the fact that in the beginning it
> did
> > >>> forward 3.5M! Only way to recover is to restart the app.
> > >>>
> > >>> Also, sometimes, the app just stops forwarding any packets - packets
> are
> > >>> received (as seen by counters), but app is unable to send anything
> back.
> > >>>
> > >>> As i did mention, i'm seeing the same behavior with l2fwd example
> app. I
> > >>> did test dpdk 17.11 and also dpdk 18.11 - the results are the same.
> > >>>
> > >>> My test environment is HP DL380G8, with 82599ES 10Gig (ixgbe) cards,
> > >>> connected with Cisco nexus 9300 sw. On the other side is ixia test
> > >>> appliance. Application is run in virtual machine (VM), using KVM
> > >>> (openstack, with sriov enabled, and numa restrictions). I did check
> that
> > >>> VM is using only cpu's from NUMA node on which network card is
> > >>> connected, so there is no cross-numa traffic. Openstack is Queens,
> > >>> Ubuntu is Bionic release. Virtual machine is also using ubuntu bionic
> > >>> as OS.
> > >>>
> > >>> I do not know how to debug this? Does someone else have the same
> > >>> observations?
> > >>>
> > >>> Regards,
> > >>>
> > >>> H.
> > >> There are additional findings. It seems that when i reach peak pps
> > >> rate, application is not fast enough, and i can see rx missed errors
> > >> on card statistics on the host. At the same time, tx side starts to
> > >> show problems (tx burst starts to show it did not send all packets).
> > >> Shortly after that, tx falls apart completely and top pps rate drops.
> > >>
> > >> Since i did not disable pause frames, i can see on the switch "RX
> > >> pause" frame counter is increasing. On the other hand, if i disable
> > >> pause frames (on the nic of server), host driver (ixgbe) reports "TX
> > >> unit hang" in dmesg, and issues card reset. Of course, after reset
> > >> none of the dpdk apps in VM's on this host does not work.
> > >>
> > >> Is it possible that at time of congestion DPDK does not release mbufs
> > >> back to the pool, and tx ring becomes "filled" with zombie packets
> > >> (not send by card and also having ref counter as they are in use)?
> > >>
> > >> Is there a way to check mempool or tx ring for "left-owers"? Is is
> > >> possible to somehow "flush" tx ring and/or mempool?
> > >>
> > >> H.
> > > After few more test, things become even weirder - if i do not free
> mbufs
> > > which are not sent, but resend them again, i can "survive"
> over-the-peek
> > > event! But, then peek rate starts to drop gradually ...
> > >
> > > I would ask if someone can try this on their platform and report back?
> I
> > > would really like to know if this is problem with my deployment, or
> > > there is something wrong with dpdk?
> > >
> > > Test should be simple - use l2fwd or l3fwd, and determine max pps. Then
> > > drive pps 30%over max, and then return back and confirm that you can
> > > still get max pps.
> > >
> > > Thanks in advance.
> > >
> > > H.
> > >
> >
> > I did receive few mails from users facing this issue, asking how it was
> > resolved.
> >
> > Unfortunately, there is no real fix. It seems that this issue is related
> > to card and hardware used. I'm still not sure which is more to blame,
> > but the combination i had is definitely problematic.
> >
> > Anyhow, in the end, i did conclude that card driver have some issues
> > when it is saturated with packets. My suspicion is that driver/software
> > does not properly free packets, and then DPDK mempool becomes
> > fragmented, and this causes performance drops. Restarting software
> > releases pools, and restores proper functionality.
> >
> > After no luck with ixgbe, we migrated to Mellanox (4LX), and now there
> > is no more of this permanent performance drop. With mlx, when limit is
> > reached, reducing number of packets restores packet forwarding, and this
> > limit seems to be stable.
> >
> > Also, we moved to newer servers - DL380G10, and got significant
> > performance increase. Also, we moved to newer switch (also cisco), with
> > 25G ports, which reduced latency - almost by factor of 2!
> >
> > I did not try old ixgbe on newer server, but i did try Intel's XL710,
> > and it is not as happy as Mellanox. It gives better PPS, but it is more
> > unstable in terms of maximum bw (has similar issues as ixgbe).
> >
> > Regards,
> >
> > H.
>
>
>
>
>

-- 
*Lincoln Lavoie*
Senior Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu/>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-03-30  7:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-20  9:22 [dpdk-users] DPDK TX problems Hrvoje Habjanic
2019-03-29  7:24 ` Hrvoje Habjanić
2019-04-08  9:52   ` Hrvoje Habjanić
2020-02-18  8:36     ` Hrvoje Habjanic
2020-03-26 20:54       ` Thomas Monjalon
2020-03-27 18:25         ` [dpdk-users] [dpdk-ci] " Lincoln Lavoie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).