[dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK

DPDK usage discussions
 help / color / mirror / Atom feed

* [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
@ 2017-04-12 21:00 Shihabur Rahman Chowdhury
  2017-04-12 22:41 ` Wiles, Keith
  0 siblings, 1 reply; 12+ messages in thread
From: Shihabur Rahman Chowdhury @ 2017-04-12 21:00 UTC (permalink / raw)
  To: users

Hello,

We are running a simple DPDK application and observing quite low
throughput. We are currently testing a DPDK application with the following
setup

- 2 machines with 2xIntel Xeon E5-2620 CPUs
- Each machine with a Mellanox single port 10G ConnectX3 card
- Mellanox DPDK version 16.11
- Mellanox OFED 4.0-2.0.0.1 and latest firmware for ConnectX3

The application is doing almost nothing. It is reading a batch of 64
packets from a single rxq, swapping the mac of each packet and writing it
back to a single txq. The rx and tx is being handled by separate lcores on
the same NUMA socket. We are running pktgen on another machine. With 64B
sized packets we are seeing ~14.8Mpps Tx rate and ~7.3Mpps Rx rate in
pktgen. We checked the NIC on the machine running the DPDK application
(with ifconfig) and it looks like there is a large number of packets being
dropped by the interface. Our connectx3 card should be theoretically be
able to handle 10Gbps Rx + 10Gbps Tx throughput (with channel width 4, the
theoretical max on PCIe 3.0 should be ~31.2Gbps). Interestingly, when Tx
rate is reduced in pktgent (to ~9Mpps), the Rx rate increases to ~9Mpps.

We would highly appriciate if we could get some pointers as to what can be
possibly causing this mismatch in Rx and Tx. Ideally, we should be able to
see ~14Mpps Rx well. Is it because we are using a single port? Or something
else?

FYI, we also ran the sample l2fwd application and test-pmd and got
comparable results in the same setup.

Thanks
Shihab

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-12 21:00 [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK Shihabur Rahman Chowdhury
@ 2017-04-12 22:41 ` Wiles, Keith
  2017-04-13  0:06   ` Shihabur Rahman Chowdhury
  0 siblings, 1 reply; 12+ messages in thread
From: Wiles, Keith @ 2017-04-12 22:41 UTC (permalink / raw)
  To: Shihabur Rahman Chowdhury; +Cc: users


> On Apr 12, 2017, at 4:00 PM, Shihabur Rahman Chowdhury <shihab.buet@gmail.com> wrote:
> 
> Hello,
> 
> We are running a simple DPDK application and observing quite low
> throughput. We are currently testing a DPDK application with the following
> setup
> 
> - 2 machines with 2xIntel Xeon E5-2620 CPUs
> - Each machine with a Mellanox single port 10G ConnectX3 card
> - Mellanox DPDK version 16.11
> - Mellanox OFED 4.0-2.0.0.1 and latest firmware for ConnectX3
> 
> The application is doing almost nothing. It is reading a batch of 64
> packets from a single rxq, swapping the mac of each packet and writing it
> back to a single txq. The rx and tx is being handled by separate lcores on
> the same NUMA socket. We are running pktgen on another machine. With 64B
> sized packets we are seeing ~14.8Mpps Tx rate and ~7.3Mpps Rx rate in
> pktgen. We checked the NIC on the machine running the DPDK application
> (with ifconfig) and it looks like there is a large number of packets being
> dropped by the interface. Our connectx3 card should be theoretically be
> able to handle 10Gbps Rx + 10Gbps Tx throughput (with channel width 4, the
> theoretical max on PCIe 3.0 should be ~31.2Gbps). Interestingly, when Tx
> rate is reduced in pktgent (to ~9Mpps), the Rx rate increases to ~9Mpps.

Not sure what is going on here, when you drop the rate to 9Mpps I assume you stop getting missed frames.
Do you have flow control enabled?

On the pktgen side are you seeing missed RX packets?
Did you loopback the cable from pktgen machine to the other port on the pktgen machine and did you get the same Rx/Tx performance in that configuration?

> 
> We would highly appriciate if we could get some pointers as to what can be
> possibly causing this mismatch in Rx and Tx. Ideally, we should be able to
> see ~14Mpps Rx well. Is it because we are using a single port? Or something
> else?
> 
> FYI, we also ran the sample l2fwd application and test-pmd and got
> comparable results in the same setup.
> 
> Thanks
> Shihab

Regards,
Keith

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-12 22:41 ` Wiles, Keith
@ 2017-04-13  0:06   ` Shihabur Rahman Chowdhury
  2017-04-13  1:56     ` Dave Wallace
  2017-04-13 13:49     ` Wiles, Keith
  0 siblings, 2 replies; 12+ messages in thread
From: Shihabur Rahman Chowdhury @ 2017-04-13  0:06 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: users

We've disabled the pause frames. That also disables flow control I assume.
Correct me if I am wrong.

On the pktgen side, the dropped field and overrun field for Rx keeps
increasing for ifconfig. Btw, both overrun and dropped fields have the
exact same value always.

Our ConnectX3 NIC has just a single 10G port. So there is no way to create
such loopback connection.

Thanks

Shihabur Rahman Chowdhury
David R. Cheriton School of Computer Science
University of Waterloo



On Wed, Apr 12, 2017 at 6:41 PM, Wiles, Keith <keith.wiles@intel.com> wrote:

>
> > On Apr 12, 2017, at 4:00 PM, Shihabur Rahman Chowdhury <
> shihab.buet@gmail.com> wrote:
> >
> > Hello,
> >
> > We are running a simple DPDK application and observing quite low
> > throughput. We are currently testing a DPDK application with the
> following
> > setup
> >
> > - 2 machines with 2xIntel Xeon E5-2620 CPUs
> > - Each machine with a Mellanox single port 10G ConnectX3 card
> > - Mellanox DPDK version 16.11
> > - Mellanox OFED 4.0-2.0.0.1 and latest firmware for ConnectX3
> >
> > The application is doing almost nothing. It is reading a batch of 64
> > packets from a single rxq, swapping the mac of each packet and writing it
> > back to a single txq. The rx and tx is being handled by separate lcores
> on
> > the same NUMA socket. We are running pktgen on another machine. With 64B
> > sized packets we are seeing ~14.8Mpps Tx rate and ~7.3Mpps Rx rate in
> > pktgen. We checked the NIC on the machine running the DPDK application
> > (with ifconfig) and it looks like there is a large number of packets
> being
> > dropped by the interface. Our connectx3 card should be theoretically be
> > able to handle 10Gbps Rx + 10Gbps Tx throughput (with channel width 4,
> the
> > theoretical max on PCIe 3.0 should be ~31.2Gbps). Interestingly, when Tx
> > rate is reduced in pktgent (to ~9Mpps), the Rx rate increases to ~9Mpps.
>
> Not sure what is going on here, when you drop the rate to 9Mpps I assume
> you stop getting missed frames.
> Do you have flow control enabled?
>
> On the pktgen side are you seeing missed RX packets?
> Did you loopback the cable from pktgen machine to the other port on the
> pktgen machine and did you get the same Rx/Tx performance in that
> configuration?
>
> >
> > We would highly appriciate if we could get some pointers as to what can
> be
> > possibly causing this mismatch in Rx and Tx. Ideally, we should be able
> to
> > see ~14Mpps Rx well. Is it because we are using a single port? Or
> something
> > else?
> >
> > FYI, we also ran the sample l2fwd application and test-pmd and got
> > comparable results in the same setup.
> >
> > Thanks
> > Shihab
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13  0:06   ` Shihabur Rahman Chowdhury
@ 2017-04-13  1:56     ` Dave Wallace
  2017-04-13  1:57       ` Shihabur Rahman Chowdhury
  2017-04-13 13:49     ` Wiles, Keith
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Wallace @ 2017-04-13  1:56 UTC (permalink / raw)
  To: Shihabur Rahman Chowdhury, Wiles, Keith; +Cc: users

I have encountered a similar issue in the past on a system configuration 
where the PCI interface to the NIC was on the other NUMA node.

Something else to check...
-daw-


On 04/12/2017 08:06 PM, Shihabur Rahman Chowdhury wrote:
> We've disabled the pause frames. That also disables flow control I assume.
> Correct me if I am wrong.
>
> On the pktgen side, the dropped field and overrun field for Rx keeps
> increasing for ifconfig. Btw, both overrun and dropped fields have the
> exact same value always.
>
> Our ConnectX3 NIC has just a single 10G port. So there is no way to create
> such loopback connection.
>
> Thanks
>
> Shihabur Rahman Chowdhury
> David R. Cheriton School of Computer Science
> University of Waterloo
>
>
>
> On Wed, Apr 12, 2017 at 6:41 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>
>>> On Apr 12, 2017, at 4:00 PM, Shihabur Rahman Chowdhury <
>> shihab.buet@gmail.com> wrote:
>>> Hello,
>>>
>>> We are running a simple DPDK application and observing quite low
>>> throughput. We are currently testing a DPDK application with the
>> following
>>> setup
>>>
>>> - 2 machines with 2xIntel Xeon E5-2620 CPUs
>>> - Each machine with a Mellanox single port 10G ConnectX3 card
>>> - Mellanox DPDK version 16.11
>>> - Mellanox OFED 4.0-2.0.0.1 and latest firmware for ConnectX3
>>>
>>> The application is doing almost nothing. It is reading a batch of 64
>>> packets from a single rxq, swapping the mac of each packet and writing it
>>> back to a single txq. The rx and tx is being handled by separate lcores
>> on
>>> the same NUMA socket. We are running pktgen on another machine. With 64B
>>> sized packets we are seeing ~14.8Mpps Tx rate and ~7.3Mpps Rx rate in
>>> pktgen. We checked the NIC on the machine running the DPDK application
>>> (with ifconfig) and it looks like there is a large number of packets
>> being
>>> dropped by the interface. Our connectx3 card should be theoretically be
>>> able to handle 10Gbps Rx + 10Gbps Tx throughput (with channel width 4,
>> the
>>> theoretical max on PCIe 3.0 should be ~31.2Gbps). Interestingly, when Tx
>>> rate is reduced in pktgent (to ~9Mpps), the Rx rate increases to ~9Mpps.
>> Not sure what is going on here, when you drop the rate to 9Mpps I assume
>> you stop getting missed frames.
>> Do you have flow control enabled?
>>
>> On the pktgen side are you seeing missed RX packets?
>> Did you loopback the cable from pktgen machine to the other port on the
>> pktgen machine and did you get the same Rx/Tx performance in that
>> configuration?
>>
>>> We would highly appriciate if we could get some pointers as to what can
>> be
>>> possibly causing this mismatch in Rx and Tx. Ideally, we should be able
>> to
>>> see ~14Mpps Rx well. Is it because we are using a single port? Or
>> something
>>> else?
>>>
>>> FYI, we also ran the sample l2fwd application and test-pmd and got
>>> comparable results in the same setup.
>>>
>>> Thanks
>>> Shihab
>> Regards,
>> Keith
>>
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13  1:56     ` Dave Wallace
@ 2017-04-13  1:57       ` Shihabur Rahman Chowdhury
  2017-04-13  5:19         ` Shahaf Shuler
  0 siblings, 1 reply; 12+ messages in thread
From: Shihabur Rahman Chowdhury @ 2017-04-13  1:57 UTC (permalink / raw)
  To: Dave Wallace; +Cc: Wiles, Keith, users

Hi Dave,

We've checked this and we are using the lcores on the NUMA node where the
PCI interface to the NIC is.

Shihabur Rahman Chowdhury
David R. Cheriton School of Computer Science
University of Waterloo



On Wed, Apr 12, 2017 at 9:56 PM, Dave Wallace <dwallacelf@gmail.com> wrote:

> I have encountered a similar issue in the past on a system configuration
> where the PCI interface to the NIC was on the other NUMA node.
>
> Something else to check...
> -daw-
>
>
>
> On 04/12/2017 08:06 PM, Shihabur Rahman Chowdhury wrote:
>
>> We've disabled the pause frames. That also disables flow control I assume.
>> Correct me if I am wrong.
>>
>> On the pktgen side, the dropped field and overrun field for Rx keeps
>> increasing for ifconfig. Btw, both overrun and dropped fields have the
>> exact same value always.
>>
>> Our ConnectX3 NIC has just a single 10G port. So there is no way to create
>> such loopback connection.
>>
>> Thanks
>>
>> Shihabur Rahman Chowdhury
>> David R. Cheriton School of Computer Science
>> University of Waterloo
>>
>>
>>
>> On Wed, Apr 12, 2017 at 6:41 PM, Wiles, Keith <keith.wiles@intel.com>
>> wrote:
>>
>> On Apr 12, 2017, at 4:00 PM, Shihabur Rahman Chowdhury <
>>>>
>>> shihab.buet@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are running a simple DPDK application and observing quite low
>>>> throughput. We are currently testing a DPDK application with the
>>>>
>>> following
>>>
>>>> setup
>>>>
>>>> - 2 machines with 2xIntel Xeon E5-2620 CPUs
>>>> - Each machine with a Mellanox single port 10G ConnectX3 card
>>>> - Mellanox DPDK version 16.11
>>>> - Mellanox OFED 4.0-2.0.0.1 and latest firmware for ConnectX3
>>>>
>>>> The application is doing almost nothing. It is reading a batch of 64
>>>> packets from a single rxq, swapping the mac of each packet and writing
>>>> it
>>>> back to a single txq. The rx and tx is being handled by separate lcores
>>>>
>>> on
>>>
>>>> the same NUMA socket. We are running pktgen on another machine. With 64B
>>>> sized packets we are seeing ~14.8Mpps Tx rate and ~7.3Mpps Rx rate in
>>>> pktgen. We checked the NIC on the machine running the DPDK application
>>>> (with ifconfig) and it looks like there is a large number of packets
>>>>
>>> being
>>>
>>>> dropped by the interface. Our connectx3 card should be theoretically be
>>>> able to handle 10Gbps Rx + 10Gbps Tx throughput (with channel width 4,
>>>>
>>> the
>>>
>>>> theoretical max on PCIe 3.0 should be ~31.2Gbps). Interestingly, when Tx
>>>> rate is reduced in pktgent (to ~9Mpps), the Rx rate increases to ~9Mpps.
>>>>
>>> Not sure what is going on here, when you drop the rate to 9Mpps I assume
>>> you stop getting missed frames.
>>> Do you have flow control enabled?
>>>
>>> On the pktgen side are you seeing missed RX packets?
>>> Did you loopback the cable from pktgen machine to the other port on the
>>> pktgen machine and did you get the same Rx/Tx performance in that
>>> configuration?
>>>
>>> We would highly appriciate if we could get some pointers as to what can
>>>>
>>> be
>>>
>>>> possibly causing this mismatch in Rx and Tx. Ideally, we should be able
>>>>
>>> to
>>>
>>>> see ~14Mpps Rx well. Is it because we are using a single port? Or
>>>>
>>> something
>>>
>>>> else?
>>>>
>>>> FYI, we also ran the sample l2fwd application and test-pmd and got
>>>> comparable results in the same setup.
>>>>
>>>> Thanks
>>>> Shihab
>>>>
>>> Regards,
>>> Keith
>>>
>>>
>>>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13  1:57       ` Shihabur Rahman Chowdhury
@ 2017-04-13  5:19         ` Shahaf Shuler
  2017-04-13 14:21           ` Shihabur Rahman Chowdhury
  0 siblings, 1 reply; 12+ messages in thread
From: Shahaf Shuler @ 2017-04-13  5:19 UTC (permalink / raw)
  To: Shihabur Rahman Chowdhury, Dave Wallace, Olga Shern, Adrien Mazarguil
  Cc: Wiles, Keith, users

Thursday, April 13, 2017 4:58 AM, Shihabur Rahman Chowdhury:
[...]
> >>>
> >>>> setup
> >>>>
> >>>> - 2 machines with 2xIntel Xeon E5-2620 CPUs
> >>>> - Each machine with a Mellanox single port 10G ConnectX3 card
> >>>> - Mellanox DPDK version 16.11
> >>>> - Mellanox OFED 4.0-2.0.0.1 and latest firmware for ConnectX3
> >>>>
> >>>> The application is doing almost nothing. It is reading a batch of 64
> >>>> packets from a single rxq, swapping the mac of each packet and writing
> >>>> it
> >>>> back to a single txq. The rx and tx is being handled by separate lcores

Why did you choose such configuration? 
Such configuration may cause high overhead in snoop cycles, as the first cache line of the packet
Will first be on the Rx lcore and then it will need to be invalidated when the Tx lcore swaps the macs. 

Since you are using 2 cores anyway, have you tried that each core will do both Rx and Tx (run to completion)?

> >>>>
> >>> on
> >>>
> >>>> the same NUMA socket. We are running pktgen on another machine.
> With 64B
> >>>> sized packets we are seeing ~14.8Mpps Tx rate and ~7.3Mpps Rx rate in
> >>>> pktgen. We checked the NIC on the machine running the DPDK
> application
> >>>> (with ifconfig) and it looks like there is a large number of packets
> >>>>
> >>> being
> >>>
> >>>> dropped by the interface. 

This might be because the scenario is SW bound, when the application don't process the packets fast enough the NIC must drop the ingress.

>>>>>Our connectx3 card should be theoretically
> be
> >>>> able to handle 10Gbps Rx + 10Gbps Tx throughput (with channel width
> 4,
> >>>>
> >>> the
> >>>
> >>>> theoretical max on PCIe 3.0 should be ~31.2Gbps). Interestingly, when
> Tx
> >>>> rate is reduced in pktgent (to ~9Mpps), the Rx rate increases to
> ~9Mpps.
> >>>>
> >>> Not sure what is going on here, when you drop the rate to 9Mpps I
> assume
> >>> you stop getting missed frames.
> >>> Do you have flow control enabled?
> >>>
> >>> On the pktgen side are you seeing missed RX packets?
> >>> Did you loopback the cable from pktgen machine to the other port on
> the
> >>> pktgen machine and did you get the same Rx/Tx performance in that
> >>> configuration?
> >>>
> >>> We would highly appriciate if we could get some pointers as to what can
> >>>>
> >>> be
> >>>
> >>>> possibly causing this mismatch in Rx and Tx. Ideally, we should be able
> >>>>
> >>> to
> >>>
> >>>> see ~14Mpps Rx well. Is it because we are using a single port? Or

Our "Hero number" for testpmd application which do i/o forwarding with ConnectX-3 is ~10Mpps for single core.
Dual core should reach ~14Mpps.

> >>>>
> >>> something
> >>>
> >>>> else?
> >>>>
> >>>> FYI, we also ran the sample l2fwd application and test-pmd and got
> >>>> comparable results in the same setup.
> >>>>
> >>>> Thanks
> >>>> Shihab
> >>>>
> >>> Regards,
> >>> Keith
> >>>
> >>>
> >>>
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13  0:06   ` Shihabur Rahman Chowdhury
  2017-04-13  1:56     ` Dave Wallace
@ 2017-04-13 13:49     ` Wiles, Keith
  2017-04-13 14:22       ` Shihabur Rahman Chowdhury
  1 sibling, 1 reply; 12+ messages in thread
From: Wiles, Keith @ 2017-04-13 13:49 UTC (permalink / raw)
  To: Shihabur Rahman Chowdhury; +Cc: users


> On Apr 12, 2017, at 7:06 PM, Shihabur Rahman Chowdhury <shihab.buet@gmail.com> wrote:
> 
> We've disabled the pause frames. That also disables flow control I assume. Correct me if I am wrong.
> 
> On the pktgen side, the dropped field and overrun field for Rx keeps increasing for ifconfig. Btw, both overrun and dropped fields have the exact same value always.

Are you using the Linux kernel pktgen or the DPDK Pktgen?

http://dpdk.org/browse/apps/pktgen-dpdk/


> 
> Our ConnectX3 NIC has just a single 10G port. So there is no way to create such loopback connection.
> 
> Thanks
> 
> Shihabur Rahman Chowdhury
> David R. Cheriton School of Computer Science
> University of Waterloo
> 
> 
> 
> On Wed, Apr 12, 2017 at 6:41 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> > On Apr 12, 2017, at 4:00 PM, Shihabur Rahman Chowdhury <shihab.buet@gmail.com> wrote:
> >
> > Hello,
> >
> > We are running a simple DPDK application and observing quite low
> > throughput. We are currently testing a DPDK application with the following
> > setup
> >
> > - 2 machines with 2xIntel Xeon E5-2620 CPUs
> > - Each machine with a Mellanox single port 10G ConnectX3 card
> > - Mellanox DPDK version 16.11
> > - Mellanox OFED 4.0-2.0.0.1 and latest firmware for ConnectX3
> >
> > The application is doing almost nothing. It is reading a batch of 64
> > packets from a single rxq, swapping the mac of each packet and writing it
> > back to a single txq. The rx and tx is being handled by separate lcores on
> > the same NUMA socket. We are running pktgen on another machine. With 64B
> > sized packets we are seeing ~14.8Mpps Tx rate and ~7.3Mpps Rx rate in
> > pktgen. We checked the NIC on the machine running the DPDK application
> > (with ifconfig) and it looks like there is a large number of packets being
> > dropped by the interface. Our connectx3 card should be theoretically be
> > able to handle 10Gbps Rx + 10Gbps Tx throughput (with channel width 4, the
> > theoretical max on PCIe 3.0 should be ~31.2Gbps). Interestingly, when Tx
> > rate is reduced in pktgent (to ~9Mpps), the Rx rate increases to ~9Mpps.
> 
> Not sure what is going on here, when you drop the rate to 9Mpps I assume you stop getting missed frames.
> Do you have flow control enabled?
> 
> On the pktgen side are you seeing missed RX packets?
> Did you loopback the cable from pktgen machine to the other port on the pktgen machine and did you get the same Rx/Tx performance in that configuration?
> 
> >
> > We would highly appriciate if we could get some pointers as to what can be
> > possibly causing this mismatch in Rx and Tx. Ideally, we should be able to
> > see ~14Mpps Rx well. Is it because we are using a single port? Or something
> > else?
> >
> > FYI, we also ran the sample l2fwd application and test-pmd and got
> > comparable results in the same setup.
> >
> > Thanks
> > Shihab
> 
> Regards,
> Keith
> 
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13  5:19         ` Shahaf Shuler
@ 2017-04-13 14:21           ` Shihabur Rahman Chowdhury
  2017-04-13 15:49             ` Kyle Larose
  0 siblings, 1 reply; 12+ messages in thread
From: Shihabur Rahman Chowdhury @ 2017-04-13 14:21 UTC (permalink / raw)
  To: Shahaf Shuler
  Cc: Dave Wallace, Olga Shern, Adrien Mazarguil, Wiles, Keith, users

On Thu, Apr 13, 2017 at 1:19 AM, Shahaf Shuler <shahafs@mellanox.com> wrote:

> Why did you choose such configuration?
> Such configuration may cause high overhead in snoop cycles, as the first
> cache line of the packet
> Will first be on the Rx lcore and then it will need to be invalidated when
> the Tx lcore swaps the macs.
>
> Since you are using 2 cores anyway, have you tried that each core will do
> both Rx and Tx (run to completion)?
>

To give a bit more context, we are developing a set of packet processors
that can be independently deployed as separate processes and can be scaled
out independently as well. So a batch of packet goes through a sequence of
processes until at some point they are written to the Tx queue or gets
dropped because of some processing decision. These packet processors are
running as secondary dpdk processes and the rx is being taking place at a
primary process (since Mellanox PMD does not allow Rx from a secondary
process). In this example configuration, one primary process is doing the
Rx, handing over the packet to another secondary process through a shared
ring and that secondary process is swapping the MAC and writing packets to
Tx queue. We are expecting some performance drop because of the cache
invalidation across lcores (also we cannot use the same lcore for different
secondary process for mempool cache corruption), but again 7.3Mpps is ~30+%
overhead.

Since you said, we tried the run to completion processing in the primary
process (i.e., rx and tx is now on the same lcore). We also configured
pktgent to handle rx and tx on the same lcore as well. With that we are now
getting ~9.9-10Mpps with 64B packets. With our multi-process setup that
drops down to ~8.4Mpps. So it seems like pktgen was not configured
properly. It seems a bit counter-intuitive since from pktgen's side doing
rx and tx on different lcore should not cause any cache invalidation (set
of rx and tx packets are disjoint). So using different lcores should
theoretically be better than handling both rx/tx in the same lcore for
pkgetn. Am I missing something here?

Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13 13:49     ` Wiles, Keith
@ 2017-04-13 14:22       ` Shihabur Rahman Chowdhury
  2017-04-13 14:47         ` Wiles, Keith
  0 siblings, 1 reply; 12+ messages in thread
From: Shihabur Rahman Chowdhury @ 2017-04-13 14:22 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: users

On Thu, Apr 13, 2017 at 9:49 AM, Wiles, Keith <keith.wiles@intel.com> wrote:

> Are you using the Linux kernel pktgen or the DPDK Pktgen?
>
> http://dpdk.org/browse/apps/pktgen-dpdk/
>

Yes, we are using dpdk-pktgen built with the latest Mellanox DPDK release.

--Shihab

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13 14:22       ` Shihabur Rahman Chowdhury
@ 2017-04-13 14:47         ` Wiles, Keith
  0 siblings, 0 replies; 12+ messages in thread
From: Wiles, Keith @ 2017-04-13 14:47 UTC (permalink / raw)
  To: Shihabur Rahman Chowdhury; +Cc: users

> On Apr 13, 2017, at 9:22 AM, Shihabur Rahman Chowdhury <shihab.buet@gmail.com> wrote:
> 
> 
> On Thu, Apr 13, 2017 at 9:49 AM, Wiles, Keith <keith.wiles@intel.com> wrote:
> Are you using the Linux kernel pktgen or the DPDK Pktgen?
> 
> http://dpdk.org/browse/apps/pktgen-dpdk/
> 
> Yes, we are using dpdk-pktgen built with the latest Mellanox DPDK release.

The reason I was asking was you were using the ifconfig numbers for missed frames and not the numbers in Pktgen at least that is what I read.

The reason I was asking about looping back the cable on the pktgen machine was to make sure the configuration on the pktgen machine was able to get to wire rate using two ports. If you do not have two ports in the pktgen machine, then my suggestion will not work :-)

Sorry, I do not have any other suggestions.

> 
> --Shihab
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13 14:21           ` Shihabur Rahman Chowdhury
@ 2017-04-13 15:49             ` Kyle Larose
  2017-04-17 17:43               ` Shihabur Rahman Chowdhury
  0 siblings, 1 reply; 12+ messages in thread
From: Kyle Larose @ 2017-04-13 15:49 UTC (permalink / raw)
  To: Shihabur Rahman Chowdhury, Shahaf Shuler
  Cc: Dave Wallace, Olga Shern, Adrien Mazarguil, Wiles, Keith, users

Hey Shihab,


> -----Original Message-----
> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Shihabur Rahman
> Chowdhury
> Sent: Thursday, April 13, 2017 10:21 AM
> To: Shahaf Shuler
> Cc: Dave Wallace; Olga Shern; Adrien Mazarguil; Wiles, Keith; users@dpdk.org
> Subject: Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3
> card with DPDK
> 
>
> To give a bit more context, we are developing a set of packet processors
> that can be independently deployed as separate processes and can be scaled
> out independently as well. So a batch of packet goes through a sequence of
> processes until at some point they are written to the Tx queue or gets
> dropped because of some processing decision. These packet processors are
> running as secondary dpdk processes and the rx is being taking place at a
> primary process (since Mellanox PMD does not allow Rx from a secondary
> process). In this example configuration, one primary process is doing the
> Rx, handing over the packet to another secondary process through a shared
> ring and that secondary process is swapping the MAC and writing packets to
> Tx queue. We are expecting some performance drop because of the cache
> invalidation across lcores (also we cannot use the same lcore for different
> secondary process for mempool cache corruption), but again 7.3Mpps is ~30+%
> overhead.
> 
> Since you said, we tried the run to completion processing in the primary
> process (i.e., rx and tx is now on the same lcore). We also configured
> pktgent to handle rx and tx on the same lcore as well. With that we are now
> getting ~9.9-10Mpps with 64B packets. With our multi-process setup that
> drops down to ~8.4Mpps. So it seems like pktgen was not configured properly.
> It seems a bit counter-intuitive since from pktgen's side doing rx and tx on
> different lcore should not cause any cache invalidation (set of rx and tx
> packets are disjoint). So using different lcores should theoretically be
> better than handling both rx/tx in the same lcore for pkgetn. Am I missing
> something here?
> 
> Thanks

It sounds to me like your bottleneck is the primary -- the packet distributor. Consider the comment from Shahaf earlier: the best Mellanox was able to achieve with testpmd (which is extremely simple) is 10Mpps per core. I've always found that receiving is more expensive than transmitting, which means that if you're splitting your work on those dimensions, you'll need to allocate more CPU to the receiver than the transmitter. This may be one of the reasons run to completion works out -- the lower tx load on that core offsets the higher rx.

If you want to continue using the packet distribution model, why don't you try using RSS/multiqueue on the distributor, and allocate two cores to it? You'll need some entropy in the packets for it to distribute well, but hopefully that's not a problem. :)

Thanks,

Kyle

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK
  2017-04-13 15:49             ` Kyle Larose
@ 2017-04-17 17:43               ` Shihabur Rahman Chowdhury
  0 siblings, 0 replies; 12+ messages in thread
From: Shihabur Rahman Chowdhury @ 2017-04-17 17:43 UTC (permalink / raw)
  To: Kyle Larose
  Cc: Shahaf Shuler, Dave Wallace, Olga Shern, Adrien Mazarguil, Wiles,
	Keith, users

Thanks for the suggestions.

We'll definitely try RSS on the distributor. In the meantime we implemented
one optimization similar to l3fwd example. Before processing the packets,
we prefetched a cache line from a fraction (currently 8 packets) of the
batch. Then while processing packets we prefetched a cacheline for rest of
the batch and then processed the prefetched packets. This along with
running pktgen rx/tx on the same logical core improved throughput to
~8.76Mpps for 64B packets.

Shihabur Rahman Chowdhury
David R. Cheriton School of Computer Science
University of Waterloo



On Thu, Apr 13, 2017 at 11:49 AM, Kyle Larose <klarose@sandvine.com> wrote:

> Hey Shihab,
>
>
> > -----Original Message-----
> > From: users [mailto:users-bounces@dpdk.org] On Behalf Of Shihabur Rahman
> > Chowdhury
> > Sent: Thursday, April 13, 2017 10:21 AM
> > To: Shahaf Shuler
> > Cc: Dave Wallace; Olga Shern; Adrien Mazarguil; Wiles, Keith;
> users@dpdk.org
> > Subject: Re: [dpdk-users] Low Rx throughput when using Mellanox
> ConnectX-3
> > card with DPDK
> >
> >
> > To give a bit more context, we are developing a set of packet processors
> > that can be independently deployed as separate processes and can be
> scaled
> > out independently as well. So a batch of packet goes through a sequence
> of
> > processes until at some point they are written to the Tx queue or gets
> > dropped because of some processing decision. These packet processors are
> > running as secondary dpdk processes and the rx is being taking place at a
> > primary process (since Mellanox PMD does not allow Rx from a secondary
> > process). In this example configuration, one primary process is doing the
> > Rx, handing over the packet to another secondary process through a shared
> > ring and that secondary process is swapping the MAC and writing packets
> to
> > Tx queue. We are expecting some performance drop because of the cache
> > invalidation across lcores (also we cannot use the same lcore for
> different
> > secondary process for mempool cache corruption), but again 7.3Mpps is
> ~30+%
> > overhead.
> >
> > Since you said, we tried the run to completion processing in the primary
> > process (i.e., rx and tx is now on the same lcore). We also configured
> > pktgent to handle rx and tx on the same lcore as well. With that we are
> now
> > getting ~9.9-10Mpps with 64B packets. With our multi-process setup that
> > drops down to ~8.4Mpps. So it seems like pktgen was not configured
> properly.
> > It seems a bit counter-intuitive since from pktgen's side doing rx and
> tx on
> > different lcore should not cause any cache invalidation (set of rx and tx
> > packets are disjoint). So using different lcores should theoretically be
> > better than handling both rx/tx in the same lcore for pkgetn. Am I
> missing
> > something here?
> >
> > Thanks
>
> It sounds to me like your bottleneck is the primary -- the packet
> distributor. Consider the comment from Shahaf earlier: the best Mellanox
> was able to achieve with testpmd (which is extremely simple) is 10Mpps per
> core. I've always found that receiving is more expensive than transmitting,
> which means that if you're splitting your work on those dimensions, you'll
> need to allocate more CPU to the receiver than the transmitter. This may be
> one of the reasons run to completion works out -- the lower tx load on that
> core offsets the higher rx.
>
> If you want to continue using the packet distribution model, why don't you
> try using RSS/multiqueue on the distributor, and allocate two cores to it?
> You'll need some entropy in the packets for it to distribute well, but
> hopefully that's not a problem. :)
>
> Thanks,
>
> Kyle
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-04-17 17:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-12 21:00 [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK Shihabur Rahman Chowdhury
2017-04-12 22:41 ` Wiles, Keith
2017-04-13  0:06   ` Shihabur Rahman Chowdhury
2017-04-13  1:56     ` Dave Wallace
2017-04-13  1:57       ` Shihabur Rahman Chowdhury
2017-04-13  5:19         ` Shahaf Shuler
2017-04-13 14:21           ` Shihabur Rahman Chowdhury
2017-04-13 15:49             ` Kyle Larose
2017-04-17 17:43               ` Shihabur Rahman Chowdhury
2017-04-13 13:49     ` Wiles, Keith
2017-04-13 14:22       ` Shihabur Rahman Chowdhury
2017-04-13 14:47         ` Wiles, Keith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).