Re: [dpdk-dev] IXGBE throughput loss with 4+ cores

DPDK patches and discussions
 help / color / mirror / Atom feed

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
@ 2018-09-06  6:10 Saber Rezvani
  2018-09-06 17:48 ` Wiles, Keith
  0 siblings, 1 reply; 11+ messages in thread
From: Saber Rezvani @ 2018-09-06  6:10 UTC (permalink / raw)
  To: Wiles,  Keith; +Cc: Stephen Hemminger, dev

On 08/29/2018 11:22 PM, Wiles, Keith wrote: > >> On Aug 29, 2018, at 12:19 PM, Saber Rezvani <irsaber@zoho.com> wrote: >> >> >> >> On 08/29/2018 01:39 AM, Wiles, Keith wrote: >>>> On Aug 28, 2018, at 2:16 PM, Saber Rezvani <irsaber@zoho.com> wrote: >>>> >>>> >>>> >>>> On 08/28/2018 11:39 PM, Wiles, Keith wrote: >>>>> Which version of Pktgen? I just pushed a patch in 3.5.3 to fix a performance problem. >>>> I use Pktgen verion 3.0.0, indeed it is O.k as far as I have one core. (10 Gb/s) but when I increase the number of core (one core per queue) then I loose some performance (roughly 8.5 Gb/s for 8-core). In my scenario Pktgen shows it is generating at line rate, but receiving 8.5 Gb/s. >>>> Is it because of Pktgen??? >>> Normally Pktgen can receive at line rate up to 10G 64 byte frames, which means Pktgen should not be the problem. You can verify that by looping the cable from one port to another on the pktgen machine to create a external loopback. Then send traffic what ever you can send from one port you should be able to receive those packets unless something is configured wrong. >>> >>> Please send me the command line for pktgen. >>> >>> >>> In pktgen if you have this config -m “[1-4:5-8].0” then you have 4 cores sending traffic and 4 core receiving packets. >>> >>> In this case the TX cores will be sending the packets on all 4 lcores to the same port. On the rx side you have 4 cores polling 4 rx queues. The rx queues are controlled by RSS, which means the RX traffic 5 tuples hash must divide the inbound packets across all 4 queues to make sure each core is doing the same amount of work. If you are sending only a single packet on the Tx cores then only one rx queue be used. >>> >>> I hope that makes sense. >> I think there is a misunderstanding of the problem. Indeed the problem is not the Pktgen. >> Here is my command --> ./app/app/x86_64-native-linuxapp-gcc/pktgen -c ffc0000 -n 4 -w 84:00.0 -w 84:00.1 --file-prefix pktgen_F2 --socket-mem 1000,2000,1000,1000 -- -T -P -m "[18-19:20-21].0, [22:23].1" >> >> The problem is when I run the symmetric_mp example for $numberOfProcesses=8 cores, then I have less throughput (roughly 8.4 Gb/s). but when I run it for $numberOfProcesses=3 cores throughput is 10G. >> for i in `seq $numberOfProcesses`; >> do >> .... some calculation goes here..... >> symmetric_mp -c $coremask -n 2 --proc-type=auto -w 0b:00.0 -w 0b:00.1 --file-prefix sm --socket-mem 4000,1000,1000,1000 -- -p 3 --num-procs=$numberOfProcesses --proc-id=$procid"; >> ..... >> done > Most NICs have a limited amount of memory on the NIC and when you start to segment that memory because you are using more queues it can effect performance. > > In one of the NICs if you go over say 6 or 5 queues the memory per queue for Rx/Tx packets starts to become a bottle neck as you do not have enough memory in the Tx/Rx queues to hold enough packets. This can cause the NIC to drop Rx packets because the host can not pull the data from the NIC or Rx ring on the host fast enough. This seems to be the problem as the amount of time to process a packet on the host has not changed only the amount of buffer space in the NIC as you increase queues. > > I am not sure this is your issue, but I figured I would state this point. What you said sounded logical, but is there away that I can be sure? I mean are there some registers at NIC which show the number of packet loss on NIC? or does DPDK have an API which shows the number of packet loss at NIC level? > >> I am trying find out what makes this loss! >> >> >>>>>> On Aug 28, 2018, at 12:05 PM, Saber Rezvani <irsaber@zoho.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 08/28/2018 08:31 PM, Stephen Hemminger wrote: >>>>>>> On Tue, 28 Aug 2018 17:34:27 +0430 >>>>>>> Saber Rezvani <irsaber@zoho.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> I have run multi_process/symmetric_mp example in DPDK example directory. >>>>>>>> For a one process its throughput is line rate but as I increase the >>>>>>>> number of cores I see decrease in throughput. For example, If the number >>>>>>>> of queues set to 4 and each queue assigns to a single core, then the >>>>>>>> throughput will be something about 9.4. if 8 queues, then throughput >>>>>>>> will be 8.5. >>>>>>>> >>>>>>>> I have read the following, but it was not convincing. >>>>>>>> >>>>>>>> http://mails.dpdk.org/archives/dev/2015-October/024960.html >>>>>>>> >>>>>>>> >>>>>>>> I am eagerly looking forward to hearing from you, all. >>>>>>>> >>>>>>>> >>>>>>>> Best wishes, >>>>>>>> >>>>>>>> Saber >>>>>>>> >>>>>>>> >>>>>>> Not completely surprising. If you have more cores than packet line rate >>>>>>> then the number of packets returned for each call to rx_burst will be less. >>>>>>> With large number of cores, most of the time will be spent doing reads of >>>>>>> PCI registers for no packets! >>>>>> Indeed pktgen says it is generating traffic at line rate, but receiving less than 10 Gb/s. So, it that case there should be something that causes the reduction in throughput :( >>>>>> >>>>>> >>>>> Regards, >>>>> Keith >>>>> >>>> >>> Regards, >>> Keith >>> >> Best regards, >> Saber > Regards, > Keith > Best regards, Saber
From xiaolong.ye@intel.com  Thu Sep  6 08:29:09 2018
Return-Path: <xiaolong.ye@intel.com>
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
 by dpdk.org (Postfix) with ESMTP id E2AFC326C
 for <dev@dpdk.org>; Thu,  6 Sep 2018 08:29:08 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
 by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 05 Sep 2018 23:29:07 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.53,334,1531810800"; d="scan'208";a="67936798"
Received: from yexl-server.sh.intel.com ([10.67.110.207])
 by fmsmga007.fm.intel.com with ESMTP; 05 Sep 2018 23:28:51 -0700
From: Xiaolong Ye <xiaolong.ye@intel.com>
To: dev@dpdk.org, Maxime Coquelin <maxime.coquelin@redhat.com>,
 Tiwei Bie <tiwei.bie@intel.com>, Zhihong Wang <zhihong.wang@intel.com>
Cc: xiao.w.wang@intel.com,
	Xiaolong Ye <xiaolong.ye@intel.com>
Date: Thu,  6 Sep 2018 21:16:52 +0800
Message-Id: <20180906131653.10752-1-xiaolong.ye@intel.com>
X-Mailer: git-send-email 2.17.1
Subject: [dpdk-dev] [PATCH v1 1/2] vhost: introduce rte_vdpa_get_device_num
	api
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Sep 2018 06:29:09 -0000

Signed-off-by: Xiaolong Ye <xiaolong.ye@intel.com>
---
 lib/librte_vhost/rte_vdpa.h | 3 +++
 lib/librte_vhost/vdpa.c     | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/lib/librte_vhost/rte_vdpa.h b/lib/librte_vhost/rte_vdpa.h
index 90465ca26..b8223e337 100644
--- a/lib/librte_vhost/rte_vdpa.h
+++ b/lib/librte_vhost/rte_vdpa.h
@@ -84,4 +84,7 @@ rte_vdpa_find_device_id(struct rte_vdpa_dev_addr *addr);
 struct rte_vdpa_device * __rte_experimental
 rte_vdpa_get_device(int did);

+/* Get current available vdpa device number */
+int __rte_experimental
+rte_vdpa_get_device_num(void);
 #endif /* _RTE_VDPA_H_ */
diff --git a/lib/librte_vhost/vdpa.c b/lib/librte_vhost/vdpa.c
index c82fd4370..c2c5dff1d 100644
--- a/lib/librte_vhost/vdpa.c
+++ b/lib/librte_vhost/vdpa.c
@@ -113,3 +113,9 @@ rte_vdpa_get_device(int did)

 	return vdpa_devices[did];
 }
+
+int
+rte_vdpa_get_device_num(void)
+{
+	return vdpa_device_num;
+}
--
2.18.0.rc1.1.g6f333ff2f

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-09-06  6:10 [dpdk-dev] IXGBE throughput loss with 4+ cores Saber Rezvani
@ 2018-09-06 17:48 ` Wiles, Keith
  0 siblings, 0 replies; 11+ messages in thread
From: Wiles, Keith @ 2018-09-06 17:48 UTC (permalink / raw)
  To: Saber Rezvani; +Cc: Stephen Hemminger, dev



> On Sep 6, 2018, at 7:10 AM, Saber Rezvani <irsaber@zoho.com> wrote:
> 
> 
> 
> On 08/29/2018 11:22 PM, Wiles, Keith wrote: 
> > 
> >> On Aug 29, 2018, at 12:19 PM, Saber Rezvani <irsaber@zoho.com> wrote: 
> >> 
> >> 
> >> 
> >> On 08/29/2018 01:39 AM, Wiles, Keith wrote: 
> >>>> On Aug 28, 2018, at 2:16 PM, Saber Rezvani <irsaber@zoho.com> wrote: 
> >>>> 
> >>>> 
> >>>> 
> >>>> On 08/28/2018 11:39 PM, Wiles, Keith wrote: 
> >>>>> Which version of Pktgen? I just pushed a patch in 3.5.3 to fix a performance problem. 
> >>>> I use Pktgen verion 3.0.0, indeed it is O.k as far as I have one core. (10 Gb/s) but when I increase the number of core (one core per queue) then I loose some performance (roughly 8.5 Gb/s for 8-core). In my scenario Pktgen shows it is generating at line rate, but receiving 8.5 Gb/s. 
> >>>> Is it because of Pktgen??? 
> >>> Normally Pktgen can receive at line rate up to 10G 64 byte frames, which means Pktgen should not be the problem. You can verify that by looping the cable from one port to another on the pktgen machine to create a external loopback. Then send traffic what ever you can send from one port you should be able to receive those packets unless something is configured wrong. 
> >>> 
> >>> Please send me the command line for pktgen. 
> >>> 
> >>> 
> >>> In pktgen if you have this config -m “[1-4:5-8].0” then you have 4 cores sending traffic and 4 core receiving packets. 
> >>> 
> >>> In this case the TX cores will be sending the packets on all 4 lcores to the same port. On the rx side you have 4 cores polling 4 rx queues. The rx queues are controlled by RSS, which means the RX traffic 5 tuples hash must divide the inbound packets across all 4 queues to make sure each core is doing the same amount of work. If you are sending only a single packet on the Tx cores then only one rx queue be used. 
> >>> 
> >>> I hope that makes sense. 
> >> I think there is a misunderstanding of the problem. Indeed the problem is not the Pktgen. 
> >> Here is my command --> ./app/app/x86_64-native-linuxapp-gcc/pktgen -c ffc0000 -n 4 -w 84:00.0 -w 84:00.1 --file-prefix pktgen_F2 --socket-mem 1000,2000,1000,1000 -- -T -P -m "[18-19:20-21].0, [22:23].1" 
> >> 
> >> The problem is when I run the symmetric_mp example for $numberOfProcesses=8 cores, then I have less throughput (roughly 8.4 Gb/s). but when I run it for $numberOfProcesses=3 cores throughput is 10G. 
> >> for i in `seq $numberOfProcesses`; 
> >> do 
> >> .... some calculation goes here..... 
> >> symmetric_mp -c $coremask -n 2 --proc-type=auto -w 0b:00.0 -w 0b:00.1 --file-prefix sm --socket-mem 4000,1000,1000,1000 -- -p 3 --num-procs=$numberOfProcesses --proc-id=$procid"; 
> >> ..... 
> >> done 
> > Most NICs have a limited amount of memory on the NIC and when you start to segment that memory because you are using more queues it can effect performance. 
> > 
> > In one of the NICs if you go over say 6 or 5 queues the memory per queue for Rx/Tx packets starts to become a bottle neck as you do not have enough memory in the Tx/Rx queues to hold enough packets. This can cause the NIC to drop Rx packets because the host can not pull the data from the NIC or Rx ring on the host fast enough. This seems to be the problem as the amount of time to process a packet on the host has not changed only the amount of buffer space in the NIC as you increase queues. 
> > 
> > I am not sure this is your issue, but I figured I would state this point. 
> What you said sounded logical, but is there away that I can be sure? I 
> mean are there some registers at NIC which show the number of packet 
> loss on NIC? or does DPDK have an API which shows the number of packet 
> loss at NIC level? 

Yes if you look in the Docs Readthedocs.org/projects/dpdk you can find the API something like rte_eth_stats_get()

> > 
> >> I am trying find out what makes this loss! 
> >> 
> >> 
> >>>>>> On Aug 28, 2018, at 12:05 PM, Saber Rezvani <irsaber@zoho.com> wrote: 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> On 08/28/2018 08:31 PM, Stephen Hemminger wrote: 
> >>>>>>> On Tue, 28 Aug 2018 17:34:27 +0430 
> >>>>>>> Saber Rezvani <irsaber@zoho.com> wrote: 
> >>>>>>> 
> >>>>>>>> Hi, 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> I have run multi_process/symmetric_mp example in DPDK example directory. 
> >>>>>>>> For a one process its throughput is line rate but as I increase the 
> >>>>>>>> number of cores I see decrease in throughput. For example, If the number 
> >>>>>>>> of queues set to 4 and each queue assigns to a single core, then the 
> >>>>>>>> throughput will be something about 9.4. if 8 queues, then throughput 
> >>>>>>>> will be 8.5. 
> >>>>>>>> 
> >>>>>>>> I have read the following, but it was not convincing. 
> >>>>>>>> 
> >>>>>>>> http://mails.dpdk.org/archives/dev/2015-October/024960.html 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> I am eagerly looking forward to hearing from you, all. 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Best wishes, 
> >>>>>>>> 
> >>>>>>>> Saber 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>> Not completely surprising. If you have more cores than packet line rate 
> >>>>>>> then the number of packets returned for each call to rx_burst will be less. 
> >>>>>>> With large number of cores, most of the time will be spent doing reads of 
> >>>>>>> PCI registers for no packets! 
> >>>>>> Indeed pktgen says it is generating traffic at line rate, but receiving less than 10 Gb/s. So, it that case there should be something that causes the reduction in throughput :( 
> >>>>>> 
> >>>>>> 
> >>>>> Regards, 
> >>>>> Keith 
> >>>>> 
> >>>> 
> >>> Regards, 
> >>> Keith 
> >>> 
> >> Best regards, 
> >> Saber 
> > Regards, 
> > Keith 
> > 
> Best regards, 
> Saber 
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-08-29 18:52             ` Wiles, Keith
@ 2018-08-30  4:08               ` Saber Rezvani
  0 siblings, 0 replies; 11+ messages in thread
From: Saber Rezvani @ 2018-08-30  4:08 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, dev



On 08/29/2018 11:22 PM, Wiles, Keith wrote:
>
>> On Aug 29, 2018, at 12:19 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>
>>
>>
>> On 08/29/2018 01:39 AM, Wiles, Keith wrote:
>>>> On Aug 28, 2018, at 2:16 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>>>
>>>>
>>>>
>>>> On 08/28/2018 11:39 PM, Wiles, Keith wrote:
>>>>> Which version of Pktgen? I just pushed a patch in 3.5.3 to fix a  performance problem.
>>>> I use Pktgen verion 3.0.0, indeed it is O.k as far as I  have one core. (10 Gb/s) but when I increase the number of core (one core per queue) then I loose some performance (roughly 8.5 Gb/s for 8-core). In my scenario Pktgen shows it is generating at line rate, but receiving 8.5 Gb/s.
>>>> Is it because of Pktgen???
>>> Normally Pktgen can receive at line rate up to 10G 64 byte frames, which means Pktgen should not be the problem. You can verify that by looping the cable from one port to another on the pktgen machine to create a external loopback. Then send traffic what ever you can send from one port you should be able to receive those packets unless something is configured wrong.
>>>
>>> Please send me the command line for pktgen.
>>>
>>>
>>> In pktgen if you have this config -m “[1-4:5-8].0” then you have 4 cores sending traffic and 4 core receiving packets.
>>>
>>> In this case the TX cores will be sending the packets on all 4 lcores to the same port. On the rx side you have 4 cores polling 4 rx queues. The rx queues are controlled by RSS, which means the RX traffic 5 tuples hash must divide the inbound packets across all 4 queues to make sure each core is doing the same amount of work. If you are sending only a single packet on the Tx cores then only one rx queue be used.
>>>
>>> I hope that makes sense.
>> I think there is a misunderstanding of the problem. Indeed the problem is not the Pktgen.
>> Here is my command --> ./app/app/x86_64-native-linuxapp-gcc/pktgen -c ffc0000 -n 4 -w 84:00.0 -w 84:00.1 --file-prefix pktgen_F2 --socket-mem 1000,2000,1000,1000 -- -T -P -m "[18-19:20-21].0, [22:23].1"
>>
>> The problem is when I run the symmetric_mp example for $numberOfProcesses=8 cores, then I have less throughput (roughly 8.4 Gb/s). but when I run it for $numberOfProcesses=3 cores throughput is 10G.
>> for i in `seq $numberOfProcesses`;
>>      do
>>              .... some calculation goes here.....
>>               symmetric_mp -c $coremask -n 2 --proc-type=auto -w 0b:00.0 -w 0b:00.1 --file-prefix sm --socket-mem 4000,1000,1000,1000 -- -p 3 --num-procs=$numberOfProcesses --proc-id=$procid";
>>               .....
>>      done
> Most NICs have a limited amount of memory on the NIC and when you start to segment that memory because you are using more queues it can effect performance.
>
> In one of the NICs if you go over say 6 or 5 queues the memory per queue for Rx/Tx packets starts to become a bottle neck as you do not have enough memory in the Tx/Rx queues to hold enough packets. This can cause the NIC to drop Rx packets because the host can not pull the data from the NIC or Rx ring on the host fast enough. This seems to be the problem as the amount of time to process a packet on the host has not changed only the amount of buffer space in the NIC as you increase queues.
>
> I am not sure this is your issue, but I figured I would state this point.
What you said sounded logical, but is there away that I can be sure? I 
mean are there some registers at NIC which show the number of packet 
loss on NIC? or does DPDK have an API which shows the number of packet 
loss at NIC level?
>
>> I am trying find out what makes this loss!
>>
>>
>>>>>> On Aug 28, 2018, at 12:05 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 08/28/2018 08:31 PM, Stephen Hemminger wrote:
>>>>>>> On Tue, 28 Aug 2018 17:34:27 +0430
>>>>>>> Saber Rezvani <irsaber@zoho.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I have run multi_process/symmetric_mp example in DPDK example directory.
>>>>>>>> For a one process its throughput is line rate but as I increase the
>>>>>>>> number of cores I see decrease in throughput. For example, If the number
>>>>>>>> of queues set to 4 and each queue assigns to a single core, then the
>>>>>>>> throughput will be something about 9.4. if 8 queues, then throughput
>>>>>>>> will be 8.5.
>>>>>>>>
>>>>>>>> I have read the following, but it was not convincing.
>>>>>>>>
>>>>>>>> http://mails.dpdk.org/archives/dev/2015-October/024960.html
>>>>>>>>
>>>>>>>>
>>>>>>>> I am eagerly looking forward to hearing from you, all.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best wishes,
>>>>>>>>
>>>>>>>> Saber
>>>>>>>>
>>>>>>>>
>>>>>>> Not completely surprising. If you have more cores than packet line rate
>>>>>>> then the number of packets returned for each call to rx_burst will be less.
>>>>>>> With large number of cores, most of the time will be spent doing reads of
>>>>>>> PCI registers for no packets!
>>>>>> Indeed pktgen says it is generating traffic at line rate, but receiving less than 10 Gb/s. So, it that case there should be something that causes the reduction in throughput :(
>>>>>>
>>>>>>
>>>>> Regards,
>>>>> Keith
>>>>>
>>>>
>>> Regards,
>>> Keith
>>>
>> Best regards,
>> Saber
> Regards,
> Keith
>
Best regards,
Saber

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-08-29 17:19           ` Saber Rezvani
@ 2018-08-29 18:52             ` Wiles, Keith
  2018-08-30  4:08               ` Saber Rezvani
  0 siblings, 1 reply; 11+ messages in thread
From: Wiles, Keith @ 2018-08-29 18:52 UTC (permalink / raw)
  To: Saber Rezvani; +Cc: Stephen Hemminger, dev



> On Aug 29, 2018, at 12:19 PM, Saber Rezvani <irsaber@zoho.com> wrote:
> 
> 
> 
> On 08/29/2018 01:39 AM, Wiles, Keith wrote:
>> 
>>> On Aug 28, 2018, at 2:16 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>> 
>>> 
>>> 
>>> On 08/28/2018 11:39 PM, Wiles, Keith wrote:
>>>> Which version of Pktgen? I just pushed a patch in 3.5.3 to fix a  performance problem.
>>> I use Pktgen verion 3.0.0, indeed it is O.k as far as I  have one core. (10 Gb/s) but when I increase the number of core (one core per queue) then I loose some performance (roughly 8.5 Gb/s for 8-core). In my scenario Pktgen shows it is generating at line rate, but receiving 8.5 Gb/s.
>>> Is it because of Pktgen???
>> Normally Pktgen can receive at line rate up to 10G 64 byte frames, which means Pktgen should not be the problem. You can verify that by looping the cable from one port to another on the pktgen machine to create a external loopback. Then send traffic what ever you can send from one port you should be able to receive those packets unless something is configured wrong.
>> 
>> Please send me the command line for pktgen.
>> 
>> 
>> In pktgen if you have this config -m “[1-4:5-8].0” then you have 4 cores sending traffic and 4 core receiving packets.
>> 
>> In this case the TX cores will be sending the packets on all 4 lcores to the same port. On the rx side you have 4 cores polling 4 rx queues. The rx queues are controlled by RSS, which means the RX traffic 5 tuples hash must divide the inbound packets across all 4 queues to make sure each core is doing the same amount of work. If you are sending only a single packet on the Tx cores then only one rx queue be used.
>> 
>> I hope that makes sense.
> I think there is a misunderstanding of the problem. Indeed the problem is not the Pktgen.
> Here is my command --> ./app/app/x86_64-native-linuxapp-gcc/pktgen -c ffc0000 -n 4 -w 84:00.0 -w 84:00.1 --file-prefix pktgen_F2 --socket-mem 1000,2000,1000,1000 -- -T -P -m "[18-19:20-21].0, [22:23].1"
> 
> The problem is when I run the symmetric_mp example for $numberOfProcesses=8 cores, then I have less throughput (roughly 8.4 Gb/s). but when I run it for $numberOfProcesses=3 cores throughput is 10G.
> for i in `seq $numberOfProcesses`;
>     do
>             .... some calculation goes here.....
>              symmetric_mp -c $coremask -n 2 --proc-type=auto -w 0b:00.0 -w 0b:00.1 --file-prefix sm --socket-mem 4000,1000,1000,1000 -- -p 3 --num-procs=$numberOfProcesses --proc-id=$procid";
>              .....
>     done

Most NICs have a limited amount of memory on the NIC and when you start to segment that memory because you are using more queues it can effect performance.

In one of the NICs if you go over say 6 or 5 queues the memory per queue for Rx/Tx packets starts to become a bottle neck as you do not have enough memory in the Tx/Rx queues to hold enough packets. This can cause the NIC to drop Rx packets because the host can not pull the data from the NIC or Rx ring on the host fast enough. This seems to be the problem as the amount of time to process a packet on the host has not changed only the amount of buffer space in the NIC as you increase queues.

I am not sure this is your issue, but I figured I would state this point.

> 
> I am trying find out what makes this loss!
> 
> 
>>>>> On Aug 28, 2018, at 12:05 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 08/28/2018 08:31 PM, Stephen Hemminger wrote:
>>>>>> On Tue, 28 Aug 2018 17:34:27 +0430
>>>>>> Saber Rezvani <irsaber@zoho.com> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> 
>>>>>>> I have run multi_process/symmetric_mp example in DPDK example directory.
>>>>>>> For a one process its throughput is line rate but as I increase the
>>>>>>> number of cores I see decrease in throughput. For example, If the number
>>>>>>> of queues set to 4 and each queue assigns to a single core, then the
>>>>>>> throughput will be something about 9.4. if 8 queues, then throughput
>>>>>>> will be 8.5.
>>>>>>> 
>>>>>>> I have read the following, but it was not convincing.
>>>>>>> 
>>>>>>> http://mails.dpdk.org/archives/dev/2015-October/024960.html
>>>>>>> 
>>>>>>> 
>>>>>>> I am eagerly looking forward to hearing from you, all.
>>>>>>> 
>>>>>>> 
>>>>>>> Best wishes,
>>>>>>> 
>>>>>>> Saber
>>>>>>> 
>>>>>>> 
>>>>>> Not completely surprising. If you have more cores than packet line rate
>>>>>> then the number of packets returned for each call to rx_burst will be less.
>>>>>> With large number of cores, most of the time will be spent doing reads of
>>>>>> PCI registers for no packets!
>>>>> Indeed pktgen says it is generating traffic at line rate, but receiving less than 10 Gb/s. So, it that case there should be something that causes the reduction in throughput :(
>>>>> 
>>>>> 
>>>> Regards,
>>>> Keith
>>>> 
>>> 
>>> 
>> Regards,
>> Keith
>> 
> 
> Best regards,
> Saber

Regards,
Keith


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-08-28 21:09         ` Wiles, Keith
@ 2018-08-29 17:19           ` Saber Rezvani
  2018-08-29 18:52             ` Wiles, Keith
  0 siblings, 1 reply; 11+ messages in thread
From: Saber Rezvani @ 2018-08-29 17:19 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, dev



On 08/29/2018 01:39 AM, Wiles, Keith wrote:
>
>> On Aug 28, 2018, at 2:16 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>
>>
>>
>> On 08/28/2018 11:39 PM, Wiles, Keith wrote:
>>> Which version of Pktgen? I just pushed a patch in 3.5.3 to fix a  performance problem.
>> I use Pktgen verion 3.0.0, indeed it is O.k as far as I  have one core. (10 Gb/s) but when I increase the number of core (one core per queue) then I loose some performance (roughly 8.5 Gb/s for 8-core). In my scenario Pktgen shows it is generating at line rate, but receiving 8.5 Gb/s.
>> Is it because of Pktgen???
> Normally Pktgen can receive at line rate up to 10G 64 byte frames, which means Pktgen should not be the problem. You can verify that by looping the cable from one port to another on the pktgen machine to create a external loopback. Then send traffic what ever you can send from one port you should be able to receive those packets unless something is configured wrong.
>
> Please send me the command line for pktgen.
>
>
> In pktgen if you have this config -m “[1-4:5-8].0” then you have 4 cores sending traffic and 4 core receiving packets.
>
> In this case the TX cores will be sending the packets on all 4 lcores to the same port. On the rx side you have 4 cores polling 4 rx queues. The rx queues are controlled by RSS, which means the RX traffic 5 tuples hash must divide the inbound packets across all 4 queues to make sure each core is doing the same amount of work. If you are sending only a single packet on the Tx cores then only one rx queue be used.
>
> I hope that makes sense.
I think there is a misunderstanding of the problem. Indeed the problem 
is not the Pktgen.
Here is my command --> ./app/app/x86_64-native-linuxapp-gcc/pktgen -c 
ffc0000 -n 4 -w 84:00.0 -w 84:00.1 --file-prefix pktgen_F2 --socket-mem 
1000,2000,1000,1000 -- -T -P -m "[18-19:20-21].0, [22:23].1"

The problem is when I run the symmetric_mp example for 
$numberOfProcesses=8 cores, then I have less throughput (roughly 8.4 
Gb/s). but when I run it for $numberOfProcesses=3 cores throughput is 10G.
for i in `seq $numberOfProcesses`;
     do
             .... some calculation goes here.....
              symmetric_mp -c $coremask -n 2 --proc-type=auto -w 0b:00.0 
-w 0b:00.1 --file-prefix sm --socket-mem 4000,1000,1000,1000 -- -p 3 
--num-procs=$numberOfProcesses --proc-id=$procid";
              .....
     done

I am trying find out what makes this loss!


>>>> On Aug 28, 2018, at 12:05 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>>>
>>>>
>>>>
>>>> On 08/28/2018 08:31 PM, Stephen Hemminger wrote:
>>>>> On Tue, 28 Aug 2018 17:34:27 +0430
>>>>> Saber Rezvani <irsaber@zoho.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I have run multi_process/symmetric_mp example in DPDK example directory.
>>>>>> For a one process its throughput is line rate but as I increase the
>>>>>> number of cores I see decrease in throughput. For example, If the number
>>>>>> of queues set to 4 and each queue assigns to a single core, then the
>>>>>> throughput will be something about 9.4. if 8 queues, then throughput
>>>>>> will be 8.5.
>>>>>>
>>>>>> I have read the following, but it was not convincing.
>>>>>>
>>>>>> http://mails.dpdk.org/archives/dev/2015-October/024960.html
>>>>>>
>>>>>>
>>>>>> I am eagerly looking forward to hearing from you, all.
>>>>>>
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Saber
>>>>>>
>>>>>>
>>>>> Not completely surprising. If you have more cores than packet line rate
>>>>> then the number of packets returned for each call to rx_burst will be less.
>>>>> With large number of cores, most of the time will be spent doing reads of
>>>>> PCI registers for no packets!
>>>> Indeed pktgen says it is generating traffic at line rate, but receiving less than 10 Gb/s. So, it that case there should be something that causes the reduction in throughput :(
>>>>
>>>>
>>> Regards,
>>> Keith
>>>
>>
>>
> Regards,
> Keith
>

Best regards,
Saber

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-08-28 19:16       ` Saber Rezvani
@ 2018-08-28 21:09         ` Wiles, Keith
  2018-08-29 17:19           ` Saber Rezvani
  0 siblings, 1 reply; 11+ messages in thread
From: Wiles, Keith @ 2018-08-28 21:09 UTC (permalink / raw)
  To: Saber Rezvani; +Cc: Stephen Hemminger, dev



> On Aug 28, 2018, at 2:16 PM, Saber Rezvani <irsaber@zoho.com> wrote:
> 
> 
> 
> On 08/28/2018 11:39 PM, Wiles, Keith wrote:
>> Which version of Pktgen? I just pushed a patch in 3.5.3 to fix a  performance problem.
> I use Pktgen verion 3.0.0, indeed it is O.k as far as I  have one core. (10 Gb/s) but when I increase the number of core (one core per queue) then I loose some performance (roughly 8.5 Gb/s for 8-core). In my scenario Pktgen shows it is generating at line rate, but receiving 8.5 Gb/s.
> Is it because of Pktgen???

Normally Pktgen can receive at line rate up to 10G 64 byte frames, which means Pktgen should not be the problem. You can verify that by looping the cable from one port to another on the pktgen machine to create a external loopback. Then send traffic what ever you can send from one port you should be able to receive those packets unless something is configured wrong.

Please send me the command line for pktgen.


In pktgen if you have this config -m “[1-4:5-8].0” then you have 4 cores sending traffic and 4 core receiving packets.

In this case the TX cores will be sending the packets on all 4 lcores to the same port. On the rx side you have 4 cores polling 4 rx queues. The rx queues are controlled by RSS, which means the RX traffic 5 tuples hash must divide the inbound packets across all 4 queues to make sure each core is doing the same amount of work. If you are sending only a single packet on the Tx cores then only one rx queue be used.

I hope that makes sense.

>> 
>>> On Aug 28, 2018, at 12:05 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>> 
>>> 
>>> 
>>> On 08/28/2018 08:31 PM, Stephen Hemminger wrote:
>>>> On Tue, 28 Aug 2018 17:34:27 +0430
>>>> Saber Rezvani <irsaber@zoho.com> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> 
>>>>> I have run multi_process/symmetric_mp example in DPDK example directory.
>>>>> For a one process its throughput is line rate but as I increase the
>>>>> number of cores I see decrease in throughput. For example, If the number
>>>>> of queues set to 4 and each queue assigns to a single core, then the
>>>>> throughput will be something about 9.4. if 8 queues, then throughput
>>>>> will be 8.5.
>>>>> 
>>>>> I have read the following, but it was not convincing.
>>>>> 
>>>>> http://mails.dpdk.org/archives/dev/2015-October/024960.html
>>>>> 
>>>>> 
>>>>> I am eagerly looking forward to hearing from you, all.
>>>>> 
>>>>> 
>>>>> Best wishes,
>>>>> 
>>>>> Saber
>>>>> 
>>>>> 
>>>> Not completely surprising. If you have more cores than packet line rate
>>>> then the number of packets returned for each call to rx_burst will be less.
>>>> With large number of cores, most of the time will be spent doing reads of
>>>> PCI registers for no packets!
>>> Indeed pktgen says it is generating traffic at line rate, but receiving less than 10 Gb/s. So, it that case there should be something that causes the reduction in throughput :(
>>> 
>>> 
>> Regards,
>> Keith
>> 
> 
> 
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-08-28 19:09     ` Wiles, Keith
@ 2018-08-28 19:16       ` Saber Rezvani
  2018-08-28 21:09         ` Wiles, Keith
  0 siblings, 1 reply; 11+ messages in thread
From: Saber Rezvani @ 2018-08-28 19:16 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, dev



On 08/28/2018 11:39 PM, Wiles, Keith wrote:
> Which version of Pktgen? I just pushed a patch in 3.5.3 to fix a  performance problem.
I use Pktgen verion 3.0.0, indeed it is O.k as far as I  have one core. 
(10 Gb/s) but when I increase the number of core (one core per queue) 
then I loose some performance (roughly 8.5 Gb/s for 8-core). In my 
scenario Pktgen shows it is generating at line rate, but receiving 8.5 Gb/s.
Is it because of Pktgen???
>
>> On Aug 28, 2018, at 12:05 PM, Saber Rezvani <irsaber@zoho.com> wrote:
>>
>>
>>
>> On 08/28/2018 08:31 PM, Stephen Hemminger wrote:
>>> On Tue, 28 Aug 2018 17:34:27 +0430
>>> Saber Rezvani <irsaber@zoho.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I have run multi_process/symmetric_mp example in DPDK example directory.
>>>> For a one process its throughput is line rate but as I increase the
>>>> number of cores I see decrease in throughput. For example, If the number
>>>> of queues set to 4 and each queue assigns to a single core, then the
>>>> throughput will be something about 9.4. if 8 queues, then throughput
>>>> will be 8.5.
>>>>
>>>> I have read the following, but it was not convincing.
>>>>
>>>> http://mails.dpdk.org/archives/dev/2015-October/024960.html
>>>>
>>>>
>>>> I am eagerly looking forward to hearing from you, all.
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>> Saber
>>>>
>>>>
>>> Not completely surprising. If you have more cores than packet line rate
>>> then the number of packets returned for each call to rx_burst will be less.
>>> With large number of cores, most of the time will be spent doing reads of
>>> PCI registers for no packets!
>> Indeed pktgen says it is generating traffic at line rate, but receiving less than 10 Gb/s. So, it that case there should be something that causes the reduction in throughput :(
>>
>>
> Regards,
> Keith
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-08-28 17:05   ` Saber Rezvani
@ 2018-08-28 19:09     ` Wiles, Keith
  2018-08-28 19:16       ` Saber Rezvani
  0 siblings, 1 reply; 11+ messages in thread
From: Wiles, Keith @ 2018-08-28 19:09 UTC (permalink / raw)
  To: Saber Rezvani; +Cc: Stephen Hemminger, dev

Which version of Pktgen? I just pushed a patch in 3.5.3 to fix a  performance problem.

> On Aug 28, 2018, at 12:05 PM, Saber Rezvani <irsaber@zoho.com> wrote:
> 
> 
> 
> On 08/28/2018 08:31 PM, Stephen Hemminger wrote:
>> On Tue, 28 Aug 2018 17:34:27 +0430
>> Saber Rezvani <irsaber@zoho.com> wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> I have run multi_process/symmetric_mp example in DPDK example directory.
>>> For a one process its throughput is line rate but as I increase the
>>> number of cores I see decrease in throughput. For example, If the number
>>> of queues set to 4 and each queue assigns to a single core, then the
>>> throughput will be something about 9.4. if 8 queues, then throughput
>>> will be 8.5.
>>> 
>>> I have read the following, but it was not convincing.
>>> 
>>> http://mails.dpdk.org/archives/dev/2015-October/024960.html
>>> 
>>> 
>>> I am eagerly looking forward to hearing from you, all.
>>> 
>>> 
>>> Best wishes,
>>> 
>>> Saber
>>> 
>>> 
>> Not completely surprising. If you have more cores than packet line rate
>> then the number of packets returned for each call to rx_burst will be less.
>> With large number of cores, most of the time will be spent doing reads of
>> PCI registers for no packets!
> Indeed pktgen says it is generating traffic at line rate, but receiving less than 10 Gb/s. So, it that case there should be something that causes the reduction in throughput :(
> 
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-08-28 16:01 ` Stephen Hemminger
@ 2018-08-28 17:05   ` Saber Rezvani
  2018-08-28 19:09     ` Wiles, Keith
  0 siblings, 1 reply; 11+ messages in thread
From: Saber Rezvani @ 2018-08-28 17:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev



On 08/28/2018 08:31 PM, Stephen Hemminger wrote:
> On Tue, 28 Aug 2018 17:34:27 +0430
> Saber Rezvani <irsaber@zoho.com> wrote:
>
>> Hi,
>>
>>
>> I have run multi_process/symmetric_mp example in DPDK example directory.
>> For a one process its throughput is line rate but as I increase the
>> number of cores I see decrease in throughput. For example, If the number
>> of queues set to 4 and each queue assigns to a single core, then the
>> throughput will be something about 9.4. if 8 queues, then throughput
>> will be 8.5.
>>
>> I have read the following, but it was not convincing.
>>
>> http://mails.dpdk.org/archives/dev/2015-October/024960.html
>>
>>
>> I am eagerly looking forward to hearing from you, all.
>>
>>
>> Best wishes,
>>
>> Saber
>>
>>
> Not completely surprising. If you have more cores than packet line rate
> then the number of packets returned for each call to rx_burst will be less.
> With large number of cores, most of the time will be spent doing reads of
> PCI registers for no packets!
Indeed pktgen says it is generating traffic at line rate, but receiving 
less than 10 Gb/s. So, it that case there should be something that 
causes the reduction in throughput :(

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] IXGBE throughput loss with 4+ cores
  2018-08-28 13:04 Saber Rezvani
@ 2018-08-28 16:01 ` Stephen Hemminger
  2018-08-28 17:05   ` Saber Rezvani
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2018-08-28 16:01 UTC (permalink / raw)
  To: Saber Rezvani; +Cc: dev

On Tue, 28 Aug 2018 17:34:27 +0430
Saber Rezvani <irsaber@zoho.com> wrote:

> Hi,
> 
> 
> I have run multi_process/symmetric_mp example in DPDK example directory.
> For a one process its throughput is line rate but as I increase the
> number of cores I see decrease in throughput. For example, If the number
> of queues set to 4 and each queue assigns to a single core, then the
> throughput will be something about 9.4. if 8 queues, then throughput
> will be 8.5.
> 
> I have read the following, but it was not convincing.
> 
> http://mails.dpdk.org/archives/dev/2015-October/024960.html
> 
> 
> I am eagerly looking forward to hearing from you, all.
> 
> 
> Best wishes,
> 
> Saber
> 
> 

Not completely surprising. If you have more cores than packet line rate
then the number of packets returned for each call to rx_burst will be less.
With large number of cores, most of the time will be spent doing reads of
PCI registers for no packets!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [dpdk-dev] IXGBE throughput loss with 4+ cores
@ 2018-08-28 13:04 Saber Rezvani
  2018-08-28 16:01 ` Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: Saber Rezvani @ 2018-08-28 13:04 UTC (permalink / raw)
  To: dev

Hi,

I have run multi_process/symmetric_mp example in DPDK example directory.
For a one process its throughput is line rate but as I increase the
number of cores I see decrease in throughput. For example, If the number
of queues set to 4 and each queue assigns to a single core, then the
throughput will be something about 9.4. if 8 queues, then throughput
will be 8.5.

I have read the following, but it was not convincing.

http://mails.dpdk.org/archives/dev/2015-October/024960.html

I am eagerly looking forward to hearing from you, all.

Best wishes,

Saber

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-09-06 17:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-06  6:10 [dpdk-dev] IXGBE throughput loss with 4+ cores Saber Rezvani
2018-09-06 17:48 ` Wiles, Keith
  -- strict thread matches above, loose matches on Subject: below --
2018-08-28 13:04 Saber Rezvani
2018-08-28 16:01 ` Stephen Hemminger
2018-08-28 17:05   ` Saber Rezvani
2018-08-28 19:09     ` Wiles, Keith
2018-08-28 19:16       ` Saber Rezvani
2018-08-28 21:09         ` Wiles, Keith
2018-08-29 17:19           ` Saber Rezvani
2018-08-29 18:52             ` Wiles, Keith
2018-08-30  4:08               ` Saber Rezvani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).