* [dpdk-dev] rte_sched library performance question @ 2017-02-16 15:13 Zoltan Kiss 2017-02-16 19:08 ` Dumitrescu, Cristian 0 siblings, 1 reply; 3+ messages in thread From: Zoltan Kiss @ 2017-02-16 15:13 UTC (permalink / raw) To: dev Hi, I'm experimenting a little bit with the scheduler library, and I got some performance numbers which seems to be worse than what I've expected. I'm sending 64 bytes packets on a 10G interface to a separate thread, and my simple test program (based on the qos_sched example) does the following: while (1) { uint16_t ret = rte_ring_sc_dequeue_burst(it.ring, (void**)flushbatch, FLUSH_SIZE); rte_mbuf** t = flushbatch; if (!ret) { /* This call is necessary to make sure the TX completed mbuf's * are returned to the pool even if there is nothing to * transmit */ rte_eth_tx_burst(it.portid, lcore, t, 0); continue; } rte_sched_port_enqueue(it.port, flushbatch, ret); ret = rte_sched_port_dequeue(it.port, flushbatch, FLUSH_SIZE); while (ret) { uint16_t n = rte_eth_tx_burst(it.portid, lcore, t, ret); /* we cannot drop the packets, so re-send */ /* update number of packets to be sent */ ret -= n; t = &t[n]; }; } I run this on a separate thread, another one doing rx and feeding the packets to the ring. When I comment out the enqueue and dequeue part in the code (reducing it to simple l2fwd), I can forward the entire ~14 Mpps traffic, whilst with the scheduler enabled I can only reach ~5.4 Mpps at best. I've tried with a single pipe or with 4k (used rand() to randomly distribute between pipe, everything else (class etc) was set to 0), didn't make a difference. Is this expected? I'm running this on a Xeon E5-2630 0 @ 2.30GHz I've used the following configuration: ; port configuration [port] [port] frame overhead = 24 number of subports per port = 1 number of pipes per subport = 1024 queue sizes = 64 64 64 64 ; Subport configuration [subport 0] tb rate = 1250000000; Bytes per second tb size = 1000000000; Bytes tc 0 rate = 1250000000; Bytes per second tc 1 rate = 1250000000; Bytes per second tc 2 rate = 1250000000; Bytes per second tc 3 rate = 1250000000; Bytes per second tc period = 10; Milliseconds tc oversubscription period = 1000; Milliseconds pipe 0-1024 = 0; These pipes are configured with pipe profile 0 ; Pipe configuration [pipe profile 0] tb rate = 1250000000; Bytes per second tb size = 1000000000; Bytes tc 0 rate = 1250000000; Bytes per second tc 1 rate = 1250000000; Bytes per second tc 2 rate = 1250000000; Bytes per second tc 3 rate = 1250000000; Bytes per second tc period = 10; Milliseconds tc 0 oversubscription weight = 1 tc 1 oversubscription weight = 1 tc 2 oversubscription weight = 1 tc 3 oversubscription weight = 1 tc 0 wrr weights = 1 1 1 1 tc 1 wrr weights = 1 1 1 1 tc 2 wrr weights = 1 1 1 1 tc 3 wrr weights = 1 1 1 1 Regards, Zoltan ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [dpdk-dev] rte_sched library performance question 2017-02-16 15:13 [dpdk-dev] rte_sched library performance question Zoltan Kiss @ 2017-02-16 19:08 ` Dumitrescu, Cristian 2017-02-24 21:09 ` Zoltan Kiss 0 siblings, 1 reply; 3+ messages in thread From: Dumitrescu, Cristian @ 2017-02-16 19:08 UTC (permalink / raw) To: Zoltan Kiss, dev Hi Zoltan, > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zoltan Kiss > Sent: Thursday, February 16, 2017 3:14 PM > To: dev@dpdk.org > Subject: [dpdk-dev] rte_sched library performance question > > Hi, > > I'm experimenting a little bit with the scheduler library, and I got some > performance numbers which seems to be worse than what I've expected. > I'm sending 64 bytes packets on a 10G interface to a separate thread, and > my simple test program (based on the qos_sched example) does the > following: > > while (1) { > uint16_t ret = rte_ring_sc_dequeue_burst(it.ring, > (void**)flushbatch, FLUSH_SIZE); > rte_mbuf** t = flushbatch; > > if (!ret) { > /* This call is necessary to make sure the TX completed > mbuf's > * are returned to the pool even if there is nothing to > * transmit */ > rte_eth_tx_burst(it.portid, lcore, t, 0); > continue; > } > rte_sched_port_enqueue(it.port, flushbatch, ret); > ret = rte_sched_port_dequeue(it.port, flushbatch, FLUSH_SIZE); Looks to me like the scheduler dequeue burst is equal to the enqueue burst size of FLUSH_SIZE, right? In this case, you are always dequeueuing the exact packets that you just enqueued, and the scheduler dequeue needs to work really hard to find exactly those FLUSH_SIZE queues that each one have a single packet at this point. This is wht the enqueue burst size should be bigger than the dequeue burst size. Basically, you add some water into the reservoir up to a reasonable fill level before you start pouring it in your glass if you want to fill the glass quickly. Typical values used: -for vector PMD: (enqueue = 32, dequeue = 24), (32, 28), (32, 16), etc -for scalar PMD: (64, 48), (64, 32), ... We used (256, 248) for VPP > while (ret) { > uint16_t n = rte_eth_tx_burst(it.portid, lcore, t, ret); > /* we cannot drop the packets, so re-send */ > /* update number of packets to be sent */ > ret -= n; > t = &t[n]; > }; > } > > I run this on a separate thread, another one doing rx and feeding the > packets to the ring. When I comment out the enqueue and dequeue part in > the > code (reducing it to simple l2fwd), I can forward the entire ~14 Mpps > traffic, whilst with the scheduler enabled I can only reach ~5.4 Mpps at > best. I've tried with a single pipe or with 4k (used rand() to randomly > distribute between pipe, everything else (class etc) was set to 0), didn't > make a difference. Is this expected? I'm running this on a Xeon E5-2630 0 @ > 2.30GHz > > I've used the following configuration: > > ; port configuration [port] > > [port] > frame overhead = 24 > number of subports per port = 1 > number of pipes per subport = 1024 > queue sizes = 64 64 64 64 > > ; Subport configuration > > [subport 0] > tb rate = 1250000000; Bytes per second > tb size = 1000000000; Bytes > tc 0 rate = 1250000000; Bytes per second > tc 1 rate = 1250000000; Bytes per second > tc 2 rate = 1250000000; Bytes per second > tc 3 rate = 1250000000; Bytes per second > tc period = 10; Milliseconds > tc oversubscription period = 1000; Milliseconds > > pipe 0-1024 = 0; These pipes are configured with pipe profile 0 > > ; Pipe configuration > > [pipe profile 0] > tb rate = 1250000000; Bytes per second > tb size = 1000000000; Bytes > > tc 0 rate = 1250000000; Bytes per second > tc 1 rate = 1250000000; Bytes per second > tc 2 rate = 1250000000; Bytes per second > tc 3 rate = 1250000000; Bytes per second > tc period = 10; Milliseconds > > tc 0 oversubscription weight = 1 > tc 1 oversubscription weight = 1 > tc 2 oversubscription weight = 1 > tc 3 oversubscription weight = 1 > > tc 0 wrr weights = 1 1 1 1 > tc 1 wrr weights = 1 1 1 1 > tc 2 wrr weights = 1 1 1 1 > tc 3 wrr weights = 1 1 1 1 > > Regards, > > Zoltan Regards, Cristian ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [dpdk-dev] rte_sched library performance question 2017-02-16 19:08 ` Dumitrescu, Cristian @ 2017-02-24 21:09 ` Zoltan Kiss 0 siblings, 0 replies; 3+ messages in thread From: Zoltan Kiss @ 2017-02-24 21:09 UTC (permalink / raw) To: Dumitrescu, Cristian, dev On 16/02/17 20:08, Dumitrescu, Cristian wrote: > Hi Zoltan, > >> -----Original Message----- >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zoltan Kiss >> Sent: Thursday, February 16, 2017 3:14 PM >> To: dev@dpdk.org >> Subject: [dpdk-dev] rte_sched library performance question >> >> Hi, >> >> I'm experimenting a little bit with the scheduler library, and I got some >> performance numbers which seems to be worse than what I've expected. >> I'm sending 64 bytes packets on a 10G interface to a separate thread, and >> my simple test program (based on the qos_sched example) does the >> following: >> >> while (1) { >> uint16_t ret = rte_ring_sc_dequeue_burst(it.ring, >> (void**)flushbatch, FLUSH_SIZE); >> rte_mbuf** t = flushbatch; >> >> if (!ret) { >> /* This call is necessary to make sure the TX completed >> mbuf's >> * are returned to the pool even if there is nothing to >> * transmit */ >> rte_eth_tx_burst(it.portid, lcore, t, 0); >> continue; >> } >> rte_sched_port_enqueue(it.port, flushbatch, ret); >> ret = rte_sched_port_dequeue(it.port, flushbatch, FLUSH_SIZE); > Looks to me like the scheduler dequeue burst is equal to the enqueue burst size of FLUSH_SIZE, right? > In this case, you are always dequeueuing the exact packets that you just enqueued, and the scheduler dequeue needs to work really hard to find exactly those FLUSH_SIZE queues that each one have a single packet at this point. > > This is wht the enqueue burst size should be bigger than the dequeue burst size. Basically, you add some water into the reservoir up to a reasonable fill level before you start pouring it in your glass if you want to fill the glass quickly. > > Typical values used: > -for vector PMD: (enqueue = 32, dequeue = 24), (32, 28), (32, 16), etc > -for scalar PMD: (64, 48), (64, 32), ... We used (256, 248) for VPP Thanks, it helped my case too. Btw. it would be good do link this document somewhere in the DPDK docs, as it contains a lot of good information about the scheduler: https://networkbuilders.intel.com/docs/Network_Builders_RA_NFV_QoS_Aug2014.pdf > >> while (ret) { >> uint16_t n = rte_eth_tx_burst(it.portid, lcore, t, ret); >> /* we cannot drop the packets, so re-send */ >> /* update number of packets to be sent */ >> ret -= n; >> t = &t[n]; >> }; >> } >> >> I run this on a separate thread, another one doing rx and feeding the >> packets to the ring. When I comment out the enqueue and dequeue part in >> the >> code (reducing it to simple l2fwd), I can forward the entire ~14 Mpps >> traffic, whilst with the scheduler enabled I can only reach ~5.4 Mpps at >> best. I've tried with a single pipe or with 4k (used rand() to randomly >> distribute between pipe, everything else (class etc) was set to 0), didn't >> make a difference. Is this expected? I'm running this on a Xeon E5-2630 0 @ >> 2.30GHz >> >> I've used the following configuration: >> >> ; port configuration [port] >> >> [port] >> frame overhead = 24 >> number of subports per port = 1 >> number of pipes per subport = 1024 >> queue sizes = 64 64 64 64 >> >> ; Subport configuration >> >> [subport 0] >> tb rate = 1250000000; Bytes per second >> tb size = 1000000000; Bytes >> tc 0 rate = 1250000000; Bytes per second >> tc 1 rate = 1250000000; Bytes per second >> tc 2 rate = 1250000000; Bytes per second >> tc 3 rate = 1250000000; Bytes per second >> tc period = 10; Milliseconds >> tc oversubscription period = 1000; Milliseconds >> >> pipe 0-1024 = 0; These pipes are configured with pipe profile 0 >> >> ; Pipe configuration >> >> [pipe profile 0] >> tb rate = 1250000000; Bytes per second >> tb size = 1000000000; Bytes >> >> tc 0 rate = 1250000000; Bytes per second >> tc 1 rate = 1250000000; Bytes per second >> tc 2 rate = 1250000000; Bytes per second >> tc 3 rate = 1250000000; Bytes per second >> tc period = 10; Milliseconds >> >> tc 0 oversubscription weight = 1 >> tc 1 oversubscription weight = 1 >> tc 2 oversubscription weight = 1 >> tc 3 oversubscription weight = 1 >> >> tc 0 wrr weights = 1 1 1 1 >> tc 1 wrr weights = 1 1 1 1 >> tc 2 wrr weights = 1 1 1 1 >> tc 3 wrr weights = 1 1 1 1 >> >> Regards, >> >> Zoltan > Regards, > Cristian ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2017-02-24 21:09 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-02-16 15:13 [dpdk-dev] rte_sched library performance question Zoltan Kiss 2017-02-16 19:08 ` Dumitrescu, Cristian 2017-02-24 21:09 ` Zoltan Kiss
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).