[dpdk-users] Query on handling packets

DPDK usage discussions
 help / color / mirror / Atom feed

* [dpdk-users] Query on handling packets
@ 2018-11-08  8:24 Harsh Patel
  2018-11-08  8:56 ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2018-11-08  8:24 UTC (permalink / raw)
  To: users

Hi,
We are working on a project where we are trying to integrate DPDK with
another software. We are able to obtain packets from the other environment
to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
send/receive burst of data packets. We want to know if there is any
functionality in DPDK to achieve this conversion of single incoming packet
to a burst of packets sent on NIC and similarly, conversion of burst read
packets from NIC to send it to other environment sequentially?

Thanks and regards
Harsh Patel, Hrishikesh Hiraskar
NITK Surathkal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-08  8:24 [dpdk-users] Query on handling packets Harsh Patel
@ 2018-11-08  8:56 ` Wiles, Keith
  2018-11-08 16:58   ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-08  8:56 UTC (permalink / raw)
  To: Harsh Patel; +Cc: users



> On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hi,
> We are working on a project where we are trying to integrate DPDK with
> another software. We are able to obtain packets from the other environment
> to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
> send/receive burst of data packets. We want to know if there is any
> functionality in DPDK to achieve this conversion of single incoming packet
> to a burst of packets sent on NIC and similarly, conversion of burst read
> packets from NIC to send it to other environment sequentially?


Search in the docs or lib/librte_ethdev directory on rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...



> Thanks and regards
> Harsh Patel, Hrishikesh Hiraskar
> NITK Surathkal

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-08  8:56 ` Wiles, Keith
@ 2018-11-08 16:58   ` Harsh Patel
  2018-11-08 17:43     ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2018-11-08 16:58 UTC (permalink / raw)
  To: keith.wiles; +Cc: users

Thanks for your insight on the topic. Transmission is working with the
functions you mentioned. We tried to search for some similar functions for
handling incoming packets but could not find anything. Can you help us on
that as well?

Regards,

Harsh and Hrishikesh.

On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Hi,
> > We are working on a project where we are trying to integrate DPDK with
> > another software. We are able to obtain packets from the other
> environment
> > to DPDK environment in one-by-one fashion. On the other hand DPDK allows
> to
> > send/receive burst of data packets. We want to know if there is any
> > functionality in DPDK to achieve this conversion of single incoming
> packet
> > to a burst of packets sent on NIC and similarly, conversion of burst read
> > packets from NIC to send it to other environment sequentially?
>
>
> Search in the docs or lib/librte_ethdev directory on
> rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
>
>
>
> > Thanks and regards
> > Harsh Patel, Hrishikesh Hiraskar
> > NITK Surathkal
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-08 16:58   ` Harsh Patel
@ 2018-11-08 17:43     ` Wiles, Keith
  2018-11-09 10:09       ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-08 17:43 UTC (permalink / raw)
  To: Harsh Patel; +Cc: users

On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:

Thanks for your insight on the topic. Transmission is working with the functions you mentioned. We tried to search for some similar functions for handling incoming packets but could not find anything. Can you help us on that as well?

I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API) PoC I was working on and presented at the DPDK Summit last Sept. In the PoC I did create a RX side version. The issues it has a bit of tangled up in the DAPI PoC.

The basic concept is a call to RX a single packet does a rx_burst of N number of packets keeping then in a mbuf list. The code would spin waiting for mbufs to arrive or return quickly if a flag was set. When it did find RX mbufs it would just return the single mbuf and keep the list of mbufs for later requests until the list is empty then do another rx_burst call.

Sorry this is a really quick note on how it works. If you need more details we can talk more later.

Regards,
Harsh and Hrishikesh.

On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:

> On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
>
> Hi,
> We are working on a project where we are trying to integrate DPDK with
> another software. We are able to obtain packets from the other environment
> to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
> send/receive burst of data packets. We want to know if there is any
> functionality in DPDK to achieve this conversion of single incoming packet
> to a burst of packets sent on NIC and similarly, conversion of burst read
> packets from NIC to send it to other environment sequentially?

Search in the docs or lib/librte_ethdev directory on rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...

> Thanks and regards
> Harsh Patel, Hrishikesh Hiraskar
> NITK Surathkal

Regards,
Keith

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-08 17:43     ` Wiles, Keith
@ 2018-11-09 10:09       ` Harsh Patel
  2018-11-09 21:26         ` Wiles, Keith
  2018-11-10  6:17         ` Wiles, Keith
  0 siblings, 2 replies; 43+ messages in thread
From: Harsh Patel @ 2018-11-09 10:09 UTC (permalink / raw)
  To: keith.wiles; +Cc: users

We have implemented the logic for Tx/Rx as you suggested. We compared the
obtained throughput with another version of same application that uses
Linux raw sockets.
Unfortunately, the throughput we receive in our DPDK application is less by
a good margin. Is this any way we can optimize our implementation or
anything that we are missing?

Thanks and regards
Harsh & Hrishikesh

On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
>
> Thanks for your insight on the topic. Transmission is working with the
> functions you mentioned. We tried to search for some similar functions for
> handling incoming packets but could not find anything. Can you help us on
> that as well?
>
>
> I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API)
> PoC I was working on and presented at the DPDK Summit last Sept. In the PoC
> I did create a RX side version. The issues it has a bit of tangled up in
> the DAPI PoC.
>
> The basic concept is a call to RX a single packet does a rx_burst of N
> number of packets keeping then in a mbuf list. The code would spin waiting
> for mbufs to arrive or return quickly if a flag was set. When it did find
> RX mbufs it would just return the single mbuf and keep the list of mbufs
> for later requests until the list is empty then do another rx_burst call.
>
> Sorry this is a really quick note on how it works. If you need more
> details we can talk more later.
>
>
> Regards,
> Harsh and Hrishikesh.
>
> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com> wrote:
>
>>
>>
>> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> >
>> > Hi,
>> > We are working on a project where we are trying to integrate DPDK with
>> > another software. We are able to obtain packets from the other
>> environment
>> > to DPDK environment in one-by-one fashion. On the other hand DPDK
>> allows to
>> > send/receive burst of data packets. We want to know if there is any
>> > functionality in DPDK to achieve this conversion of single incoming
>> packet
>> > to a burst of packets sent on NIC and similarly, conversion of burst
>> read
>> > packets from NIC to send it to other environment sequentially?
>>
>>
>> Search in the docs or lib/librte_ethdev directory on
>> rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
>>
>>
>>
>> > Thanks and regards
>> > Harsh Patel, Hrishikesh Hiraskar
>> > NITK Surathkal
>>
>> Regards,
>> Keith
>>
>>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-09 10:09       ` Harsh Patel
@ 2018-11-09 21:26         ` Wiles, Keith
  2018-11-10  6:17         ` Wiles, Keith
  1 sibling, 0 replies; 43+ messages in thread
From: Wiles, Keith @ 2018-11-09 21:26 UTC (permalink / raw)
  To: Harsh Patel; +Cc: users

Sent from my iPhone

On Nov 9, 2018, at 5:09 AM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:

We have implemented the logic for Tx/Rx as you suggested. We compared the obtained throughput with another version of same application that uses Linux raw sockets.
Unfortunately, the throughput we receive in our DPDK application is less by a good margin. Is this any way we can optimize our implementation or anything that we are missing?

The PoC code I was developing for DAPI I did not have any performance of issues it run just as fast with my limited testing. I converted the l3fwd code and I saw 10G 64byte wire rate as I remember using pktgen to generate the traffic.

Not sure why you would see a big performance drop, but I do not know your application or code.

Thanks and regards
Harsh & Hrishikesh

On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:

On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:

Thanks for your insight on the topic. Transmission is working with the functions you mentioned. We tried to search for some similar functions for handling incoming packets but could not find anything. Can you help us on that as well?

I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API) PoC I was working on and presented at the DPDK Summit last Sept. In the PoC I did create a RX side version. The issues it has a bit of tangled up in the DAPI PoC.

The basic concept is a call to RX a single packet does a rx_burst of N number of packets keeping then in a mbuf list. The code would spin waiting for mbufs to arrive or return quickly if a flag was set. When it did find RX mbufs it would just return the single mbuf and keep the list of mbufs for later requests until the list is empty then do another rx_burst call.

Sorry this is a really quick note on how it works. If you need more details we can talk more later.

Regards,
Harsh and Hrishikesh.

On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:

> On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
>
> Hi,
> We are working on a project where we are trying to integrate DPDK with
> another software. We are able to obtain packets from the other environment
> to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
> send/receive burst of data packets. We want to know if there is any
> functionality in DPDK to achieve this conversion of single incoming packet
> to a burst of packets sent on NIC and similarly, conversion of burst read
> packets from NIC to send it to other environment sequentially?

Search in the docs or lib/librte_ethdev directory on rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...

> Thanks and regards
> Harsh Patel, Hrishikesh Hiraskar
> NITK Surathkal

Regards,
Keith

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-09 10:09       ` Harsh Patel
  2018-11-09 21:26         ` Wiles, Keith
@ 2018-11-10  6:17         ` Wiles, Keith
  2018-11-11 19:45           ` Harsh Patel
  1 sibling, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-10  6:17 UTC (permalink / raw)
  To: Harsh Patel; +Cc: users

Please make sure to send your emails in plain text format. The Mac mail program loves to use rich-text format is the original email use it and I have told it not only send plain text :-(

> On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> We have implemented the logic for Tx/Rx as you suggested. We compared the obtained throughput with another version of same application that uses Linux raw sockets. 
> Unfortunately, the throughput we receive in our DPDK application is less by a good margin. Is this any way we can optimize our implementation or anything that we are missing?
> 

The PoC code I was developing for DAPI I did not have any performance of issues it run just as fast with my limited testing. I converted the l3fwd code and I saw 10G 64byte wire rate as I remember using pktgen to generate the traffic.

Not sure why you would see a big performance drop, but I do not know your application or code.

> Thanks and regards
> Harsh & Hrishikesh
> 
> On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
>> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
>> 
>> Thanks
>>  for your insight on the topic. Transmission is working with the functions you mentioned. We tried to search for some similar functions for handling incoming packets but could not find anything. Can you help us on that as well?
>> 
> 
> I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API) PoC I was working on and presented at the DPDK Summit last Sept. In the PoC I did create a RX side version. The issues it has a bit of tangled up in the DAPI PoC.
> 
> The basic concept is a call to RX a single packet does a rx_burst of N number of packets keeping then in a mbuf list. The code would spin waiting for mbufs to arrive or return quickly if a flag was set. When it did find RX mbufs it would just return the single mbuf and keep the list of mbufs for later requests until the list is empty then do another rx_burst call.
> 
> Sorry this is a really quick note on how it works. If you need more details we can talk more later.
>> 
>> Regards,
>> Harsh
>>  and Hrishikesh.
>> 
>> 
>> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com> wrote:
>> 
>> 
>> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
>> > 
>> > Hi,
>> > We are working on a project where we are trying to integrate DPDK with
>> > another software. We are able to obtain packets from the other environment
>> > to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
>> > send/receive burst of data packets. We want to know if there is any
>> > functionality in DPDK to achieve this conversion of single incoming packet
>> > to a burst of packets sent on NIC and similarly, conversion of burst read
>> > packets from NIC to send it to other environment sequentially?
>> 
>> 
>> Search in the docs or lib/librte_ethdev directory on rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
>> 
>> 
>> 
>> > Thanks and regards
>> > Harsh Patel, Hrishikesh Hiraskar
>> > NITK Surathkal
>> 
>> Regards,
>> Keith
>> 
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-10  6:17         ` Wiles, Keith
@ 2018-11-11 19:45           ` Harsh Patel
  2018-11-13  2:25             ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2018-11-11 19:45 UTC (permalink / raw)
  To: keith.wiles; +Cc: users

Thanks a lot for all the support. We are looking into our work as of now
and will contact you once we are done checking it completely from our side.
Thanks for the help.

Regards,
Harsh and Hrishikesh

On Sat, 10 Nov 2018 at 11:47, Wiles, Keith <keith.wiles@intel.com> wrote:

> Please make sure to send your emails in plain text format. The Mac mail
> program loves to use rich-text format is the original email use it and I
> have told it not only send plain text :-(
>
> > On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > We have implemented the logic for Tx/Rx as you suggested. We compared
> the obtained throughput with another version of same application that uses
> Linux raw sockets.
> > Unfortunately, the throughput we receive in our DPDK application is less
> by a good margin. Is this any way we can optimize our implementation or
> anything that we are missing?
> >
>
> The PoC code I was developing for DAPI I did not have any performance of
> issues it run just as fast with my limited testing. I converted the l3fwd
> code and I saw 10G 64byte wire rate as I remember using pktgen to generate
> the traffic.
>
> Not sure why you would see a big performance drop, but I do not know your
> application or code.
>
> > Thanks and regards
> > Harsh & Hrishikesh
> >
> > On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com> wrote:
> >
> >
> >> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >>
> >> Thanks
> >>  for your insight on the topic. Transmission is working with the
> functions you mentioned. We tried to search for some similar functions for
> handling incoming packets but could not find anything. Can you help us on
> that as well?
> >>
> >
> > I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API)
> PoC I was working on and presented at the DPDK Summit last Sept. In the PoC
> I did create a RX side version. The issues it has a bit of tangled up in
> the DAPI PoC.
> >
> > The basic concept is a call to RX a single packet does a rx_burst of N
> number of packets keeping then in a mbuf list. The code would spin waiting
> for mbufs to arrive or return quickly if a flag was set. When it did find
> RX mbufs it would just return the single mbuf and keep the list of mbufs
> for later requests until the list is empty then do another rx_burst call.
> >
> > Sorry this is a really quick note on how it works. If you need more
> details we can talk more later.
> >>
> >> Regards,
> >> Harsh
> >>  and Hrishikesh.
> >>
> >>
> >> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> >>
> >>
> >> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >> >
> >> > Hi,
> >> > We are working on a project where we are trying to integrate DPDK with
> >> > another software. We are able to obtain packets from the other
> environment
> >> > to DPDK environment in one-by-one fashion. On the other hand DPDK
> allows to
> >> > send/receive burst of data packets. We want to know if there is any
> >> > functionality in DPDK to achieve this conversion of single incoming
> packet
> >> > to a burst of packets sent on NIC and similarly, conversion of burst
> read
> >> > packets from NIC to send it to other environment sequentially?
> >>
> >>
> >> Search in the docs or lib/librte_ethdev directory on
> rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
> >>
> >>
> >>
> >> > Thanks and regards
> >> > Harsh Patel, Hrishikesh Hiraskar
> >> > NITK Surathkal
> >>
> >> Regards,
> >> Keith
> >>
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-11 19:45           ` Harsh Patel
@ 2018-11-13  2:25             ` Harsh Patel
  2018-11-13 13:47               ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2018-11-13  2:25 UTC (permalink / raw)
  To: keith.wiles; +Cc: users

Hello,
It would be really helpful if you can provide us a link (for both Tx and
Rx) to the project you mentioned earlier where you worked on a
similar problem, if possible.

Thanks and Regards,
Harsh & Hrishikesh.

On Mon, 12 Nov 2018 at 01:15, Harsh Patel <thadodaharsh10@gmail.com> wrote:

> Thanks a lot for all the support. We are looking into our work as of now
> and will contact you once we are done checking it completely from our side.
> Thanks for the help.
>
> Regards,
> Harsh and Hrishikesh
>
> On Sat, 10 Nov 2018 at 11:47, Wiles, Keith <keith.wiles@intel.com> wrote:
>
>> Please make sure to send your emails in plain text format. The Mac mail
>> program loves to use rich-text format is the original email use it and I
>> have told it not only send plain text :-(
>>
>> > On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> >
>> > We have implemented the logic for Tx/Rx as you suggested. We compared
>> the obtained throughput with another version of same application that uses
>> Linux raw sockets.
>> > Unfortunately, the throughput we receive in our DPDK application is
>> less by a good margin. Is this any way we can optimize our implementation
>> or anything that we are missing?
>> >
>>
>> The PoC code I was developing for DAPI I did not have any performance of
>> issues it run just as fast with my limited testing. I converted the l3fwd
>> code and I saw 10G 64byte wire rate as I remember using pktgen to generate
>> the traffic.
>>
>> Not sure why you would see a big performance drop, but I do not know your
>> application or code.
>>
>> > Thanks and regards
>> > Harsh & Hrishikesh
>> >
>> > On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com>
>> wrote:
>> >
>> >
>> >> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> >>
>> >> Thanks
>> >>  for your insight on the topic. Transmission is working with the
>> functions you mentioned. We tried to search for some similar functions for
>> handling incoming packets but could not find anything. Can you help us on
>> that as well?
>> >>
>> >
>> > I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API)
>> PoC I was working on and presented at the DPDK Summit last Sept. In the PoC
>> I did create a RX side version. The issues it has a bit of tangled up in
>> the DAPI PoC.
>> >
>> > The basic concept is a call to RX a single packet does a rx_burst of N
>> number of packets keeping then in a mbuf list. The code would spin waiting
>> for mbufs to arrive or return quickly if a flag was set. When it did find
>> RX mbufs it would just return the single mbuf and keep the list of mbufs
>> for later requests until the list is empty then do another rx_burst call.
>> >
>> > Sorry this is a really quick note on how it works. If you need more
>> details we can talk more later.
>> >>
>> >> Regards,
>> >> Harsh
>> >>  and Hrishikesh.
>> >>
>> >>
>> >> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com>
>> wrote:
>> >>
>> >>
>> >> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> >> >
>> >> > Hi,
>> >> > We are working on a project where we are trying to integrate DPDK
>> with
>> >> > another software. We are able to obtain packets from the other
>> environment
>> >> > to DPDK environment in one-by-one fashion. On the other hand DPDK
>> allows to
>> >> > send/receive burst of data packets. We want to know if there is any
>> >> > functionality in DPDK to achieve this conversion of single incoming
>> packet
>> >> > to a burst of packets sent on NIC and similarly, conversion of burst
>> read
>> >> > packets from NIC to send it to other environment sequentially?
>> >>
>> >>
>> >> Search in the docs or lib/librte_ethdev directory on
>> rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
>> >>
>> >>
>> >>
>> >> > Thanks and regards
>> >> > Harsh Patel, Hrishikesh Hiraskar
>> >> > NITK Surathkal
>> >>
>> >> Regards,
>> >> Keith
>> >>
>> >
>> > Regards,
>> > Keith
>> >
>>
>> Regards,
>> Keith
>>
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-13  2:25             ` Harsh Patel
@ 2018-11-13 13:47               ` Wiles, Keith
  2018-11-14 13:54                 ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-13 13:47 UTC (permalink / raw)
  To: Harsh Patel; +Cc: users



> On Nov 12, 2018, at 8:25 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hello,
> It would be really helpful if you can provide us a link (for both Tx and Rx) to the project you mentioned earlier where you worked on a similar problem, if possible. 
> 

At this time I can not provide a link. I will try and see what I can do, but do not hold your breath it could be awhile as we have to go thru a lot of legal stuff. If you can try vtune tool from Intel for x86 systems if you can get a copy for your platform as it can tell you a lot about the code and where the performance issues are located. If you are not running Intel x86 then my code may not work for you, I do not remember if you told me which platform.


> Thanks and Regards, 
> Harsh & Hrishikesh.
> 
> On Mon, 12 Nov 2018 at 01:15, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> Thanks a lot for all the support. We are looking into our work as of now and will contact you once we are done checking it completely from our side. Thanks for the help.
> 
> Regards,
> Harsh and Hrishikesh
> 
> On Sat, 10 Nov 2018 at 11:47, Wiles, Keith <keith.wiles@intel.com> wrote:
> Please make sure to send your emails in plain text format. The Mac mail program loves to use rich-text format is the original email use it and I have told it not only send plain text :-(
> 
> > On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> > We have implemented the logic for Tx/Rx as you suggested. We compared the obtained throughput with another version of same application that uses Linux raw sockets. 
> > Unfortunately, the throughput we receive in our DPDK application is less by a good margin. Is this any way we can optimize our implementation or anything that we are missing?
> > 
> 
> The PoC code I was developing for DAPI I did not have any performance of issues it run just as fast with my limited testing. I converted the l3fwd code and I saw 10G 64byte wire rate as I remember using pktgen to generate the traffic.
> 
> Not sure why you would see a big performance drop, but I do not know your application or code.
> 
> > Thanks and regards
> > Harsh & Hrishikesh
> > 
> > On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com> wrote:
> > 
> > 
> >> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> >> 
> >> Thanks
> >>  for your insight on the topic. Transmission is working with the functions you mentioned. We tried to search for some similar functions for handling incoming packets but could not find anything. Can you help us on that as well?
> >> 
> > 
> > I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API) PoC I was working on and presented at the DPDK Summit last Sept. In the PoC I did create a RX side version. The issues it has a bit of tangled up in the DAPI PoC.
> > 
> > The basic concept is a call to RX a single packet does a rx_burst of N number of packets keeping then in a mbuf list. The code would spin waiting for mbufs to arrive or return quickly if a flag was set. When it did find RX mbufs it would just return the single mbuf and keep the list of mbufs for later requests until the list is empty then do another rx_burst call.
> > 
> > Sorry this is a really quick note on how it works. If you need more details we can talk more later.
> >> 
> >> Regards,
> >> Harsh
> >>  and Hrishikesh.
> >> 
> >> 
> >> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com> wrote:
> >> 
> >> 
> >> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> >> > 
> >> > Hi,
> >> > We are working on a project where we are trying to integrate DPDK with
> >> > another software. We are able to obtain packets from the other environment
> >> > to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
> >> > send/receive burst of data packets. We want to know if there is any
> >> > functionality in DPDK to achieve this conversion of single incoming packet
> >> > to a burst of packets sent on NIC and similarly, conversion of burst read
> >> > packets from NIC to send it to other environment sequentially?
> >> 
> >> 
> >> Search in the docs or lib/librte_ethdev directory on rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
> >> 
> >> 
> >> 
> >> > Thanks and regards
> >> > Harsh Patel, Hrishikesh Hiraskar
> >> > NITK Surathkal
> >> 
> >> Regards,
> >> Keith
> >> 
> > 
> > Regards,
> > Keith
> > 
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-13 13:47               ` Wiles, Keith
@ 2018-11-14 13:54                 ` Harsh Patel
  2018-11-14 15:02                   ` Wiles, Keith
                                     ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Harsh Patel @ 2018-11-14 13:54 UTC (permalink / raw)
  To: keith.wiles; +Cc: users

Hello,
This is a link to the complete source code of our project :-
https://github.com/ns-3-dpdk-integration/ns-3-dpdk
For the description of the project, look through this :-
https://ns-3-dpdk-integration.github.io/
Once you go through it, you will have a basic understanding of the project.
Installation instructions link are provided in the github.io page.

In the code we mentioned above, the master branch contains the
implementation of the logic using rte_rings which we mentioned at the very
beginning of the discussion. There is a branch named "newrxtx" which
contains the implementation according to the logic you provided.

We would like you to take a look at the code in newrxtx branch. (
https://github.com/ns-3-dpdk-integration/ns-3-dpdk/tree/newrxtx)
In the code in this branch, go to
ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/ directory. Here we
have implemented the DpdkNetDevice model. This model contains the code
which implements the whole model providing interaction between ns-3 and
DPDK. We would like you take a look at our Read function (
https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L626)
and Write function (
https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L576).
These contains the logic you suggested.

Can you go through this and suggest us some changes or find some mistake in
our code? If you need any help or have any doubt, ping us.

Thanks and Regards,
Harsh & Hrishikesh

On Tue, 13 Nov 2018 at 19:17, Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Nov 12, 2018, at 8:25 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Hello,
> > It would be really helpful if you can provide us a link (for both Tx and
> Rx) to the project you mentioned earlier where you worked on a similar
> problem, if possible.
> >
>
> At this time I can not provide a link. I will try and see what I can do,
> but do not hold your breath it could be awhile as we have to go thru a lot
> of legal stuff. If you can try vtune tool from Intel for x86 systems if you
> can get a copy for your platform as it can tell you a lot about the code
> and where the performance issues are located. If you are not running Intel
> x86 then my code may not work for you, I do not remember if you told me
> which platform.
>
>
> > Thanks and Regards,
> > Harsh & Hrishikesh.
> >
> > On Mon, 12 Nov 2018 at 01:15, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > Thanks a lot for all the support. We are looking into our work as of now
> and will contact you once we are done checking it completely from our side.
> Thanks for the help.
> >
> > Regards,
> > Harsh and Hrishikesh
> >
> > On Sat, 10 Nov 2018 at 11:47, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > Please make sure to send your emails in plain text format. The Mac mail
> program loves to use rich-text format is the original email use it and I
> have told it not only send plain text :-(
> >
> > > On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >
> > > We have implemented the logic for Tx/Rx as you suggested. We compared
> the obtained throughput with another version of same application that uses
> Linux raw sockets.
> > > Unfortunately, the throughput we receive in our DPDK application is
> less by a good margin. Is this any way we can optimize our implementation
> or anything that we are missing?
> > >
> >
> > The PoC code I was developing for DAPI I did not have any performance of
> issues it run just as fast with my limited testing. I converted the l3fwd
> code and I saw 10G 64byte wire rate as I remember using pktgen to generate
> the traffic.
> >
> > Not sure why you would see a big performance drop, but I do not know
> your application or code.
> >
> > > Thanks and regards
> > > Harsh & Hrishikesh
> > >
> > > On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > >
> > >
> > >> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >>
> > >> Thanks
> > >>  for your insight on the topic. Transmission is working with the
> functions you mentioned. We tried to search for some similar functions for
> handling incoming packets but could not find anything. Can you help us on
> that as well?
> > >>
> > >
> > > I do not know if a DPDK API set for RX side. But in the DAPI (DPDK
> API) PoC I was working on and presented at the DPDK Summit last Sept. In
> the PoC I did create a RX side version. The issues it has a bit of tangled
> up in the DAPI PoC.
> > >
> > > The basic concept is a call to RX a single packet does a rx_burst of N
> number of packets keeping then in a mbuf list. The code would spin waiting
> for mbufs to arrive or return quickly if a flag was set. When it did find
> RX mbufs it would just return the single mbuf and keep the list of mbufs
> for later requests until the list is empty then do another rx_burst call.
> > >
> > > Sorry this is a really quick note on how it works. If you need more
> details we can talk more later.
> > >>
> > >> Regards,
> > >> Harsh
> > >>  and Hrishikesh.
> > >>
> > >>
> > >> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > >>
> > >>
> > >> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >> >
> > >> > Hi,
> > >> > We are working on a project where we are trying to integrate DPDK
> with
> > >> > another software. We are able to obtain packets from the other
> environment
> > >> > to DPDK environment in one-by-one fashion. On the other hand DPDK
> allows to
> > >> > send/receive burst of data packets. We want to know if there is any
> > >> > functionality in DPDK to achieve this conversion of single incoming
> packet
> > >> > to a burst of packets sent on NIC and similarly, conversion of
> burst read
> > >> > packets from NIC to send it to other environment sequentially?
> > >>
> > >>
> > >> Search in the docs or lib/librte_ethdev directory on
> rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
> > >>
> > >>
> > >>
> > >> > Thanks and regards
> > >> > Harsh Patel, Hrishikesh Hiraskar
> > >> > NITK Surathkal
> > >>
> > >> Regards,
> > >> Keith
> > >>
> > >
> > > Regards,
> > > Keith
> > >
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-14 13:54                 ` Harsh Patel
@ 2018-11-14 15:02                   ` Wiles, Keith
  2018-11-14 15:04                   ` Wiles, Keith
  2018-11-14 15:15                   ` Wiles, Keith
  2 siblings, 0 replies; 43+ messages in thread
From: Wiles, Keith @ 2018-11-14 15:02 UTC (permalink / raw)
  To: Harsh Patel; +Cc: users

On Nov 14, 2018, at 7:54 AM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:

Hello,
This is a link to the complete source code of our project :- https://github.com/ns-3-dpdk-integration/ns-3-dpdk
For the description of the project, look through this :- https://ns-3-dpdk-integration.github.io/
Once you go through it, you will have a basic understanding of the project.
Installation instructions link are provided in the github.io<http://github.io/> page.

In the code we mentioned above, the master branch contains the implementation of the logic using rte_rings which we mentioned at the very beginning of the discussion. There is a branch named "newrxtx" which contains the implementation according to the logic you provided.

We would like you to take a look at the code in newrxtx branch. (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/tree/newrxtx)
In the code in this branch, go to ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/ directory. Here we have implemented the DpdkNetDevice model. This model contains the code which implements the whole model providing interaction between ns-3 and DPDK. We would like you take a look at our Read function (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L626) and Write function (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L576). These contains the logic you suggested.

I looked at the read and write routines briefly. The one thing that jumped out at me is you copy the packet from an internal data buffer to the mbuf or mbuf to data buffer. You should try your hardest to remove these memcpy calls in the data path as they will kill your performance. If you have to use memcpy I would look at the rte_memcpy() routine to use as they are highly optimized for DPDK. Even with using DPKD rte_memcpy() you will still see a big performance hit.

I did not look at were the buffer came from, but maybe you could allocate a pktmbuf pool (as you did) and when you main code asks for a buffer it grabs a mbufs points to the start of the mbuf and returns that pointer instead. Then when you get to the write or read routine you find the start of the mbuf header based on the buffer address or even some meta data attached to the buffer. Then you can call the rte_eth_tx_buffer() routine with that mbuf pointer. For the TX side the mbuf is freed by the driver, but could be on the TX done queue, just make sure you have enough buffers.

On the read side you need to also find the place the buffer is allocated and allocate a mbuf then save the mbuf pointer in the meta data of the buffer (if you have meta data per buffer) then you can at some point free the mbuf after you have processed the data buffer.

I hope that is clear, I meetings I must attend.

Can you go through this and suggest us some changes or find some mistake in our code? If you need any help or have any doubt, ping us.

Thanks and Regards,
Harsh & Hrishikesh

On Tue, 13 Nov 2018 at 19:17, Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:

> On Nov 12, 2018, at 8:25 PM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
>
> Hello,
> It would be really helpful if you can provide us a link (for both Tx and Rx) to the project you mentioned earlier where you worked on a similar problem, if possible.
>

At this time I can not provide a link. I will try and see what I can do, but do not hold your breath it could be awhile as we have to go thru a lot of legal stuff. If you can try vtune tool from Intel for x86 systems if you can get a copy for your platform as it can tell you a lot about the code and where the performance issues are located. If you are not running Intel x86 then my code may not work for you, I do not remember if you told me which platform.

> Thanks and Regards,
> Harsh & Hrishikesh.
>
> On Mon, 12 Nov 2018 at 01:15, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
> Thanks a lot for all the support. We are looking into our work as of now and will contact you once we are done checking it completely from our side. Thanks for the help.
>
> Regards,
> Harsh and Hrishikesh
>
> On Sat, 10 Nov 2018 at 11:47, Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:
> Please make sure to send your emails in plain text format. The Mac mail program loves to use rich-text format is the original email use it and I have told it not only send plain text :-(
>
> > On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
> >
> > We have implemented the logic for Tx/Rx as you suggested. We compared the obtained throughput with another version of same application that uses Linux raw sockets.
> > Unfortunately, the throughput we receive in our DPDK application is less by a good margin. Is this any way we can optimize our implementation or anything that we are missing?
> >
>
> The PoC code I was developing for DAPI I did not have any performance of issues it run just as fast with my limited testing. I converted the l3fwd code and I saw 10G 64byte wire rate as I remember using pktgen to generate the traffic.
>
> Not sure why you would see a big performance drop, but I do not know your application or code.
>
> > Thanks and regards
> > Harsh & Hrishikesh
> >
> > On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:
> >
> >
> >> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
> >>
> >> Thanks
> >>  for your insight on the topic. Transmission is working with the functions you mentioned. We tried to search for some similar functions for handling incoming packets but could not find anything. Can you help us on that as well?
> >>
> >
> > I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API) PoC I was working on and presented at the DPDK Summit last Sept. In the PoC I did create a RX side version. The issues it has a bit of tangled up in the DAPI PoC.
> >
> > The basic concept is a call to RX a single packet does a rx_burst of N number of packets keeping then in a mbuf list. The code would spin waiting for mbufs to arrive or return quickly if a flag was set. When it did find RX mbufs it would just return the single mbuf and keep the list of mbufs for later requests until the list is empty then do another rx_burst call.
> >
> > Sorry this is a really quick note on how it works. If you need more details we can talk more later.
> >>
> >> Regards,
> >> Harsh
> >>  and Hrishikesh.
> >>
> >>
> >> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:
> >>
> >>
> >> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
> >> >
> >> > Hi,
> >> > We are working on a project where we are trying to integrate DPDK with
> >> > another software. We are able to obtain packets from the other environment
> >> > to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
> >> > send/receive burst of data packets. We want to know if there is any
> >> > functionality in DPDK to achieve this conversion of single incoming packet
> >> > to a burst of packets sent on NIC and similarly, conversion of burst read
> >> > packets from NIC to send it to other environment sequentially?
> >>
> >>
> >> Search in the docs or lib/librte_ethdev directory on rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
> >>
> >>
> >>
> >> > Thanks and regards
> >> > Harsh Patel, Hrishikesh Hiraskar
> >> > NITK Surathkal
> >>
> >> Regards,
> >> Keith
> >>
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>

Regards,
Keith

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-14 13:54                 ` Harsh Patel
  2018-11-14 15:02                   ` Wiles, Keith
@ 2018-11-14 15:04                   ` Wiles, Keith
  2018-11-14 15:15                   ` Wiles, Keith
  2 siblings, 0 replies; 43+ messages in thread
From: Wiles, Keith @ 2018-11-14 15:04 UTC (permalink / raw)
  To: Harsh Patel; +Cc: users

Sorry, did not send plain text email again. 

> On Nov 14, 2018, at 7:54 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hello,
> This is a link to the complete source code of our project :- https://github.com/ns-3-dpdk-integration/ns-3-dpdk
> For the description of the project, look through this :- https://ns-3-dpdk-integration.github.io/
> Once you go through it, you will have a basic understanding of the project.
> Installation instructions link are provided in the github.io page.
> 
> In the code we mentioned above, the master branch contains the implementation of the logic using rte_rings which we mentioned at the very beginning of the discussion. There is a branch named "newrxtx" which contains the implementation according to the logic you provided.
> 
> We would like you to take a look at the code in newrxtx branch. (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/tree/newrxtx)
> In the code in this branch, go to ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/ directory. Here we have implemented the DpdkNetDevice model. This model contains the code which implements the whole model providing interaction between ns-3 and DPDK. We would like you take a look at our Read function (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L626) and Write function (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L576). These contains the logic you suggested.
> 

I looked at the read and write routines briefly. The one thing that jumped out at me is you copy the packet from an internal data buffer to the mbuf or mbuf to data buffer. You should try your hardest to remove these memcpy calls in the data path as they will kill your performance. If you have to use memcpy I would look at the rte_memcpy() routine to use as they are highly optimized for DPDK. Even with using DPKD rte_memcpy() you will still see a big performance hit.

I did not look at were the buffer came from, but maybe you could allocate a pktmbuf pool (as you did) and when you main code asks for a buffer it grabs a mbufs points to the start of the mbuf and returns that pointer instead. Then when you get to the write or read routine you find the start of the mbuf header based on the buffer address or even some meta data attached to the buffer. Then you can call the rte_eth_tx_buffer() routine with that mbuf pointer. For the TX side the mbuf is freed by the driver, but could be on the TX done queue, just make sure you have enough buffers.

On the read side you need to also find the place the buffer is allocated and allocate a mbuf then save the mbuf pointer in the meta data of the buffer (if you have meta data per buffer) then you can at some point free the mbuf after you have processed the data buffer.

I hope that is clear, I meetings I must attend.

> Can you go through this and suggest us some changes or find some mistake in our code? If you need any help or have any doubt, ping us.
> 
> Thanks and Regards,
> Harsh & Hrishikesh
> 
> On Tue, 13 Nov 2018 at 19:17, Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> > On Nov 12, 2018, at 8:25 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> > Hello,
> > It would be really helpful if you can provide us a link (for both Tx and Rx) to the project you mentioned earlier where you worked on a similar problem, if possible. 
> > 
> 
> At this time I can not provide a link. I will try and see what I can do, but do not hold your breath it could be awhile as we have to go thru a lot of legal stuff. If you can try vtune tool from Intel for x86 systems if you can get a copy for your platform as it can tell you a lot about the code and where the performance issues are located. If you are not running Intel x86 then my code may not work for you, I do not remember if you told me which platform.
> 
> 
> > Thanks and Regards, 
> > Harsh & Hrishikesh.
> > 
> > On Mon, 12 Nov 2018 at 01:15, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > Thanks a lot for all the support. We are looking into our work as of now and will contact you once we are done checking it completely from our side. Thanks for the help.
> > 
> > Regards,
> > Harsh and Hrishikesh
> > 
> > On Sat, 10 Nov 2018 at 11:47, Wiles, Keith <keith.wiles@intel.com> wrote:
> > Please make sure to send your emails in plain text format. The Mac mail program loves to use rich-text format is the original email use it and I have told it not only send plain text :-(
> > 
> > > On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > > 
> > > We have implemented the logic for Tx/Rx as you suggested. We compared the obtained throughput with another version of same application that uses Linux raw sockets. 
> > > Unfortunately, the throughput we receive in our DPDK application is less by a good margin. Is this any way we can optimize our implementation or anything that we are missing?
> > > 
> > 
> > The PoC code I was developing for DAPI I did not have any performance of issues it run just as fast with my limited testing. I converted the l3fwd code and I saw 10G 64byte wire rate as I remember using pktgen to generate the traffic.
> > 
> > Not sure why you would see a big performance drop, but I do not know your application or code.
> > 
> > > Thanks and regards
> > > Harsh & Hrishikesh
> > > 
> > > On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com> wrote:
> > > 
> > > 
> > >> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > >> 
> > >> Thanks
> > >>  for your insight on the topic. Transmission is working with the functions you mentioned. We tried to search for some similar functions for handling incoming packets but could not find anything. Can you help us on that as well?
> > >> 
> > > 
> > > I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API) PoC I was working on and presented at the DPDK Summit last Sept. In the PoC I did create a RX side version. The issues it has a bit of tangled up in the DAPI PoC.
> > > 
> > > The basic concept is a call to RX a single packet does a rx_burst of N number of packets keeping then in a mbuf list. The code would spin waiting for mbufs to arrive or return quickly if a flag was set. When it did find RX mbufs it would just return the single mbuf and keep the list of mbufs for later requests until the list is empty then do another rx_burst call.
> > > 
> > > Sorry this is a really quick note on how it works. If you need more details we can talk more later.
> > >> 
> > >> Regards,
> > >> Harsh
> > >>  and Hrishikesh.
> > >> 
> > >> 
> > >> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com> wrote:
> > >> 
> > >> 
> > >> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > >> > 
> > >> > Hi,
> > >> > We are working on a project where we are trying to integrate DPDK with
> > >> > another software. We are able to obtain packets from the other environment
> > >> > to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
> > >> > send/receive burst of data packets. We want to know if there is any
> > >> > functionality in DPDK to achieve this conversion of single incoming packet
> > >> > to a burst of packets sent on NIC and similarly, conversion of burst read
> > >> > packets from NIC to send it to other environment sequentially?
> > >> 
> > >> 
> > >> Search in the docs or lib/librte_ethdev directory on rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
> > >> 
> > >> 
> > >> 
> > >> > Thanks and regards
> > >> > Harsh Patel, Hrishikesh Hiraskar
> > >> > NITK Surathkal
> > >> 
> > >> Regards,
> > >> Keith
> > >> 
> > > 
> > > Regards,
> > > Keith
> > > 
> > 
> > Regards,
> > Keith
> > 
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-14 13:54                 ` Harsh Patel
  2018-11-14 15:02                   ` Wiles, Keith
  2018-11-14 15:04                   ` Wiles, Keith
@ 2018-11-14 15:15                   ` Wiles, Keith
  2018-11-17 10:22                     ` Harsh Patel
  2 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-14 15:15 UTC (permalink / raw)
  To: Harsh Patel; +Cc: users



> On Nov 14, 2018, at 7:54 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hello,
> This is a link to the complete source code of our project :- https://github.com/ns-3-dpdk-integration/ns-3-dpdk
> For the description of the project, look through this :- https://ns-3-dpdk-integration.github.io/
> Once you go through it, you will have a basic understanding of the project.
> Installation instructions link are provided in the github.io page.
> 
> In the code we mentioned above, the master branch contains the implementation of the logic using rte_rings which we mentioned at the very beginning of the discussion. There is a branch named "newrxtx" which contains the implementation according to the logic you provided.
> 
> We would like you to take a look at the code in newrxtx branch. (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/tree/newrxtx)
> In the code in this branch, go to ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/ directory. Here we have implemented the DpdkNetDevice model. This model contains the code which implements the whole model providing interaction between ns-3 and DPDK. We would like you take a look at our Read function (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L626) and Write function (https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L576). These contains the logic you suggested.

A couple of points for performance with DPDK.
 - Never use memcpy in the data path unless it is absolutely require and always try to avoid copying all of the data. In some cases you may want to use memcpy or rte_memcpy to only replace a small amount of data or to grab a copy of some small amount of data.
 - Never use malloc in the data path, meaning never call malloc on every packet use a list of buffers allocated up front if you need buffers of some time.
 - DPDK mempools are highly tuned and if you can use them for fixed size buffers.

I believe in the DPDK docs is a performance white paper or some information about optimizing packet process in DPDK. If you have not read it you may want to do so.

> 
> Can you go through this and suggest us some changes or find some mistake in our code? If you need any help or have any doubt, ping us.
> 
> Thanks and Regards,
> Harsh & Hrishikesh
> 
> On Tue, 13 Nov 2018 at 19:17, Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> > On Nov 12, 2018, at 8:25 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> > Hello,
> > It would be really helpful if you can provide us a link (for both Tx and Rx) to the project you mentioned earlier where you worked on a similar problem, if possible. 
> > 
> 
> At this time I can not provide a link. I will try and see what I can do, but do not hold your breath it could be awhile as we have to go thru a lot of legal stuff. If you can try vtune tool from Intel for x86 systems if you can get a copy for your platform as it can tell you a lot about the code and where the performance issues are located. If you are not running Intel x86 then my code may not work for you, I do not remember if you told me which platform.
> 
> 
> > Thanks and Regards, 
> > Harsh & Hrishikesh.
> > 
> > On Mon, 12 Nov 2018 at 01:15, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > Thanks a lot for all the support. We are looking into our work as of now and will contact you once we are done checking it completely from our side. Thanks for the help.
> > 
> > Regards,
> > Harsh and Hrishikesh
> > 
> > On Sat, 10 Nov 2018 at 11:47, Wiles, Keith <keith.wiles@intel.com> wrote:
> > Please make sure to send your emails in plain text format. The Mac mail program loves to use rich-text format is the original email use it and I have told it not only send plain text :-(
> > 
> > > On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > > 
> > > We have implemented the logic for Tx/Rx as you suggested. We compared the obtained throughput with another version of same application that uses Linux raw sockets. 
> > > Unfortunately, the throughput we receive in our DPDK application is less by a good margin. Is this any way we can optimize our implementation or anything that we are missing?
> > > 
> > 
> > The PoC code I was developing for DAPI I did not have any performance of issues it run just as fast with my limited testing. I converted the l3fwd code and I saw 10G 64byte wire rate as I remember using pktgen to generate the traffic.
> > 
> > Not sure why you would see a big performance drop, but I do not know your application or code.
> > 
> > > Thanks and regards
> > > Harsh & Hrishikesh
> > > 
> > > On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com> wrote:
> > > 
> > > 
> > >> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > >> 
> > >> Thanks
> > >>  for your insight on the topic. Transmission is working with the functions you mentioned. We tried to search for some similar functions for handling incoming packets but could not find anything. Can you help us on that as well?
> > >> 
> > > 
> > > I do not know if a DPDK API set for RX side. But in the DAPI (DPDK API) PoC I was working on and presented at the DPDK Summit last Sept. In the PoC I did create a RX side version. The issues it has a bit of tangled up in the DAPI PoC.
> > > 
> > > The basic concept is a call to RX a single packet does a rx_burst of N number of packets keeping then in a mbuf list. The code would spin waiting for mbufs to arrive or return quickly if a flag was set. When it did find RX mbufs it would just return the single mbuf and keep the list of mbufs for later requests until the list is empty then do another rx_burst call.
> > > 
> > > Sorry this is a really quick note on how it works. If you need more details we can talk more later.
> > >> 
> > >> Regards,
> > >> Harsh
> > >>  and Hrishikesh.
> > >> 
> > >> 
> > >> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com> wrote:
> > >> 
> > >> 
> > >> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > >> > 
> > >> > Hi,
> > >> > We are working on a project where we are trying to integrate DPDK with
> > >> > another software. We are able to obtain packets from the other environment
> > >> > to DPDK environment in one-by-one fashion. On the other hand DPDK allows to
> > >> > send/receive burst of data packets. We want to know if there is any
> > >> > functionality in DPDK to achieve this conversion of single incoming packet
> > >> > to a burst of packets sent on NIC and similarly, conversion of burst read
> > >> > packets from NIC to send it to other environment sequentially?
> > >> 
> > >> 
> > >> Search in the docs or lib/librte_ethdev directory on rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
> > >> 
> > >> 
> > >> 
> > >> > Thanks and regards
> > >> > Harsh Patel, Hrishikesh Hiraskar
> > >> > NITK Surathkal
> > >> 
> > >> Regards,
> > >> Keith
> > >> 
> > > 
> > > Regards,
> > > Keith
> > > 
> > 
> > Regards,
> > Keith
> > 
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-14 15:15                   ` Wiles, Keith
@ 2018-11-17 10:22                     ` Harsh Patel
  2018-11-17 22:05                       ` Kyle Larose
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2018-11-17 10:22 UTC (permalink / raw)
  To: keith.wiles; +Cc: users

Hello,
Thanks a lot for going through the code and providing us with so much
information.
We removed all the memcpy/malloc from the data path as you suggested and
here is the link to the Read/Write part of code.
READ -
https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L638
WRITE -
https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L600
After removing this, we are able to see a performance gain but not as good
as raw socket.

After this we ran some tests and we have some graphs which are attached to
this mail. The graphs contain images of TCP & UDP flows of various
bandwidths on both raw socket and ns-3-DPDK scenarios.
For some reason, we can see a bottleneck in both TCP and UDP. It will be
clear when you look at the graphs.
Do you know what could be the reason for this? Or can you look at the code
and see what is going wrong?

Thanks for your help.
Regards,
Harsh & Hrishikesh.

On Wed, 14 Nov 2018 at 20:45, Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Nov 14, 2018, at 7:54 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Hello,
> > This is a link to the complete source code of our project :-
> https://github.com/ns-3-dpdk-integration/ns-3-dpdk
> > For the description of the project, look through this :-
> https://ns-3-dpdk-integration.github.io/
> > Once you go through it, you will have a basic understanding of the
> project.
> > Installation instructions link are provided in the github.io page.
> >
> > In the code we mentioned above, the master branch contains the
> implementation of the logic using rte_rings which we mentioned at the very
> beginning of the discussion. There is a branch named "newrxtx" which
> contains the implementation according to the logic you provided.
> >
> > We would like you to take a look at the code in newrxtx branch. (
> https://github.com/ns-3-dpdk-integration/ns-3-dpdk/tree/newrxtx)
> > In the code in this branch, go to
> ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/ directory. Here we
> have implemented the DpdkNetDevice model. This model contains the code
> which implements the whole model providing interaction between ns-3 and
> DPDK. We would like you take a look at our Read function (
> https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L626)
> and Write function (
> https://github.com/ns-3-dpdk-integration/ns-3-dpdk/blob/newrxtx/ns-allinone-3.28.1/ns-3.28.1/src/fd-net-device/model/dpdk-net-device.cc#L576).
> These contains the logic you suggested.
>
> A couple of points for performance with DPDK.
>  - Never use memcpy in the data path unless it is absolutely require and
> always try to avoid copying all of the data. In some cases you may want to
> use memcpy or rte_memcpy to only replace a small amount of data or to grab
> a copy of some small amount of data.
>  - Never use malloc in the data path, meaning never call malloc on every
> packet use a list of buffers allocated up front if you need buffers of some
> time.
>  - DPDK mempools are highly tuned and if you can use them for fixed size
> buffers.
>
> I believe in the DPDK docs is a performance white paper or some
> information about optimizing packet process in DPDK. If you have not read
> it you may want to do so.
>
> >
> > Can you go through this and suggest us some changes or find some mistake
> in our code? If you need any help or have any doubt, ping us.
> >
> > Thanks and Regards,
> > Harsh & Hrishikesh
> >
> > On Tue, 13 Nov 2018 at 19:17, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> >
> >
> > > On Nov 12, 2018, at 8:25 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >
> > > Hello,
> > > It would be really helpful if you can provide us a link (for both Tx
> and Rx) to the project you mentioned earlier where you worked on a similar
> problem, if possible.
> > >
> >
> > At this time I can not provide a link. I will try and see what I can do,
> but do not hold your breath it could be awhile as we have to go thru a lot
> of legal stuff. If you can try vtune tool from Intel for x86 systems if you
> can get a copy for your platform as it can tell you a lot about the code
> and where the performance issues are located. If you are not running Intel
> x86 then my code may not work for you, I do not remember if you told me
> which platform.
> >
> >
> > > Thanks and Regards,
> > > Harsh & Hrishikesh.
> > >
> > > On Mon, 12 Nov 2018 at 01:15, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > Thanks a lot for all the support. We are looking into our work as of
> now and will contact you once we are done checking it completely from our
> side. Thanks for the help.
> > >
> > > Regards,
> > > Harsh and Hrishikesh
> > >
> > > On Sat, 10 Nov 2018 at 11:47, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > > Please make sure to send your emails in plain text format. The Mac
> mail program loves to use rich-text format is the original email use it and
> I have told it not only send plain text :-(
> > >
> > > > On Nov 9, 2018, at 4:09 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > >
> > > > We have implemented the logic for Tx/Rx as you suggested. We
> compared the obtained throughput with another version of same application
> that uses Linux raw sockets.
> > > > Unfortunately, the throughput we receive in our DPDK application is
> less by a good margin. Is this any way we can optimize our implementation
> or anything that we are missing?
> > > >
> > >
> > > The PoC code I was developing for DAPI I did not have any performance
> of issues it run just as fast with my limited testing. I converted the
> l3fwd code and I saw 10G 64byte wire rate as I remember using pktgen to
> generate the traffic.
> > >
> > > Not sure why you would see a big performance drop, but I do not know
> your application or code.
> > >
> > > > Thanks and regards
> > > > Harsh & Hrishikesh
> > > >
> > > > On Thu, 8 Nov 2018 at 23:14, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > > >
> > > >
> > > >> On Nov 8, 2018, at 4:58 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > >>
> > > >> Thanks
> > > >>  for your insight on the topic. Transmission is working with the
> functions you mentioned. We tried to search for some similar functions for
> handling incoming packets but could not find anything. Can you help us on
> that as well?
> > > >>
> > > >
> > > > I do not know if a DPDK API set for RX side. But in the DAPI (DPDK
> API) PoC I was working on and presented at the DPDK Summit last Sept. In
> the PoC I did create a RX side version. The issues it has a bit of tangled
> up in the DAPI PoC.
> > > >
> > > > The basic concept is a call to RX a single packet does a rx_burst of
> N number of packets keeping then in a mbuf list. The code would spin
> waiting for mbufs to arrive or return quickly if a flag was set. When it
> did find RX mbufs it would just return the single mbuf and keep the list of
> mbufs for later requests until the list is empty then do another rx_burst
> call.
> > > >
> > > > Sorry this is a really quick note on how it works. If you need more
> details we can talk more later.
> > > >>
> > > >> Regards,
> > > >> Harsh
> > > >>  and Hrishikesh.
> > > >>
> > > >>
> > > >> On Thu, 8 Nov 2018 at 14:26, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > > >>
> > > >>
> > > >> > On Nov 8, 2018, at 8:24 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > >> >
> > > >> > Hi,
> > > >> > We are working on a project where we are trying to integrate DPDK
> with
> > > >> > another software. We are able to obtain packets from the other
> environment
> > > >> > to DPDK environment in one-by-one fashion. On the other hand DPDK
> allows to
> > > >> > send/receive burst of data packets. We want to know if there is
> any
> > > >> > functionality in DPDK to achieve this conversion of single
> incoming packet
> > > >> > to a burst of packets sent on NIC and similarly, conversion of
> burst read
> > > >> > packets from NIC to send it to other environment sequentially?
> > > >>
> > > >>
> > > >> Search in the docs or lib/librte_ethdev directory on
> rte_eth_tx_buffer_init, rte_eth_tx_buffer, ...
> > > >>
> > > >>
> > > >>
> > > >> > Thanks and regards
> > > >> > Harsh Patel, Hrishikesh Hiraskar
> > > >> > NITK Surathkal
> > > >>
> > > >> Regards,
> > > >> Keith
> > > >>
> > > >
> > > > Regards,
> > > > Keith
> > > >
> > >
> > > Regards,
> > > Keith
> > >
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: UDP Throughput Comparison.png
Type: image/png
Size: 9863 bytes
Desc: not available
URL: <http://mails.dpdk.org/archives/users/attachments/20181117/40072f9b/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TCP PPS Comparison.png
Type: image/png
Size: 11408 bytes
Desc: not available
URL: <http://mails.dpdk.org/archives/users/attachments/20181117/40072f9b/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TCP Throughput Comparison.png
Type: image/png
Size: 12833 bytes
Desc: not available
URL: <http://mails.dpdk.org/archives/users/attachments/20181117/40072f9b/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: UDP PPS Comparison.png
Type: image/png
Size: 11645 bytes
Desc: not available
URL: <http://mails.dpdk.org/archives/users/attachments/20181117/40072f9b/attachment-0003.png>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-17 10:22                     ` Harsh Patel
@ 2018-11-17 22:05                       ` Kyle Larose
  2018-11-19 13:49                         ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Kyle Larose @ 2018-11-17 22:05 UTC (permalink / raw)
  To: thadodaharsh10; +Cc: keith.wiles, users

On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com> wrote:
>
> Hello,
> Thanks a lot for going through the code and providing us with so much
> information.
> We removed all the memcpy/malloc from the data path as you suggested and
...
> After removing this, we are able to see a performance gain but not as good
> as raw socket.
>

You're using an unordered_map to map your buffer pointers back to the
mbufs. While it may not do a memcpy all the time, It will likely end
up doing a malloc arbitrarily when you insert or remove entries from
the map. If it needs to resize the table, it'll be even worse. You may
want to consider using librte_hash:
https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
see if you can design the system to avoid needing to do a lookup like
this. Can you return a handle with the mbuf pointer and the data
together?

You're also using floating point math where it's unnecessary (the
timing check). Just multiply the numerator by 1000000 prior to doing
the division. I doubt you'll overflow a uint64_t with that. It's not
as efficient as integer math, though I'm not sure offhand it'd cause a
major perf problem.

One final thing: using a raw socket, the kernel will take over
transmitting and receiving to the NIC itself. that means it is free to
use multiple CPUs for the rx and tx. I notice that you only have one
rx/tx queue, meaning at most one CPU can send and receive packets.
When running your performance test with the raw socket, you may want
to see how busy the system is doing packet sends and receives. Is it
using more than one CPU's worth of processing? Is it using less, but
when combined with your main application's usage, the overall system
is still using more than one?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-17 22:05                       ` Kyle Larose
@ 2018-11-19 13:49                         ` Wiles, Keith
  2018-11-22 15:54                           ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-19 13:49 UTC (permalink / raw)
  To: Kyle Larose; +Cc: Harsh Patel, users



> On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com> wrote:
> 
> On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com> wrote:
>> 
>> Hello,
>> Thanks a lot for going through the code and providing us with so much
>> information.
>> We removed all the memcpy/malloc from the data path as you suggested and
> ...
>> After removing this, we are able to see a performance gain but not as good
>> as raw socket.
>> 
> 
> You're using an unordered_map to map your buffer pointers back to the
> mbufs. While it may not do a memcpy all the time, It will likely end
> up doing a malloc arbitrarily when you insert or remove entries from
> the map. If it needs to resize the table, it'll be even worse. You may
> want to consider using librte_hash:
> https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> see if you can design the system to avoid needing to do a lookup like
> this. Can you return a handle with the mbuf pointer and the data
> together?
> 
> You're also using floating point math where it's unnecessary (the
> timing check). Just multiply the numerator by 1000000 prior to doing
> the division. I doubt you'll overflow a uint64_t with that. It's not
> as efficient as integer math, though I'm not sure offhand it'd cause a
> major perf problem.
> 
> One final thing: using a raw socket, the kernel will take over
> transmitting and receiving to the NIC itself. that means it is free to
> use multiple CPUs for the rx and tx. I notice that you only have one
> rx/tx queue, meaning at most one CPU can send and receive packets.
> When running your performance test with the raw socket, you may want
> to see how busy the system is doing packet sends and receives. Is it
> using more than one CPU's worth of processing? Is it using less, but
> when combined with your main application's usage, the overall system
> is still using more than one?

Along with the floating point math, I would remove all floating point math and use the rte_rdtsc() function to use cycles. Using something like:

uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);	/* One 16th of a second use 2/4/8/16/32 power of two numbers to make the math simple divide */

cur_tsc = rte_rdtsc();

next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */

while(1) {
	cur_tsc = rte_rdtsc();
	if (cur_tsc >= next_tsc) {
		flush();
		next_tsc += timo;
	}
	/* Do other stuff */
}

For the m_bufPktMap I would use the rte_hash or do not use a hash at all by grabbing the buffer address and subtract the
mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);


DpdkNetDevice:Write(uint8_t *buffer, size_t length)
{
	struct rte_mbuf *pkt;
	uint64_t cur_tsc;

	pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);

	/* No need to test pkt, but buffer maybe tested to make sure it is not null above the math above */

	pkt->pk_len = length;
	pkt->data_len = length;

	rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);

	cur_tsc = rte_rdtsc();

	/* next_tsc is a private variable */
	if (cur_tsc >= next_tsc) {
		rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);	/* hardcoded the queue id, should be fixed */
		next_tsc = cur_tsc + timo; /* timo is a fixed number of cycles to wait */
	}
	return length;
}

DpdkNetDevice::Read()
{
	struct rte_mbuf *pkt;

	if (m_rxBuffer->length == 0) {
		m_rxBuffer->next = 0;
		m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, m_rxBuffer->pmts, MAX_PKT_BURST);

		if (m_rxBuffer->length == 0)
			return std::make_pair(NULL, -1);
	}

	pkt = m_rxBuffer->pkts[m_rxBuffer->next++];

	/* do not use rte_pktmbuf_read() as it does a copy for the complete packet */

	return std:make_pair(rte_pktmbuf_mtod(pkt, char *), pkt->pkt_len);
}

void
DpdkNetDevice::FreeBuf(uint8_t *buf)
{
	struct rte_mbuf *pkt;

	if (!buf)
		return;
	pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) + RTE_MAX_HEADROOM);

	rte_pktmbuf_free(pkt);
}

When your code is done with the buffer, then convert the buffer address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This should eliminate the copy and floating point code. Converting my C code to C++ priceless :-)

Hopefully the buffer address passed is the original buffer address and has not be adjusted.


Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-19 13:49                         ` Wiles, Keith
@ 2018-11-22 15:54                           ` Harsh Patel
  2018-11-24 15:43                             ` Wiles, Keith
  2018-11-24 16:01                             ` Wiles, Keith
  0 siblings, 2 replies; 43+ messages in thread
From: Harsh Patel @ 2018-11-22 15:54 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Kyle Larose, users

Hi

Thank you so much for the reply and for the solution.

We used the given code. We were amazed by the pointer arithmetic you used,
got to learn something new.

But still we are under performing.The same bottleneck of ~2.5Mbps is seen.

We also checked if the raw socket was using any extra (logical) cores than
the DPDK. We found that raw socket has 2 logical threads running on 2
logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical
CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same
bottleneck.

We have updated our code (you can use the same links from previous mail).
It would be helpful if you could help us in finding what causes the
bottleneck.

Thanks and Regards,
Harsh and Hrishikesh


On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com> wrote:
> >
> > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >>
> >> Hello,
> >> Thanks a lot for going through the code and providing us with so much
> >> information.
> >> We removed all the memcpy/malloc from the data path as you suggested and
> > ...
> >> After removing this, we are able to see a performance gain but not as
> good
> >> as raw socket.
> >>
> >
> > You're using an unordered_map to map your buffer pointers back to the
> > mbufs. While it may not do a memcpy all the time, It will likely end
> > up doing a malloc arbitrarily when you insert or remove entries from
> > the map. If it needs to resize the table, it'll be even worse. You may
> > want to consider using librte_hash:
> > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > see if you can design the system to avoid needing to do a lookup like
> > this. Can you return a handle with the mbuf pointer and the data
> > together?
> >
> > You're also using floating point math where it's unnecessary (the
> > timing check). Just multiply the numerator by 1000000 prior to doing
> > the division. I doubt you'll overflow a uint64_t with that. It's not
> > as efficient as integer math, though I'm not sure offhand it'd cause a
> > major perf problem.
> >
> > One final thing: using a raw socket, the kernel will take over
> > transmitting and receiving to the NIC itself. that means it is free to
> > use multiple CPUs for the rx and tx. I notice that you only have one
> > rx/tx queue, meaning at most one CPU can send and receive packets.
> > When running your performance test with the raw socket, you may want
> > to see how busy the system is doing packet sends and receives. Is it
> > using more than one CPU's worth of processing? Is it using less, but
> > when combined with your main application's usage, the overall system
> > is still using more than one?
>
> Along with the floating point math, I would remove all floating point math
> and use the rte_rdtsc() function to use cycles. Using something like:
>
> uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One
> 16th of a second use 2/4/8/16/32 power of two numbers to make the math
> simple divide */
>
> cur_tsc = rte_rdtsc();
>
> next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
>
> while(1) {
>         cur_tsc = rte_rdtsc();
>         if (cur_tsc >= next_tsc) {
>                 flush();
>                 next_tsc += timo;
>         }
>         /* Do other stuff */
> }
>
> For the m_bufPktMap I would use the rte_hash or do not use a hash at all
> by grabbing the buffer address and subtract the
> mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) +
> RTE_MAX_HEADROOM);
>
>
> DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> {
>         struct rte_mbuf *pkt;
>         uint64_t cur_tsc;
>
>         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct
> rte_mbuf) + RTE_MAX_HEADROOM);
>
>         /* No need to test pkt, but buffer maybe tested to make sure it is
> not null above the math above */
>
>         pkt->pk_len = length;
>         pkt->data_len = length;
>
>         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
>
>         cur_tsc = rte_rdtsc();
>
>         /* next_tsc is a private variable */
>         if (cur_tsc >= next_tsc) {
>                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);       /*
> hardcoded the queue id, should be fixed */
>                 next_tsc = cur_tsc + timo; /* timo is a fixed number of
> cycles to wait */
>         }
>         return length;
> }
>
> DpdkNetDevice::Read()
> {
>         struct rte_mbuf *pkt;
>
>         if (m_rxBuffer->length == 0) {
>                 m_rxBuffer->next = 0;
>                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0,
> m_rxBuffer->pmts, MAX_PKT_BURST);
>
>                 if (m_rxBuffer->length == 0)
>                         return std::make_pair(NULL, -1);
>         }
>
>         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
>
>         /* do not use rte_pktmbuf_read() as it does a copy for the
> complete packet */
>
>         return std:make_pair(rte_pktmbuf_mtod(pkt, char *), pkt->pkt_len);
> }
>
> void
> DpdkNetDevice::FreeBuf(uint8_t *buf)
> {
>         struct rte_mbuf *pkt;
>
>         if (!buf)
>                 return;
>         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) +
> RTE_MAX_HEADROOM);
>
>         rte_pktmbuf_free(pkt);
> }
>
> When your code is done with the buffer, then convert the buffer address
> back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This should
> eliminate the copy and floating point code. Converting my C code to C++
> priceless :-)
>
> Hopefully the buffer address passed is the original buffer address and has
> not be adjusted.
>
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-22 15:54                           ` Harsh Patel
@ 2018-11-24 15:43                             ` Wiles, Keith
  2018-11-24 15:48                               ` Wiles, Keith
  2018-11-24 16:01                             ` Wiles, Keith
  1 sibling, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-24 15:43 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Kyle Larose, users



> On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hi
> 
> Thank you so much for the reply and for the solution.
> 
> We used the given code. We were amazed by the pointer arithmetic you used, got to learn something new.
> 
> But still we are under performing.The same bottleneck of ~2.5Mbps is seen.

Make sure the cores you are using are on the same NUMA or socket the PCI devices are located.

If you have two CPUs or sockets in your system. The cpu_layout.py script will help you understand the layout of the cores and/or lcores in the system.

On my machine the PCI bus is connected to socket 1 and not socket 0, this means I have to use lcores only on socket 1. Some systems have two PCI buses one on each socket. Accessing data from one NUMA zone or socket to another can effect performance and should be avoided.

HTH
> 
> We also checked if the raw socket was using any extra (logical) cores than the DPDK. We found that raw socket has 2 logical threads running on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same bottleneck.
> 
> We have updated our code (you can use the same links from previous mail). It would be helpful if you could help us in finding what causes the bottleneck.
> 
> Thanks and Regards, 
> Harsh and Hrishikesh 
> 
> 
> On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com> wrote:
> > 
> > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com> wrote:
> >> 
> >> Hello,
> >> Thanks a lot for going through the code and providing us with so much
> >> information.
> >> We removed all the memcpy/malloc from the data path as you suggested and
> > ...
> >> After removing this, we are able to see a performance gain but not as good
> >> as raw socket.
> >> 
> > 
> > You're using an unordered_map to map your buffer pointers back to the
> > mbufs. While it may not do a memcpy all the time, It will likely end
> > up doing a malloc arbitrarily when you insert or remove entries from
> > the map. If it needs to resize the table, it'll be even worse. You may
> > want to consider using librte_hash:
> > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > see if you can design the system to avoid needing to do a lookup like
> > this. Can you return a handle with the mbuf pointer and the data
> > together?
> > 
> > You're also using floating point math where it's unnecessary (the
> > timing check). Just multiply the numerator by 1000000 prior to doing
> > the division. I doubt you'll overflow a uint64_t with that. It's not
> > as efficient as integer math, though I'm not sure offhand it'd cause a
> > major perf problem.
> > 
> > One final thing: using a raw socket, the kernel will take over
> > transmitting and receiving to the NIC itself. that means it is free to
> > use multiple CPUs for the rx and tx. I notice that you only have one
> > rx/tx queue, meaning at most one CPU can send and receive packets.
> > When running your performance test with the raw socket, you may want
> > to see how busy the system is doing packet sends and receives. Is it
> > using more than one CPU's worth of processing? Is it using less, but
> > when combined with your main application's usage, the overall system
> > is still using more than one?
> 
> Along with the floating point math, I would remove all floating point math and use the rte_rdtsc() function to use cycles. Using something like:
> 
> uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th of a second use 2/4/8/16/32 power of two numbers to make the math simple divide */
> 
> cur_tsc = rte_rdtsc();
> 
> next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> 
> while(1) {
>         cur_tsc = rte_rdtsc();
>         if (cur_tsc >= next_tsc) {
>                 flush();
>                 next_tsc += timo;
>         }
>         /* Do other stuff */
> }
> 
> For the m_bufPktMap I would use the rte_hash or do not use a hash at all by grabbing the buffer address and subtract the
> mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> 
> 
> DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> {
>         struct rte_mbuf *pkt;
>         uint64_t cur_tsc;
> 
>         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> 
>         /* No need to test pkt, but buffer maybe tested to make sure it is not null above the math above */
> 
>         pkt->pk_len = length;
>         pkt->data_len = length;
> 
>         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> 
>         cur_tsc = rte_rdtsc();
> 
>         /* next_tsc is a private variable */
>         if (cur_tsc >= next_tsc) {
>                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);       /* hardcoded the queue id, should be fixed */
>                 next_tsc = cur_tsc + timo; /* timo is a fixed number of cycles to wait */
>         }
>         return length;
> }
> 
> DpdkNetDevice::Read()
> {
>         struct rte_mbuf *pkt;
> 
>         if (m_rxBuffer->length == 0) {
>                 m_rxBuffer->next = 0;
>                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, m_rxBuffer->pmts, MAX_PKT_BURST);
> 
>                 if (m_rxBuffer->length == 0)
>                         return std::make_pair(NULL, -1);
>         }
> 
>         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
> 
>         /* do not use rte_pktmbuf_read() as it does a copy for the complete packet */
> 
>         return std:make_pair(rte_pktmbuf_mtod(pkt, char *), pkt->pkt_len);
> }
> 
> void
> DpdkNetDevice::FreeBuf(uint8_t *buf)
> {
>         struct rte_mbuf *pkt;
> 
>         if (!buf)
>                 return;
>         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) + RTE_MAX_HEADROOM);
> 
>         rte_pktmbuf_free(pkt);
> }
> 
> When your code is done with the buffer, then convert the buffer address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This should eliminate the copy and floating point code. Converting my C code to C++ priceless :-)
> 
> Hopefully the buffer address passed is the original buffer address and has not be adjusted.
> 
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-24 15:43                             ` Wiles, Keith
@ 2018-11-24 15:48                               ` Wiles, Keith
  0 siblings, 0 replies; 43+ messages in thread
From: Wiles, Keith @ 2018-11-24 15:48 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Kyle Larose, users



> On Nov 24, 2018, at 9:43 AM, Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> 
>> On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
>> 
>> Hi
>> 
>> Thank you so much for the reply and for the solution.
>> 
>> We used the given code. We were amazed by the pointer arithmetic you used, got to learn something new.
>> 
>> But still we are under performing.The same bottleneck of ~2.5Mbps is seen.
> 
> Make sure the cores you are using are on the same NUMA or socket the PCI devices are located.
> 
> If you have two CPUs or sockets in your system. The cpu_layout.py script will help you understand the layout of the cores and/or lcores in the system.
> 
> On my machine the PCI bus is connected to socket 1 and not socket 0, this means I have to use lcores only on socket 1. Some systems have two PCI buses one on each socket. Accessing data from one NUMA zone or socket to another can effect performance and should be avoided.
> 
> HTH
>> 
>> We also checked if the raw socket was using any extra (logical) cores than the DPDK. We found that raw socket has 2 logical threads running on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same bottleneck.

Not sure what you are trying to tell me here, but a picture could help me a lot.

>> 
>> We have updated our code (you can use the same links from previous mail). It would be helpful if you could help us in finding what causes the bottleneck.
>> 
>> Thanks and Regards, 
>> Harsh and Hrishikesh 
>> 
>> 
>> On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com> wrote:
>> 
>> 
>>> On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com> wrote:
>>> 
>>> On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com> wrote:
>>>> 
>>>> Hello,
>>>> Thanks a lot for going through the code and providing us with so much
>>>> information.
>>>> We removed all the memcpy/malloc from the data path as you suggested and
>>> ...
>>>> After removing this, we are able to see a performance gain but not as good
>>>> as raw socket.
>>>> 
>>> 
>>> You're using an unordered_map to map your buffer pointers back to the
>>> mbufs. While it may not do a memcpy all the time, It will likely end
>>> up doing a malloc arbitrarily when you insert or remove entries from
>>> the map. If it needs to resize the table, it'll be even worse. You may
>>> want to consider using librte_hash:
>>> https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
>>> see if you can design the system to avoid needing to do a lookup like
>>> this. Can you return a handle with the mbuf pointer and the data
>>> together?
>>> 
>>> You're also using floating point math where it's unnecessary (the
>>> timing check). Just multiply the numerator by 1000000 prior to doing
>>> the division. I doubt you'll overflow a uint64_t with that. It's not
>>> as efficient as integer math, though I'm not sure offhand it'd cause a
>>> major perf problem.
>>> 
>>> One final thing: using a raw socket, the kernel will take over
>>> transmitting and receiving to the NIC itself. that means it is free to
>>> use multiple CPUs for the rx and tx. I notice that you only have one
>>> rx/tx queue, meaning at most one CPU can send and receive packets.
>>> When running your performance test with the raw socket, you may want
>>> to see how busy the system is doing packet sends and receives. Is it
>>> using more than one CPU's worth of processing? Is it using less, but
>>> when combined with your main application's usage, the overall system
>>> is still using more than one?
>> 
>> Along with the floating point math, I would remove all floating point math and use the rte_rdtsc() function to use cycles. Using something like:
>> 
>> uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th of a second use 2/4/8/16/32 power of two numbers to make the math simple divide */
>> 
>> cur_tsc = rte_rdtsc();
>> 
>> next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
>> 
>> while(1) {
>>        cur_tsc = rte_rdtsc();
>>        if (cur_tsc >= next_tsc) {
>>                flush();
>>                next_tsc += timo;
>>        }
>>        /* Do other stuff */
>> }
>> 
>> For the m_bufPktMap I would use the rte_hash or do not use a hash at all by grabbing the buffer address and subtract the
>> mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
>> 
>> 
>> DpdkNetDevice:Write(uint8_t *buffer, size_t length)
>> {
>>        struct rte_mbuf *pkt;
>>        uint64_t cur_tsc;
>> 
>>        pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
>> 
>>        /* No need to test pkt, but buffer maybe tested to make sure it is not null above the math above */
>> 
>>        pkt->pk_len = length;
>>        pkt->data_len = length;
>> 
>>        rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
>> 
>>        cur_tsc = rte_rdtsc();
>> 
>>        /* next_tsc is a private variable */
>>        if (cur_tsc >= next_tsc) {
>>                rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);       /* hardcoded the queue id, should be fixed */
>>                next_tsc = cur_tsc + timo; /* timo is a fixed number of cycles to wait */
>>        }
>>        return length;
>> }
>> 
>> DpdkNetDevice::Read()
>> {
>>        struct rte_mbuf *pkt;
>> 
>>        if (m_rxBuffer->length == 0) {
>>                m_rxBuffer->next = 0;
>>                m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, m_rxBuffer->pmts, MAX_PKT_BURST);
>> 
>>                if (m_rxBuffer->length == 0)
>>                        return std::make_pair(NULL, -1);
>>        }
>> 
>>        pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
>> 
>>        /* do not use rte_pktmbuf_read() as it does a copy for the complete packet */
>> 
>>        return std:make_pair(rte_pktmbuf_mtod(pkt, char *), pkt->pkt_len);
>> }
>> 
>> void
>> DpdkNetDevice::FreeBuf(uint8_t *buf)
>> {
>>        struct rte_mbuf *pkt;
>> 
>>        if (!buf)
>>                return;
>>        pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) + RTE_MAX_HEADROOM);
>> 
>>        rte_pktmbuf_free(pkt);
>> }
>> 
>> When your code is done with the buffer, then convert the buffer address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This should eliminate the copy and floating point code. Converting my C code to C++ priceless :-)
>> 
>> Hopefully the buffer address passed is the original buffer address and has not be adjusted.
>> 
>> 
>> Regards,
>> Keith
>> 
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-22 15:54                           ` Harsh Patel
  2018-11-24 15:43                             ` Wiles, Keith
@ 2018-11-24 16:01                             ` Wiles, Keith
  2018-11-25  4:35                               ` Stephen Hemminger
  1 sibling, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-24 16:01 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Kyle Larose, users



> On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hi
> 
> Thank you so much for the reply and for the solution.
> 
> We used the given code. We were amazed by the pointer arithmetic you used, got to learn something new.
> 
> But still we are under performing.The same bottleneck of ~2.5Mbps is seen.
> 
> We also checked if the raw socket was using any extra (logical) cores than the DPDK. We found that raw socket has 2 logical threads running on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same bottleneck.
> 
> We have updated our code (you can use the same links from previous mail). It would be helpful if you could help us in finding what causes the bottleneck.

I looked at the code for a few seconds and noticed your TX_TIMEOUT is macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call rte_get_timer_hz() time, but grab the value and store the hz locally and use that variable instead. This will not improve performance is my guess and I would have to look at the code the that routine to see if it buys you anything to store the value locally. If the getting hz is just a simple read of a variable then good, but still you should should a local variable within the object to hold the (rte_get_timer_hz()/2048) instead of doing the call and divide each time.

> 
> Thanks and Regards, 
> Harsh and Hrishikesh 
> 
> 
> On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com> wrote:
> > 
> > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com> wrote:
> >> 
> >> Hello,
> >> Thanks a lot for going through the code and providing us with so much
> >> information.
> >> We removed all the memcpy/malloc from the data path as you suggested and
> > ...
> >> After removing this, we are able to see a performance gain but not as good
> >> as raw socket.
> >> 
> > 
> > You're using an unordered_map to map your buffer pointers back to the
> > mbufs. While it may not do a memcpy all the time, It will likely end
> > up doing a malloc arbitrarily when you insert or remove entries from
> > the map. If it needs to resize the table, it'll be even worse. You may
> > want to consider using librte_hash:
> > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > see if you can design the system to avoid needing to do a lookup like
> > this. Can you return a handle with the mbuf pointer and the data
> > together?
> > 
> > You're also using floating point math where it's unnecessary (the
> > timing check). Just multiply the numerator by 1000000 prior to doing
> > the division. I doubt you'll overflow a uint64_t with that. It's not
> > as efficient as integer math, though I'm not sure offhand it'd cause a
> > major perf problem.
> > 
> > One final thing: using a raw socket, the kernel will take over
> > transmitting and receiving to the NIC itself. that means it is free to
> > use multiple CPUs for the rx and tx. I notice that you only have one
> > rx/tx queue, meaning at most one CPU can send and receive packets.
> > When running your performance test with the raw socket, you may want
> > to see how busy the system is doing packet sends and receives. Is it
> > using more than one CPU's worth of processing? Is it using less, but
> > when combined with your main application's usage, the overall system
> > is still using more than one?
> 
> Along with the floating point math, I would remove all floating point math and use the rte_rdtsc() function to use cycles. Using something like:
> 
> uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th of a second use 2/4/8/16/32 power of two numbers to make the math simple divide */
> 
> cur_tsc = rte_rdtsc();
> 
> next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> 
> while(1) {
>         cur_tsc = rte_rdtsc();
>         if (cur_tsc >= next_tsc) {
>                 flush();
>                 next_tsc += timo;
>         }
>         /* Do other stuff */
> }
> 
> For the m_bufPktMap I would use the rte_hash or do not use a hash at all by grabbing the buffer address and subtract the
> mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> 
> 
> DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> {
>         struct rte_mbuf *pkt;
>         uint64_t cur_tsc;
> 
>         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> 
>         /* No need to test pkt, but buffer maybe tested to make sure it is not null above the math above */
> 
>         pkt->pk_len = length;
>         pkt->data_len = length;
> 
>         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> 
>         cur_tsc = rte_rdtsc();
> 
>         /* next_tsc is a private variable */
>         if (cur_tsc >= next_tsc) {
>                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);       /* hardcoded the queue id, should be fixed */
>                 next_tsc = cur_tsc + timo; /* timo is a fixed number of cycles to wait */
>         }
>         return length;
> }
> 
> DpdkNetDevice::Read()
> {
>         struct rte_mbuf *pkt;
> 
>         if (m_rxBuffer->length == 0) {
>                 m_rxBuffer->next = 0;
>                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, m_rxBuffer->pmts, MAX_PKT_BURST);
> 
>                 if (m_rxBuffer->length == 0)
>                         return std::make_pair(NULL, -1);
>         }
> 
>         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
> 
>         /* do not use rte_pktmbuf_read() as it does a copy for the complete packet */
> 
>         return std:make_pair(rte_pktmbuf_mtod(pkt, char *), pkt->pkt_len);
> }
> 
> void
> DpdkNetDevice::FreeBuf(uint8_t *buf)
> {
>         struct rte_mbuf *pkt;
> 
>         if (!buf)
>                 return;
>         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) + RTE_MAX_HEADROOM);
> 
>         rte_pktmbuf_free(pkt);
> }
> 
> When your code is done with the buffer, then convert the buffer address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This should eliminate the copy and floating point code. Converting my C code to C++ priceless :-)
> 
> Hopefully the buffer address passed is the original buffer address and has not be adjusted.
> 
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-24 16:01                             ` Wiles, Keith
@ 2018-11-25  4:35                               ` Stephen Hemminger
  2018-11-30  9:02                                 ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Hemminger @ 2018-11-25  4:35 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Harsh Patel, Kyle Larose, users

On Sat, 24 Nov 2018 16:01:04 +0000
"Wiles, Keith" <keith.wiles@intel.com> wrote:

> > On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> > Hi
> > 
> > Thank you so much for the reply and for the solution.
> > 
> > We used the given code. We were amazed by the pointer arithmetic you used, got to learn something new.
> > 
> > But still we are under performing.The same bottleneck of ~2.5Mbps is seen.
> > 
> > We also checked if the raw socket was using any extra (logical) cores than the DPDK. We found that raw socket has 2 logical threads running on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same bottleneck.
> > 
> > We have updated our code (you can use the same links from previous mail). It would be helpful if you could help us in finding what causes the bottleneck.  
> 
> I looked at the code for a few seconds and noticed your TX_TIMEOUT is macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call rte_get_timer_hz() time, but grab the value and store the hz locally and use that variable instead. This will not improve performance is my guess and I would have to look at the code the that routine to see if it buys you anything to store the value locally. If the getting hz is just a simple read of a variable then good, but still you should should a local variable within the object to hold the (rte_get_timer_hz()/2048) instead of doing the call and divide each time.
> 
> > 
> > Thanks and Regards, 
> > Harsh and Hrishikesh 
> > 
> > 
> > On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com> wrote:
> > 
> >   
> > > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com> wrote:
> > > 
> > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com> wrote:  
> > >> 
> > >> Hello,
> > >> Thanks a lot for going through the code and providing us with so much
> > >> information.
> > >> We removed all the memcpy/malloc from the data path as you suggested and  
> > > ...  
> > >> After removing this, we are able to see a performance gain but not as good
> > >> as raw socket.
> > >>   
> > > 
> > > You're using an unordered_map to map your buffer pointers back to the
> > > mbufs. While it may not do a memcpy all the time, It will likely end
> > > up doing a malloc arbitrarily when you insert or remove entries from
> > > the map. If it needs to resize the table, it'll be even worse. You may
> > > want to consider using librte_hash:
> > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > > see if you can design the system to avoid needing to do a lookup like
> > > this. Can you return a handle with the mbuf pointer and the data
> > > together?
> > > 
> > > You're also using floating point math where it's unnecessary (the
> > > timing check). Just multiply the numerator by 1000000 prior to doing
> > > the division. I doubt you'll overflow a uint64_t with that. It's not
> > > as efficient as integer math, though I'm not sure offhand it'd cause a
> > > major perf problem.
> > > 
> > > One final thing: using a raw socket, the kernel will take over
> > > transmitting and receiving to the NIC itself. that means it is free to
> > > use multiple CPUs for the rx and tx. I notice that you only have one
> > > rx/tx queue, meaning at most one CPU can send and receive packets.
> > > When running your performance test with the raw socket, you may want
> > > to see how busy the system is doing packet sends and receives. Is it
> > > using more than one CPU's worth of processing? Is it using less, but
> > > when combined with your main application's usage, the overall system
> > > is still using more than one?  
> > 
> > Along with the floating point math, I would remove all floating point math and use the rte_rdtsc() function to use cycles. Using something like:
> > 
> > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th of a second use 2/4/8/16/32 power of two numbers to make the math simple divide */
> > 
> > cur_tsc = rte_rdtsc();
> > 
> > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> > 
> > while(1) {
> >         cur_tsc = rte_rdtsc();
> >         if (cur_tsc >= next_tsc) {
> >                 flush();
> >                 next_tsc += timo;
> >         }
> >         /* Do other stuff */
> > }
> > 
> > For the m_bufPktMap I would use the rte_hash or do not use a hash at all by grabbing the buffer address and subtract the
> > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> > 
> > 
> > DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> > {
> >         struct rte_mbuf *pkt;
> >         uint64_t cur_tsc;
> > 
> >         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> > 
> >         /* No need to test pkt, but buffer maybe tested to make sure it is not null above the math above */
> > 
> >         pkt->pk_len = length;
> >         pkt->data_len = length;
> > 
> >         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> > 
> >         cur_tsc = rte_rdtsc();
> > 
> >         /* next_tsc is a private variable */
> >         if (cur_tsc >= next_tsc) {
> >                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);       /* hardcoded the queue id, should be fixed */
> >                 next_tsc = cur_tsc + timo; /* timo is a fixed number of cycles to wait */
> >         }
> >         return length;
> > }
> > 
> > DpdkNetDevice::Read()
> > {
> >         struct rte_mbuf *pkt;
> > 
> >         if (m_rxBuffer->length == 0) {
> >                 m_rxBuffer->next = 0;
> >                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, m_rxBuffer->pmts, MAX_PKT_BURST);
> > 
> >                 if (m_rxBuffer->length == 0)
> >                         return std::make_pair(NULL, -1);
> >         }
> > 
> >         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
> > 
> >         /* do not use rte_pktmbuf_read() as it does a copy for the complete packet */
> > 
> >         return std:make_pair(rte_pktmbuf_mtod(pkt, char *), pkt->pkt_len);
> > }
> > 
> > void
> > DpdkNetDevice::FreeBuf(uint8_t *buf)
> > {
> >         struct rte_mbuf *pkt;
> > 
> >         if (!buf)
> >                 return;
> >         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) + RTE_MAX_HEADROOM);
> > 
> >         rte_pktmbuf_free(pkt);
> > }
> > 
> > When your code is done with the buffer, then convert the buffer address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This should eliminate the copy and floating point code. Converting my C code to C++ priceless :-)
> > 
> > Hopefully the buffer address passed is the original buffer address and has not be adjusted.
> > 
> > 
> > Regards,
> > Keith
> >   
> 
> Regards,
> Keith
> 

Also rdtsc causes cpu to stop doing any look ahead, so there is a heisenberg effect.
Adding more rdtsc will hurt performance.  It also looks like your code is not doing bursting correctly.
What if multiple packets arrive in one rx_burst?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-25  4:35                               ` Stephen Hemminger
@ 2018-11-30  9:02                                 ` Harsh Patel
  2018-11-30 10:24                                   ` Harsh Patel
  2018-11-30 15:54                                   ` Wiles, Keith
  0 siblings, 2 replies; 43+ messages in thread
From: Harsh Patel @ 2018-11-30  9:02 UTC (permalink / raw)
  To: stephen; +Cc: Wiles, Keith, Kyle Larose, users

Hello,
Sorry for the long delay, we were busy with some exams.


*1) About the NUMA sockets*This is the result of the command you mentioned
:-
======================================================================
Core and Socket Information (as reported by '/sys/devices/system/cpu')
======================================================================

cores =  [0, 1, 2, 3]
sockets =  [0]

       Socket 0
       --------
Core 0 [0]
Core 1 [1]
Core 2 [2]
Core 3 [3]

We don't know much about this and would like your input on what else to be
checked or what do we need to do.


*2) The part where you asked for a graph *We used `ps` to analyse which CPU
cores are being utilized.
The raw socket version had two logical threads which used cores 0 and 1.
The DPDK version had 6 logical threads, which also used cores 0 and 1. This
is the case for which we showed you the results.
As the previous case had 2 cores and was not giving desired results, we
tried to give more cores to see if the DPDK in ns-3 code can achieve the
desired throughput and pps. (We thought giving more cores might improve the
performance.)
For this new case, we provided 4 total cores using  EAL arguments, upon
which, it used cores 0-3. And still we got the same results as the one sent
earlier.
We think this means that the bottleneck is a different problem unrelated to
number of cores as of now. (This whole section is an answer to the question
in the last paragraph raised by Kyle to which Keith asked for a graph)

*3) About updating the TX_TIMEOUT and storing rte_get_timer_hz()  *
We have not tried this and will try it by today and will send you the
status after that in some time.

*4) For the suggestion by Stephen*
We are not clear on what you suggested and it would be nice if you
elaborate your suggestion.

Thanks and Regards,
Harsh and Hrishikesh

PS :- We are done with our exams and would be working now on this
regularly.

On Sun, 25 Nov 2018 at 10:05, Stephen Hemminger <stephen@networkplumber.org>
wrote:

> On Sat, 24 Nov 2018 16:01:04 +0000
> "Wiles, Keith" <keith.wiles@intel.com> wrote:
>
> > > On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >
> > > Hi
> > >
> > > Thank you so much for the reply and for the solution.
> > >
> > > We used the given code. We were amazed by the pointer arithmetic you
> used, got to learn something new.
> > >
> > > But still we are under performing.The same bottleneck of ~2.5Mbps is
> seen.
> > >
> > > We also checked if the raw socket was using any extra (logical) cores
> than the DPDK. We found that raw socket has 2 logical threads running on 2
> logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical
> CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same
> bottleneck.
> > >
> > > We have updated our code (you can use the same links from previous
> mail). It would be helpful if you could help us in finding what causes the
> bottleneck.
> >
> > I looked at the code for a few seconds and noticed your TX_TIMEOUT is
> macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call
> rte_get_timer_hz() time, but grab the value and store the hz locally and
> use that variable instead. This will not improve performance is my guess
> and I would have to look at the code the that routine to see if it buys you
> anything to store the value locally. If the getting hz is just a simple
> read of a variable then good, but still you should should a local variable
> within the object to hold the (rte_get_timer_hz()/2048) instead of doing
> the call and divide each time.
> >
> > >
> > > Thanks and Regards,
> > > Harsh and Hrishikesh
> > >

> >
> > > On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > >
> > >
> > > > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com>
> wrote:
> > > >
> > > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <
> thadodaharsh10@gmail.com> wrote:
> > > >>
> > > >> Hello,
> > > >> Thanks a lot for going through the code and providing us with so
> much
> > > >> information.
> > > >> We removed all the memcpy/malloc from the data path as you
> suggested and
> > > > ...
> > > >> After removing this, we are able to see a performance gain but not
> as good
> > > >> as raw socket.
> > > >>
> > > >
> > > > You're using an unordered_map to map your buffer pointers back to the
> > > > mbufs. While it may not do a memcpy all the time, It will likely end
> > > > up doing a malloc arbitrarily when you insert or remove entries from
> > > > the map. If it needs to resize the table, it'll be even worse. You
> may
> > > > want to consider using librte_hash:
> > > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > > > see if you can design the system to avoid needing to do a lookup like
> > > > this. Can you return a handle with the mbuf pointer and the data
> > > > together?
> > > >
> > > > You're also using floating point math where it's unnecessary (the
> > > > timing check). Just multiply the numerator by 1000000 prior to doing
> > > > the division. I doubt you'll overflow a uint64_t with that. It's not
> > > > as efficient as integer math, though I'm not sure offhand it'd cause
> a
> > > > major perf problem.
> > > >
> > > > One final thing: using a raw socket, the kernel will take over
> > > > transmitting and receiving to the NIC itself. that means it is free
> to
> > > > use multiple CPUs for the rx and tx. I notice that you only have one
> > > > rx/tx queue, meaning at most one CPU can send and receive packets.
> > > > When running your performance test with the raw socket, you may want
> > > > to see how busy the system is doing packet sends and receives. Is it
> > > > using more than one CPU's worth of processing? Is it using less, but
> > > > when combined with your main application's usage, the overall system
> > > > is still using more than one?
> > >
> > > Along with the floating point math, I would remove all floating point
> math and use the rte_rdtsc() function to use cycles. Using something like:
> > >
> > > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One
> 16th of a second use 2/4/8/16/32 power of two numbers to make the math
> simple divide */
> > >
> > > cur_tsc = rte_rdtsc();
> > >
> > > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> > >
> > > while(1) {
> > >         cur_tsc = rte_rdtsc();
> > >         if (cur_tsc >= next_tsc) {
> > >                 flush();
> > >                 next_tsc += timo;
> > >         }
> > >         /* Do other stuff */
> > > }
> > >
> > > For the m_bufPktMap I would use the rte_hash or do not use a hash at
> all by grabbing the buffer address and subtract the
> > > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) +
> RTE_MAX_HEADROOM);
> > >
> > >
> > > DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> > > {
> > >         struct rte_mbuf *pkt;
> > >         uint64_t cur_tsc;
> > >
> > >         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct
> rte_mbuf) + RTE_MAX_HEADROOM);
> > >
> > >         /* No need to test pkt, but buffer maybe tested to make sure
> it is not null above the math above */
> > >
> > >         pkt->pk_len = length;
> > >         pkt->data_len = length;
> > >
> > >         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> > >
> > >         cur_tsc = rte_rdtsc();
> > >
> > >         /* next_tsc is a private variable */
> > >         if (cur_tsc >= next_tsc) {
> > >                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);
>  /* hardcoded the queue id, should be fixed */
> > >                 next_tsc = cur_tsc + timo; /* timo is a fixed number
> of cycles to wait */
> > >         }
> > >         return length;
> > > }
> > >
> > > DpdkNetDevice::Read()
> > > {
> > >         struct rte_mbuf *pkt;
> > >
> > >         if (m_rxBuffer->length == 0) {
> > >                 m_rxBuffer->next = 0;
> > >                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0,
> m_rxBuffer->pmts, MAX_PKT_BURST);
> > >
> > >                 if (m_rxBuffer->length == 0)
> > >                         return std::make_pair(NULL, -1);
> > >         }
> > >
> > >         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
> > >
> > >         /* do not use rte_pktmbuf_read() as it does a copy for the
> complete packet */
> > >
> > >         return std:make_pair(rte_pktmbuf_mtod(pkt, char *),
> pkt->pkt_len);
> > > }
> > >
> > > void
> > > DpdkNetDevice::FreeBuf(uint8_t *buf)
> > > {
> > >         struct rte_mbuf *pkt;
> > >
> > >         if (!buf)
> > >                 return;
> > >         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) +
> RTE_MAX_HEADROOM);
> > >
> > >         rte_pktmbuf_free(pkt);
> > > }
> > >
> > > When your code is done with the buffer, then convert the buffer
> address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This
> should eliminate the copy and floating point code. Converting my C code to
> C++ priceless :-)
> > >
> > > Hopefully the buffer address passed is the original buffer address and
> has not be adjusted.
> > >
> > >
> > > Regards,
> > > Keith
> > >
> >
> > Regards,
> > Keith
> >
>
> Also rdtsc causes cpu to stop doing any look ahead, so there is a
> heisenberg effect.
> Adding more rdtsc will hurt performance.  It also looks like your code is
> not doing bursting correctly.
> What if multiple packets arrive in one rx_burst?
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-30  9:02                                 ` Harsh Patel
@ 2018-11-30 10:24                                   ` Harsh Patel
  2018-11-30 15:54                                   ` Wiles, Keith
  1 sibling, 0 replies; 43+ messages in thread
From: Harsh Patel @ 2018-11-30 10:24 UTC (permalink / raw)
  To: stephen; +Cc: Wiles, Keith, Kyle Larose, users

Hello,
*About updating TX_TIMEOUT *
We edited the code to call the rte_get_timer_hz() function only once (you
can check the code) and we are observing no performance gain. It shows that
what you said is correct on this not improving the performance.

Thanks,
Harsh & Hrishikesh


On Fri, 30 Nov 2018 at 14:32, Harsh Patel <thadodaharsh10@gmail.com> wrote:

> Hello,
> Sorry for the long delay, we were busy with some exams.
>
>
> *1) About the NUMA sockets*This is the result of the command you
> mentioned :-
> ======================================================================
> Core and Socket Information (as reported by '/sys/devices/system/cpu')
> ======================================================================
>
> cores =  [0, 1, 2, 3]
> sockets =  [0]
>
>        Socket 0
>        --------
> Core 0 [0]
> Core 1 [1]
> Core 2 [2]
> Core 3 [3]
>
> We don't know much about this and would like your input on what else to be
> checked or what do we need to do.
>
>
> *2) The part where you asked for a graph *We used `ps` to analyse which
> CPU cores are being utilized.
> The raw socket version had two logical threads which used cores 0 and 1.
> The DPDK version had 6 logical threads, which also used cores 0 and 1.
> This is the case for which we showed you the results.
> As the previous case had 2 cores and was not giving desired results, we
> tried to give more cores to see if the DPDK in ns-3 code can achieve the
> desired throughput and pps. (We thought giving more cores might improve the
> performance.)
> For this new case, we provided 4 total cores using  EAL arguments, upon
> which, it used cores 0-3. And still we got the same results as the one sent
> earlier.
> We think this means that the bottleneck is a different problem unrelated
> to number of cores as of now. (This whole section is an answer to the
> question in the last paragraph raised by Kyle to which Keith asked for a
> graph)
>
> *3) About updating the TX_TIMEOUT and storing rte_get_timer_hz()  *
> We have not tried this and will try it by today and will send you the
> status after that in some time.
>
> *4) For the suggestion by Stephen*
> We are not clear on what you suggested and it would be nice if you
> elaborate your suggestion.
>
> Thanks and Regards,
> Harsh and Hrishikesh
>
> PS :- We are done with our exams and would be working now on this
> regularly.
>
> On Sun, 25 Nov 2018 at 10:05, Stephen Hemminger <
> stephen@networkplumber.org> wrote:
>
>> On Sat, 24 Nov 2018 16:01:04 +0000
>> "Wiles, Keith" <keith.wiles@intel.com> wrote:
>>
>> > > On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> > >
>> > > Hi
>> > >
>> > > Thank you so much for the reply and for the solution.
>> > >
>> > > We used the given code. We were amazed by the pointer arithmetic you
>> used, got to learn something new.
>> > >
>> > > But still we are under performing.The same bottleneck of ~2.5Mbps is
>> seen.
>> > >
>> > > We also checked if the raw socket was using any extra (logical) cores
>> than the DPDK. We found that raw socket has 2 logical threads running on 2
>> logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical
>> CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same
>> bottleneck.
>> > >
>> > > We have updated our code (you can use the same links from previous
>> mail). It would be helpful if you could help us in finding what causes the
>> bottleneck.
>> >
>> > I looked at the code for a few seconds and noticed your TX_TIMEOUT is
>> macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call
>> rte_get_timer_hz() time, but grab the value and store the hz locally and
>> use that variable instead. This will not improve performance is my guess
>> and I would have to look at the code the that routine to see if it buys you
>> anything to store the value locally. If the getting hz is just a simple
>> read of a variable then good, but still you should should a local variable
>> within the object to hold the (rte_get_timer_hz()/2048) instead of doing
>> the call and divide each time.
>> >
>> > >
>> > > Thanks and Regards,
>> > > Harsh and Hrishikesh
>> > >
>
> > >
>> > > On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com>
>> wrote:
>> > >
>> > >
>> > > > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com>
>> wrote:
>> > > >
>> > > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <
>> thadodaharsh10@gmail.com> wrote:
>> > > >>
>> > > >> Hello,
>> > > >> Thanks a lot for going through the code and providing us with so
>> much
>> > > >> information.
>> > > >> We removed all the memcpy/malloc from the data path as you
>> suggested and
>> > > > ...
>> > > >> After removing this, we are able to see a performance gain but not
>> as good
>> > > >> as raw socket.
>> > > >>
>> > > >
>> > > > You're using an unordered_map to map your buffer pointers back to
>> the
>> > > > mbufs. While it may not do a memcpy all the time, It will likely end
>> > > > up doing a malloc arbitrarily when you insert or remove entries from
>> > > > the map. If it needs to resize the table, it'll be even worse. You
>> may
>> > > > want to consider using librte_hash:
>> > > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even
>> better,
>> > > > see if you can design the system to avoid needing to do a lookup
>> like
>> > > > this. Can you return a handle with the mbuf pointer and the data
>> > > > together?
>> > > >
>> > > > You're also using floating point math where it's unnecessary (the
>> > > > timing check). Just multiply the numerator by 1000000 prior to doing
>> > > > the division. I doubt you'll overflow a uint64_t with that. It's not
>> > > > as efficient as integer math, though I'm not sure offhand it'd
>> cause a
>> > > > major perf problem.
>> > > >
>> > > > One final thing: using a raw socket, the kernel will take over
>> > > > transmitting and receiving to the NIC itself. that means it is free
>> to
>> > > > use multiple CPUs for the rx and tx. I notice that you only have one
>> > > > rx/tx queue, meaning at most one CPU can send and receive packets.
>> > > > When running your performance test with the raw socket, you may want
>> > > > to see how busy the system is doing packet sends and receives. Is it
>> > > > using more than one CPU's worth of processing? Is it using less, but
>> > > > when combined with your main application's usage, the overall system
>> > > > is still using more than one?
>> > >
>> > > Along with the floating point math, I would remove all floating point
>> math and use the rte_rdtsc() function to use cycles. Using something like:
>> > >
>> > > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /*
>> One 16th of a second use 2/4/8/16/32 power of two numbers to make the math
>> simple divide */
>> > >
>> > > cur_tsc = rte_rdtsc();
>> > >
>> > > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
>> > >
>> > > while(1) {
>> > >         cur_tsc = rte_rdtsc();
>> > >         if (cur_tsc >= next_tsc) {
>> > >                 flush();
>> > >                 next_tsc += timo;
>> > >         }
>> > >         /* Do other stuff */
>> > > }
>> > >
>> > > For the m_bufPktMap I would use the rte_hash or do not use a hash at
>> all by grabbing the buffer address and subtract the
>> > > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) +
>> RTE_MAX_HEADROOM);
>> > >
>> > >
>> > > DpdkNetDevice:Write(uint8_t *buffer, size_t length)
>> > > {
>> > >         struct rte_mbuf *pkt;
>> > >         uint64_t cur_tsc;
>> > >
>> > >         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct
>> rte_mbuf) + RTE_MAX_HEADROOM);
>> > >
>> > >         /* No need to test pkt, but buffer maybe tested to make sure
>> it is not null above the math above */
>> > >
>> > >         pkt->pk_len = length;
>> > >         pkt->data_len = length;
>> > >
>> > >         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
>> > >
>> > >         cur_tsc = rte_rdtsc();
>> > >
>> > >         /* next_tsc is a private variable */
>> > >         if (cur_tsc >= next_tsc) {
>> > >                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);
>>    /* hardcoded the queue id, should be fixed */
>> > >                 next_tsc = cur_tsc + timo; /* timo is a fixed number
>> of cycles to wait */
>> > >         }
>> > >         return length;
>> > > }
>> > >
>> > > DpdkNetDevice::Read()
>> > > {
>> > >         struct rte_mbuf *pkt;
>> > >
>> > >         if (m_rxBuffer->length == 0) {
>> > >                 m_rxBuffer->next = 0;
>> > >                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0,
>> m_rxBuffer->pmts, MAX_PKT_BURST);
>> > >
>> > >                 if (m_rxBuffer->length == 0)
>> > >                         return std::make_pair(NULL, -1);
>> > >         }
>> > >
>> > >         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
>> > >
>> > >         /* do not use rte_pktmbuf_read() as it does a copy for the
>> complete packet */
>> > >
>> > >         return std:make_pair(rte_pktmbuf_mtod(pkt, char *),
>> pkt->pkt_len);
>> > > }
>> > >
>> > > void
>> > > DpdkNetDevice::FreeBuf(uint8_t *buf)
>> > > {
>> > >         struct rte_mbuf *pkt;
>> > >
>> > >         if (!buf)
>> > >                 return;
>> > >         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) +
>> RTE_MAX_HEADROOM);
>> > >
>> > >         rte_pktmbuf_free(pkt);
>> > > }
>> > >
>> > > When your code is done with the buffer, then convert the buffer
>> address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This
>> should eliminate the copy and floating point code. Converting my C code to
>> C++ priceless :-)
>> > >
>> > > Hopefully the buffer address passed is the original buffer address
>> and has not be adjusted.
>> > >
>> > >
>> > > Regards,
>> > > Keith
>> > >
>> >
>> > Regards,
>> > Keith
>> >
>>
>> Also rdtsc causes cpu to stop doing any look ahead, so there is a
>> heisenberg effect.
>> Adding more rdtsc will hurt performance.  It also looks like your code is
>> not doing bursting correctly.
>> What if multiple packets arrive in one rx_burst?
>>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-30  9:02                                 ` Harsh Patel
  2018-11-30 10:24                                   ` Harsh Patel
@ 2018-11-30 15:54                                   ` Wiles, Keith
  2018-12-03  9:37                                     ` Harsh Patel
  1 sibling, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-11-30 15:54 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



> On Nov 30, 2018, at 3:02 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hello,
> Sorry for the long delay, we were busy with some exams.
> 
> 1) About the NUMA sockets
> This is the result of the command you mentioned :-
> ======================================================================
> Core and Socket Information (as reported by '/sys/devices/system/cpu')
> ======================================================================
> 
> cores =  [0, 1, 2, 3]
> sockets =  [0]
> 
>        Socket 0  
>        --------  
> Core 0 [0]       
> Core 1 [1]       
> Core 2 [2]       
> Core 3 [3]
> 
> We don't know much about this and would like your input on what else to be checked or what do we need to do.
> 
> 2) The part where you asked for a graph 
> We used `ps` to analyse which CPU cores are being utilized.
> The raw socket version had two logical threads which used cores 0 and 1.
> The DPDK version had 6 logical threads, which also used cores 0 and 1. This is the case for which we showed you the results.
> As the previous case had 2 cores and was not giving desired results, we tried to give more cores to see if the DPDK in ns-3 code can achieve the desired throughput and pps. (We thought giving more cores might improve the performance.)
> For this new case, we provided 4 total cores using  EAL arguments, upon which, it used cores 0-3. And still we got the same results as the one sent earlier.
> We think this means that the bottleneck is a different problem unrelated to number of cores as of now. (This whole section is an answer to the question in the last paragraph raised by Kyle to which Keith asked for a graph)

In the CPU output above you are running a four core system with no hyper-threads. This means you only have four core and four threads in the terms of DPDK. Using 6 logical threads will not improve performance in the DPDK case. DPDK normally uses a single thread per core. You can have more than one pthread per core, but having more than one thread per code requires the software to switch threads. Having context switch is not a good performance win in most cases.

Not sure how your system is setup and a picture could help.

I will be traveling all next week and responses will be slow.

> 
> 3) About updating the TX_TIMEOUT and storing rte_get_timer_hz()  
> We have not tried this and will try it by today and will send you the status after that in some time. 
> 
> 4) For the suggestion by Stephen
> We are not clear on what you suggested and it would be nice if you elaborate your suggestion.
> 
> Thanks and Regards, 
> Harsh and Hrishikesh
> 
> PS :- We are done with our exams and would be working now on this regularly. 
> 
> On Sun, 25 Nov 2018 at 10:05, Stephen Hemminger <stephen@networkplumber.org> wrote:
> On Sat, 24 Nov 2018 16:01:04 +0000
> "Wiles, Keith" <keith.wiles@intel.com> wrote:
> 
> > > On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > > 
> > > Hi
> > > 
> > > Thank you so much for the reply and for the solution.
> > > 
> > > We used the given code. We were amazed by the pointer arithmetic you used, got to learn something new.
> > > 
> > > But still we are under performing.The same bottleneck of ~2.5Mbps is seen.
> > > 
> > > We also checked if the raw socket was using any extra (logical) cores than the DPDK. We found that raw socket has 2 logical threads running on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same bottleneck.
> > > 
> > > We have updated our code (you can use the same links from previous mail). It would be helpful if you could help us in finding what causes the bottleneck.  
> > 
> > I looked at the code for a few seconds and noticed your TX_TIMEOUT is macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call rte_get_timer_hz() time, but grab the value and store the hz locally and use that variable instead. This will not improve performance is my guess and I would have to look at the code the that routine to see if it buys you anything to store the value locally. If the getting hz is just a simple read of a variable then good, but still you should should a local variable within the object to hold the (rte_get_timer_hz()/2048) instead of doing the call and divide each time.
> > 
> > > 
> > > Thanks and Regards, 
> > > Harsh and Hrishikesh 
> > >  
> > > 
> > > On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com> wrote:
> > > 
> > >   
> > > > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com> wrote:
> > > > 
> > > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com> wrote:  
> > > >> 
> > > >> Hello,
> > > >> Thanks a lot for going through the code and providing us with so much
> > > >> information.
> > > >> We removed all the memcpy/malloc from the data path as you suggested and  
> > > > ...  
> > > >> After removing this, we are able to see a performance gain but not as good
> > > >> as raw socket.
> > > >>   
> > > > 
> > > > You're using an unordered_map to map your buffer pointers back to the
> > > > mbufs. While it may not do a memcpy all the time, It will likely end
> > > > up doing a malloc arbitrarily when you insert or remove entries from
> > > > the map. If it needs to resize the table, it'll be even worse. You may
> > > > want to consider using librte_hash:
> > > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > > > see if you can design the system to avoid needing to do a lookup like
> > > > this. Can you return a handle with the mbuf pointer and the data
> > > > together?
> > > > 
> > > > You're also using floating point math where it's unnecessary (the
> > > > timing check). Just multiply the numerator by 1000000 prior to doing
> > > > the division. I doubt you'll overflow a uint64_t with that. It's not
> > > > as efficient as integer math, though I'm not sure offhand it'd cause a
> > > > major perf problem.
> > > > 
> > > > One final thing: using a raw socket, the kernel will take over
> > > > transmitting and receiving to the NIC itself. that means it is free to
> > > > use multiple CPUs for the rx and tx. I notice that you only have one
> > > > rx/tx queue, meaning at most one CPU can send and receive packets.
> > > > When running your performance test with the raw socket, you may want
> > > > to see how busy the system is doing packet sends and receives. Is it
> > > > using more than one CPU's worth of processing? Is it using less, but
> > > > when combined with your main application's usage, the overall system
> > > > is still using more than one?  
> > > 
> > > Along with the floating point math, I would remove all floating point math and use the rte_rdtsc() function to use cycles. Using something like:
> > > 
> > > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th of a second use 2/4/8/16/32 power of two numbers to make the math simple divide */
> > > 
> > > cur_tsc = rte_rdtsc();
> > > 
> > > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> > > 
> > > while(1) {
> > >         cur_tsc = rte_rdtsc();
> > >         if (cur_tsc >= next_tsc) {
> > >                 flush();
> > >                 next_tsc += timo;
> > >         }
> > >         /* Do other stuff */
> > > }
> > > 
> > > For the m_bufPktMap I would use the rte_hash or do not use a hash at all by grabbing the buffer address and subtract the
> > > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> > > 
> > > 
> > > DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> > > {
> > >         struct rte_mbuf *pkt;
> > >         uint64_t cur_tsc;
> > > 
> > >         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> > > 
> > >         /* No need to test pkt, but buffer maybe tested to make sure it is not null above the math above */
> > > 
> > >         pkt->pk_len = length;
> > >         pkt->data_len = length;
> > > 
> > >         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> > > 
> > >         cur_tsc = rte_rdtsc();
> > > 
> > >         /* next_tsc is a private variable */
> > >         if (cur_tsc >= next_tsc) {
> > >                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);       /* hardcoded the queue id, should be fixed */
> > >                 next_tsc = cur_tsc + timo; /* timo is a fixed number of cycles to wait */
> > >         }
> > >         return length;
> > > }
> > > 
> > > DpdkNetDevice::Read()
> > > {
> > >         struct rte_mbuf *pkt;
> > > 
> > >         if (m_rxBuffer->length == 0) {
> > >                 m_rxBuffer->next = 0;
> > >                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, m_rxBuffer->pmts, MAX_PKT_BURST);
> > > 
> > >                 if (m_rxBuffer->length == 0)
> > >                         return std::make_pair(NULL, -1);
> > >         }
> > > 
> > >         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
> > > 
> > >         /* do not use rte_pktmbuf_read() as it does a copy for the complete packet */
> > > 
> > >         return std:make_pair(rte_pktmbuf_mtod(pkt, char *), pkt->pkt_len);
> > > }
> > > 
> > > void
> > > DpdkNetDevice::FreeBuf(uint8_t *buf)
> > > {
> > >         struct rte_mbuf *pkt;
> > > 
> > >         if (!buf)
> > >                 return;
> > >         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) + RTE_MAX_HEADROOM);
> > > 
> > >         rte_pktmbuf_free(pkt);
> > > }
> > > 
> > > When your code is done with the buffer, then convert the buffer address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This should eliminate the copy and floating point code. Converting my C code to C++ priceless :-)
> > > 
> > > Hopefully the buffer address passed is the original buffer address and has not be adjusted.
> > > 
> > > 
> > > Regards,
> > > Keith
> > >   
> > 
> > Regards,
> > Keith
> > 
> 
> Also rdtsc causes cpu to stop doing any look ahead, so there is a heisenberg effect.
> Adding more rdtsc will hurt performance.  It also looks like your code is not doing bursting correctly.
> What if multiple packets arrive in one rx_burst?

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-11-30 15:54                                   ` Wiles, Keith
@ 2018-12-03  9:37                                     ` Harsh Patel
  2018-12-14 17:41                                       ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2018-12-03  9:37 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: stephen, Kyle Larose, users

Hello,
The data mentioned in the previous mails are observations and the number of
threads mentioned are what the system is creating and not given by us to
the system. I'm not sure how to explain this by a picture but I will
provide a text explanation.

First, we ran the Linux kernel code which uses raw sockets and we gave 2
cores. That example used 2 threads on its own.
Secondly, we ran our DPDK in ns-3 code and we the same number of cores i.e.
2 cores. That example spawned 6 threads on its own.
(Note:- These are observations)
All of the above statistics were provided to answer the question if both
the simulations might be given different number of cores and may be that
was the reason of the performance bottleneck. Clearly they are both using
same no. of cores (2) and the results are what I have sent earlier. (Raw
socket ~ 10 Mbps and DPDK ~ 2.5 Mbps)

Now we thought that we might give more cores to DPDK in ns-3 code, which
might improve its performance.
This is where we gave 4 cores to our DPDK in ns-3 code which still spawned
the same 6 threads. And it gave the same results as 2 cores for DPDK in
ns-3.
This was the observation.

>From this, we assume that the number of cores is not a reason for the less
performance. This is not a problem we need to look somewhere else.
So, the problem due to which we are getting less performance and a
bottleneck around 2.5Mbps is somewhere else and we need to figure that out.

Ask again if not clear. If clear, we need to see where the problem is and
can you help in finding the reason why this happennig?

Thanks & Regards,
Harsh & Hrishikesh


On Fri, 30 Nov 2018 at 21:24, Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Nov 30, 2018, at 3:02 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Hello,
> > Sorry for the long delay, we were busy with some exams.
> >
> > 1) About the NUMA sockets
> > This is the result of the command you mentioned :-
> > ======================================================================
> > Core and Socket Information (as reported by '/sys/devices/system/cpu')
> > ======================================================================
> >
> > cores =  [0, 1, 2, 3]
> > sockets =  [0]
> >
> >        Socket 0
> >        --------
> > Core 0 [0]
> > Core 1 [1]
> > Core 2 [2]
> > Core 3 [3]
> >
> > We don't know much about this and would like your input on what else to
> be checked or what do we need to do.
> >
> > 2) The part where you asked for a graph
> > We used `ps` to analyse which CPU cores are being utilized.
> > The raw socket version had two logical threads which used cores 0 and 1.
> > The DPDK version had 6 logical threads, which also used cores 0 and 1.
> This is the case for which we showed you the results.
> > As the previous case had 2 cores and was not giving desired results, we
> tried to give more cores to see if the DPDK in ns-3 code can achieve the
> desired throughput and pps. (We thought giving more cores might improve the
> performance.)
> > For this new case, we provided 4 total cores using  EAL arguments, upon
> which, it used cores 0-3. And still we got the same results as the one sent
> earlier.
> > We think this means that the bottleneck is a different problem unrelated
> to number of cores as of now. (This whole section is an answer to the
> question in the last paragraph raised by Kyle to which Keith asked for a
> graph)
>
> In the CPU output above you are running a four core system with no
> hyper-threads. This means you only have four core and four threads in the
> terms of DPDK. Using 6 logical threads will not improve performance in the
> DPDK case. DPDK normally uses a single thread per core. You can have more
> than one pthread per core, but having more than one thread per code
> requires the software to switch threads. Having context switch is not a
> good performance win in most cases.
>
> Not sure how your system is setup and a picture could help.
>
> I will be traveling all next week and responses will be slow.
>
> >
> > 3) About updating the TX_TIMEOUT and storing rte_get_timer_hz()
> > We have not tried this and will try it by today and will send you the
> status after that in some time.
> >
> > 4) For the suggestion by Stephen
> > We are not clear on what you suggested and it would be nice if you
> elaborate your suggestion.
> >
> > Thanks and Regards,
> > Harsh and Hrishikesh
> >
> > PS :- We are done with our exams and would be working now on this
> regularly.
> >
> > On Sun, 25 Nov 2018 at 10:05, Stephen Hemminger <
> stephen@networkplumber.org> wrote:
> > On Sat, 24 Nov 2018 16:01:04 +0000
> > "Wiles, Keith" <keith.wiles@intel.com> wrote:
> >
> > > > On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > >
> > > > Hi
> > > >
> > > > Thank you so much for the reply and for the solution.
> > > >
> > > > We used the given code. We were amazed by the pointer arithmetic you
> used, got to learn something new.
> > > >
> > > > But still we are under performing.The same bottleneck of ~2.5Mbps is
> seen.
> > > >
> > > > We also checked if the raw socket was using any extra (logical)
> cores than the DPDK. We found that raw socket has 2 logical threads running
> on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2
> logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the
> same bottleneck.
> > > >
> > > > We have updated our code (you can use the same links from previous
> mail). It would be helpful if you could help us in finding what causes the
> bottleneck.
> > >
> > > I looked at the code for a few seconds and noticed your TX_TIMEOUT is
> macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call
> rte_get_timer_hz() time, but grab the value and store the hz locally and
> use that variable instead. This will not improve performance is my guess
> and I would have to look at the code the that routine to see if it buys you
> anything to store the value locally. If the getting hz is just a simple
> read of a variable then good, but still you should should a local variable
> within the object to hold the (rte_get_timer_hz()/2048) instead of doing
> the call and divide each time.
> > >
> > > >
> > > > Thanks and Regards,
> > > > Harsh and Hrishikesh
> > > >
> > > >
> > > > On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > > >
> > > >
> > > > > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com>
> wrote:
> > > > >
> > > > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <
> thadodaharsh10@gmail.com> wrote:
> > > > >>
> > > > >> Hello,
> > > > >> Thanks a lot for going through the code and providing us with so
> much
> > > > >> information.
> > > > >> We removed all the memcpy/malloc from the data path as you
> suggested and
> > > > > ...
> > > > >> After removing this, we are able to see a performance gain but
> not as good
> > > > >> as raw socket.
> > > > >>
> > > > >
> > > > > You're using an unordered_map to map your buffer pointers back to
> the
> > > > > mbufs. While it may not do a memcpy all the time, It will likely
> end
> > > > > up doing a malloc arbitrarily when you insert or remove entries
> from
> > > > > the map. If it needs to resize the table, it'll be even worse. You
> may
> > > > > want to consider using librte_hash:
> > > > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even
> better,
> > > > > see if you can design the system to avoid needing to do a lookup
> like
> > > > > this. Can you return a handle with the mbuf pointer and the data
> > > > > together?
> > > > >
> > > > > You're also using floating point math where it's unnecessary (the
> > > > > timing check). Just multiply the numerator by 1000000 prior to
> doing
> > > > > the division. I doubt you'll overflow a uint64_t with that. It's
> not
> > > > > as efficient as integer math, though I'm not sure offhand it'd
> cause a
> > > > > major perf problem.
> > > > >
> > > > > One final thing: using a raw socket, the kernel will take over
> > > > > transmitting and receiving to the NIC itself. that means it is
> free to
> > > > > use multiple CPUs for the rx and tx. I notice that you only have
> one
> > > > > rx/tx queue, meaning at most one CPU can send and receive packets.
> > > > > When running your performance test with the raw socket, you may
> want
> > > > > to see how busy the system is doing packet sends and receives. Is
> it
> > > > > using more than one CPU's worth of processing? Is it using less,
> but
> > > > > when combined with your main application's usage, the overall
> system
> > > > > is still using more than one?
> > > >
> > > > Along with the floating point math, I would remove all floating
> point math and use the rte_rdtsc() function to use cycles. Using something
> like:
> > > >
> > > > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /*
> One 16th of a second use 2/4/8/16/32 power of two numbers to make the math
> simple divide */
> > > >
> > > > cur_tsc = rte_rdtsc();
> > > >
> > > > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> > > >
> > > > while(1) {
> > > >         cur_tsc = rte_rdtsc();
> > > >         if (cur_tsc >= next_tsc) {
> > > >                 flush();
> > > >                 next_tsc += timo;
> > > >         }
> > > >         /* Do other stuff */
> > > > }
> > > >
> > > > For the m_bufPktMap I would use the rte_hash or do not use a hash at
> all by grabbing the buffer address and subtract the
> > > > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) +
> RTE_MAX_HEADROOM);
> > > >
> > > >
> > > > DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> > > > {
> > > >         struct rte_mbuf *pkt;
> > > >         uint64_t cur_tsc;
> > > >
> > > >         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct
> rte_mbuf) + RTE_MAX_HEADROOM);
> > > >
> > > >         /* No need to test pkt, but buffer maybe tested to make sure
> it is not null above the math above */
> > > >
> > > >         pkt->pk_len = length;
> > > >         pkt->data_len = length;
> > > >
> > > >         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> > > >
> > > >         cur_tsc = rte_rdtsc();
> > > >
> > > >         /* next_tsc is a private variable */
> > > >         if (cur_tsc >= next_tsc) {
> > > >                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);
>    /* hardcoded the queue id, should be fixed */
> > > >                 next_tsc = cur_tsc + timo; /* timo is a fixed number
> of cycles to wait */
> > > >         }
> > > >         return length;
> > > > }
> > > >
> > > > DpdkNetDevice::Read()
> > > > {
> > > >         struct rte_mbuf *pkt;
> > > >
> > > >         if (m_rxBuffer->length == 0) {
> > > >                 m_rxBuffer->next = 0;
> > > >                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0,
> m_rxBuffer->pmts, MAX_PKT_BURST);
> > > >
> > > >                 if (m_rxBuffer->length == 0)
> > > >                         return std::make_pair(NULL, -1);
> > > >         }
> > > >
> > > >         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
> > > >
> > > >         /* do not use rte_pktmbuf_read() as it does a copy for the
> complete packet */
> > > >
> > > >         return std:make_pair(rte_pktmbuf_mtod(pkt, char *),
> pkt->pkt_len);
> > > > }
> > > >
> > > > void
> > > > DpdkNetDevice::FreeBuf(uint8_t *buf)
> > > > {
> > > >         struct rte_mbuf *pkt;
> > > >
> > > >         if (!buf)
> > > >                 return;
> > > >         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) +
> RTE_MAX_HEADROOM);
> > > >
> > > >         rte_pktmbuf_free(pkt);
> > > > }
> > > >
> > > > When your code is done with the buffer, then convert the buffer
> address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This
> should eliminate the copy and floating point code. Converting my C code to
> C++ priceless :-)
> > > >
> > > > Hopefully the buffer address passed is the original buffer address
> and has not be adjusted.
> > > >
> > > >
> > > > Regards,
> > > > Keith
> > > >
> > >
> > > Regards,
> > > Keith
> > >
> >
> > Also rdtsc causes cpu to stop doing any look ahead, so there is a
> heisenberg effect.
> > Adding more rdtsc will hurt performance.  It also looks like your code
> is not doing bursting correctly.
> > What if multiple packets arrive in one rx_burst?
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-12-03  9:37                                     ` Harsh Patel
@ 2018-12-14 17:41                                       ` Harsh Patel
  2018-12-14 18:06                                         ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2018-12-14 17:41 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: stephen, Kyle Larose, users

Hello,
It has been a big break since our last message.
We want to inform you that we have tried a few things from which, we will
show some results which we think might me relevant for the progress.

We thought that there might be some relation between the burst size and
throughput and thus we took a 10Mbps flow and a 20Mbps flow and changed
burst size from 1,2,4,8,16,32 and so on till 256, which is the size of
mbufpool and we found out that the Throughput we get for all of these flows
is about the range of 8.5-9.0 Mbps which is the bottleneck for wireless
environment.

Secondly, we modified the value of the variable in the equation to
calculate TX_TIMEOUT where we used rte_get_timer_hz()/2048 and we changed
2048 to the values 16,32,64,...,16384. We are not able to see any
difference in the performance. We were trying a lot of things and we
thought may be this was something that some effect. We guess now it doesn't.

Also, we showed that we replaced the code to use pointer arithmetic and
allocated memory pool for Tx/Rx intermediate buffers to convert the single
packet flow to burst and vice versa. In this code, we allocated a same
memory pool which was used by both Tx buffer and the Rx buffer. We thought
this might have some effect and so we implemented a version where we had 2
separate memory pools, 1 for Tx and 1 for Rx. But again in this case we are
not able to see any difference in the performance.

The modified code for the experiments is not available on the repository
for which we gave a link earlier. That code just contains some tweaks which
are not that important. In case, you can ask for it. Also the main code is
there on the repository which is working and up to date which you can have
a look at.

We wanted to inform you about this and would like to hear from you on what
else can we do to find out where the problem is. It would be really helpful
if you can point out the mistake or problem in the code or give an idea is
to what might be or what is creating this problem.

We thank you for your time.

Regards,
Harsh and Hrishikesh

On Mon, 3 Dec 2018 at 15:07, Harsh Patel <thadodaharsh10@gmail.com> wrote:

> Hello,
> The data mentioned in the previous mails are observations and the number
> of threads mentioned are what the system is creating and not given by us to
> the system. I'm not sure how to explain this by a picture but I will
> provide a text explanation.
>
> First, we ran the Linux kernel code which uses raw sockets and we gave 2
> cores. That example used 2 threads on its own.
> Secondly, we ran our DPDK in ns-3 code and we the same number of cores
> i.e. 2 cores. That example spawned 6 threads on its own.
> (Note:- These are observations)
> All of the above statistics were provided to answer the question if both
> the simulations might be given different number of cores and may be that
> was the reason of the performance bottleneck. Clearly they are both using
> same no. of cores (2) and the results are what I have sent earlier. (Raw
> socket ~ 10 Mbps and DPDK ~ 2.5 Mbps)
>
> Now we thought that we might give more cores to DPDK in ns-3 code, which
> might improve its performance.
> This is where we gave 4 cores to our DPDK in ns-3 code which still spawned
> the same 6 threads. And it gave the same results as 2 cores for DPDK in
> ns-3.
> This was the observation.
>
> From this, we assume that the number of cores is not a reason for the less
> performance. This is not a problem we need to look somewhere else.
> So, the problem due to which we are getting less performance and a
> bottleneck around 2.5Mbps is somewhere else and we need to figure that out.
>
> Ask again if not clear. If clear, we need to see where the problem is and
> can you help in finding the reason why this happennig?
>
> Thanks & Regards,
> Harsh & Hrishikesh
>
>
> On Fri, 30 Nov 2018 at 21:24, Wiles, Keith <keith.wiles@intel.com> wrote:
>
>>
>>
>> > On Nov 30, 2018, at 3:02 AM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> >
>> > Hello,
>> > Sorry for the long delay, we were busy with some exams.
>> >
>> > 1) About the NUMA sockets
>> > This is the result of the command you mentioned :-
>> > ======================================================================
>> > Core and Socket Information (as reported by '/sys/devices/system/cpu')
>> > ======================================================================
>> >
>> > cores =  [0, 1, 2, 3]
>> > sockets =  [0]
>> >
>> >        Socket 0
>> >        --------
>> > Core 0 [0]
>> > Core 1 [1]
>> > Core 2 [2]
>> > Core 3 [3]
>> >
>> > We don't know much about this and would like your input on what else to
>> be checked or what do we need to do.
>> >
>> > 2) The part where you asked for a graph
>> > We used `ps` to analyse which CPU cores are being utilized.
>> > The raw socket version had two logical threads which used cores 0 and 1.
>> > The DPDK version had 6 logical threads, which also used cores 0 and 1.
>> This is the case for which we showed you the results.
>> > As the previous case had 2 cores and was not giving desired results, we
>> tried to give more cores to see if the DPDK in ns-3 code can achieve the
>> desired throughput and pps. (We thought giving more cores might improve the
>> performance.)
>> > For this new case, we provided 4 total cores using  EAL arguments, upon
>> which, it used cores 0-3. And still we got the same results as the one sent
>> earlier.
>> > We think this means that the bottleneck is a different problem
>> unrelated to number of cores as of now. (This whole section is an answer to
>> the question in the last paragraph raised by Kyle to which Keith asked for
>> a graph)
>>
>> In the CPU output above you are running a four core system with no
>> hyper-threads. This means you only have four core and four threads in the
>> terms of DPDK. Using 6 logical threads will not improve performance in the
>> DPDK case. DPDK normally uses a single thread per core. You can have more
>> than one pthread per core, but having more than one thread per code
>> requires the software to switch threads. Having context switch is not a
>> good performance win in most cases.
>>
>> Not sure how your system is setup and a picture could help.
>>
>> I will be traveling all next week and responses will be slow.
>>
>> >
>> > 3) About updating the TX_TIMEOUT and storing rte_get_timer_hz()
>> > We have not tried this and will try it by today and will send you the
>> status after that in some time.
>> >
>> > 4) For the suggestion by Stephen
>> > We are not clear on what you suggested and it would be nice if you
>> elaborate your suggestion.
>> >
>> > Thanks and Regards,
>> > Harsh and Hrishikesh
>> >
>> > PS :- We are done with our exams and would be working now on this
>> regularly.
>> >
>> > On Sun, 25 Nov 2018 at 10:05, Stephen Hemminger <
>> stephen@networkplumber.org> wrote:
>> > On Sat, 24 Nov 2018 16:01:04 +0000
>> > "Wiles, Keith" <keith.wiles@intel.com> wrote:
>> >
>> > > > On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> > > >
>> > > > Hi
>> > > >
>> > > > Thank you so much for the reply and for the solution.
>> > > >
>> > > > We used the given code. We were amazed by the pointer arithmetic
>> you used, got to learn something new.
>> > > >
>> > > > But still we are under performing.The same bottleneck of ~2.5Mbps
>> is seen.
>> > > >
>> > > > We also checked if the raw socket was using any extra (logical)
>> cores than the DPDK. We found that raw socket has 2 logical threads running
>> on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2
>> logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the
>> same bottleneck.
>> > > >
>> > > > We have updated our code (you can use the same links from previous
>> mail). It would be helpful if you could help us in finding what causes the
>> bottleneck.
>> > >
>> > > I looked at the code for a few seconds and noticed your TX_TIMEOUT is
>> macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call
>> rte_get_timer_hz() time, but grab the value and store the hz locally and
>> use that variable instead. This will not improve performance is my guess
>> and I would have to look at the code the that routine to see if it buys you
>> anything to store the value locally. If the getting hz is just a simple
>> read of a variable then good, but still you should should a local variable
>> within the object to hold the (rte_get_timer_hz()/2048) instead of doing
>> the call and divide each time.
>> > >
>> > > >
>> > > > Thanks and Regards,
>> > > > Harsh and Hrishikesh
>> > > >
>> > > >
>> > > > On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com>
>> wrote:
>> > > >
>> > > >
>> > > > > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com>
>> wrote:
>> > > > >
>> > > > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <
>> thadodaharsh10@gmail.com> wrote:
>> > > > >>
>> > > > >> Hello,
>> > > > >> Thanks a lot for going through the code and providing us with so
>> much
>> > > > >> information.
>> > > > >> We removed all the memcpy/malloc from the data path as you
>> suggested and
>> > > > > ...
>> > > > >> After removing this, we are able to see a performance gain but
>> not as good
>> > > > >> as raw socket.
>> > > > >>
>> > > > >
>> > > > > You're using an unordered_map to map your buffer pointers back to
>> the
>> > > > > mbufs. While it may not do a memcpy all the time, It will likely
>> end
>> > > > > up doing a malloc arbitrarily when you insert or remove entries
>> from
>> > > > > the map. If it needs to resize the table, it'll be even worse.
>> You may
>> > > > > want to consider using librte_hash:
>> > > > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even
>> better,
>> > > > > see if you can design the system to avoid needing to do a lookup
>> like
>> > > > > this. Can you return a handle with the mbuf pointer and the data
>> > > > > together?
>> > > > >
>> > > > > You're also using floating point math where it's unnecessary (the
>> > > > > timing check). Just multiply the numerator by 1000000 prior to
>> doing
>> > > > > the division. I doubt you'll overflow a uint64_t with that. It's
>> not
>> > > > > as efficient as integer math, though I'm not sure offhand it'd
>> cause a
>> > > > > major perf problem.
>> > > > >
>> > > > > One final thing: using a raw socket, the kernel will take over
>> > > > > transmitting and receiving to the NIC itself. that means it is
>> free to
>> > > > > use multiple CPUs for the rx and tx. I notice that you only have
>> one
>> > > > > rx/tx queue, meaning at most one CPU can send and receive packets.
>> > > > > When running your performance test with the raw socket, you may
>> want
>> > > > > to see how busy the system is doing packet sends and receives. Is
>> it
>> > > > > using more than one CPU's worth of processing? Is it using less,
>> but
>> > > > > when combined with your main application's usage, the overall
>> system
>> > > > > is still using more than one?
>> > > >
>> > > > Along with the floating point math, I would remove all floating
>> point math and use the rte_rdtsc() function to use cycles. Using something
>> like:
>> > > >
>> > > > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /*
>> One 16th of a second use 2/4/8/16/32 power of two numbers to make the math
>> simple divide */
>> > > >
>> > > > cur_tsc = rte_rdtsc();
>> > > >
>> > > > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
>> > > >
>> > > > while(1) {
>> > > >         cur_tsc = rte_rdtsc();
>> > > >         if (cur_tsc >= next_tsc) {
>> > > >                 flush();
>> > > >                 next_tsc += timo;
>> > > >         }
>> > > >         /* Do other stuff */
>> > > > }
>> > > >
>> > > > For the m_bufPktMap I would use the rte_hash or do not use a hash
>> at all by grabbing the buffer address and subtract the
>> > > > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf)
>> + RTE_MAX_HEADROOM);
>> > > >
>> > > >
>> > > > DpdkNetDevice:Write(uint8_t *buffer, size_t length)
>> > > > {
>> > > >         struct rte_mbuf *pkt;
>> > > >         uint64_t cur_tsc;
>> > > >
>> > > >         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct
>> rte_mbuf) + RTE_MAX_HEADROOM);
>> > > >
>> > > >         /* No need to test pkt, but buffer maybe tested to make
>> sure it is not null above the math above */
>> > > >
>> > > >         pkt->pk_len = length;
>> > > >         pkt->data_len = length;
>> > > >
>> > > >         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
>> > > >
>> > > >         cur_tsc = rte_rdtsc();
>> > > >
>> > > >         /* next_tsc is a private variable */
>> > > >         if (cur_tsc >= next_tsc) {
>> > > >                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);
>>      /* hardcoded the queue id, should be fixed */
>> > > >                 next_tsc = cur_tsc + timo; /* timo is a fixed
>> number of cycles to wait */
>> > > >         }
>> > > >         return length;
>> > > > }
>> > > >
>> > > > DpdkNetDevice::Read()
>> > > > {
>> > > >         struct rte_mbuf *pkt;
>> > > >
>> > > >         if (m_rxBuffer->length == 0) {
>> > > >                 m_rxBuffer->next = 0;
>> > > >                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0,
>> m_rxBuffer->pmts, MAX_PKT_BURST);
>> > > >
>> > > >                 if (m_rxBuffer->length == 0)
>> > > >                         return std::make_pair(NULL, -1);
>> > > >         }
>> > > >
>> > > >         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
>> > > >
>> > > >         /* do not use rte_pktmbuf_read() as it does a copy for the
>> complete packet */
>> > > >
>> > > >         return std:make_pair(rte_pktmbuf_mtod(pkt, char *),
>> pkt->pkt_len);
>> > > > }
>> > > >
>> > > > void
>> > > > DpdkNetDevice::FreeBuf(uint8_t *buf)
>> > > > {
>> > > >         struct rte_mbuf *pkt;
>> > > >
>> > > >         if (!buf)
>> > > >                 return;
>> > > >         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf)
>> + RTE_MAX_HEADROOM);
>> > > >
>> > > >         rte_pktmbuf_free(pkt);
>> > > > }
>> > > >
>> > > > When your code is done with the buffer, then convert the buffer
>> address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This
>> should eliminate the copy and floating point code. Converting my C code to
>> C++ priceless :-)
>> > > >
>> > > > Hopefully the buffer address passed is the original buffer address
>> and has not be adjusted.
>> > > >
>> > > >
>> > > > Regards,
>> > > > Keith
>> > > >
>> > >
>> > > Regards,
>> > > Keith
>> > >
>> >
>> > Also rdtsc causes cpu to stop doing any look ahead, so there is a
>> heisenberg effect.
>> > Adding more rdtsc will hurt performance.  It also looks like your code
>> is not doing bursting correctly.
>> > What if multiple packets arrive in one rx_burst?
>>
>> Regards,
>> Keith
>>
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-12-14 17:41                                       ` Harsh Patel
@ 2018-12-14 18:06                                         ` Wiles, Keith
       [not found]                                           ` <CAA0iYrHyLtO3XLXMq-aeVhgJhns0+ErfuhEeDSNDi4cFVBcZmw@mail.gmail.com>
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-12-14 18:06 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



> On Dec 14, 2018, at 11:41 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hello, 
> It has been a big break since our last message. 
> We want to inform you that we have tried a few things from which, we will show some results which we think might me relevant for the progress. 
> 
> We thought that there might be some relation between the burst size and throughput and thus we took a 10Mbps flow and a 20Mbps flow and changed burst size from 1,2,4,8,16,32 and so on till 256, which is the size of mbufpool and we found out that the Throughput we get for all of these flows is about the range of 8.5-9.0 Mbps which is the bottleneck for wireless environment. 
> 
> Secondly, we modified the value of the variable in the equation to calculate TX_TIMEOUT where we used rte_get_timer_hz()/2048 and we changed 2048 to the values 16,32,64,...,16384. We are not able to see any difference in the performance. We were trying a lot of things and we thought may be this was something that some effect. We guess now it doesn't.
> 
> Also, we showed that we replaced the code to use pointer arithmetic and allocated memory pool for Tx/Rx intermediate buffers to convert the single packet flow to burst and vice versa. In this code, we allocated a same memory pool which was used by both Tx buffer and the Rx buffer. We thought this might have some effect and so we implemented a version where we had 2 separate memory pools, 1 for Tx and 1 for Rx. But again in this case we are not able to see any difference in the performance.
> 
> The modified code for the experiments is not available on the repository for which we gave a link earlier. That code just contains some tweaks which are not that important. In case, you can ask for it. Also the main code is there on the repository which is working and up to date which you can have a look at.
> 
> We wanted to inform you about this and would like to hear from you on what else can we do to find out where the problem is. It would be really helpful if you can point out the mistake or problem in the code or give an idea is to what might be or what is creating this problem. 
> 
> We thank you for your time.

Well I do not know why you get that level of performance. I assume you are building your code with -O3 optimization. It must be something else as we know that DPDK performs well, but it could be the C++ code or ????

Can you try vTune or have access to that type of tool to analyze your complete application?

This seems to be the only direction to go now, which is to find a tool to measure the performance of the code to locate the bottlenecks.

> 
> Regards, 
> Harsh and Hrishikesh
> 
> On Mon, 3 Dec 2018 at 15:07, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> Hello,
> The data mentioned in the previous mails are observations and the number of threads mentioned are what the system is creating and not given by us to the system. I'm not sure how to explain this by a picture but I will provide a text explanation.
> 
> First, we ran the Linux kernel code which uses raw sockets and we gave 2 cores. That example used 2 threads on its own. 
> Secondly, we ran our DPDK in ns-3 code and we the same number of cores i.e. 2 cores. That example spawned 6 threads on its own.
> (Note:- These are observations)
> All of the above statistics were provided to answer the question if both the simulations might be given different number of cores and may be that was the reason of the performance bottleneck. Clearly they are both using same no. of cores (2) and the results are what I have sent earlier. (Raw socket ~ 10 Mbps and DPDK ~ 2.5 Mbps)
> 
> Now we thought that we might give more cores to DPDK in ns-3 code, which might improve its performance. 
> This is where we gave 4 cores to our DPDK in ns-3 code which still spawned the same 6 threads. And it gave the same results as 2 cores for DPDK in ns-3. 
> This was the observation.
> 
> From this, we assume that the number of cores is not a reason for the less performance. This is not a problem we need to look somewhere else.
> So, the problem due to which we are getting less performance and a bottleneck around 2.5Mbps is somewhere else and we need to figure that out. 
> 
> Ask again if not clear. If clear, we need to see where the problem is and can you help in finding the reason why this happennig?
> 
> Thanks & Regards, 
> Harsh & Hrishikesh
>  
> 
> On Fri, 30 Nov 2018 at 21:24, Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> > On Nov 30, 2018, at 3:02 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> > Hello,
> > Sorry for the long delay, we were busy with some exams.
> > 
> > 1) About the NUMA sockets
> > This is the result of the command you mentioned :-
> > ======================================================================
> > Core and Socket Information (as reported by '/sys/devices/system/cpu')
> > ======================================================================
> > 
> > cores =  [0, 1, 2, 3]
> > sockets =  [0]
> > 
> >        Socket 0  
> >        --------  
> > Core 0 [0]       
> > Core 1 [1]       
> > Core 2 [2]       
> > Core 3 [3]
> > 
> > We don't know much about this and would like your input on what else to be checked or what do we need to do.
> > 
> > 2) The part where you asked for a graph 
> > We used `ps` to analyse which CPU cores are being utilized.
> > The raw socket version had two logical threads which used cores 0 and 1.
> > The DPDK version had 6 logical threads, which also used cores 0 and 1. This is the case for which we showed you the results.
> > As the previous case had 2 cores and was not giving desired results, we tried to give more cores to see if the DPDK in ns-3 code can achieve the desired throughput and pps. (We thought giving more cores might improve the performance.)
> > For this new case, we provided 4 total cores using  EAL arguments, upon which, it used cores 0-3. And still we got the same results as the one sent earlier.
> > We think this means that the bottleneck is a different problem unrelated to number of cores as of now. (This whole section is an answer to the question in the last paragraph raised by Kyle to which Keith asked for a graph)
> 
> In the CPU output above you are running a four core system with no hyper-threads. This means you only have four core and four threads in the terms of DPDK. Using 6 logical threads will not improve performance in the DPDK case. DPDK normally uses a single thread per core. You can have more than one pthread per core, but having more than one thread per code requires the software to switch threads. Having context switch is not a good performance win in most cases.
> 
> Not sure how your system is setup and a picture could help.
> 
> I will be traveling all next week and responses will be slow.
> 
> > 
> > 3) About updating the TX_TIMEOUT and storing rte_get_timer_hz()  
> > We have not tried this and will try it by today and will send you the status after that in some time. 
> > 
> > 4) For the suggestion by Stephen
> > We are not clear on what you suggested and it would be nice if you elaborate your suggestion.
> > 
> > Thanks and Regards, 
> > Harsh and Hrishikesh
> > 
> > PS :- We are done with our exams and would be working now on this regularly. 
> > 
> > On Sun, 25 Nov 2018 at 10:05, Stephen Hemminger <stephen@networkplumber.org> wrote:
> > On Sat, 24 Nov 2018 16:01:04 +0000
> > "Wiles, Keith" <keith.wiles@intel.com> wrote:
> > 
> > > > On Nov 22, 2018, at 9:54 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > > > 
> > > > Hi
> > > > 
> > > > Thank you so much for the reply and for the solution.
> > > > 
> > > > We used the given code. We were amazed by the pointer arithmetic you used, got to learn something new.
> > > > 
> > > > But still we are under performing.The same bottleneck of ~2.5Mbps is seen.
> > > > 
> > > > We also checked if the raw socket was using any extra (logical) cores than the DPDK. We found that raw socket has 2 logical threads running on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same bottleneck.
> > > > 
> > > > We have updated our code (you can use the same links from previous mail). It would be helpful if you could help us in finding what causes the bottleneck.  
> > > 
> > > I looked at the code for a few seconds and noticed your TX_TIMEOUT is macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call rte_get_timer_hz() time, but grab the value and store the hz locally and use that variable instead. This will not improve performance is my guess and I would have to look at the code the that routine to see if it buys you anything to store the value locally. If the getting hz is just a simple read of a variable then good, but still you should should a local variable within the object to hold the (rte_get_timer_hz()/2048) instead of doing the call and divide each time.
> > > 
> > > > 
> > > > Thanks and Regards, 
> > > > Harsh and Hrishikesh 
> > > >  
> > > > 
> > > > On Mon, Nov 19, 2018, 19:19 Wiles, Keith <keith.wiles@intel.com> wrote:
> > > > 
> > > >   
> > > > > On Nov 17, 2018, at 4:05 PM, Kyle Larose <eomereadig@gmail.com> wrote:
> > > > > 
> > > > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel <thadodaharsh10@gmail.com> wrote:  
> > > > >> 
> > > > >> Hello,
> > > > >> Thanks a lot for going through the code and providing us with so much
> > > > >> information.
> > > > >> We removed all the memcpy/malloc from the data path as you suggested and  
> > > > > ...  
> > > > >> After removing this, we are able to see a performance gain but not as good
> > > > >> as raw socket.
> > > > >>   
> > > > > 
> > > > > You're using an unordered_map to map your buffer pointers back to the
> > > > > mbufs. While it may not do a memcpy all the time, It will likely end
> > > > > up doing a malloc arbitrarily when you insert or remove entries from
> > > > > the map. If it needs to resize the table, it'll be even worse. You may
> > > > > want to consider using librte_hash:
> > > > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > > > > see if you can design the system to avoid needing to do a lookup like
> > > > > this. Can you return a handle with the mbuf pointer and the data
> > > > > together?
> > > > > 
> > > > > You're also using floating point math where it's unnecessary (the
> > > > > timing check). Just multiply the numerator by 1000000 prior to doing
> > > > > the division. I doubt you'll overflow a uint64_t with that. It's not
> > > > > as efficient as integer math, though I'm not sure offhand it'd cause a
> > > > > major perf problem.
> > > > > 
> > > > > One final thing: using a raw socket, the kernel will take over
> > > > > transmitting and receiving to the NIC itself. that means it is free to
> > > > > use multiple CPUs for the rx and tx. I notice that you only have one
> > > > > rx/tx queue, meaning at most one CPU can send and receive packets.
> > > > > When running your performance test with the raw socket, you may want
> > > > > to see how busy the system is doing packet sends and receives. Is it
> > > > > using more than one CPU's worth of processing? Is it using less, but
> > > > > when combined with your main application's usage, the overall system
> > > > > is still using more than one?  
> > > > 
> > > > Along with the floating point math, I would remove all floating point math and use the rte_rdtsc() function to use cycles. Using something like:
> > > > 
> > > > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th of a second use 2/4/8/16/32 power of two numbers to make the math simple divide */
> > > > 
> > > > cur_tsc = rte_rdtsc();
> > > > 
> > > > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> > > > 
> > > > while(1) {
> > > >         cur_tsc = rte_rdtsc();
> > > >         if (cur_tsc >= next_tsc) {
> > > >                 flush();
> > > >                 next_tsc += timo;
> > > >         }
> > > >         /* Do other stuff */
> > > > }
> > > > 
> > > > For the m_bufPktMap I would use the rte_hash or do not use a hash at all by grabbing the buffer address and subtract the
> > > > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> > > > 
> > > > 
> > > > DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> > > > {
> > > >         struct rte_mbuf *pkt;
> > > >         uint64_t cur_tsc;
> > > > 
> > > >         pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) + RTE_MAX_HEADROOM);
> > > > 
> > > >         /* No need to test pkt, but buffer maybe tested to make sure it is not null above the math above */
> > > > 
> > > >         pkt->pk_len = length;
> > > >         pkt->data_len = length;
> > > > 
> > > >         rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> > > > 
> > > >         cur_tsc = rte_rdtsc();
> > > > 
> > > >         /* next_tsc is a private variable */
> > > >         if (cur_tsc >= next_tsc) {
> > > >                 rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);       /* hardcoded the queue id, should be fixed */
> > > >                 next_tsc = cur_tsc + timo; /* timo is a fixed number of cycles to wait */
> > > >         }
> > > >         return length;
> > > > }
> > > > 
> > > > DpdkNetDevice::Read()
> > > > {
> > > >         struct rte_mbuf *pkt;
> > > > 
> > > >         if (m_rxBuffer->length == 0) {
> > > >                 m_rxBuffer->next = 0;
> > > >                 m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, m_rxBuffer->pmts, MAX_PKT_BURST);
> > > > 
> > > >                 if (m_rxBuffer->length == 0)
> > > >                         return std::make_pair(NULL, -1);
> > > >         }
> > > > 
> > > >         pkt = m_rxBuffer->pkts[m_rxBuffer->next++];
> > > > 
> > > >         /* do not use rte_pktmbuf_read() as it does a copy for the complete packet */
> > > > 
> > > >         return std:make_pair(rte_pktmbuf_mtod(pkt, char *), pkt->pkt_len);
> > > > }
> > > > 
> > > > void
> > > > DpdkNetDevice::FreeBuf(uint8_t *buf)
> > > > {
> > > >         struct rte_mbuf *pkt;
> > > > 
> > > >         if (!buf)
> > > >                 return;
> > > >         pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) + RTE_MAX_HEADROOM);
> > > > 
> > > >         rte_pktmbuf_free(pkt);
> > > > }
> > > > 
> > > > When your code is done with the buffer, then convert the buffer address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This should eliminate the copy and floating point code. Converting my C code to C++ priceless :-)
> > > > 
> > > > Hopefully the buffer address passed is the original buffer address and has not be adjusted.
> > > > 
> > > > 
> > > > Regards,
> > > > Keith
> > > >   
> > > 
> > > Regards,
> > > Keith
> > > 
> > 
> > Also rdtsc causes cpu to stop doing any look ahead, so there is a heisenberg effect.
> > Adding more rdtsc will hurt performance.  It also looks like your code is not doing bursting correctly.
> > What if multiple packets arrive in one rx_burst?
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

[parent not found: <CAA0iYrHyLtO3XLXMq-aeVhgJhns0+ErfuhEeDSNDi4cFVBcZmw@mail.gmail.com>]

* Re: [dpdk-users] Query on handling packets
       [not found]                                           ` <CAA0iYrHyLtO3XLXMq-aeVhgJhns0+ErfuhEeDSNDi4cFVBcZmw@mail.gmail.com>
@ 2018-12-30  0:19                                             ` Wiles, Keith
  2018-12-30  0:30                                             ` Wiles, Keith
  1 sibling, 0 replies; 43+ messages in thread
From: Wiles, Keith @ 2018-12-30  0:19 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



> On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hello,
> As suggested, we tried profiling the application using Intel VTune Amplifier. We aren't sure how to use these results, so we are attaching them to this email.
> 
> The things we understood were 'Top Hotspots' and 'Effective CPU utilization'. Following are some of our understandings:
> 
> Top Hotspots
> 
> Function        Module  CPU Time
> rte_delay_us_block      librte_eal.so.6.1       15.042s
> eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so      3.522s
> ns3::DpdkNetDeviceReader::DoRead        libns3.28.1-fd-net-device-debug.so      2.470s
> rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
> [Others]                6.656s
> 
> We knew about other methods except `rte_delay_us_block`. So we investigated the callers of this method:
> 
> Callers Effective Time  Spin Time       Overhead Time   Effective Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
> e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
> e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec   0usec
> e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
> e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
> eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
> e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s  0usec   0usec
> e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec   0usec
> 
> We lack sufficient knowledge to investigate more than this.
> 
> Effective CPU utilization
> 
> Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4 logical CPUs). We thought this is less. So we compared this with the raw-socket version of the code, which was even less, 8.0% (0.318 out of 4 logical CPUs), and even then it is performing way better.
> 
> It would be helpful if you give us insights on how to use these results or point us to some resources to do so. 

I tracked down the rte_delay_us_block to SendFrom() function calling IsLinkUp() function and it appears calling that routine on every SendFrom() call, which for the e1000 it must be very expensive call. So rework your code to not call IsLinkUp() except every so often. I believe you can enable link status interrupt in DPDK to take an interrupt on link status change, which would be better then calling this routine. How you do that I am not sure, but it should be in the docs someplace.

For now I would remove the IsLinkUp() call and just assume it is up after you it the first time in Setup call function.

> 
> Thank you 
> 
> Regards
> Harsh & Hrishikesh
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
       [not found]                                           ` <CAA0iYrHyLtO3XLXMq-aeVhgJhns0+ErfuhEeDSNDi4cFVBcZmw@mail.gmail.com>
  2018-12-30  0:19                                             ` Wiles, Keith
@ 2018-12-30  0:30                                             ` Wiles, Keith
  2019-01-03 18:12                                               ` Harsh Patel
  1 sibling, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2018-12-30  0:30 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



> On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hello,
> As suggested, we tried profiling the application using Intel VTune Amplifier. We aren't sure how to use these results, so we are attaching them to this email.
> 
> The things we understood were 'Top Hotspots' and 'Effective CPU utilization'. Following are some of our understandings:
> 
> Top Hotspots
> 
> Function        Module  CPU Time
> rte_delay_us_block      librte_eal.so.6.1       15.042s
> eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so      3.522s
> ns3::DpdkNetDeviceReader::DoRead        libns3.28.1-fd-net-device-debug.so      2.470s
> rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
> [Others]                6.656s
> 
> We knew about other methods except `rte_delay_us_block`. So we investigated the callers of this method:
> 
> Callers Effective Time  Spin Time       Overhead Time   Effective Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
> e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
> e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec   0usec
> e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
> e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
> eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
> e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s  0usec   0usec
> e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec   0usec
> 
> We lack sufficient knowledge to investigate more than this.
> 
> Effective CPU utilization
> 
> Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4 logical CPUs). We thought this is less. So we compared this with the raw-socket version of the code, which was even less, 8.0% (0.318 out of 4 logical CPUs), and even then it is performing way better.
> 
> It would be helpful if you give us insights on how to use these results or point us to some resources to do so. 
> 
> Thank you 
> 

BTW, I was able to build ns3 with DPDK 18.11 it required a couple changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.

I did have a problem including rte_mbuf.h file into your code. It appears the g++ compiler did not like referencing the struct rte_mbuf_sched inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union as a hack I moved the struct outside of the rte_mbuf structure and replaced the struct in the union with ’struct rte_mbuf_sched sched;', but I am guessing you are missing some compiler options in your build system as DPDK builds just fine without that hack.

The next place was the rxmode and the txq_flags. The rxmode structure has changed and I commented out the inits in ns3 and then commented out the txq_flags init code as these are now the defaults.

Regards,
Keith


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2018-12-30  0:30                                             ` Wiles, Keith
@ 2019-01-03 18:12                                               ` Harsh Patel
  2019-01-03 22:43                                                 ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2019-01-03 18:12 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, Kyle Larose, users

Hi

We applied your suggestion of removing the `IsLinkUp()` call. But the
performace is even worse. We could only get around 340kbits/s.

The Top Hotspots are:

Function    Module    CPU Time
eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
rte_delay_us_block    librte_eal.so.6.1    7.372s
ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so    5.080s
rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so
3.364s
[Others]        4.760s

Upon checking the callers of `rte_delay_us_block`, we got to know that most
of the time (92%) spent in this function is during initialization.
This does not waste our processing time during communication. So, it's a
good start to our optimization.

Callers    CPU Time: Total    CPU Time: Self
rte_delay_us_block    100.0%    7.372s
  e1000_enable_ulp_lpt_lp    92.3%    6.804s
  e1000_write_phy_reg_mdic    1.8%    0.136s
  e1000_reset_hw_ich8lan    1.7%    0.128s
  e1000_read_phy_reg_mdic    1.4%    0.104s
  eth_em_link_update    1.4%    0.100s
  e1000_get_cfg_done_generic    0.7%    0.052s
  e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s

Effective CPU Utilization:    21.4% (0.856 out of 4)

Here is the link to vtune profiling results.
https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5

Thank you

Regards

On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Hello,
> > As suggested, we tried profiling the application using Intel VTune
> Amplifier. We aren't sure how to use these results, so we are attaching
> them to this email.
> >
> > The things we understood were 'Top Hotspots' and 'Effective CPU
> utilization'. Following are some of our understandings:
> >
> > Top Hotspots
> >
> > Function        Module  CPU Time
> > rte_delay_us_block      librte_eal.so.6.1       15.042s
> > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so
> 3.522s
> > ns3::DpdkNetDeviceReader::DoRead
> libns3.28.1-fd-net-device-debug.so      2.470s
> > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
> > [Others]                6.656s
> >
> > We knew about other methods except `rte_delay_us_block`. So we
> investigated the callers of this method:
> >
> > Callers Effective Time  Spin Time       Overhead Time   Effective Time
> Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
> > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
> > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec
>  0usec
> > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
> > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
> > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
> > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s
> 0usec   0usec
> > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec
>  0usec
> >
> > We lack sufficient knowledge to investigate more than this.
> >
> > Effective CPU utilization
> >
> > Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4
> logical CPUs). We thought this is less. So we compared this with the
> raw-socket version of the code, which was even less, 8.0% (0.318 out of 4
> logical CPUs), and even then it is performing way better.
> >
> > It would be helpful if you give us insights on how to use these results
> or point us to some resources to do so.
> >
> > Thank you
> >
>
> BTW, I was able to build ns3 with DPDK 18.11 it required a couple changes
> in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
>
> I did have a problem including rte_mbuf.h file into your code. It appears
> the g++ compiler did not like referencing the struct rte_mbuf_sched inside
> the rte_mbuf structure. The rte_mbuf_sched was inside the big union as a
> hack I moved the struct outside of the rte_mbuf structure and replaced the
> struct in the union with ’struct rte_mbuf_sched sched;', but I am guessing
> you are missing some compiler options in your build system as DPDK builds
> just fine without that hack.
>
> The next place was the rxmode and the txq_flags. The rxmode structure has
> changed and I commented out the inits in ns3 and then commented out the
> txq_flags init code as these are now the defaults.
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-01-03 18:12                                               ` Harsh Patel
@ 2019-01-03 22:43                                                 ` Wiles, Keith
  2019-01-04  5:57                                                   ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2019-01-03 22:43 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



> On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hi
> 
> We applied your suggestion of removing the `IsLinkUp()` call. But the performace is even worse. We could only get around 340kbits/s.
> 
> The Top Hotspots are:
> 
> Function    Module    CPU Time
> eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
> rte_delay_us_block    librte_eal.so.6.1    7.372s
> ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so    5.080s
> rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
> ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so    3.364s
> [Others]        4.760s

Performance reduced by removing that link status check, that is weird.
> 
> Upon checking the callers of `rte_delay_us_block`, we got to know that most of the time (92%) spent in this function is during initialization.
> This does not waste our processing time during communication. So, it's a good start to our optimization.
> 
> Callers    CPU Time: Total    CPU Time: Self
> rte_delay_us_block    100.0%    7.372s
>   e1000_enable_ulp_lpt_lp    92.3%    6.804s
>   e1000_write_phy_reg_mdic    1.8%    0.136s
>   e1000_reset_hw_ich8lan    1.7%    0.128s
>   e1000_read_phy_reg_mdic    1.4%    0.104s
>   eth_em_link_update    1.4%    0.100s
>   e1000_get_cfg_done_generic    0.7%    0.052s
>   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s

I guess you are having vTune start your application and that is why you have init time items in your log. I normally start my application and then attach vtune to the application. One of the options in configuration of vtune for that project is to attach to the application. Maybe it would help hear.

Looking at the data you provided it was ok. The problem is it would not load the source files as I did not have the same build or executable. I tried to build the code, but it failed to build and I did not go further. I guess I would need to see the full source tree and the executable you used to really look at the problem. I have limited time, but I can try if you like. 
> 
> 
> Effective CPU Utilization:    21.4% (0.856 out of 4)
> 
> Here is the link to vtune profiling results. https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
> 
> Thank you
> 
> Regards
> 
> On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> > Hello,
> > As suggested, we tried profiling the application using Intel VTune Amplifier. We aren't sure how to use these results, so we are attaching them to this email.
> > 
> > The things we understood were 'Top Hotspots' and 'Effective CPU utilization'. Following are some of our understandings:
> > 
> > Top Hotspots
> > 
> > Function        Module  CPU Time
> > rte_delay_us_block      librte_eal.so.6.1       15.042s
> > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so      3.522s
> > ns3::DpdkNetDeviceReader::DoRead        libns3.28.1-fd-net-device-debug.so      2.470s
> > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
> > [Others]                6.656s
> > 
> > We knew about other methods except `rte_delay_us_block`. So we investigated the callers of this method:
> > 
> > Callers Effective Time  Spin Time       Overhead Time   Effective Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
> > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
> > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec   0usec
> > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
> > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
> > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
> > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s  0usec   0usec
> > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec   0usec
> > 
> > We lack sufficient knowledge to investigate more than this.
> > 
> > Effective CPU utilization
> > 
> > Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4 logical CPUs). We thought this is less. So we compared this with the raw-socket version of the code, which was even less, 8.0% (0.318 out of 4 logical CPUs), and even then it is performing way better.
> > 
> > It would be helpful if you give us insights on how to use these results or point us to some resources to do so. 
> > 
> > Thank you 
> > 
> 
> BTW, I was able to build ns3 with DPDK 18.11 it required a couple changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
> 
> I did have a problem including rte_mbuf.h file into your code. It appears the g++ compiler did not like referencing the struct rte_mbuf_sched inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union as a hack I moved the struct outside of the rte_mbuf structure and replaced the struct in the union with ’struct rte_mbuf_sched sched;', but I am guessing you are missing some compiler options in your build system as DPDK builds just fine without that hack.
> 
> The next place was the rxmode and the txq_flags. The rxmode structure has changed and I commented out the inits in ns3 and then commented out the txq_flags init code as these are now the defaults.
> 
> Regards,
> Keith
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-01-03 22:43                                                 ` Wiles, Keith
@ 2019-01-04  5:57                                                   ` Harsh Patel
  2019-01-16 13:55                                                     ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2019-01-04  5:57 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, Kyle Larose, users

Yes that would be helpful.
It'd be ok for now to use the same dpdk version to overcome the build
issues.
We will look into updating the code for latest versions once we get past
this problem.

Thank you very much.

Regards,
Harsh & Hrishikesh

On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Hi
> >
> > We applied your suggestion of removing the `IsLinkUp()` call. But the
> performace is even worse. We could only get around 340kbits/s.
> >
> > The Top Hotspots are:
> >
> > Function    Module    CPU Time
> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
> > rte_delay_us_block    librte_eal.so.6.1    7.372s
> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so    5.080s
> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
> > ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so
>   3.364s
> > [Others]        4.760s
>
> Performance reduced by removing that link status check, that is weird.
> >
> > Upon checking the callers of `rte_delay_us_block`, we got to know that
> most of the time (92%) spent in this function is during initialization.
> > This does not waste our processing time during communication. So, it's a
> good start to our optimization.
> >
> > Callers    CPU Time: Total    CPU Time: Self
> > rte_delay_us_block    100.0%    7.372s
> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
> >   e1000_write_phy_reg_mdic    1.8%    0.136s
> >   e1000_reset_hw_ich8lan    1.7%    0.128s
> >   e1000_read_phy_reg_mdic    1.4%    0.104s
> >   eth_em_link_update    1.4%    0.100s
> >   e1000_get_cfg_done_generic    0.7%    0.052s
> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
>
> I guess you are having vTune start your application and that is why you
> have init time items in your log. I normally start my application and then
> attach vtune to the application. One of the options in configuration of
> vtune for that project is to attach to the application. Maybe it would help
> hear.
>
> Looking at the data you provided it was ok. The problem is it would not
> load the source files as I did not have the same build or executable. I
> tried to build the code, but it failed to build and I did not go further. I
> guess I would need to see the full source tree and the executable you used
> to really look at the problem. I have limited time, but I can try if you
> like.
> >
> >
> > Effective CPU Utilization:    21.4% (0.856 out of 4)
> >
> > Here is the link to vtune profiling results.
> https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
> >
> > Thank you
> >
> > Regards
> >
> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com> wrote:
> >
> >
> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >
> > > Hello,
> > > As suggested, we tried profiling the application using Intel VTune
> Amplifier. We aren't sure how to use these results, so we are attaching
> them to this email.
> > >
> > > The things we understood were 'Top Hotspots' and 'Effective CPU
> utilization'. Following are some of our understandings:
> > >
> > > Top Hotspots
> > >
> > > Function        Module  CPU Time
> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so
>   3.522s
> > > ns3::DpdkNetDeviceReader::DoRead
> libns3.28.1-fd-net-device-debug.so      2.470s
> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
> > > [Others]                6.656s
> > >
> > > We knew about other methods except `rte_delay_us_block`. So we
> investigated the callers of this method:
> > >
> > > Callers Effective Time  Spin Time       Overhead Time   Effective
> Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time:
> Self
> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec
>  0usec
> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%
> 0.064s  0usec   0usec
> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec
>  0usec
> > >
> > > We lack sufficient knowledge to investigate more than this.
> > >
> > > Effective CPU utilization
> > >
> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4
> logical CPUs). We thought this is less. So we compared this with the
> raw-socket version of the code, which was even less, 8.0% (0.318 out of 4
> logical CPUs), and even then it is performing way better.
> > >
> > > It would be helpful if you give us insights on how to use these
> results or point us to some resources to do so.
> > >
> > > Thank you
> > >
> >
> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple
> changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
> >
> > I did have a problem including rte_mbuf.h file into your code. It
> appears the g++ compiler did not like referencing the struct rte_mbuf_sched
> inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union
> as a hack I moved the struct outside of the rte_mbuf structure and replaced
> the struct in the union with ’struct rte_mbuf_sched sched;', but I am
> guessing you are missing some compiler options in your build system as DPDK
> builds just fine without that hack.
> >
> > The next place was the rxmode and the txq_flags. The rxmode structure
> has changed and I commented out the inits in ns3 and then commented out the
> txq_flags init code as these are now the defaults.
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-01-04  5:57                                                   ` Harsh Patel
@ 2019-01-16 13:55                                                     ` Harsh Patel
  2019-01-30 23:36                                                       ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2019-01-16 13:55 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, Kyle Larose, users

Hi

We were able to optimise the DPDK version. There were couple of things we
needed to do.

We were using tx timeout as 1s/2048, which we found out to be very less.
Then we increased the timeout, but we were getting lot of retransmissions.

So we removed the timeout and sent single packet as soon as we get it. This
increased the throughput.

Then we used DPDK feature to launch function on core, and gave a dedicated
core for Rx. This increased the throughput further.

The code is working really well for low bandwidth (<~50Mbps) and is
outperforming raw socket version.
But for high bandwidth, we are getting packet length mismatches for some
reason. We are investigating it.

We really thank you for the suggestions given by you and also for keeping
the patience for last couple of months.

Thank you

Regards,
Harsh & Hrishikesh

On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com> wrote:

> Yes that would be helpful.
> It'd be ok for now to use the same dpdk version to overcome the build
> issues.
> We will look into updating the code for latest versions once we get past
> this problem.
>
> Thank you very much.
>
> Regards,
> Harsh & Hrishikesh
>
> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com> wrote:
>
>>
>>
>> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> >
>> > Hi
>> >
>> > We applied your suggestion of removing the `IsLinkUp()` call. But the
>> performace is even worse. We could only get around 340kbits/s.
>> >
>> > The Top Hotspots are:
>> >
>> > Function    Module    CPU Time
>> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
>> > rte_delay_us_block    librte_eal.so.6.1    7.372s
>> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so
>> 5.080s
>> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
>> > ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so
>>   3.364s
>> > [Others]        4.760s
>>
>> Performance reduced by removing that link status check, that is weird.
>> >
>> > Upon checking the callers of `rte_delay_us_block`, we got to know that
>> most of the time (92%) spent in this function is during initialization.
>> > This does not waste our processing time during communication. So, it's
>> a good start to our optimization.
>> >
>> > Callers    CPU Time: Total    CPU Time: Self
>> > rte_delay_us_block    100.0%    7.372s
>> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
>> >   e1000_write_phy_reg_mdic    1.8%    0.136s
>> >   e1000_reset_hw_ich8lan    1.7%    0.128s
>> >   e1000_read_phy_reg_mdic    1.4%    0.104s
>> >   eth_em_link_update    1.4%    0.100s
>> >   e1000_get_cfg_done_generic    0.7%    0.052s
>> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
>>
>> I guess you are having vTune start your application and that is why you
>> have init time items in your log. I normally start my application and then
>> attach vtune to the application. One of the options in configuration of
>> vtune for that project is to attach to the application. Maybe it would help
>> hear.
>>
>> Looking at the data you provided it was ok. The problem is it would not
>> load the source files as I did not have the same build or executable. I
>> tried to build the code, but it failed to build and I did not go further. I
>> guess I would need to see the full source tree and the executable you used
>> to really look at the problem. I have limited time, but I can try if you
>> like.
>> >
>> >
>> > Effective CPU Utilization:    21.4% (0.856 out of 4)
>> >
>> > Here is the link to vtune profiling results.
>> https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
>> >
>> > Thank you
>> >
>> > Regards
>> >
>> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com> wrote:
>> >
>> >
>> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com>
>> wrote:
>> > >
>> > > Hello,
>> > > As suggested, we tried profiling the application using Intel VTune
>> Amplifier. We aren't sure how to use these results, so we are attaching
>> them to this email.
>> > >
>> > > The things we understood were 'Top Hotspots' and 'Effective CPU
>> utilization'. Following are some of our understandings:
>> > >
>> > > Top Hotspots
>> > >
>> > > Function        Module  CPU Time
>> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
>> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
>> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so
>>     3.522s
>> > > ns3::DpdkNetDeviceReader::DoRead
>> libns3.28.1-fd-net-device-debug.so      2.470s
>> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so
>> 2.456s
>> > > [Others]                6.656s
>> > >
>> > > We knew about other methods except `rte_delay_us_block`. So we
>> investigated the callers of this method:
>> > >
>> > > Callers Effective Time  Spin Time       Overhead Time   Effective
>> Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time:
>> Self
>> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
>> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s
>> 0usec   0usec
>> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
>> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
>> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
>> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%
>> 0.064s  0usec   0usec
>> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s
>> 0usec   0usec
>> > >
>> > > We lack sufficient knowledge to investigate more than this.
>> > >
>> > > Effective CPU utilization
>> > >
>> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out of
>> 4 logical CPUs). We thought this is less. So we compared this with the
>> raw-socket version of the code, which was even less, 8.0% (0.318 out of 4
>> logical CPUs), and even then it is performing way better.
>> > >
>> > > It would be helpful if you give us insights on how to use these
>> results or point us to some resources to do so.
>> > >
>> > > Thank you
>> > >
>> >
>> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple
>> changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
>> >
>> > I did have a problem including rte_mbuf.h file into your code. It
>> appears the g++ compiler did not like referencing the struct rte_mbuf_sched
>> inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union
>> as a hack I moved the struct outside of the rte_mbuf structure and replaced
>> the struct in the union with ’struct rte_mbuf_sched sched;', but I am
>> guessing you are missing some compiler options in your build system as DPDK
>> builds just fine without that hack.
>> >
>> > The next place was the rxmode and the txq_flags. The rxmode structure
>> has changed and I commented out the inits in ns3 and then commented out the
>> txq_flags init code as these are now the defaults.
>> >
>> > Regards,
>> > Keith
>> >
>>
>> Regards,
>> Keith
>>
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-01-16 13:55                                                     ` Harsh Patel
@ 2019-01-30 23:36                                                       ` Harsh Patel
  2019-01-31 16:58                                                         ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2019-01-30 23:36 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, Kyle Larose, users

Hello,

This mail is to inform you that the integration of DPDK is working with
ns-3 on a basic level. The model is running.
For UDP traffic we are getting throughput same or better than raw socket.
(Around 100Mbps)
But unfortunately for TCP, there are burst packet losses due to which the
throughput is drastically affected after some point of time. The bandwidth
of the link used was 100Mbps.
We have obtained cwnd and ssthresh graphs which show that once the flow
gets out from Slow Start mode, there are so many packet losses that the
congestion window & the slow start threshold is not able to go above 4-5
packets.
We have attached the graphs with this mail.

We would like to know if there is any reason to this or how can we fix
this.

Thanks & Regards
Harsh & Hrishikesh

On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com> wrote:

> Hi
>
> We were able to optimise the DPDK version. There were couple of things we
> needed to do.
>
> We were using tx timeout as 1s/2048, which we found out to be very less.
> Then we increased the timeout, but we were getting lot of retransmissions.
>
> So we removed the timeout and sent single packet as soon as we get it.
> This increased the throughput.
>
> Then we used DPDK feature to launch function on core, and gave a dedicated
> core for Rx. This increased the throughput further.
>
> The code is working really well for low bandwidth (<~50Mbps) and is
> outperforming raw socket version.
> But for high bandwidth, we are getting packet length mismatches for some
> reason. We are investigating it.
>
> We really thank you for the suggestions given by you and also for keeping
> the patience for last couple of months.
>
> Thank you
>
> Regards,
> Harsh & Hrishikesh
>
> On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com> wrote:
>
>> Yes that would be helpful.
>> It'd be ok for now to use the same dpdk version to overcome the build
>> issues.
>> We will look into updating the code for latest versions once we get past
>> this problem.
>>
>> Thank you very much.
>>
>> Regards,
>> Harsh & Hrishikesh
>>
>> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com> wrote:
>>
>>>
>>>
>>> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com>
>>> wrote:
>>> >
>>> > Hi
>>> >
>>> > We applied your suggestion of removing the `IsLinkUp()` call. But the
>>> performace is even worse. We could only get around 340kbits/s.
>>> >
>>> > The Top Hotspots are:
>>> >
>>> > Function    Module    CPU Time
>>> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
>>> > rte_delay_us_block    librte_eal.so.6.1    7.372s
>>> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so
>>> 5.080s
>>> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
>>> > ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so
>>>   3.364s
>>> > [Others]        4.760s
>>>
>>> Performance reduced by removing that link status check, that is weird.
>>> >
>>> > Upon checking the callers of `rte_delay_us_block`, we got to know that
>>> most of the time (92%) spent in this function is during initialization.
>>> > This does not waste our processing time during communication. So, it's
>>> a good start to our optimization.
>>> >
>>> > Callers    CPU Time: Total    CPU Time: Self
>>> > rte_delay_us_block    100.0%    7.372s
>>> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
>>> >   e1000_write_phy_reg_mdic    1.8%    0.136s
>>> >   e1000_reset_hw_ich8lan    1.7%    0.128s
>>> >   e1000_read_phy_reg_mdic    1.4%    0.104s
>>> >   eth_em_link_update    1.4%    0.100s
>>> >   e1000_get_cfg_done_generic    0.7%    0.052s
>>> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
>>>
>>> I guess you are having vTune start your application and that is why you
>>> have init time items in your log. I normally start my application and then
>>> attach vtune to the application. One of the options in configuration of
>>> vtune for that project is to attach to the application. Maybe it would help
>>> hear.
>>>
>>> Looking at the data you provided it was ok. The problem is it would not
>>> load the source files as I did not have the same build or executable. I
>>> tried to build the code, but it failed to build and I did not go further. I
>>> guess I would need to see the full source tree and the executable you used
>>> to really look at the problem. I have limited time, but I can try if you
>>> like.
>>> >
>>> >
>>> > Effective CPU Utilization:    21.4% (0.856 out of 4)
>>> >
>>> > Here is the link to vtune profiling results.
>>> https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
>>> >
>>> > Thank you
>>> >
>>> > Regards
>>> >
>>> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com>
>>> wrote:
>>> >
>>> >
>>> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com>
>>> wrote:
>>> > >
>>> > > Hello,
>>> > > As suggested, we tried profiling the application using Intel VTune
>>> Amplifier. We aren't sure how to use these results, so we are attaching
>>> them to this email.
>>> > >
>>> > > The things we understood were 'Top Hotspots' and 'Effective CPU
>>> utilization'. Following are some of our understandings:
>>> > >
>>> > > Top Hotspots
>>> > >
>>> > > Function        Module  CPU Time
>>> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
>>> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
>>> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so
>>>     3.522s
>>> > > ns3::DpdkNetDeviceReader::DoRead
>>> libns3.28.1-fd-net-device-debug.so      2.470s
>>> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so
>>> 2.456s
>>> > > [Others]                6.656s
>>> > >
>>> > > We knew about other methods except `rte_delay_us_block`. So we
>>> investigated the callers of this method:
>>> > >
>>> > > Callers Effective Time  Spin Time       Overhead Time   Effective
>>> Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time:
>>> Self
>>> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
>>> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s
>>> 0usec   0usec
>>> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
>>> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
>>> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
>>> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%
>>> 0.064s  0usec   0usec
>>> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s
>>> 0usec   0usec
>>> > >
>>> > > We lack sufficient knowledge to investigate more than this.
>>> > >
>>> > > Effective CPU utilization
>>> > >
>>> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out of
>>> 4 logical CPUs). We thought this is less. So we compared this with the
>>> raw-socket version of the code, which was even less, 8.0% (0.318 out of 4
>>> logical CPUs), and even then it is performing way better.
>>> > >
>>> > > It would be helpful if you give us insights on how to use these
>>> results or point us to some resources to do so.
>>> > >
>>> > > Thank you
>>> > >
>>> >
>>> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple
>>> changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
>>> >
>>> > I did have a problem including rte_mbuf.h file into your code. It
>>> appears the g++ compiler did not like referencing the struct rte_mbuf_sched
>>> inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union
>>> as a hack I moved the struct outside of the rte_mbuf structure and replaced
>>> the struct in the union with ’struct rte_mbuf_sched sched;', but I am
>>> guessing you are missing some compiler options in your build system as DPDK
>>> builds just fine without that hack.
>>> >
>>> > The next place was the rxmode and the txq_flags. The rxmode structure
>>> has changed and I commented out the inits in ns3 and then commented out the
>>> txq_flags init code as these are now the defaults.
>>> >
>>> > Regards,
>>> > Keith
>>> >
>>>
>>> Regards,
>>> Keith
>>>
>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Ssthresh.png
Type: image/png
Size: 29575 bytes
Desc: not available
URL: <http://mails.dpdk.org/archives/users/attachments/20190131/6e8ada4f/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Cwnd.png
Type: image/png
Size: 33933 bytes
Desc: not available
URL: <http://mails.dpdk.org/archives/users/attachments/20190131/6e8ada4f/attachment-0001.png>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-01-30 23:36                                                       ` Harsh Patel
@ 2019-01-31 16:58                                                         ` Wiles, Keith
  2019-02-05  6:37                                                           ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2019-01-31 16:58 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



Sent from my iPhone

On Jan 30, 2019, at 5:36 PM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:

Hello,

This mail is to inform you that the integration of DPDK is working with ns-3 on a basic level. The model is running.
For UDP traffic we are getting throughput same or better than raw socket. (Around 100Mbps)
But unfortunately for TCP, there are burst packet losses due to which the throughput is drastically affected after some point of time. The bandwidth of the link used was 100Mbps.
We have obtained cwnd and ssthresh graphs which show that once the flow gets out from Slow Start mode, there are so many packet losses that the congestion window & the slow start threshold is not able to go above 4-5 packets.

Can you determine where the packets are being dropped?
We have attached the graphs with this mail.


I do not see the graphs attached but that’s OK.
We would like to know if there is any reason to this or how can we fix this.

I think we have to find out where the packets are being dropped this is the only reason for the case to your referring to.

Thanks & Regards
Harsh & Hrishikesh

On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
Hi

We were able to optimise the DPDK version. There were couple of things we needed to do.

We were using tx timeout as 1s/2048, which we found out to be very less. Then we increased the timeout, but we were getting lot of retransmissions.

So we removed the timeout and sent single packet as soon as we get it. This increased the throughput.

Then we used DPDK feature to launch function on core, and gave a dedicated core for Rx. This increased the throughput further.

The code is working really well for low bandwidth (<~50Mbps) and is outperforming raw socket version.
But for high bandwidth, we are getting packet length mismatches for some reason. We are investigating it.

We really thank you for the suggestions given by you and also for keeping the patience for last couple of months.

Thank you

Regards,
Harsh & Hrishikesh

On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
Yes that would be helpful.
It'd be ok for now to use the same dpdk version to overcome the build issues.
We will look into updating the code for latest versions once we get past this problem.

Thank you very much.

Regards,
Harsh & Hrishikesh

On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:


> On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
>
> Hi
>
> We applied your suggestion of removing the `IsLinkUp()` call. But the performace is even worse. We could only get around 340kbits/s.
>
> The Top Hotspots are:
>
> Function    Module    CPU Time
> eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
> rte_delay_us_block    librte_eal.so.6.1    7.372s
> ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so<http://libns3.28.1-fd-net-device-debug.so>    5.080s
> rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so<http://libns3.28.1-fd-net-device-debug.so>    3.558s
> ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so<http://libns3.28.1-fd-net-device-debug.so>    3.364s
> [Others]        4.760s

Performance reduced by removing that link status check, that is weird.
>
> Upon checking the callers of `rte_delay_us_block`, we got to know that most of the time (92%) spent in this function is during initialization.
> This does not waste our processing time during communication. So, it's a good start to our optimization.
>
> Callers    CPU Time: Total    CPU Time: Self
> rte_delay_us_block    100.0%    7.372s
>   e1000_enable_ulp_lpt_lp    92.3%    6.804s
>   e1000_write_phy_reg_mdic    1.8%    0.136s
>   e1000_reset_hw_ich8lan    1.7%    0.128s
>   e1000_read_phy_reg_mdic    1.4%    0.104s
>   eth_em_link_update    1.4%    0.100s
>   e1000_get_cfg_done_generic    0.7%    0.052s
>   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s

I guess you are having vTune start your application and that is why you have init time items in your log. I normally start my application and then attach vtune to the application. One of the options in configuration of vtune for that project is to attach to the application. Maybe it would help hear.

Looking at the data you provided it was ok. The problem is it would not load the source files as I did not have the same build or executable. I tried to build the code, but it failed to build and I did not go further. I guess I would need to see the full source tree and the executable you used to really look at the problem. I have limited time, but I can try if you like.
>
>
> Effective CPU Utilization:    21.4% (0.856 out of 4)
>
> Here is the link to vtune profiling results. https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
>
> Thank you
>
> Regards
>
> On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com<mailto:keith.wiles@intel.com>> wrote:
>
>
> > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com<mailto:thadodaharsh10@gmail.com>> wrote:
> >
> > Hello,
> > As suggested, we tried profiling the application using Intel VTune Amplifier. We aren't sure how to use these results, so we are attaching them to this email.
> >
> > The things we understood were 'Top Hotspots' and 'Effective CPU utilization'. Following are some of our understandings:
> >
> > Top Hotspots
> >
> > Function        Module  CPU Time
> > rte_delay_us_block      librte_eal.so.6.1       15.042s
> > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so<http://libns3.28.1-fd-net-device-debug.so>      3.522s
> > ns3::DpdkNetDeviceReader::DoRead        libns3.28.1-fd-net-device-debug.so<http://libns3.28.1-fd-net-device-debug.so>      2.470s
> > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so<http://libns3.28.1-fd-net-device-debug.so>      2.456s
> > [Others]                6.656s
> >
> > We knew about other methods except `rte_delay_us_block`. So we investigated the callers of this method:
> >
> > Callers Effective Time  Spin Time       Overhead Time   Effective Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
> > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
> > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec   0usec
> > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
> > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
> > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
> > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s  0usec   0usec
> > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec   0usec
> >
> > We lack sufficient knowledge to investigate more than this.
> >
> > Effective CPU utilization
> >
> > Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4 logical CPUs). We thought this is less. So we compared this with the raw-socket version of the code, which was even less, 8.0% (0.318 out of 4 logical CPUs), and even then it is performing way better.
> >
> > It would be helpful if you give us insights on how to use these results or point us to some resources to do so.
> >
> > Thank you
> >
>
> BTW, I was able to build ns3 with DPDK 18.11 it required a couple changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
>
> I did have a problem including rte_mbuf.h file into your code. It appears the g++ compiler did not like referencing the struct rte_mbuf_sched inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union as a hack I moved the struct outside of the rte_mbuf structure and replaced the struct in the union with ’struct rte_mbuf_sched sched;', but I am guessing you are missing some compiler options in your build system as DPDK builds just fine without that hack.
>
> The next place was the rxmode and the txq_flags. The rxmode structure has changed and I commented out the inits in ns3 and then commented out the txq_flags init code as these are now the defaults.
>
> Regards,
> Keith
>

Regards,
Keith

<Ssthresh.png>
<Cwnd.png>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-01-31 16:58                                                         ` Wiles, Keith
@ 2019-02-05  6:37                                                           ` Harsh Patel
  2019-02-05 13:03                                                             ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2019-02-05  6:37 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, Kyle Larose, users

Hi,

We would like to inform you that our code is working as expected and we are
able to obtain 95-98 Mbps data rate for a 100Mbps application rate. We are
now working on the testing of the code. Thanks a lot, especially to Keith
for all the help you provided.

We have 2 main queries :-
1) We wanted to calculate Backlog at the NIC Tx Descriptors but were not
able to find anything in the documentation. Can you help us in how to
calculate the backlog?
2) We searched on how to use Byte Queue Limit (BQL) on the NIC queue but
couldn't find anything like that in DPDK. Does DPDK support BQL? If so, can
you help us on how to use it for our project?

Thanks & Regards
Harsh & Hrishikesh

On Thu, 31 Jan 2019 at 22:28, Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> Sent from my iPhone
>
> On Jan 30, 2019, at 5:36 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
>
> Hello,
>
> This mail is to inform you that the integration of DPDK is working with
> ns-3 on a basic level. The model is running.
> For UDP traffic we are getting throughput same or better than raw socket.
> (Around 100Mbps)
> But unfortunately for TCP, there are burst packet losses due to which the
> throughput is drastically affected after some point of time. The bandwidth
> of the link used was 100Mbps.
> We have obtained cwnd and ssthresh graphs which show that once the flow
> gets out from Slow Start mode, there are so many packet losses that the
> congestion window & the slow start threshold is not able to go above 4-5
> packets.
>
>
> Can you determine where the packets are being dropped?
>
> We have attached the graphs with this mail.
>
>
> I do not see the graphs attached but that’s OK.
>
> We would like to know if there is any reason to this or how can we fix
> this.
>
>
> I think we have to find out where the packets are being dropped this is
> the only reason for the case to your referring to.
>
>
> Thanks & Regards
> Harsh & Hrishikesh
>
> On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
>
>> Hi
>>
>> We were able to optimise the DPDK version. There were couple of things we
>> needed to do.
>>
>> We were using tx timeout as 1s/2048, which we found out to be very less.
>> Then we increased the timeout, but we were getting lot of retransmissions.
>>
>> So we removed the timeout and sent single packet as soon as we get it.
>> This increased the throughput.
>>
>> Then we used DPDK feature to launch function on core, and gave a
>> dedicated core for Rx. This increased the throughput further.
>>
>> The code is working really well for low bandwidth (<~50Mbps) and is
>> outperforming raw socket version.
>> But for high bandwidth, we are getting packet length mismatches for some
>> reason. We are investigating it.
>>
>> We really thank you for the suggestions given by you and also for keeping
>> the patience for last couple of months.
>>
>> Thank you
>>
>> Regards,
>> Harsh & Hrishikesh
>>
>> On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com> wrote:
>>
>>> Yes that would be helpful.
>>> It'd be ok for now to use the same dpdk version to overcome the build
>>> issues.
>>> We will look into updating the code for latest versions once we get past
>>> this problem.
>>>
>>> Thank you very much.
>>>
>>> Regards,
>>> Harsh & Hrishikesh
>>>
>>> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com> wrote:
>>>
>>>>
>>>>
>>>> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi
>>>> >
>>>> > We applied your suggestion of removing the `IsLinkUp()` call. But the
>>>> performace is even worse. We could only get around 340kbits/s.
>>>> >
>>>> > The Top Hotspots are:
>>>> >
>>>> > Function    Module    CPU Time
>>>> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
>>>> > rte_delay_us_block    librte_eal.so.6.1    7.372s
>>>> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so
>>>> 5.080s
>>>> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
>>>> > ns3::DpdkNetDeviceReader::DoRead
>>>> libns3.28.1-fd-net-device-debug.so    3.364s
>>>> > [Others]        4.760s
>>>>
>>>> Performance reduced by removing that link status check, that is weird.
>>>> >
>>>> > Upon checking the callers of `rte_delay_us_block`, we got to know
>>>> that most of the time (92%) spent in this function is during initialization.
>>>> > This does not waste our processing time during communication. So,
>>>> it's a good start to our optimization.
>>>> >
>>>> > Callers    CPU Time: Total    CPU Time: Self
>>>> > rte_delay_us_block    100.0%    7.372s
>>>> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
>>>> >   e1000_write_phy_reg_mdic    1.8%    0.136s
>>>> >   e1000_reset_hw_ich8lan    1.7%    0.128s
>>>> >   e1000_read_phy_reg_mdic    1.4%    0.104s
>>>> >   eth_em_link_update    1.4%    0.100s
>>>> >   e1000_get_cfg_done_generic    0.7%    0.052s
>>>> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
>>>>
>>>> I guess you are having vTune start your application and that is why you
>>>> have init time items in your log. I normally start my application and then
>>>> attach vtune to the application. One of the options in configuration of
>>>> vtune for that project is to attach to the application. Maybe it would help
>>>> hear.
>>>>
>>>> Looking at the data you provided it was ok. The problem is it would not
>>>> load the source files as I did not have the same build or executable. I
>>>> tried to build the code, but it failed to build and I did not go further. I
>>>> guess I would need to see the full source tree and the executable you used
>>>> to really look at the problem. I have limited time, but I can try if you
>>>> like.
>>>> >
>>>> >
>>>> > Effective CPU Utilization:    21.4% (0.856 out of 4)
>>>> >
>>>> > Here is the link to vtune profiling results.
>>>> https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
>>>> >
>>>> > Thank you
>>>> >
>>>> > Regards
>>>> >
>>>> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com>
>>>> wrote:
>>>> >
>>>> >
>>>> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com>
>>>> wrote:
>>>> > >
>>>> > > Hello,
>>>> > > As suggested, we tried profiling the application using Intel VTune
>>>> Amplifier. We aren't sure how to use these results, so we are attaching
>>>> them to this email.
>>>> > >
>>>> > > The things we understood were 'Top Hotspots' and 'Effective CPU
>>>> utilization'. Following are some of our understandings:
>>>> > >
>>>> > > Top Hotspots
>>>> > >
>>>> > > Function        Module  CPU Time
>>>> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
>>>> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
>>>> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so
>>>>     3.522s
>>>> > > ns3::DpdkNetDeviceReader::DoRead
>>>> libns3.28.1-fd-net-device-debug.so      2.470s
>>>> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so
>>>> 2.456s
>>>> > > [Others]                6.656s
>>>> > >
>>>> > > We knew about other methods except `rte_delay_us_block`. So we
>>>> investigated the callers of this method:
>>>> > >
>>>> > > Callers Effective Time  Spin Time       Overhead Time   Effective
>>>> Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time:
>>>> Self
>>>> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec
>>>>  0usec
>>>> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s
>>>> 0usec   0usec
>>>> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec
>>>>  0usec
>>>> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec
>>>>  0usec
>>>> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec
>>>>  0usec
>>>> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%
>>>> 0.064s  0usec   0usec
>>>> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s
>>>> 0usec   0usec
>>>> > >
>>>> > > We lack sufficient knowledge to investigate more than this.
>>>> > >
>>>> > > Effective CPU utilization
>>>> > >
>>>> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out
>>>> of 4 logical CPUs). We thought this is less. So we compared this with the
>>>> raw-socket version of the code, which was even less, 8.0% (0.318 out of 4
>>>> logical CPUs), and even then it is performing way better.
>>>> > >
>>>> > > It would be helpful if you give us insights on how to use these
>>>> results or point us to some resources to do so.
>>>> > >
>>>> > > Thank you
>>>> > >
>>>> >
>>>> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple
>>>> changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
>>>> >
>>>> > I did have a problem including rte_mbuf.h file into your code. It
>>>> appears the g++ compiler did not like referencing the struct rte_mbuf_sched
>>>> inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union
>>>> as a hack I moved the struct outside of the rte_mbuf structure and replaced
>>>> the struct in the union with ’struct rte_mbuf_sched sched;', but I am
>>>> guessing you are missing some compiler options in your build system as DPDK
>>>> builds just fine without that hack.
>>>> >
>>>> > The next place was the rxmode and the txq_flags. The rxmode structure
>>>> has changed and I commented out the inits in ns3 and then commented out the
>>>> txq_flags init code as these are now the defaults.
>>>> >
>>>> > Regards,
>>>> > Keith
>>>> >
>>>>
>>>> Regards,
>>>> Keith
>>>>
>>>> <Ssthresh.png>
>
> <Cwnd.png>
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-02-05  6:37                                                           ` Harsh Patel
@ 2019-02-05 13:03                                                             ` Wiles, Keith
  2019-02-05 14:00                                                               ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2019-02-05 13:03 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



> On Feb 5, 2019, at 12:37 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hi, 
> 
> We would like to inform you that our code is working as expected and we are able to obtain 95-98 Mbps data rate for a 100Mbps application rate. We are now working on the testing of the code. Thanks a lot, especially to Keith for all the help you provided.
> 
> We have 2 main queries :-
> 1) We wanted to calculate Backlog at the NIC Tx Descriptors but were not able to find anything in the documentation. Can you help us in how to calculate the backlog?
> 2) We searched on how to use Byte Queue Limit (BQL) on the NIC queue but couldn't find anything like that in DPDK. Does DPDK support BQL? If so, can you help us on how to use it for our project?

what was the last set of problems if I may ask?
> 
> Thanks & Regards
> Harsh & Hrishikesh
> 
> On Thu, 31 Jan 2019 at 22:28, Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> Sent from my iPhone
> 
> On Jan 30, 2019, at 5:36 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
>> Hello, 
>> 
>> This mail is to inform you that the integration of DPDK is working with ns-3 on a basic level. The model is running. 
>> For UDP traffic we are getting throughput same or better than raw socket. (Around 100Mbps)
>> But unfortunately for TCP, there are burst packet losses due to which the throughput is drastically affected after some point of time. The bandwidth of the link used was 100Mbps. 
>> We have obtained cwnd and ssthresh graphs which show that once the flow gets out from Slow Start mode, there are so many packet losses that the congestion window & the slow start threshold is not able to go above 4-5 packets. 
> 
> Can you determine where the packets are being dropped?
>> We have attached the graphs with this mail.
>> 
> 
> I do not see the graphs attached but that’s OK. 
>> We would like to know if there is any reason to this or how can we fix this. 
> 
> I think we have to find out where the packets are being dropped this is the only reason for the case to your referring to. 
>> 
>> Thanks & Regards
>> Harsh & Hrishikesh
>> 
>> On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com> wrote:
>> Hi
>> 
>> We were able to optimise the DPDK version. There were couple of things we needed to do.
>> 
>> We were using tx timeout as 1s/2048, which we found out to be very less. Then we increased the timeout, but we were getting lot of retransmissions.
>> 
>> So we removed the timeout and sent single packet as soon as we get it. This increased the throughput.
>> 
>> Then we used DPDK feature to launch function on core, and gave a dedicated core for Rx. This increased the throughput further.
>> 
>> The code is working really well for low bandwidth (<~50Mbps) and is outperforming raw socket version.
>> But for high bandwidth, we are getting packet length mismatches for some reason. We are investigating it.
>> 
>> We really thank you for the suggestions given by you and also for keeping the patience for last couple of months. 
>> 
>> Thank you
>> 
>> Regards, 
>> Harsh & Hrishikesh 
>> 
>> On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com> wrote:
>> Yes that would be helpful. 
>> It'd be ok for now to use the same dpdk version to overcome the build issues. 
>> We will look into updating the code for latest versions once we get past this problem. 
>> 
>> Thank you very much. 
>> 
>> Regards, 
>> Harsh & Hrishikesh
>> 
>> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com> wrote:
>> 
>> 
>> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
>> > 
>> > Hi
>> > 
>> > We applied your suggestion of removing the `IsLinkUp()` call. But the performace is even worse. We could only get around 340kbits/s.
>> > 
>> > The Top Hotspots are:
>> > 
>> > Function    Module    CPU Time
>> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
>> > rte_delay_us_block    librte_eal.so.6.1    7.372s
>> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so    5.080s
>> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
>> > ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so    3.364s
>> > [Others]        4.760s
>> 
>> Performance reduced by removing that link status check, that is weird.
>> > 
>> > Upon checking the callers of `rte_delay_us_block`, we got to know that most of the time (92%) spent in this function is during initialization.
>> > This does not waste our processing time during communication. So, it's a good start to our optimization.
>> > 
>> > Callers    CPU Time: Total    CPU Time: Self
>> > rte_delay_us_block    100.0%    7.372s
>> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
>> >   e1000_write_phy_reg_mdic    1.8%    0.136s
>> >   e1000_reset_hw_ich8lan    1.7%    0.128s
>> >   e1000_read_phy_reg_mdic    1.4%    0.104s
>> >   eth_em_link_update    1.4%    0.100s
>> >   e1000_get_cfg_done_generic    0.7%    0.052s
>> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
>> 
>> I guess you are having vTune start your application and that is why you have init time items in your log. I normally start my application and then attach vtune to the application. One of the options in configuration of vtune for that project is to attach to the application. Maybe it would help hear.
>> 
>> Looking at the data you provided it was ok. The problem is it would not load the source files as I did not have the same build or executable. I tried to build the code, but it failed to build and I did not go further. I guess I would need to see the full source tree and the executable you used to really look at the problem. I have limited time, but I can try if you like. 
>> > 
>> > 
>> > Effective CPU Utilization:    21.4% (0.856 out of 4)
>> > 
>> > Here is the link to vtune profiling results. https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
>> > 
>> > Thank you
>> > 
>> > Regards
>> > 
>> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com> wrote:
>> > 
>> > 
>> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
>> > > 
>> > > Hello,
>> > > As suggested, we tried profiling the application using Intel VTune Amplifier. We aren't sure how to use these results, so we are attaching them to this email.
>> > > 
>> > > The things we understood were 'Top Hotspots' and 'Effective CPU utilization'. Following are some of our understandings:
>> > > 
>> > > Top Hotspots
>> > > 
>> > > Function        Module  CPU Time
>> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
>> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
>> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so      3.522s
>> > > ns3::DpdkNetDeviceReader::DoRead        libns3.28.1-fd-net-device-debug.so      2.470s
>> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
>> > > [Others]                6.656s
>> > > 
>> > > We knew about other methods except `rte_delay_us_block`. So we investigated the callers of this method:
>> > > 
>> > > Callers Effective Time  Spin Time       Overhead Time   Effective Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
>> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
>> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec   0usec
>> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
>> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
>> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
>> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s  0usec   0usec
>> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec   0usec
>> > > 
>> > > We lack sufficient knowledge to investigate more than this.
>> > > 
>> > > Effective CPU utilization
>> > > 
>> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4 logical CPUs). We thought this is less. So we compared this with the raw-socket version of the code, which was even less, 8.0% (0.318 out of 4 logical CPUs), and even then it is performing way better.
>> > > 
>> > > It would be helpful if you give us insights on how to use these results or point us to some resources to do so. 
>> > > 
>> > > Thank you 
>> > > 
>> > 
>> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
>> > 
>> > I did have a problem including rte_mbuf.h file into your code. It appears the g++ compiler did not like referencing the struct rte_mbuf_sched inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union as a hack I moved the struct outside of the rte_mbuf structure and replaced the struct in the union with ’struct rte_mbuf_sched sched;', but I am guessing you are missing some compiler options in your build system as DPDK builds just fine without that hack.
>> > 
>> > The next place was the rxmode and the txq_flags. The rxmode structure has changed and I commented out the inits in ns3 and then commented out the txq_flags init code as these are now the defaults.
>> > 
>> > Regards,
>> > Keith
>> > 
>> 
>> Regards,
>> Keith
>> 
>> <Ssthresh.png>
>> <Cwnd.png>

Regards,
Keith


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-02-05 13:03                                                             ` Wiles, Keith
@ 2019-02-05 14:00                                                               ` Harsh Patel
  2019-02-05 14:12                                                                 ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2019-02-05 14:00 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, Kyle Larose, users

Hi,
One of the mistake was as following. ns-3 frees the packet buffer just as
it writes to the socket and thus we thought that we should also do the
same. But dpdk while writing places the packet buffer to the tx descriptor
ring and perform the transmission after that on its own. And we were
freeing early so sometimes the packets were lost i.e. freed before
transmission.

Another thing was that as you suggested earlier we compiled the whole ns-3
in optimized mode. That improved the performance.

These 2 things combined got us the desired results.

Regards,
Harsh & Hrishikesh

On Tue, Feb 5, 2019, 18:33 Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Feb 5, 2019, at 12:37 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Hi,
> >
> > We would like to inform you that our code is working as expected and we
> are able to obtain 95-98 Mbps data rate for a 100Mbps application rate. We
> are now working on the testing of the code. Thanks a lot, especially to
> Keith for all the help you provided.
> >
> > We have 2 main queries :-
> > 1) We wanted to calculate Backlog at the NIC Tx Descriptors but were not
> able to find anything in the documentation. Can you help us in how to
> calculate the backlog?
> > 2) We searched on how to use Byte Queue Limit (BQL) on the NIC queue but
> couldn't find anything like that in DPDK. Does DPDK support BQL? If so, can
> you help us on how to use it for our project?
>
> what was the last set of problems if I may ask?
> >
> > Thanks & Regards
> > Harsh & Hrishikesh
> >
> > On Thu, 31 Jan 2019 at 22:28, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> >
> >
> > Sent from my iPhone
> >
> > On Jan 30, 2019, at 5:36 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> >> Hello,
> >>
> >> This mail is to inform you that the integration of DPDK is working with
> ns-3 on a basic level. The model is running.
> >> For UDP traffic we are getting throughput same or better than raw
> socket. (Around 100Mbps)
> >> But unfortunately for TCP, there are burst packet losses due to which
> the throughput is drastically affected after some point of time. The
> bandwidth of the link used was 100Mbps.
> >> We have obtained cwnd and ssthresh graphs which show that once the flow
> gets out from Slow Start mode, there are so many packet losses that the
> congestion window & the slow start threshold is not able to go above 4-5
> packets.
> >
> > Can you determine where the packets are being dropped?
> >> We have attached the graphs with this mail.
> >>
> >
> > I do not see the graphs attached but that’s OK.
> >> We would like to know if there is any reason to this or how can we fix
> this.
> >
> > I think we have to find out where the packets are being dropped this is
> the only reason for the case to your referring to.
> >>
> >> Thanks & Regards
> >> Harsh & Hrishikesh
> >>
> >> On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >> Hi
> >>
> >> We were able to optimise the DPDK version. There were couple of things
> we needed to do.
> >>
> >> We were using tx timeout as 1s/2048, which we found out to be very
> less. Then we increased the timeout, but we were getting lot of
> retransmissions.
> >>
> >> So we removed the timeout and sent single packet as soon as we get it.
> This increased the throughput.
> >>
> >> Then we used DPDK feature to launch function on core, and gave a
> dedicated core for Rx. This increased the throughput further.
> >>
> >> The code is working really well for low bandwidth (<~50Mbps) and is
> outperforming raw socket version.
> >> But for high bandwidth, we are getting packet length mismatches for
> some reason. We are investigating it.
> >>
> >> We really thank you for the suggestions given by you and also for
> keeping the patience for last couple of months.
> >>
> >> Thank you
> >>
> >> Regards,
> >> Harsh & Hrishikesh
> >>
> >> On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >> Yes that would be helpful.
> >> It'd be ok for now to use the same dpdk version to overcome the build
> issues.
> >> We will look into updating the code for latest versions once we get
> past this problem.
> >>
> >> Thank you very much.
> >>
> >> Regards,
> >> Harsh & Hrishikesh
> >>
> >> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com> wrote:
> >>
> >>
> >> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >> >
> >> > Hi
> >> >
> >> > We applied your suggestion of removing the `IsLinkUp()` call. But the
> performace is even worse. We could only get around 340kbits/s.
> >> >
> >> > The Top Hotspots are:
> >> >
> >> > Function    Module    CPU Time
> >> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
> >> > rte_delay_us_block    librte_eal.so.6.1    7.372s
> >> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so
> 5.080s
> >> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
> >> > ns3::DpdkNetDeviceReader::DoRead
> libns3.28.1-fd-net-device-debug.so    3.364s
> >> > [Others]        4.760s
> >>
> >> Performance reduced by removing that link status check, that is weird.
> >> >
> >> > Upon checking the callers of `rte_delay_us_block`, we got to know
> that most of the time (92%) spent in this function is during initialization.
> >> > This does not waste our processing time during communication. So,
> it's a good start to our optimization.
> >> >
> >> > Callers    CPU Time: Total    CPU Time: Self
> >> > rte_delay_us_block    100.0%    7.372s
> >> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
> >> >   e1000_write_phy_reg_mdic    1.8%    0.136s
> >> >   e1000_reset_hw_ich8lan    1.7%    0.128s
> >> >   e1000_read_phy_reg_mdic    1.4%    0.104s
> >> >   eth_em_link_update    1.4%    0.100s
> >> >   e1000_get_cfg_done_generic    0.7%    0.052s
> >> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
> >>
> >> I guess you are having vTune start your application and that is why you
> have init time items in your log. I normally start my application and then
> attach vtune to the application. One of the options in configuration of
> vtune for that project is to attach to the application. Maybe it would help
> hear.
> >>
> >> Looking at the data you provided it was ok. The problem is it would not
> load the source files as I did not have the same build or executable. I
> tried to build the code, but it failed to build and I did not go further. I
> guess I would need to see the full source tree and the executable you used
> to really look at the problem. I have limited time, but I can try if you
> like.
> >> >
> >> >
> >> > Effective CPU Utilization:    21.4% (0.856 out of 4)
> >> >
> >> > Here is the link to vtune profiling results.
> https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
> >> >
> >> > Thank you
> >> >
> >> > Regards
> >> >
> >> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com>
> wrote:
> >> >
> >> >
> >> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >> > >
> >> > > Hello,
> >> > > As suggested, we tried profiling the application using Intel VTune
> Amplifier. We aren't sure how to use these results, so we are attaching
> them to this email.
> >> > >
> >> > > The things we understood were 'Top Hotspots' and 'Effective CPU
> utilization'. Following are some of our understandings:
> >> > >
> >> > > Top Hotspots
> >> > >
> >> > > Function        Module  CPU Time
> >> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
> >> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> >> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so
>     3.522s
> >> > > ns3::DpdkNetDeviceReader::DoRead
> libns3.28.1-fd-net-device-debug.so      2.470s
> >> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so
> 2.456s
> >> > > [Others]                6.656s
> >> > >
> >> > > We knew about other methods except `rte_delay_us_block`. So we
> investigated the callers of this method:
> >> > >
> >> > > Callers Effective Time  Spin Time       Overhead Time   Effective
> Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time:
> Self
> >> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec
>  0usec
> >> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s
> 0usec   0usec
> >> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec
>  0usec
> >> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec
>  0usec
> >> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec
>  0usec
> >> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%
> 0.064s  0usec   0usec
> >> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s
> 0usec   0usec
> >> > >
> >> > > We lack sufficient knowledge to investigate more than this.
> >> > >
> >> > > Effective CPU utilization
> >> > >
> >> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out
> of 4 logical CPUs). We thought this is less. So we compared this with the
> raw-socket version of the code, which was even less, 8.0% (0.318 out of 4
> logical CPUs), and even then it is performing way better.
> >> > >
> >> > > It would be helpful if you give us insights on how to use these
> results or point us to some resources to do so.
> >> > >
> >> > > Thank you
> >> > >
> >> >
> >> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple
> changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
> >> >
> >> > I did have a problem including rte_mbuf.h file into your code. It
> appears the g++ compiler did not like referencing the struct rte_mbuf_sched
> inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union
> as a hack I moved the struct outside of the rte_mbuf structure and replaced
> the struct in the union with ’struct rte_mbuf_sched sched;', but I am
> guessing you are missing some compiler options in your build system as DPDK
> builds just fine without that hack.
> >> >
> >> > The next place was the rxmode and the txq_flags. The rxmode structure
> has changed and I commented out the inits in ns3 and then commented out the
> txq_flags init code as these are now the defaults.
> >> >
> >> > Regards,
> >> > Keith
> >> >
> >>
> >> Regards,
> >> Keith
> >>
> >> <Ssthresh.png>
> >> <Cwnd.png>
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-02-05 14:00                                                               ` Harsh Patel
@ 2019-02-05 14:12                                                                 ` Wiles, Keith
  2019-02-05 14:22                                                                   ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2019-02-05 14:12 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



> On Feb 5, 2019, at 8:00 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Hi, 
> One of the mistake was as following. ns-3 frees the packet buffer just as it writes to the socket and thus we thought that we should also do the same. But dpdk while writing places the packet buffer to the tx descriptor ring and perform the transmission after that on its own. And we were freeing early so sometimes the packets were lost i.e. freed before transmission. 
> 
> Another thing was that as you suggested earlier we compiled the whole ns-3 in optimized mode. That improved the performance. 
> 
> These 2 things combined got us the desired results. 

Excellent thanks
> 
> Regards, 
> Harsh & Hrishikesh 
> 
> On Tue, Feb 5, 2019, 18:33 Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> > On Feb 5, 2019, at 12:37 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> > Hi, 
> > 
> > We would like to inform you that our code is working as expected and we are able to obtain 95-98 Mbps data rate for a 100Mbps application rate. We are now working on the testing of the code. Thanks a lot, especially to Keith for all the help you provided.
> > 
> > We have 2 main queries :-
> > 1) We wanted to calculate Backlog at the NIC Tx Descriptors but were not able to find anything in the documentation. Can you help us in how to calculate the backlog?
> > 2) We searched on how to use Byte Queue Limit (BQL) on the NIC queue but couldn't find anything like that in DPDK. Does DPDK support BQL? If so, can you help us on how to use it for our project?
> 
> what was the last set of problems if I may ask?
> > 
> > Thanks & Regards
> > Harsh & Hrishikesh
> > 
> > On Thu, 31 Jan 2019 at 22:28, Wiles, Keith <keith.wiles@intel.com> wrote:
> > 
> > 
> > Sent from my iPhone
> > 
> > On Jan 30, 2019, at 5:36 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> >> Hello, 
> >> 
> >> This mail is to inform you that the integration of DPDK is working with ns-3 on a basic level. The model is running. 
> >> For UDP traffic we are getting throughput same or better than raw socket. (Around 100Mbps)
> >> But unfortunately for TCP, there are burst packet losses due to which the throughput is drastically affected after some point of time. The bandwidth of the link used was 100Mbps. 
> >> We have obtained cwnd and ssthresh graphs which show that once the flow gets out from Slow Start mode, there are so many packet losses that the congestion window & the slow start threshold is not able to go above 4-5 packets. 
> > 
> > Can you determine where the packets are being dropped?
> >> We have attached the graphs with this mail.
> >> 
> > 
> > I do not see the graphs attached but that’s OK. 
> >> We would like to know if there is any reason to this or how can we fix this. 
> > 
> > I think we have to find out where the packets are being dropped this is the only reason for the case to your referring to. 
> >> 
> >> Thanks & Regards
> >> Harsh & Hrishikesh
> >> 
> >> On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> >> Hi
> >> 
> >> We were able to optimise the DPDK version. There were couple of things we needed to do.
> >> 
> >> We were using tx timeout as 1s/2048, which we found out to be very less. Then we increased the timeout, but we were getting lot of retransmissions.
> >> 
> >> So we removed the timeout and sent single packet as soon as we get it. This increased the throughput.
> >> 
> >> Then we used DPDK feature to launch function on core, and gave a dedicated core for Rx. This increased the throughput further.
> >> 
> >> The code is working really well for low bandwidth (<~50Mbps) and is outperforming raw socket version.
> >> But for high bandwidth, we are getting packet length mismatches for some reason. We are investigating it.
> >> 
> >> We really thank you for the suggestions given by you and also for keeping the patience for last couple of months. 
> >> 
> >> Thank you
> >> 
> >> Regards, 
> >> Harsh & Hrishikesh 
> >> 
> >> On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com> wrote:
> >> Yes that would be helpful. 
> >> It'd be ok for now to use the same dpdk version to overcome the build issues. 
> >> We will look into updating the code for latest versions once we get past this problem. 
> >> 
> >> Thank you very much. 
> >> 
> >> Regards, 
> >> Harsh & Hrishikesh
> >> 
> >> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com> wrote:
> >> 
> >> 
> >> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> >> > 
> >> > Hi
> >> > 
> >> > We applied your suggestion of removing the `IsLinkUp()` call. But the performace is even worse. We could only get around 340kbits/s.
> >> > 
> >> > The Top Hotspots are:
> >> > 
> >> > Function    Module    CPU Time
> >> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
> >> > rte_delay_us_block    librte_eal.so.6.1    7.372s
> >> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so    5.080s
> >> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
> >> > ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so    3.364s
> >> > [Others]        4.760s
> >> 
> >> Performance reduced by removing that link status check, that is weird.
> >> > 
> >> > Upon checking the callers of `rte_delay_us_block`, we got to know that most of the time (92%) spent in this function is during initialization.
> >> > This does not waste our processing time during communication. So, it's a good start to our optimization.
> >> > 
> >> > Callers    CPU Time: Total    CPU Time: Self
> >> > rte_delay_us_block    100.0%    7.372s
> >> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
> >> >   e1000_write_phy_reg_mdic    1.8%    0.136s
> >> >   e1000_reset_hw_ich8lan    1.7%    0.128s
> >> >   e1000_read_phy_reg_mdic    1.4%    0.104s
> >> >   eth_em_link_update    1.4%    0.100s
> >> >   e1000_get_cfg_done_generic    0.7%    0.052s
> >> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
> >> 
> >> I guess you are having vTune start your application and that is why you have init time items in your log. I normally start my application and then attach vtune to the application. One of the options in configuration of vtune for that project is to attach to the application. Maybe it would help hear.
> >> 
> >> Looking at the data you provided it was ok. The problem is it would not load the source files as I did not have the same build or executable. I tried to build the code, but it failed to build and I did not go further. I guess I would need to see the full source tree and the executable you used to really look at the problem. I have limited time, but I can try if you like. 
> >> > 
> >> > 
> >> > Effective CPU Utilization:    21.4% (0.856 out of 4)
> >> > 
> >> > Here is the link to vtune profiling results. https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
> >> > 
> >> > Thank you
> >> > 
> >> > Regards
> >> > 
> >> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com> wrote:
> >> > 
> >> > 
> >> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> >> > > 
> >> > > Hello,
> >> > > As suggested, we tried profiling the application using Intel VTune Amplifier. We aren't sure how to use these results, so we are attaching them to this email.
> >> > > 
> >> > > The things we understood were 'Top Hotspots' and 'Effective CPU utilization'. Following are some of our understandings:
> >> > > 
> >> > > Top Hotspots
> >> > > 
> >> > > Function        Module  CPU Time
> >> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
> >> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> >> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so      3.522s
> >> > > ns3::DpdkNetDeviceReader::DoRead        libns3.28.1-fd-net-device-debug.so      2.470s
> >> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
> >> > > [Others]                6.656s
> >> > > 
> >> > > We knew about other methods except `rte_delay_us_block`. So we investigated the callers of this method:
> >> > > 
> >> > > Callers Effective Time  Spin Time       Overhead Time   Effective Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
> >> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
> >> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec   0usec
> >> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
> >> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
> >> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
> >> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s  0usec   0usec
> >> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec   0usec
> >> > > 
> >> > > We lack sufficient knowledge to investigate more than this.
> >> > > 
> >> > > Effective CPU utilization
> >> > > 
> >> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4 logical CPUs). We thought this is less. So we compared this with the raw-socket version of the code, which was even less, 8.0% (0.318 out of 4 logical CPUs), and even then it is performing way better.
> >> > > 
> >> > > It would be helpful if you give us insights on how to use these results or point us to some resources to do so. 
> >> > > 
> >> > > Thank you 
> >> > > 
> >> > 
> >> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
> >> > 
> >> > I did have a problem including rte_mbuf.h file into your code. It appears the g++ compiler did not like referencing the struct rte_mbuf_sched inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union as a hack I moved the struct outside of the rte_mbuf structure and replaced the struct in the union with ’struct rte_mbuf_sched sched;', but I am guessing you are missing some compiler options in your build system as DPDK builds just fine without that hack.
> >> > 
> >> > The next place was the rxmode and the txq_flags. The rxmode structure has changed and I commented out the inits in ns3 and then commented out the txq_flags init code as these are now the defaults.
> >> > 
> >> > Regards,
> >> > Keith
> >> > 
> >> 
> >> Regards,
> >> Keith
> >> 
> >> <Ssthresh.png>
> >> <Cwnd.png>
> 
> Regards,
> Keith
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-02-05 14:12                                                                 ` Wiles, Keith
@ 2019-02-05 14:22                                                                   ` Harsh Patel
  2019-02-05 14:27                                                                     ` Wiles, Keith
  0 siblings, 1 reply; 43+ messages in thread
From: Harsh Patel @ 2019-02-05 14:22 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, Kyle Larose, users

Can you help us with those questions we asked you? We need them as
parameters for our testing.

Thanks,
Harsh & Hrishikesh

On Tue, Feb 5, 2019, 19:42 Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Feb 5, 2019, at 8:00 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Hi,
> > One of the mistake was as following. ns-3 frees the packet buffer just
> as it writes to the socket and thus we thought that we should also do the
> same. But dpdk while writing places the packet buffer to the tx descriptor
> ring and perform the transmission after that on its own. And we were
> freeing early so sometimes the packets were lost i.e. freed before
> transmission.
> >
> > Another thing was that as you suggested earlier we compiled the whole
> ns-3 in optimized mode. That improved the performance.
> >
> > These 2 things combined got us the desired results.
>
> Excellent thanks
> >
> > Regards,
> > Harsh & Hrishikesh
> >
> > On Tue, Feb 5, 2019, 18:33 Wiles, Keith <keith.wiles@intel.com> wrote:
> >
> >
> > > On Feb 5, 2019, at 12:37 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >
> > > Hi,
> > >
> > > We would like to inform you that our code is working as expected and
> we are able to obtain 95-98 Mbps data rate for a 100Mbps application rate.
> We are now working on the testing of the code. Thanks a lot, especially to
> Keith for all the help you provided.
> > >
> > > We have 2 main queries :-
> > > 1) We wanted to calculate Backlog at the NIC Tx Descriptors but were
> not able to find anything in the documentation. Can you help us in how to
> calculate the backlog?
> > > 2) We searched on how to use Byte Queue Limit (BQL) on the NIC queue
> but couldn't find anything like that in DPDK. Does DPDK support BQL? If so,
> can you help us on how to use it for our project?
> >
> > what was the last set of problems if I may ask?
> > >
> > > Thanks & Regards
> > > Harsh & Hrishikesh
> > >
> > > On Thu, 31 Jan 2019 at 22:28, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > >
> > >
> > > Sent from my iPhone
> > >
> > > On Jan 30, 2019, at 5:36 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >
> > >> Hello,
> > >>
> > >> This mail is to inform you that the integration of DPDK is working
> with ns-3 on a basic level. The model is running.
> > >> For UDP traffic we are getting throughput same or better than raw
> socket. (Around 100Mbps)
> > >> But unfortunately for TCP, there are burst packet losses due to which
> the throughput is drastically affected after some point of time. The
> bandwidth of the link used was 100Mbps.
> > >> We have obtained cwnd and ssthresh graphs which show that once the
> flow gets out from Slow Start mode, there are so many packet losses that
> the congestion window & the slow start threshold is not able to go above
> 4-5 packets.
> > >
> > > Can you determine where the packets are being dropped?
> > >> We have attached the graphs with this mail.
> > >>
> > >
> > > I do not see the graphs attached but that’s OK.
> > >> We would like to know if there is any reason to this or how can we
> fix this.
> > >
> > > I think we have to find out where the packets are being dropped this
> is the only reason for the case to your referring to.
> > >>
> > >> Thanks & Regards
> > >> Harsh & Hrishikesh
> > >>
> > >> On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >> Hi
> > >>
> > >> We were able to optimise the DPDK version. There were couple of
> things we needed to do.
> > >>
> > >> We were using tx timeout as 1s/2048, which we found out to be very
> less. Then we increased the timeout, but we were getting lot of
> retransmissions.
> > >>
> > >> So we removed the timeout and sent single packet as soon as we get
> it. This increased the throughput.
> > >>
> > >> Then we used DPDK feature to launch function on core, and gave a
> dedicated core for Rx. This increased the throughput further.
> > >>
> > >> The code is working really well for low bandwidth (<~50Mbps) and is
> outperforming raw socket version.
> > >> But for high bandwidth, we are getting packet length mismatches for
> some reason. We are investigating it.
> > >>
> > >> We really thank you for the suggestions given by you and also for
> keeping the patience for last couple of months.
> > >>
> > >> Thank you
> > >>
> > >> Regards,
> > >> Harsh & Hrishikesh
> > >>
> > >> On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >> Yes that would be helpful.
> > >> It'd be ok for now to use the same dpdk version to overcome the build
> issues.
> > >> We will look into updating the code for latest versions once we get
> past this problem.
> > >>
> > >> Thank you very much.
> > >>
> > >> Regards,
> > >> Harsh & Hrishikesh
> > >>
> > >> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > >>
> > >>
> > >> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >> >
> > >> > Hi
> > >> >
> > >> > We applied your suggestion of removing the `IsLinkUp()` call. But
> the performace is even worse. We could only get around 340kbits/s.
> > >> >
> > >> > The Top Hotspots are:
> > >> >
> > >> > Function    Module    CPU Time
> > >> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
> > >> > rte_delay_us_block    librte_eal.so.6.1    7.372s
> > >> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so
> 5.080s
> > >> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
> > >> > ns3::DpdkNetDeviceReader::DoRead
> libns3.28.1-fd-net-device-debug.so    3.364s
> > >> > [Others]        4.760s
> > >>
> > >> Performance reduced by removing that link status check, that is weird.
> > >> >
> > >> > Upon checking the callers of `rte_delay_us_block`, we got to know
> that most of the time (92%) spent in this function is during initialization.
> > >> > This does not waste our processing time during communication. So,
> it's a good start to our optimization.
> > >> >
> > >> > Callers    CPU Time: Total    CPU Time: Self
> > >> > rte_delay_us_block    100.0%    7.372s
> > >> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
> > >> >   e1000_write_phy_reg_mdic    1.8%    0.136s
> > >> >   e1000_reset_hw_ich8lan    1.7%    0.128s
> > >> >   e1000_read_phy_reg_mdic    1.4%    0.104s
> > >> >   eth_em_link_update    1.4%    0.100s
> > >> >   e1000_get_cfg_done_generic    0.7%    0.052s
> > >> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
> > >>
> > >> I guess you are having vTune start your application and that is why
> you have init time items in your log. I normally start my application and
> then attach vtune to the application. One of the options in configuration
> of vtune for that project is to attach to the application. Maybe it would
> help hear.
> > >>
> > >> Looking at the data you provided it was ok. The problem is it would
> not load the source files as I did not have the same build or executable. I
> tried to build the code, but it failed to build and I did not go further. I
> guess I would need to see the full source tree and the executable you used
> to really look at the problem. I have limited time, but I can try if you
> like.
> > >> >
> > >> >
> > >> > Effective CPU Utilization:    21.4% (0.856 out of 4)
> > >> >
> > >> > Here is the link to vtune profiling results.
> https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
> > >> >
> > >> > Thank you
> > >> >
> > >> > Regards
> > >> >
> > >> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > >> >
> > >> >
> > >> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <
> thadodaharsh10@gmail.com> wrote:
> > >> > >
> > >> > > Hello,
> > >> > > As suggested, we tried profiling the application using Intel
> VTune Amplifier. We aren't sure how to use these results, so we are
> attaching them to this email.
> > >> > >
> > >> > > The things we understood were 'Top Hotspots' and 'Effective CPU
> utilization'. Following are some of our understandings:
> > >> > >
> > >> > > Top Hotspots
> > >> > >
> > >> > > Function        Module  CPU Time
> > >> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
> > >> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> > >> > > ns3::DpdkNetDevice::Read
> libns3.28.1-fd-net-device-debug.so      3.522s
> > >> > > ns3::DpdkNetDeviceReader::DoRead
> libns3.28.1-fd-net-device-debug.so      2.470s
> > >> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so
> 2.456s
> > >> > > [Others]                6.656s
> > >> > >
> > >> > > We knew about other methods except `rte_delay_us_block`. So we
> investigated the callers of this method:
> > >> > >
> > >> > > Callers Effective Time  Spin Time       Overhead Time   Effective
> Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time:
> Self
> > >> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec
>  0usec
> > >> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s
> 0usec   0usec
> > >> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec
>  0usec
> > >> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec
>  0usec
> > >> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec
>  0usec
> > >> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%
> 0.064s  0usec   0usec
> > >> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s
> 0usec   0usec
> > >> > >
> > >> > > We lack sufficient knowledge to investigate more than this.
> > >> > >
> > >> > > Effective CPU utilization
> > >> > >
> > >> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out
> of 4 logical CPUs). We thought this is less. So we compared this with the
> raw-socket version of the code, which was even less, 8.0% (0.318 out of 4
> logical CPUs), and even then it is performing way better.
> > >> > >
> > >> > > It would be helpful if you give us insights on how to use these
> results or point us to some resources to do so.
> > >> > >
> > >> > > Thank you
> > >> > >
> > >> >
> > >> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple
> changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
> > >> >
> > >> > I did have a problem including rte_mbuf.h file into your code. It
> appears the g++ compiler did not like referencing the struct rte_mbuf_sched
> inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union
> as a hack I moved the struct outside of the rte_mbuf structure and replaced
> the struct in the union with ’struct rte_mbuf_sched sched;', but I am
> guessing you are missing some compiler options in your build system as DPDK
> builds just fine without that hack.
> > >> >
> > >> > The next place was the rxmode and the txq_flags. The rxmode
> structure has changed and I commented out the inits in ns3 and then
> commented out the txq_flags init code as these are now the defaults.
> > >> >
> > >> > Regards,
> > >> > Keith
> > >> >
> > >>
> > >> Regards,
> > >> Keith
> > >>
> > >> <Ssthresh.png>
> > >> <Cwnd.png>
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-02-05 14:22                                                                   ` Harsh Patel
@ 2019-02-05 14:27                                                                     ` Wiles, Keith
  2019-02-05 14:33                                                                       ` Harsh Patel
  0 siblings, 1 reply; 43+ messages in thread
From: Wiles, Keith @ 2019-02-05 14:27 UTC (permalink / raw)
  To: Harsh Patel; +Cc: Stephen Hemminger, Kyle Larose, users



> On Feb 5, 2019, at 8:22 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> 
> Can you help us with those questions we asked you? We need them as parameters for our testing.

i would love to but i do not know much about what you are asking, sorry.

i hope someone else steps in, maybe the pmd maintainer could help. look in the maintainers file and message him directly.
> 
> Thanks, 
> Harsh & Hrishikesh 
> 
> On Tue, Feb 5, 2019, 19:42 Wiles, Keith <keith.wiles@intel.com> wrote:
> 
> 
> > On Feb 5, 2019, at 8:00 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > 
> > Hi, 
> > One of the mistake was as following. ns-3 frees the packet buffer just as it writes to the socket and thus we thought that we should also do the same. But dpdk while writing places the packet buffer to the tx descriptor ring and perform the transmission after that on its own. And we were freeing early so sometimes the packets were lost i.e. freed before transmission. 
> > 
> > Another thing was that as you suggested earlier we compiled the whole ns-3 in optimized mode. That improved the performance. 
> > 
> > These 2 things combined got us the desired results. 
> 
> Excellent thanks
> > 
> > Regards, 
> > Harsh & Hrishikesh 
> > 
> > On Tue, Feb 5, 2019, 18:33 Wiles, Keith <keith.wiles@intel.com> wrote:
> > 
> > 
> > > On Feb 5, 2019, at 12:37 AM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > > 
> > > Hi, 
> > > 
> > > We would like to inform you that our code is working as expected and we are able to obtain 95-98 Mbps data rate for a 100Mbps application rate. We are now working on the testing of the code. Thanks a lot, especially to Keith for all the help you provided.
> > > 
> > > We have 2 main queries :-
> > > 1) We wanted to calculate Backlog at the NIC Tx Descriptors but were not able to find anything in the documentation. Can you help us in how to calculate the backlog?
> > > 2) We searched on how to use Byte Queue Limit (BQL) on the NIC queue but couldn't find anything like that in DPDK. Does DPDK support BQL? If so, can you help us on how to use it for our project?
> > 
> > what was the last set of problems if I may ask?
> > > 
> > > Thanks & Regards
> > > Harsh & Hrishikesh
> > > 
> > > On Thu, 31 Jan 2019 at 22:28, Wiles, Keith <keith.wiles@intel.com> wrote:
> > > 
> > > 
> > > Sent from my iPhone
> > > 
> > > On Jan 30, 2019, at 5:36 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > > 
> > >> Hello, 
> > >> 
> > >> This mail is to inform you that the integration of DPDK is working with ns-3 on a basic level. The model is running. 
> > >> For UDP traffic we are getting throughput same or better than raw socket. (Around 100Mbps)
> > >> But unfortunately for TCP, there are burst packet losses due to which the throughput is drastically affected after some point of time. The bandwidth of the link used was 100Mbps. 
> > >> We have obtained cwnd and ssthresh graphs which show that once the flow gets out from Slow Start mode, there are so many packet losses that the congestion window & the slow start threshold is not able to go above 4-5 packets. 
> > > 
> > > Can you determine where the packets are being dropped?
> > >> We have attached the graphs with this mail.
> > >> 
> > > 
> > > I do not see the graphs attached but that’s OK. 
> > >> We would like to know if there is any reason to this or how can we fix this. 
> > > 
> > > I think we have to find out where the packets are being dropped this is the only reason for the case to your referring to. 
> > >> 
> > >> Thanks & Regards
> > >> Harsh & Hrishikesh
> > >> 
> > >> On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > >> Hi
> > >> 
> > >> We were able to optimise the DPDK version. There were couple of things we needed to do.
> > >> 
> > >> We were using tx timeout as 1s/2048, which we found out to be very less. Then we increased the timeout, but we were getting lot of retransmissions.
> > >> 
> > >> So we removed the timeout and sent single packet as soon as we get it. This increased the throughput.
> > >> 
> > >> Then we used DPDK feature to launch function on core, and gave a dedicated core for Rx. This increased the throughput further.
> > >> 
> > >> The code is working really well for low bandwidth (<~50Mbps) and is outperforming raw socket version.
> > >> But for high bandwidth, we are getting packet length mismatches for some reason. We are investigating it.
> > >> 
> > >> We really thank you for the suggestions given by you and also for keeping the patience for last couple of months. 
> > >> 
> > >> Thank you
> > >> 
> > >> Regards, 
> > >> Harsh & Hrishikesh 
> > >> 
> > >> On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > >> Yes that would be helpful. 
> > >> It'd be ok for now to use the same dpdk version to overcome the build issues. 
> > >> We will look into updating the code for latest versions once we get past this problem. 
> > >> 
> > >> Thank you very much. 
> > >> 
> > >> Regards, 
> > >> Harsh & Hrishikesh
> > >> 
> > >> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com> wrote:
> > >> 
> > >> 
> > >> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > >> > 
> > >> > Hi
> > >> > 
> > >> > We applied your suggestion of removing the `IsLinkUp()` call. But the performace is even worse. We could only get around 340kbits/s.
> > >> > 
> > >> > The Top Hotspots are:
> > >> > 
> > >> > Function    Module    CPU Time
> > >> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
> > >> > rte_delay_us_block    librte_eal.so.6.1    7.372s
> > >> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so    5.080s
> > >> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
> > >> > ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so    3.364s
> > >> > [Others]        4.760s
> > >> 
> > >> Performance reduced by removing that link status check, that is weird.
> > >> > 
> > >> > Upon checking the callers of `rte_delay_us_block`, we got to know that most of the time (92%) spent in this function is during initialization.
> > >> > This does not waste our processing time during communication. So, it's a good start to our optimization.
> > >> > 
> > >> > Callers    CPU Time: Total    CPU Time: Self
> > >> > rte_delay_us_block    100.0%    7.372s
> > >> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
> > >> >   e1000_write_phy_reg_mdic    1.8%    0.136s
> > >> >   e1000_reset_hw_ich8lan    1.7%    0.128s
> > >> >   e1000_read_phy_reg_mdic    1.4%    0.104s
> > >> >   eth_em_link_update    1.4%    0.100s
> > >> >   e1000_get_cfg_done_generic    0.7%    0.052s
> > >> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
> > >> 
> > >> I guess you are having vTune start your application and that is why you have init time items in your log. I normally start my application and then attach vtune to the application. One of the options in configuration of vtune for that project is to attach to the application. Maybe it would help hear.
> > >> 
> > >> Looking at the data you provided it was ok. The problem is it would not load the source files as I did not have the same build or executable. I tried to build the code, but it failed to build and I did not go further. I guess I would need to see the full source tree and the executable you used to really look at the problem. I have limited time, but I can try if you like. 
> > >> > 
> > >> > 
> > >> > Effective CPU Utilization:    21.4% (0.856 out of 4)
> > >> > 
> > >> > Here is the link to vtune profiling results. https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
> > >> > 
> > >> > Thank you
> > >> > 
> > >> > Regards
> > >> > 
> > >> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com> wrote:
> > >> > 
> > >> > 
> > >> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <thadodaharsh10@gmail.com> wrote:
> > >> > > 
> > >> > > Hello,
> > >> > > As suggested, we tried profiling the application using Intel VTune Amplifier. We aren't sure how to use these results, so we are attaching them to this email.
> > >> > > 
> > >> > > The things we understood were 'Top Hotspots' and 'Effective CPU utilization'. Following are some of our understandings:
> > >> > > 
> > >> > > Top Hotspots
> > >> > > 
> > >> > > Function        Module  CPU Time
> > >> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
> > >> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> > >> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so      3.522s
> > >> > > ns3::DpdkNetDeviceReader::DoRead        libns3.28.1-fd-net-device-debug.so      2.470s
> > >> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
> > >> > > [Others]                6.656s
> > >> > > 
> > >> > > We knew about other methods except `rte_delay_us_block`. So we investigated the callers of this method:
> > >> > > 
> > >> > > Callers Effective Time  Spin Time       Overhead Time   Effective Time  Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
> > >> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
> > >> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec   0usec
> > >> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
> > >> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
> > >> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
> > >> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s  0usec   0usec
> > >> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec   0usec
> > >> > > 
> > >> > > We lack sufficient knowledge to investigate more than this.
> > >> > > 
> > >> > > Effective CPU utilization
> > >> > > 
> > >> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4 logical CPUs). We thought this is less. So we compared this with the raw-socket version of the code, which was even less, 8.0% (0.318 out of 4 logical CPUs), and even then it is performing way better.
> > >> > > 
> > >> > > It would be helpful if you give us insights on how to use these results or point us to some resources to do so. 
> > >> > > 
> > >> > > Thank you 
> > >> > > 
> > >> > 
> > >> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
> > >> > 
> > >> > I did have a problem including rte_mbuf.h file into your code. It appears the g++ compiler did not like referencing the struct rte_mbuf_sched inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union as a hack I moved the struct outside of the rte_mbuf structure and replaced the struct in the union with ’struct rte_mbuf_sched sched;', but I am guessing you are missing some compiler options in your build system as DPDK builds just fine without that hack.
> > >> > 
> > >> > The next place was the rxmode and the txq_flags. The rxmode structure has changed and I commented out the inits in ns3 and then commented out the txq_flags init code as these are now the defaults.
> > >> > 
> > >> > Regards,
> > >> > Keith
> > >> > 
> > >> 
> > >> Regards,
> > >> Keith
> > >> 
> > >> <Ssthresh.png>
> > >> <Cwnd.png>
> > 
> > Regards,
> > Keith
> > 
> 
> Regards,
> Keith
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-users] Query on handling packets
  2019-02-05 14:27                                                                     ` Wiles, Keith
@ 2019-02-05 14:33                                                                       ` Harsh Patel
  0 siblings, 0 replies; 43+ messages in thread
From: Harsh Patel @ 2019-02-05 14:33 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Stephen Hemminger, Kyle Larose, users

Cool. Thanks a lot. We'll do that.

On Tue, Feb 5, 2019, 19:57 Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Feb 5, 2019, at 8:22 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> >
> > Can you help us with those questions we asked you? We need them as
> parameters for our testing.
>
> i would love to but i do not know much about what you are asking, sorry.
>
> i hope someone else steps in, maybe the pmd maintainer could help. look in
> the maintainers file and message him directly.
> >
> > Thanks,
> > Harsh & Hrishikesh
> >
> > On Tue, Feb 5, 2019, 19:42 Wiles, Keith <keith.wiles@intel.com> wrote:
> >
> >
> > > On Feb 5, 2019, at 8:00 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > >
> > > Hi,
> > > One of the mistake was as following. ns-3 frees the packet buffer just
> as it writes to the socket and thus we thought that we should also do the
> same. But dpdk while writing places the packet buffer to the tx descriptor
> ring and perform the transmission after that on its own. And we were
> freeing early so sometimes the packets were lost i.e. freed before
> transmission.
> > >
> > > Another thing was that as you suggested earlier we compiled the whole
> ns-3 in optimized mode. That improved the performance.
> > >
> > > These 2 things combined got us the desired results.
> >
> > Excellent thanks
> > >
> > > Regards,
> > > Harsh & Hrishikesh
> > >
> > > On Tue, Feb 5, 2019, 18:33 Wiles, Keith <keith.wiles@intel.com> wrote:
> > >
> > >
> > > > On Feb 5, 2019, at 12:37 AM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > We would like to inform you that our code is working as expected and
> we are able to obtain 95-98 Mbps data rate for a 100Mbps application rate.
> We are now working on the testing of the code. Thanks a lot, especially to
> Keith for all the help you provided.
> > > >
> > > > We have 2 main queries :-
> > > > 1) We wanted to calculate Backlog at the NIC Tx Descriptors but were
> not able to find anything in the documentation. Can you help us in how to
> calculate the backlog?
> > > > 2) We searched on how to use Byte Queue Limit (BQL) on the NIC queue
> but couldn't find anything like that in DPDK. Does DPDK support BQL? If so,
> can you help us on how to use it for our project?
> > >
> > > what was the last set of problems if I may ask?
> > > >
> > > > Thanks & Regards
> > > > Harsh & Hrishikesh
> > > >
> > > > On Thu, 31 Jan 2019 at 22:28, Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 30, 2019, at 5:36 PM, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> This mail is to inform you that the integration of DPDK is working
> with ns-3 on a basic level. The model is running.
> > > >> For UDP traffic we are getting throughput same or better than raw
> socket. (Around 100Mbps)
> > > >> But unfortunately for TCP, there are burst packet losses due to
> which the throughput is drastically affected after some point of time. The
> bandwidth of the link used was 100Mbps.
> > > >> We have obtained cwnd and ssthresh graphs which show that once the
> flow gets out from Slow Start mode, there are so many packet losses that
> the congestion window & the slow start threshold is not able to go above
> 4-5 packets.
> > > >
> > > > Can you determine where the packets are being dropped?
> > > >> We have attached the graphs with this mail.
> > > >>
> > > >
> > > > I do not see the graphs attached but that’s OK.
> > > >> We would like to know if there is any reason to this or how can we
> fix this.
> > > >
> > > > I think we have to find out where the packets are being dropped this
> is the only reason for the case to your referring to.
> > > >>
> > > >> Thanks & Regards
> > > >> Harsh & Hrishikesh
> > > >>
> > > >> On Wed, 16 Jan 2019 at 19:25, Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > >> Hi
> > > >>
> > > >> We were able to optimise the DPDK version. There were couple of
> things we needed to do.
> > > >>
> > > >> We were using tx timeout as 1s/2048, which we found out to be very
> less. Then we increased the timeout, but we were getting lot of
> retransmissions.
> > > >>
> > > >> So we removed the timeout and sent single packet as soon as we get
> it. This increased the throughput.
> > > >>
> > > >> Then we used DPDK feature to launch function on core, and gave a
> dedicated core for Rx. This increased the throughput further.
> > > >>
> > > >> The code is working really well for low bandwidth (<~50Mbps) and is
> outperforming raw socket version.
> > > >> But for high bandwidth, we are getting packet length mismatches for
> some reason. We are investigating it.
> > > >>
> > > >> We really thank you for the suggestions given by you and also for
> keeping the patience for last couple of months.
> > > >>
> > > >> Thank you
> > > >>
> > > >> Regards,
> > > >> Harsh & Hrishikesh
> > > >>
> > > >> On Fri, Jan 4, 2019, 11:27 Harsh Patel <thadodaharsh10@gmail.com>
> wrote:
> > > >> Yes that would be helpful.
> > > >> It'd be ok for now to use the same dpdk version to overcome the
> build issues.
> > > >> We will look into updating the code for latest versions once we get
> past this problem.
> > > >>
> > > >> Thank you very much.
> > > >>
> > > >> Regards,
> > > >> Harsh & Hrishikesh
> > > >>
> > > >> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > > >>
> > > >>
> > > >> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <
> thadodaharsh10@gmail.com> wrote:
> > > >> >
> > > >> > Hi
> > > >> >
> > > >> > We applied your suggestion of removing the `IsLinkUp()` call. But
> the performace is even worse. We could only get around 340kbits/s.
> > > >> >
> > > >> > The Top Hotspots are:
> > > >> >
> > > >> > Function    Module    CPU Time
> > > >> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
> > > >> > rte_delay_us_block    librte_eal.so.6.1    7.372s
> > > >> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so
>   5.080s
> > > >> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
> > > >> > ns3::DpdkNetDeviceReader::DoRead
> libns3.28.1-fd-net-device-debug.so    3.364s
> > > >> > [Others]        4.760s
> > > >>
> > > >> Performance reduced by removing that link status check, that is
> weird.
> > > >> >
> > > >> > Upon checking the callers of `rte_delay_us_block`, we got to know
> that most of the time (92%) spent in this function is during initialization.
> > > >> > This does not waste our processing time during communication. So,
> it's a good start to our optimization.
> > > >> >
> > > >> > Callers    CPU Time: Total    CPU Time: Self
> > > >> > rte_delay_us_block    100.0%    7.372s
> > > >> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
> > > >> >   e1000_write_phy_reg_mdic    1.8%    0.136s
> > > >> >   e1000_reset_hw_ich8lan    1.7%    0.128s
> > > >> >   e1000_read_phy_reg_mdic    1.4%    0.104s
> > > >> >   eth_em_link_update    1.4%    0.100s
> > > >> >   e1000_get_cfg_done_generic    0.7%    0.052s
> > > >> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
> > > >>
> > > >> I guess you are having vTune start your application and that is why
> you have init time items in your log. I normally start my application and
> then attach vtune to the application. One of the options in configuration
> of vtune for that project is to attach to the application. Maybe it would
> help hear.
> > > >>
> > > >> Looking at the data you provided it was ok. The problem is it would
> not load the source files as I did not have the same build or executable. I
> tried to build the code, but it failed to build and I did not go further. I
> guess I would need to see the full source tree and the executable you used
> to really look at the problem. I have limited time, but I can try if you
> like.
> > > >> >
> > > >> >
> > > >> > Effective CPU Utilization:    21.4% (0.856 out of 4)
> > > >> >
> > > >> > Here is the link to vtune profiling results.
> https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
> > > >> >
> > > >> > Thank you
> > > >> >
> > > >> > Regards
> > > >> >
> > > >> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <keith.wiles@intel.com>
> wrote:
> > > >> >
> > > >> >
> > > >> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <
> thadodaharsh10@gmail.com> wrote:
> > > >> > >
> > > >> > > Hello,
> > > >> > > As suggested, we tried profiling the application using Intel
> VTune Amplifier. We aren't sure how to use these results, so we are
> attaching them to this email.
> > > >> > >
> > > >> > > The things we understood were 'Top Hotspots' and 'Effective CPU
> utilization'. Following are some of our understandings:
> > > >> > >
> > > >> > > Top Hotspots
> > > >> > >
> > > >> > > Function        Module  CPU Time
> > > >> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
> > > >> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
> > > >> > > ns3::DpdkNetDevice::Read
> libns3.28.1-fd-net-device-debug.so      3.522s
> > > >> > > ns3::DpdkNetDeviceReader::DoRead
> libns3.28.1-fd-net-device-debug.so      2.470s
> > > >> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so
>   2.456s
> > > >> > > [Others]                6.656s
> > > >> > >
> > > >> > > We knew about other methods except `rte_delay_us_block`. So we
> investigated the callers of this method:
> > > >> > >
> > > >> > > Callers Effective Time  Spin Time       Overhead Time
>  Effective Time  Spin Time       Overhead Time   Wait Time: Total
> Wait Time: Self
> > > >> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec
>  0usec
> > > >> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s
> 0usec   0usec
> > > >> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec
>  0usec
> > > >> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec
>  0usec
> > > >> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec
>  0usec
> > > >> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%
> 0.064s  0usec   0usec
> > > >> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s
> 0usec   0usec
> > > >> > >
> > > >> > > We lack sufficient knowledge to investigate more than this.
> > > >> > >
> > > >> > > Effective CPU utilization
> > > >> > >
> > > >> > > Interestingly, the effective CPU utilization was 20.8% (0.832
> out of 4 logical CPUs). We thought this is less. So we compared this with
> the raw-socket version of the code, which was even less, 8.0% (0.318 out of
> 4 logical CPUs), and even then it is performing way better.
> > > >> > >
> > > >> > > It would be helpful if you give us insights on how to use these
> results or point us to some resources to do so.
> > > >> > >
> > > >> > > Thank you
> > > >> > >
> > > >> >
> > > >> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple
> changes in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
> > > >> >
> > > >> > I did have a problem including rte_mbuf.h file into your code. It
> appears the g++ compiler did not like referencing the struct rte_mbuf_sched
> inside the rte_mbuf structure. The rte_mbuf_sched was inside the big union
> as a hack I moved the struct outside of the rte_mbuf structure and replaced
> the struct in the union with ’struct rte_mbuf_sched sched;', but I am
> guessing you are missing some compiler options in your build system as DPDK
> builds just fine without that hack.
> > > >> >
> > > >> > The next place was the rxmode and the txq_flags. The rxmode
> structure has changed and I commented out the inits in ns3 and then
> commented out the txq_flags init code as these are now the defaults.
> > > >> >
> > > >> > Regards,
> > > >> > Keith
> > > >> >
> > > >>
> > > >> Regards,
> > > >> Keith
> > > >>
> > > >> <Ssthresh.png>
> > > >> <Cwnd.png>
> > >
> > > Regards,
> > > Keith
> > >
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2019-02-05 14:33 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08  8:24 [dpdk-users] Query on handling packets Harsh Patel
2018-11-08  8:56 ` Wiles, Keith
2018-11-08 16:58   ` Harsh Patel
2018-11-08 17:43     ` Wiles, Keith
2018-11-09 10:09       ` Harsh Patel
2018-11-09 21:26         ` Wiles, Keith
2018-11-10  6:17         ` Wiles, Keith
2018-11-11 19:45           ` Harsh Patel
2018-11-13  2:25             ` Harsh Patel
2018-11-13 13:47               ` Wiles, Keith
2018-11-14 13:54                 ` Harsh Patel
2018-11-14 15:02                   ` Wiles, Keith
2018-11-14 15:04                   ` Wiles, Keith
2018-11-14 15:15                   ` Wiles, Keith
2018-11-17 10:22                     ` Harsh Patel
2018-11-17 22:05                       ` Kyle Larose
2018-11-19 13:49                         ` Wiles, Keith
2018-11-22 15:54                           ` Harsh Patel
2018-11-24 15:43                             ` Wiles, Keith
2018-11-24 15:48                               ` Wiles, Keith
2018-11-24 16:01                             ` Wiles, Keith
2018-11-25  4:35                               ` Stephen Hemminger
2018-11-30  9:02                                 ` Harsh Patel
2018-11-30 10:24                                   ` Harsh Patel
2018-11-30 15:54                                   ` Wiles, Keith
2018-12-03  9:37                                     ` Harsh Patel
2018-12-14 17:41                                       ` Harsh Patel
2018-12-14 18:06                                         ` Wiles, Keith
     [not found]                                           ` <CAA0iYrHyLtO3XLXMq-aeVhgJhns0+ErfuhEeDSNDi4cFVBcZmw@mail.gmail.com>
2018-12-30  0:19                                             ` Wiles, Keith
2018-12-30  0:30                                             ` Wiles, Keith
2019-01-03 18:12                                               ` Harsh Patel
2019-01-03 22:43                                                 ` Wiles, Keith
2019-01-04  5:57                                                   ` Harsh Patel
2019-01-16 13:55                                                     ` Harsh Patel
2019-01-30 23:36                                                       ` Harsh Patel
2019-01-31 16:58                                                         ` Wiles, Keith
2019-02-05  6:37                                                           ` Harsh Patel
2019-02-05 13:03                                                             ` Wiles, Keith
2019-02-05 14:00                                                               ` Harsh Patel
2019-02-05 14:12                                                                 ` Wiles, Keith
2019-02-05 14:22                                                                   ` Harsh Patel
2019-02-05 14:27                                                                     ` Wiles, Keith
2019-02-05 14:33                                                                       ` Harsh Patel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).