DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] IXGBE RX packet loss with 5+ cores
@ 2015-10-13  2:57 Sanford, Robert
  2015-10-13  5:18 ` Stephen Hemminger
  0 siblings, 1 reply; 8+ messages in thread
From: Sanford, Robert @ 2015-10-13  2:57 UTC (permalink / raw)
  To: dev, cunming.liang, konstantin.ananyev

I'm hoping that someone (perhaps at Intel) can help us understand
an IXGBE RX packet loss issue we're able to reproduce with testpmd.

We run testpmd with various numbers of cores. We offer line-rate
traffic (~14.88 Mpps) to one ethernet port, and forward all received
packets via the second port.

When we configure 1, 2, 3, or 4 cores (per port, with same number RX
queues per port), there is no RX packet loss. When we configure 5 or
more cores, we observe the following packet loss (approximate):
 5 cores - 3% loss
 6 cores - 7% loss
 7 cores - 11% loss
 8 cores - 15% loss
 9 cores - 18% loss

All of the "lost" packets are accounted for in the device's Rx Missed
Packets Count register (RXMPC[0]). Quoting the datasheet:
 "Packets are missed when the receive FIFO has insufficient space to
 store the incoming packet. This might be caused due to insufficient
 buffers allocated, or because there is insufficient bandwidth on the
 IO bus."

RXMPC, and our use of API rx_descriptor_done to verify that we don't
run out of mbufs (discussed below), lead us to theorize that packet
loss occurs because the device is unable to DMA all packets from its
internal packet buffer (512 KB, reported by register RXPBSIZE[0])
before overrun.

Questions
=========
1. The 82599 device supports up to 128 queues. Why do we see trouble
with as few as 5 queues? What could limit the system (and one port
controlled by 5+ cores) from receiving at line-rate without loss?

2. As far as we can tell, the RX path only touches the device
registers when it updates a Receive Descriptor Tail register (RDT[n]),
roughly every rx_free_thresh packets. Is there a big difference
between one core doing this and N cores doing it 1/N as often?

3. Do CPU reads/writes from/to device registers have a higher priority
than device reads/writes from/to memory? Could the former transactions
(CPU <-> device) significantly impede the latter (device <-> RAM)?

Thanks in advance for any help you can provide.



Testpmd Command Line
====================
Here is an example of how we run testpmd:

# socket 0 lcores: 0-7, 16-23
N_QUEUES=5
N_CORES=10

./testpmd -c 0x003e013e -n 2 \
 --pci-whitelist "01:00.0" --pci-whitelist "01:00.1" \
 --master-lcore 8 -- \
 --interactive --portmask=0x3 --numa --socket-num=0 --auto-start \
 --coremask=0x003e003e \
 --rxd=4096 --txd=4096 --rxfreet=512 --txfreet=512 \
 --burst=128 --mbcache=256 \
 --nb-cores=$N_CORES --rxq=$N_QUEUES --txq=$N_QUEUES


Test machines
=============
* We performed most testing on a system with two E5-2640 v3
(Haswell 2.6 GHz 8 cores) CPUs, 64 GB 1866 MHz RAM, TYAN S7076 mobo.
* We obtained similar results on a system with two E5-2698 v3
(Haswell 2.3 GHz 16 cores) CPUs, 64 GB 2133 MHz RAM, Dell R730.
* DPDK 2.1.0, Linux 2.6.32-504.23.4

Intel 10GB adapters
===================
All ethernet adapters are 82599_SFP_SF2, vendor 8086, device 154D,
svendor 8086, sdevice 7B11.


Other Details and Ideas we tried
================================
* Make sure that all cores, memory, and ethernet ports in use are on
the same NUMA socket.

* Modify testpmd to insert CPU delays in the forwarding loop, to
target some average number of RX packets that we reap per rx_pkt_burst
(e.g., 75% of burst).

* We configured the RSS redirection table such that all packets go to
one RX queue. In this case, there was NO packet loss (with any number
of RX cores), as the ethernet and core activity is very similar to
using only one RX core.

* When rx_pkt_burst returns a full burst, look at the subsequent RX
descriptors, using a binary search of calls to rx_descriptor_done, to
see whether the RX desc array is close to running out of new buffers.
The answer was: No, none of the RX queues has more than 100 additional
packets "done" (when testing with 5+ cores).

* Increase testpmd config params, e.g., --rxd, --rxfreet, --burst,
--mbcache, etc. These result in very small improvements, i.e., slight
reduction of packet loss.


Other Observations
==================
* Some IXGBE RX/TX code paths do not follow (my interpretation of) the
documented semantics of the rx/tx packet burst APIs. For example,
invoke rx_pkt_burst with nb_pkts=64, and it returns 32, even when more
RX packets are available, because the code path is optimized to handle
a burst of 32. The same thing may be true in the tx_pkt_burst code
path.

To allow us to run testpmd with --burst greater than 32, we worked
around these limitations by wrapping the calls to rx_pkt_burst and
tx_pkt_burst with do-whiles that continue while rx/tx burst returns
32 and we have not yet satisfied the desired burst count.

The point here is that IXGBE's rx/tx packet burst API behavior is
misleading! The application developer should not need to know that
certain drivers or driver paths do not always complete an entire
burst, even though they could have.

* We naïvely believed that if a run-to-completion model uses too
many cycles per packet, we could just spread it over more cores.
If there is some inherent limitation to the number of cores that
together can receive line-rate with no loss, then we obviously need
to change the s/w architecture, e.g., have i/o cores distribute to
worker cores.

* A similar problem was discussed here:
http://dpdk.org/ml/archives/dev/2014-January/001098.html



--
Regards,
Robert Sanford

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] IXGBE RX packet loss with 5+ cores
  2015-10-13  2:57 [dpdk-dev] IXGBE RX packet loss with 5+ cores Sanford, Robert
@ 2015-10-13  5:18 ` Stephen Hemminger
  2015-10-13 13:59   ` Bruce Richardson
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2015-10-13  5:18 UTC (permalink / raw)
  To: Sanford, Robert; +Cc: dev

On Tue, 13 Oct 2015 02:57:46 +0000
"Sanford, Robert" <rsanford@akamai.com> wrote:

> I'm hoping that someone (perhaps at Intel) can help us understand
> an IXGBE RX packet loss issue we're able to reproduce with testpmd.
> 
> We run testpmd with various numbers of cores. We offer line-rate
> traffic (~14.88 Mpps) to one ethernet port, and forward all received
> packets via the second port.
> 
> When we configure 1, 2, 3, or 4 cores (per port, with same number RX
> queues per port), there is no RX packet loss. When we configure 5 or
> more cores, we observe the following packet loss (approximate):
>  5 cores - 3% loss
>  6 cores - 7% loss
>  7 cores - 11% loss
>  8 cores - 15% loss
>  9 cores - 18% loss
> 
> All of the "lost" packets are accounted for in the device's Rx Missed
> Packets Count register (RXMPC[0]). Quoting the datasheet:
>  "Packets are missed when the receive FIFO has insufficient space to
>  store the incoming packet. This might be caused due to insufficient
>  buffers allocated, or because there is insufficient bandwidth on the
>  IO bus."
> 
> RXMPC, and our use of API rx_descriptor_done to verify that we don't
> run out of mbufs (discussed below), lead us to theorize that packet
> loss occurs because the device is unable to DMA all packets from its
> internal packet buffer (512 KB, reported by register RXPBSIZE[0])
> before overrun.
> 
> Questions
> =========
> 1. The 82599 device supports up to 128 queues. Why do we see trouble
> with as few as 5 queues? What could limit the system (and one port
> controlled by 5+ cores) from receiving at line-rate without loss?
> 
> 2. As far as we can tell, the RX path only touches the device
> registers when it updates a Receive Descriptor Tail register (RDT[n]),
> roughly every rx_free_thresh packets. Is there a big difference
> between one core doing this and N cores doing it 1/N as often?
> 
> 3. Do CPU reads/writes from/to device registers have a higher priority
> than device reads/writes from/to memory? Could the former transactions
> (CPU <-> device) significantly impede the latter (device <-> RAM)?
> 
> Thanks in advance for any help you can provide.

As you add cores, there is more traffic on the PCI bus from each core
polling. There is a fix number of PCI bus transactions per second possible.
Each core is increasing the number of useless (empty) transactions.
Why do you think adding more cores will help?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] IXGBE RX packet loss with 5+ cores
  2015-10-13  5:18 ` Stephen Hemminger
@ 2015-10-13 13:59   ` Bruce Richardson
  2015-10-13 14:47     ` Sanford, Robert
  0 siblings, 1 reply; 8+ messages in thread
From: Bruce Richardson @ 2015-10-13 13:59 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Mon, Oct 12, 2015 at 10:18:30PM -0700, Stephen Hemminger wrote:
> On Tue, 13 Oct 2015 02:57:46 +0000
> "Sanford, Robert" <rsanford@akamai.com> wrote:
> 
> > I'm hoping that someone (perhaps at Intel) can help us understand
> > an IXGBE RX packet loss issue we're able to reproduce with testpmd.
> > 
> > We run testpmd with various numbers of cores. We offer line-rate
> > traffic (~14.88 Mpps) to one ethernet port, and forward all received
> > packets via the second port.
> > 
> > When we configure 1, 2, 3, or 4 cores (per port, with same number RX
> > queues per port), there is no RX packet loss. When we configure 5 or
> > more cores, we observe the following packet loss (approximate):
> >  5 cores - 3% loss
> >  6 cores - 7% loss
> >  7 cores - 11% loss
> >  8 cores - 15% loss
> >  9 cores - 18% loss
> > 
> > All of the "lost" packets are accounted for in the device's Rx Missed
> > Packets Count register (RXMPC[0]). Quoting the datasheet:
> >  "Packets are missed when the receive FIFO has insufficient space to
> >  store the incoming packet. This might be caused due to insufficient
> >  buffers allocated, or because there is insufficient bandwidth on the
> >  IO bus."
> > 
> > RXMPC, and our use of API rx_descriptor_done to verify that we don't
> > run out of mbufs (discussed below), lead us to theorize that packet
> > loss occurs because the device is unable to DMA all packets from its
> > internal packet buffer (512 KB, reported by register RXPBSIZE[0])
> > before overrun.
> > 
> > Questions
> > =========
> > 1. The 82599 device supports up to 128 queues. Why do we see trouble
> > with as few as 5 queues? What could limit the system (and one port
> > controlled by 5+ cores) from receiving at line-rate without loss?
> > 
> > 2. As far as we can tell, the RX path only touches the device
> > registers when it updates a Receive Descriptor Tail register (RDT[n]),
> > roughly every rx_free_thresh packets. Is there a big difference
> > between one core doing this and N cores doing it 1/N as often?
> > 
> > 3. Do CPU reads/writes from/to device registers have a higher priority
> > than device reads/writes from/to memory? Could the former transactions
> > (CPU <-> device) significantly impede the latter (device <-> RAM)?
> > 
> > Thanks in advance for any help you can provide.
> 
> As you add cores, there is more traffic on the PCI bus from each core
> polling. There is a fix number of PCI bus transactions per second possible.
> Each core is increasing the number of useless (empty) transactions.
> Why do you think adding more cores will help?
>
The polling for packets by the core should not be using PCI bandwidth directly,
as the ixgbe driver (and other drivers) check for the DD bit being set on the
descriptor in memory/cache. However, using an increased number of queues can
use PCI bandwidth in other ways, for instance, with more queues you reduce the
amount of descriptor coalescing that can be done by the NICs, so that instead of
having a single transaction of 4 descriptors to one queue, the NIC may instead
have to do 4 transactions each writing 1 descriptor to 4 different queues. This
is possibly why sending all traffic to a single queue works ok - the polling on
the other queues is still being done, but has little effect.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] IXGBE RX packet loss with 5+ cores
  2015-10-13 13:59   ` Bruce Richardson
@ 2015-10-13 14:47     ` Sanford, Robert
  2015-10-13 15:34       ` Venkatesan, Venky
  2015-10-13 20:24       ` Alexander Duyck
  0 siblings, 2 replies; 8+ messages in thread
From: Sanford, Robert @ 2015-10-13 14:47 UTC (permalink / raw)
  To: Bruce Richardson, Stephen Hemminger, dev


>>> [Robert:]
>>> 1. The 82599 device supports up to 128 queues. Why do we see trouble
>>> with as few as 5 queues? What could limit the system (and one port
>>> controlled by 5+ cores) from receiving at line-rate without loss?
>>>
>>> 2. As far as we can tell, the RX path only touches the device
>>> registers when it updates a Receive Descriptor Tail register (RDT[n]),
>>> roughly every rx_free_thresh packets. Is there a big difference
>>> between one core doing this and N cores doing it 1/N as often?

>>[Stephen:]
>>As you add cores, there is more traffic on the PCI bus from each core
>>polling. There is a fix number of PCI bus transactions per second
>>possible.
>>Each core is increasing the number of useless (empty) transactions.

>[Bruce:]
>The polling for packets by the core should not be using PCI bandwidth
>directly,
>as the ixgbe driver (and other drivers) check for the DD bit being set on
>the
>descriptor in memory/cache.

I was preparing to reply with the same point.

>>[Stephen:] Why do you think adding more cores will help?

We're using run-to-completion and sometimes spend too many cycles per pkt.
We realize that we need to move to io+workers model, but wanted a better
understanding of the dynamics involved here.



>[Bruce:] However, using an increased number of queues can
>use PCI bandwidth in other ways, for instance, with more queues you
>reduce the
>amount of descriptor coalescing that can be done by the NICs, so that
>instead of
>having a single transaction of 4 descriptors to one queue, the NIC may
>instead
>have to do 4 transactions each writing 1 descriptor to 4 different
>queues. This
>is possibly why sending all traffic to a single queue works ok - the
>polling on
>the other queues is still being done, but has little effect.

Brilliant! This idea did not occur to me.



--
Thanks guys,
Robert

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] IXGBE RX packet loss with 5+ cores
  2015-10-13 14:47     ` Sanford, Robert
@ 2015-10-13 15:34       ` Venkatesan, Venky
  2018-11-01  6:42         ` Saber Rezvani
  2015-10-13 20:24       ` Alexander Duyck
  1 sibling, 1 reply; 8+ messages in thread
From: Venkatesan, Venky @ 2015-10-13 15:34 UTC (permalink / raw)
  To: dev



On 10/13/2015 7:47 AM, Sanford, Robert wrote:
>>>> [Robert:]
>>>> 1. The 82599 device supports up to 128 queues. Why do we see trouble
>>>> with as few as 5 queues? What could limit the system (and one port
>>>> controlled by 5+ cores) from receiving at line-rate without loss?
>>>>
>>>> 2. As far as we can tell, the RX path only touches the device
>>>> registers when it updates a Receive Descriptor Tail register (RDT[n]),
>>>> roughly every rx_free_thresh packets. Is there a big difference
>>>> between one core doing this and N cores doing it 1/N as often?
>>> [Stephen:]
>>> As you add cores, there is more traffic on the PCI bus from each core
>>> polling. There is a fix number of PCI bus transactions per second
>>> possible.
>>> Each core is increasing the number of useless (empty) transactions.
>> [Bruce:]
>> The polling for packets by the core should not be using PCI bandwidth
>> directly,
>> as the ixgbe driver (and other drivers) check for the DD bit being set on
>> the
>> descriptor in memory/cache.
> I was preparing to reply with the same point.
>
>>> [Stephen:] Why do you think adding more cores will help?
> We're using run-to-completion and sometimes spend too many cycles per pkt.
> We realize that we need to move to io+workers model, but wanted a better
> understanding of the dynamics involved here.
>
>> [Bruce:] However, using an increased number of queues can
>> use PCI bandwidth in other ways, for instance, with more queues you
>> reduce the
>> amount of descriptor coalescing that can be done by the NICs, so that
>> instead of
>> having a single transaction of 4 descriptors to one queue, the NIC may
>> instead
>> have to do 4 transactions each writing 1 descriptor to 4 different
>> queues. This
>> is possibly why sending all traffic to a single queue works ok - the
>> polling on
>> the other queues is still being done, but has little effect.
> Brilliant! This idea did not occur to me.
To add a little more detail - this ends up being both a bandwidth and a 
transaction bottleneck. Not only do you add an increased transaction 
count, you also add a huge amount of bandwidth overhead (each 16 byte 
descriptor is preceded by a PCI-E TLP which is about the same size). So 
what ends up happening in the case where the incoming packets are 
bifurcated to different queues (1 per queue) is that you have 2x the 
number of transactions (1 for the packet and one for the descriptor) and 
then we essentially double the bandwidth used because you now have the 
TLP overhead per descriptor write.

There is a second issue that also pops up when coalescing breaks down - 
testpmd essentially in iofwd mode simply transmits the number of packets 
it receives (i.e. Rx (n) -> Tx (n)). This means that the transmit side 
also suffers from writing one descriptor at a time for output (i.e. when 
the NIC pulls a descriptor cache line to transmit, it finds 1 valid 
descriptor). When a second descriptor is transmitted on the same it will 
again pull and find only one valid descriptor. That is another 2x 
increase in transaction count as well as PCI-E TLP overhead.

The third hit actually comes from the transmit side when transmitting 
one packet at a time. The last part of the transmit process is a MMIO 
write to the tail pointer. This is a costly operation (since it is a 
un-cacheable memory operation) in terms of cycles, not to mention again 
with heavy PCI-E overhead (TLP + 4 byte write) and increased transaction 
counts on PCI-E.

Hope that explains all the touch-points as to why you see the drop off 
in performance you see.
>
>
>
> --
> Thanks guys,
> Robert
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] IXGBE RX packet loss with 5+ cores
  2015-10-13 14:47     ` Sanford, Robert
  2015-10-13 15:34       ` Venkatesan, Venky
@ 2015-10-13 20:24       ` Alexander Duyck
  2015-10-14  9:29         ` Bruce Richardson
  1 sibling, 1 reply; 8+ messages in thread
From: Alexander Duyck @ 2015-10-13 20:24 UTC (permalink / raw)
  To: Sanford, Robert, Bruce Richardson, Stephen Hemminger, dev

On 10/13/2015 07:47 AM, Sanford, Robert wrote:
>>>> [Robert:]
>>>> 1. The 82599 device supports up to 128 queues. Why do we see trouble
>>>> with as few as 5 queues? What could limit the system (and one port
>>>> controlled by 5+ cores) from receiving at line-rate without loss?
>>>>
>>>> 2. As far as we can tell, the RX path only touches the device
>>>> registers when it updates a Receive Descriptor Tail register (RDT[n]),
>>>> roughly every rx_free_thresh packets. Is there a big difference
>>>> between one core doing this and N cores doing it 1/N as often?
>>> [Stephen:]
>>> As you add cores, there is more traffic on the PCI bus from each core
>>> polling. There is a fix number of PCI bus transactions per second
>>> possible.
>>> Each core is increasing the number of useless (empty) transactions.
>> [Bruce:]
>> The polling for packets by the core should not be using PCI bandwidth
>> directly,
>> as the ixgbe driver (and other drivers) check for the DD bit being set on
>> the
>> descriptor in memory/cache.
> I was preparing to reply with the same point.
>
>>> [Stephen:] Why do you think adding more cores will help?
> We're using run-to-completion and sometimes spend too many cycles per pkt.
> We realize that we need to move to io+workers model, but wanted a better
> understanding of the dynamics involved here.
>
>
>
>> [Bruce:] However, using an increased number of queues can
>> use PCI bandwidth in other ways, for instance, with more queues you
>> reduce the
>> amount of descriptor coalescing that can be done by the NICs, so that
>> instead of
>> having a single transaction of 4 descriptors to one queue, the NIC may
>> instead
>> have to do 4 transactions each writing 1 descriptor to 4 different
>> queues. This
>> is possibly why sending all traffic to a single queue works ok - the
>> polling on
>> the other queues is still being done, but has little effect.
> Brilliant! This idea did not occur to me.

You can actually make the throughput regression disappear by altering 
the traffic pattern you are testing with.  In the past I have found that 
sending traffic in bursts where 4 frames belong to the same queue before 
moving to the next one essentially eliminated the dropped packets due to 
PCIe bandwidth limitations.  The trick is you need to have the Rx 
descriptor processing work in batches so that you can get multiple 
descriptors processed for each PCIe read/write.

- Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] IXGBE RX packet loss with 5+ cores
  2015-10-13 20:24       ` Alexander Duyck
@ 2015-10-14  9:29         ` Bruce Richardson
  0 siblings, 0 replies; 8+ messages in thread
From: Bruce Richardson @ 2015-10-14  9:29 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: dev

On Tue, Oct 13, 2015 at 01:24:22PM -0700, Alexander Duyck wrote:
> On 10/13/2015 07:47 AM, Sanford, Robert wrote:
> >>>>[Robert:]
> >>>>1. The 82599 device supports up to 128 queues. Why do we see trouble
> >>>>with as few as 5 queues? What could limit the system (and one port
> >>>>controlled by 5+ cores) from receiving at line-rate without loss?
> >>>>
> >>>>2. As far as we can tell, the RX path only touches the device
> >>>>registers when it updates a Receive Descriptor Tail register (RDT[n]),
> >>>>roughly every rx_free_thresh packets. Is there a big difference
> >>>>between one core doing this and N cores doing it 1/N as often?
> >>>[Stephen:]
> >>>As you add cores, there is more traffic on the PCI bus from each core
> >>>polling. There is a fix number of PCI bus transactions per second
> >>>possible.
> >>>Each core is increasing the number of useless (empty) transactions.
> >>[Bruce:]
> >>The polling for packets by the core should not be using PCI bandwidth
> >>directly,
> >>as the ixgbe driver (and other drivers) check for the DD bit being set on
> >>the
> >>descriptor in memory/cache.
> >I was preparing to reply with the same point.
> >
> >>>[Stephen:] Why do you think adding more cores will help?
> >We're using run-to-completion and sometimes spend too many cycles per pkt.
> >We realize that we need to move to io+workers model, but wanted a better
> >understanding of the dynamics involved here.
> >
> >
> >
> >>[Bruce:] However, using an increased number of queues can
> >>use PCI bandwidth in other ways, for instance, with more queues you
> >>reduce the
> >>amount of descriptor coalescing that can be done by the NICs, so that
> >>instead of
> >>having a single transaction of 4 descriptors to one queue, the NIC may
> >>instead
> >>have to do 4 transactions each writing 1 descriptor to 4 different
> >>queues. This
> >>is possibly why sending all traffic to a single queue works ok - the
> >>polling on
> >>the other queues is still being done, but has little effect.
> >Brilliant! This idea did not occur to me.
> 
> You can actually make the throughput regression disappear by altering the
> traffic pattern you are testing with.  In the past I have found that sending
> traffic in bursts where 4 frames belong to the same queue before moving to
> the next one essentially eliminated the dropped packets due to PCIe
> bandwidth limitations.  The trick is you need to have the Rx descriptor
> processing work in batches so that you can get multiple descriptors
> processed for each PCIe read/write.
>
Yep, that's one test we used to prove the effect on descriptor coalescing, and
it does work a treat! Unfortunately, I think controlling real-world input traffic
that way, could be, ... em ... challenging? :-)

/Bruce

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] IXGBE RX packet loss with 5+ cores
  2015-10-13 15:34       ` Venkatesan, Venky
@ 2018-11-01  6:42         ` Saber Rezvani
  0 siblings, 0 replies; 8+ messages in thread
From: Saber Rezvani @ 2018-11-01  6:42 UTC (permalink / raw)
  To: dev

According to the problem to this thread -->

http://mails.dpdk.org/archives/dev/2015-October/024966.html


  Venkatesan, Venky mentioned the following reason:

To add a little more detail - this ends up being both a bandwidth and a
transaction bottleneck. Not only do you add an increased transaction
count, you also add a huge amount of bandwidth overhead (each 16 byte
descriptor is preceded by a PCI-E TLP which is about the same size). So
what ends up happening in the case where the incoming packets are
bifurcated to different queues (1 per queue) is that you have 2x the
number of transactions (1 for the packet and one for the descriptor) and
then we essentially double the bandwidth used because you now have the
TLP overhead per descriptor write.


But I couldn't figure out why we have bandwidth and transaction bottleneck.


Can anyone help me?


Best regards,

Saber

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-11-01  6:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-13  2:57 [dpdk-dev] IXGBE RX packet loss with 5+ cores Sanford, Robert
2015-10-13  5:18 ` Stephen Hemminger
2015-10-13 13:59   ` Bruce Richardson
2015-10-13 14:47     ` Sanford, Robert
2015-10-13 15:34       ` Venkatesan, Venky
2018-11-01  6:42         ` Saber Rezvani
2015-10-13 20:24       ` Alexander Duyck
2015-10-14  9:29         ` Bruce Richardson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).