From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from prod-mail-xrelay06.akamai.com (prod-mail-xrelay06.akamai.com [96.6.114.98]) by dpdk.org (Postfix) with ESMTP id CD0838E80 for ; Tue, 13 Oct 2015 04:57:48 +0200 (CEST) Received: from prod-mail-xrelay06.akamai.com (localhost.localdomain [127.0.0.1]) by postfix.imss70 (Postfix) with ESMTP id CF93F496C79; Tue, 13 Oct 2015 02:57:47 +0000 (GMT) Received: from prod-mail-relay08.akamai.com (prod-mail-relay08.akamai.com [172.27.22.71]) by prod-mail-xrelay06.akamai.com (Postfix) with ESMTP id B04C4496C21; Tue, 13 Oct 2015 02:57:47 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; s=a1; t=1444705067; bh=XpYjJLgNM8j0b4RBgquotMBlCXsgJL6ECifSGQAO7uk=; l=5708; h=From:To:Date:From; b=iHLb/zK4zIIlO7/Hkd4fLZmNyOD0sZK314/BVxX3BO+U2ncahrkYrCYPWtL7Zkd/W AgeG6XDWC3sXpcKOcsZ2j6U5WqvqqDMuKsjoforwdr9ZxcZv14pfpbpqPgBJWFZvo+ a803fjNJLzmmttHz3yiOKMQ3Q+IFmRjuBS5RhC1c= Received: from email.msg.corp.akamai.com (ustx2ex-cas3.msg.corp.akamai.com [172.27.25.32]) by prod-mail-relay08.akamai.com (Postfix) with ESMTP id AB4E198088; Tue, 13 Oct 2015 02:57:47 +0000 (GMT) Received: from ustx2ex-dag1mb6.msg.corp.akamai.com (172.27.27.107) by ustx2ex-dag1mb5.msg.corp.akamai.com (172.27.27.105) with Microsoft SMTP Server (TLS) id 15.0.1076.9; Mon, 12 Oct 2015 21:57:46 -0500 Received: from ustx2ex-dag1mb6.msg.corp.akamai.com ([172.27.27.107]) by ustx2ex-dag1mb6.msg.corp.akamai.com ([172.27.27.107]) with mapi id 15.00.1076.000; Mon, 12 Oct 2015 19:57:46 -0700 From: "Sanford, Robert" To: "dev@dpdk.org" , "cunming.liang@intel.com" , "konstantin.ananyev@intel.com" Thread-Topic: IXGBE RX packet loss with 5+ cores Thread-Index: AQHRBWLwhDAnFbZzP06Psm0gcmGeEg== Date: Tue, 13 Oct 2015 02:57:46 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.4.3.140616 x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [172.19.133.150] Content-Type: text/plain; charset="iso-8859-1" Content-ID: <2AD16923BD2CCB4789A43F34732E112C@akamai.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: [dpdk-dev] IXGBE RX packet loss with 5+ cores X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Oct 2015 02:57:49 -0000 I'm hoping that someone (perhaps at Intel) can help us understand an IXGBE RX packet loss issue we're able to reproduce with testpmd. We run testpmd with various numbers of cores. We offer line-rate traffic (~14.88 Mpps) to one ethernet port, and forward all received packets via the second port. When we configure 1, 2, 3, or 4 cores (per port, with same number RX queues per port), there is no RX packet loss. When we configure 5 or more cores, we observe the following packet loss (approximate): 5 cores - 3% loss 6 cores - 7% loss 7 cores - 11% loss 8 cores - 15% loss 9 cores - 18% loss All of the "lost" packets are accounted for in the device's Rx Missed Packets Count register (RXMPC[0]). Quoting the datasheet: "Packets are missed when the receive FIFO has insufficient space to store the incoming packet. This might be caused due to insufficient buffers allocated, or because there is insufficient bandwidth on the IO bus." RXMPC, and our use of API rx_descriptor_done to verify that we don't run out of mbufs (discussed below), lead us to theorize that packet loss occurs because the device is unable to DMA all packets from its internal packet buffer (512 KB, reported by register RXPBSIZE[0]) before overrun. Questions =3D=3D=3D=3D=3D=3D=3D=3D=3D 1. The 82599 device supports up to 128 queues. Why do we see trouble with as few as 5 queues? What could limit the system (and one port controlled by 5+ cores) from receiving at line-rate without loss? 2. As far as we can tell, the RX path only touches the device registers when it updates a Receive Descriptor Tail register (RDT[n]), roughly every rx_free_thresh packets. Is there a big difference between one core doing this and N cores doing it 1/N as often? 3. Do CPU reads/writes from/to device registers have a higher priority than device reads/writes from/to memory? Could the former transactions (CPU <-> device) significantly impede the latter (device <-> RAM)? Thanks in advance for any help you can provide. Testpmd Command Line =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Here is an example of how we run testpmd: # socket 0 lcores: 0-7, 16-23 N_QUEUES=3D5 N_CORES=3D10 ./testpmd -c 0x003e013e -n 2 \ --pci-whitelist "01:00.0" --pci-whitelist "01:00.1" \ --master-lcore 8 -- \ --interactive --portmask=3D0x3 --numa --socket-num=3D0 --auto-start \ --coremask=3D0x003e003e \ --rxd=3D4096 --txd=3D4096 --rxfreet=3D512 --txfreet=3D512 \ --burst=3D128 --mbcache=3D256 \ --nb-cores=3D$N_CORES --rxq=3D$N_QUEUES --txq=3D$N_QUEUES Test machines =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * We performed most testing on a system with two E5-2640 v3 (Haswell 2.6 GHz 8 cores) CPUs, 64 GB 1866 MHz RAM, TYAN S7076 mobo. * We obtained similar results on a system with two E5-2698 v3 (Haswell 2.3 GHz 16 cores) CPUs, 64 GB 2133 MHz RAM, Dell R730. * DPDK 2.1.0, Linux 2.6.32-504.23.4 Intel 10GB adapters =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D All ethernet adapters are 82599_SFP_SF2, vendor 8086, device 154D, svendor 8086, sdevice 7B11. Other Details and Ideas we tried =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D * Make sure that all cores, memory, and ethernet ports in use are on the same NUMA socket. * Modify testpmd to insert CPU delays in the forwarding loop, to target some average number of RX packets that we reap per rx_pkt_burst (e.g., 75% of burst). * We configured the RSS redirection table such that all packets go to one RX queue. In this case, there was NO packet loss (with any number of RX cores), as the ethernet and core activity is very similar to using only one RX core. * When rx_pkt_burst returns a full burst, look at the subsequent RX descriptors, using a binary search of calls to rx_descriptor_done, to see whether the RX desc array is close to running out of new buffers. The answer was: No, none of the RX queues has more than 100 additional packets "done" (when testing with 5+ cores). * Increase testpmd config params, e.g., --rxd, --rxfreet, --burst, --mbcache, etc. These result in very small improvements, i.e., slight reduction of packet loss. Other Observations =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * Some IXGBE RX/TX code paths do not follow (my interpretation of) the documented semantics of the rx/tx packet burst APIs. For example, invoke rx_pkt_burst with nb_pkts=3D64, and it returns 32, even when more RX packets are available, because the code path is optimized to handle a burst of 32. The same thing may be true in the tx_pkt_burst code path. To allow us to run testpmd with --burst greater than 32, we worked around these limitations by wrapping the calls to rx_pkt_burst and tx_pkt_burst with do-whiles that continue while rx/tx burst returns 32 and we have not yet satisfied the desired burst count. The point here is that IXGBE's rx/tx packet burst API behavior is misleading! The application developer should not need to know that certain drivers or driver paths do not always complete an entire burst, even though they could have. * We na=EFvely believed that if a run-to-completion model uses too many cycles per packet, we could just spread it over more cores. If there is some inherent limitation to the number of cores that together can receive line-rate with no loss, then we obviously need to change the s/w architecture, e.g., have i/o cores distribute to worker cores. * A similar problem was discussed here: http://dpdk.org/ml/archives/dev/2014-January/001098.html -- Regards, Robert Sanford