From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it1-f196.google.com (mail-it1-f196.google.com [209.85.166.196]) by dpdk.org (Postfix) with ESMTP id C56D81B3A7 for ; Fri, 30 Nov 2018 10:02:23 +0100 (CET) Received: by mail-it1-f196.google.com with SMTP id a6so8124532itl.4 for ; Fri, 30 Nov 2018 01:02:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=51MCCJ1JsoUavzZERf6dk9KcUmbOfZ6XSOR1efDE85I=; b=U5+tKM4FbTnDVSCPpkIVXUGz48MBbdfae82Mj7XaRh/7oZ60bEDsyj87QLEeUn4cfo ujHsb/TQlXGngSMHCryg4AMRmziPjmKCj5x08kPLrimSrl89YXKfnCsrS3pmMgOj82qz n1Rz7GQF9TF+ant8TsqWt89iJMWYFU0S6q+H/EFqGyqeChObQoaB5QnSYoVKdW8TslG8 6HFBR/SRGCEs6yXh1CrpnSwxXlUZAyb7VCBffEFydQrTVCSX7yNZnECkDDoZ4P1M4M3z eRsMDxwXZ+Nw+CrsNatatquO1s915tBEk9Ux2cS065pHZ3tkpVT9IEp2waK8qNiQAoBh C+zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=51MCCJ1JsoUavzZERf6dk9KcUmbOfZ6XSOR1efDE85I=; b=IbmSfuBnxFLZsbsRfig+Nyxxm3tQSUgZ/pxb0mEwE7EEqTCzpF0Nfqo2vI6SCFtDvY IxkgwEuus+5qXrNQgbvXSKAotB04E71lKGAq9iTJauDiQS1wwCeUBchSlGO6DVreXvff Uwe2yEm8IONndcHLeq+M/w4Mj65GOhUaaIV2NVMtWHX4Ri77yEw57mJiGNZq2UIMaEbd 68+Frk+hDGRC1ypG6DhXDCU3ku9lHuot7J29axaB/CTo5V2orgH3w9tEJagzjV1+i811 1CsHmM+ytB5iIvmJ+ZnyF6Y16SWbSmFolotVw/urME/pjJdnFvSDUn8lGUXXcLfn9oTd U2Hg== X-Gm-Message-State: AA+aEWZ2tx76/9c0/CIRTLKta/eyqVlR6GMXFmYgL5YkJ2mGlVQYoGd3 U2JoZE5izWWgdzDLDmUC54bHX/mtNQ19JVi4Rrs= X-Google-Smtp-Source: AFSGD/UhLkEG0EhPUPlYu7MLzQ3It1FHceeyytFVwRLOxLITOd6iBnYy4fOPiPXFMxryUqE3FxPsbTNQIPcsBY1UNJQ= X-Received: by 2002:a02:6f4d:: with SMTP id b13mr4170982jae.57.1543568542848; Fri, 30 Nov 2018 01:02:22 -0800 (PST) MIME-Version: 1.0 References: <71CBA720-633D-4CFE-805C-606DAAEDD356@intel.com> <3C60E59D-36AD-4382-8CC3-89D4EEB0140D@intel.com> <76959924-D9DB-4C58-BB05-E33107AD98AC@intel.com> <485F0372-7486-473B-ACDA-F42A2D86EF03@intel.com> <34E92C48-A90C-472C-A915-AAA4A6B5CDE8@intel.com> <20181124203541.4aa9bbf2@xeon-e3> In-Reply-To: <20181124203541.4aa9bbf2@xeon-e3> From: Harsh Patel Date: Fri, 30 Nov 2018 14:32:11 +0530 Message-ID: To: stephen@networkplumber.org Cc: "Wiles, Keith" , Kyle Larose , users@dpdk.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-users] Query on handling packets X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Nov 2018 09:02:24 -0000 Hello, Sorry for the long delay, we were busy with some exams. *1) About the NUMA sockets*This is the result of the command you mentioned :- ====================================================================== Core and Socket Information (as reported by '/sys/devices/system/cpu') ====================================================================== cores = [0, 1, 2, 3] sockets = [0] Socket 0 -------- Core 0 [0] Core 1 [1] Core 2 [2] Core 3 [3] We don't know much about this and would like your input on what else to be checked or what do we need to do. *2) The part where you asked for a graph *We used `ps` to analyse which CPU cores are being utilized. The raw socket version had two logical threads which used cores 0 and 1. The DPDK version had 6 logical threads, which also used cores 0 and 1. This is the case for which we showed you the results. As the previous case had 2 cores and was not giving desired results, we tried to give more cores to see if the DPDK in ns-3 code can achieve the desired throughput and pps. (We thought giving more cores might improve the performance.) For this new case, we provided 4 total cores using EAL arguments, upon which, it used cores 0-3. And still we got the same results as the one sent earlier. We think this means that the bottleneck is a different problem unrelated to number of cores as of now. (This whole section is an answer to the question in the last paragraph raised by Kyle to which Keith asked for a graph) *3) About updating the TX_TIMEOUT and storing rte_get_timer_hz() * We have not tried this and will try it by today and will send you the status after that in some time. *4) For the suggestion by Stephen* We are not clear on what you suggested and it would be nice if you elaborate your suggestion. Thanks and Regards, Harsh and Hrishikesh PS :- We are done with our exams and would be working now on this regularly. On Sun, 25 Nov 2018 at 10:05, Stephen Hemminger wrote: > On Sat, 24 Nov 2018 16:01:04 +0000 > "Wiles, Keith" wrote: > > > > On Nov 22, 2018, at 9:54 AM, Harsh Patel > wrote: > > > > > > Hi > > > > > > Thank you so much for the reply and for the solution. > > > > > > We used the given code. We were amazed by the pointer arithmetic you > used, got to learn something new. > > > > > > But still we are under performing.The same bottleneck of ~2.5Mbps is > seen. > > > > > > We also checked if the raw socket was using any extra (logical) cores > than the DPDK. We found that raw socket has 2 logical threads running on 2 > logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical > CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same > bottleneck. > > > > > > We have updated our code (you can use the same links from previous > mail). It would be helpful if you could help us in finding what causes the > bottleneck. > > > > I looked at the code for a few seconds and noticed your TX_TIMEOUT is > macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call > rte_get_timer_hz() time, but grab the value and store the hz locally and > use that variable instead. This will not improve performance is my guess > and I would have to look at the code the that routine to see if it buys you > anything to store the value locally. If the getting hz is just a simple > read of a variable then good, but still you should should a local variable > within the object to hold the (rte_get_timer_hz()/2048) instead of doing > the call and divide each time. > > > > > > > > Thanks and Regards, > > > Harsh and Hrishikesh > > > > > > > > On Mon, Nov 19, 2018, 19:19 Wiles, Keith > wrote: > > > > > > > > > > On Nov 17, 2018, at 4:05 PM, Kyle Larose > wrote: > > > > > > > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel < > thadodaharsh10@gmail.com> wrote: > > > >> > > > >> Hello, > > > >> Thanks a lot for going through the code and providing us with so > much > > > >> information. > > > >> We removed all the memcpy/malloc from the data path as you > suggested and > > > > ... > > > >> After removing this, we are able to see a performance gain but not > as good > > > >> as raw socket. > > > >> > > > > > > > > You're using an unordered_map to map your buffer pointers back to the > > > > mbufs. While it may not do a memcpy all the time, It will likely end > > > > up doing a malloc arbitrarily when you insert or remove entries from > > > > the map. If it needs to resize the table, it'll be even worse. You > may > > > > want to consider using librte_hash: > > > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better, > > > > see if you can design the system to avoid needing to do a lookup like > > > > this. Can you return a handle with the mbuf pointer and the data > > > > together? > > > > > > > > You're also using floating point math where it's unnecessary (the > > > > timing check). Just multiply the numerator by 1000000 prior to doing > > > > the division. I doubt you'll overflow a uint64_t with that. It's not > > > > as efficient as integer math, though I'm not sure offhand it'd cause > a > > > > major perf problem. > > > > > > > > One final thing: using a raw socket, the kernel will take over > > > > transmitting and receiving to the NIC itself. that means it is free > to > > > > use multiple CPUs for the rx and tx. I notice that you only have one > > > > rx/tx queue, meaning at most one CPU can send and receive packets. > > > > When running your performance test with the raw socket, you may want > > > > to see how busy the system is doing packet sends and receives. Is it > > > > using more than one CPU's worth of processing? Is it using less, but > > > > when combined with your main application's usage, the overall system > > > > is still using more than one? > > > > > > Along with the floating point math, I would remove all floating point > math and use the rte_rdtsc() function to use cycles. Using something like: > > > > > > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16); /* One > 16th of a second use 2/4/8/16/32 power of two numbers to make the math > simple divide */ > > > > > > cur_tsc = rte_rdtsc(); > > > > > > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */ > > > > > > while(1) { > > > cur_tsc = rte_rdtsc(); > > > if (cur_tsc >= next_tsc) { > > > flush(); > > > next_tsc += timo; > > > } > > > /* Do other stuff */ > > > } > > > > > > For the m_bufPktMap I would use the rte_hash or do not use a hash at > all by grabbing the buffer address and subtract the > > > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + > RTE_MAX_HEADROOM); > > > > > > > > > DpdkNetDevice:Write(uint8_t *buffer, size_t length) > > > { > > > struct rte_mbuf *pkt; > > > uint64_t cur_tsc; > > > > > > pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct > rte_mbuf) + RTE_MAX_HEADROOM); > > > > > > /* No need to test pkt, but buffer maybe tested to make sure > it is not null above the math above */ > > > > > > pkt->pk_len = length; > > > pkt->data_len = length; > > > > > > rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt); > > > > > > cur_tsc = rte_rdtsc(); > > > > > > /* next_tsc is a private variable */ > > > if (cur_tsc >= next_tsc) { > > > rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer); > /* hardcoded the queue id, should be fixed */ > > > next_tsc = cur_tsc + timo; /* timo is a fixed number > of cycles to wait */ > > > } > > > return length; > > > } > > > > > > DpdkNetDevice::Read() > > > { > > > struct rte_mbuf *pkt; > > > > > > if (m_rxBuffer->length == 0) { > > > m_rxBuffer->next = 0; > > > m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, > m_rxBuffer->pmts, MAX_PKT_BURST); > > > > > > if (m_rxBuffer->length == 0) > > > return std::make_pair(NULL, -1); > > > } > > > > > > pkt = m_rxBuffer->pkts[m_rxBuffer->next++]; > > > > > > /* do not use rte_pktmbuf_read() as it does a copy for the > complete packet */ > > > > > > return std:make_pair(rte_pktmbuf_mtod(pkt, char *), > pkt->pkt_len); > > > } > > > > > > void > > > DpdkNetDevice::FreeBuf(uint8_t *buf) > > > { > > > struct rte_mbuf *pkt; > > > > > > if (!buf) > > > return; > > > pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) + > RTE_MAX_HEADROOM); > > > > > > rte_pktmbuf_free(pkt); > > > } > > > > > > When your code is done with the buffer, then convert the buffer > address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This > should eliminate the copy and floating point code. Converting my C code to > C++ priceless :-) > > > > > > Hopefully the buffer address passed is the original buffer address and > has not be adjusted. > > > > > > > > > Regards, > > > Keith > > > > > > > Regards, > > Keith > > > > Also rdtsc causes cpu to stop doing any look ahead, so there is a > heisenberg effect. > Adding more rdtsc will hurt performance. It also looks like your code is > not doing bursting correctly. > What if multiple packets arrive in one rx_burst? >