From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it1-f175.google.com (mail-it1-f175.google.com [209.85.166.175]) by dpdk.org (Postfix) with ESMTP id BB0291B927 for ; Fri, 14 Dec 2018 18:41:28 +0100 (CET) Received: by mail-it1-f175.google.com with SMTP id x19so10033793itl.1 for ; Fri, 14 Dec 2018 09:41:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=7ftqqrTvEGx0PFbPWu8h/oLrAqWDQj8s/kvsHnu1UGI=; b=dsSMPCJN4w73weVFHOQ7oQ7KKJVyNjgT8gnVGMKmUaiukiqs0zKZGPFQNvRikRjxnp WqnvrxjmRLSoEsjnD8zVQTaBntETLgdyMIH+zobrqguWrtH4cenDrMD0jeopksbUEpx5 NcCEXtn3YXNc9CqR+HPrkjNlyLBFSHaOSSWITsSbNPBcI4DEy4GgygmQqf5exxQgTYlV kAz5kGatQn4nhiS/YKwcaGCwa7rpCw0svXSnMeVUp6I2g1M803PDWG4rIF8kyg1LYGcU D6vDkdlDlpFtfucNfQYgFcpEOI0qfbMfydbdpwoJqsw5LqrA2yO4CqrHUYHerEi8tb1s HMNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7ftqqrTvEGx0PFbPWu8h/oLrAqWDQj8s/kvsHnu1UGI=; b=UdwYk+kLpx7817nNTkGEDcfmtYHxYf7gZlFXiXzt8g4YKcADkufHdjEmF0DWxlURFk P/5Azt55MSYjN3cnlGk9rvfT/uDjfKc0D8p6BM88WSmIEStrJLb6+A1HEpS71POJQ7Uy 1M467rDIMbvqw9FKFPuBeHjxlDXg8CcajBEGWl0PPLiCsEpqBVZngFUiKHGBErwOlwgp 8LqC+95fgMgWAmsSEj5ckS1wKRLIkvhfls+2TawZjvvH1S1JnhbYPZlhZO9TFqCYlD+J huYDFyEKKxxWchqQRLrgN8o171mCyffnmgB48Og0kMJ0O120ptZ/TQyYQisLQ5QUr42S Mcmg== X-Gm-Message-State: AA+aEWaTagx0yp3LK6unekd9vffieQDMpeRL1MN2hD7eyv+WOtFO0QbB gKvulKevRZdKSnSkV68BrIp7ja1QAQb/DTpxTYo= X-Google-Smtp-Source: AFSGD/U4Rh98CP3SNtwPsSv/hK8l5V1rYLIh2ZLE9RXrfkWues5mkjtVDWIVC4p6P66MfbRxsHdGUTRapiqeKf7DDu0= X-Received: by 2002:a24:570a:: with SMTP id u10mr4107269ita.11.1544809287853; Fri, 14 Dec 2018 09:41:27 -0800 (PST) MIME-Version: 1.0 References: <71CBA720-633D-4CFE-805C-606DAAEDD356@intel.com> <3C60E59D-36AD-4382-8CC3-89D4EEB0140D@intel.com> <76959924-D9DB-4C58-BB05-E33107AD98AC@intel.com> <485F0372-7486-473B-ACDA-F42A2D86EF03@intel.com> <34E92C48-A90C-472C-A915-AAA4A6B5CDE8@intel.com> <20181124203541.4aa9bbf2@xeon-e3> <1B6F92FD-D742-4377-896A-8D7DA6AAF799@intel.com> In-Reply-To: From: Harsh Patel Date: Fri, 14 Dec 2018 23:11:16 +0530 Message-ID: To: "Wiles, Keith" Cc: stephen@networkplumber.org, Kyle Larose , users@dpdk.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-users] Query on handling packets X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Dec 2018 17:41:29 -0000 Hello, It has been a big break since our last message. We want to inform you that we have tried a few things from which, we will show some results which we think might me relevant for the progress. We thought that there might be some relation between the burst size and throughput and thus we took a 10Mbps flow and a 20Mbps flow and changed burst size from 1,2,4,8,16,32 and so on till 256, which is the size of mbufpool and we found out that the Throughput we get for all of these flows is about the range of 8.5-9.0 Mbps which is the bottleneck for wireless environment. Secondly, we modified the value of the variable in the equation to calculate TX_TIMEOUT where we used rte_get_timer_hz()/2048 and we changed 2048 to the values 16,32,64,...,16384. We are not able to see any difference in the performance. We were trying a lot of things and we thought may be this was something that some effect. We guess now it doesn't. Also, we showed that we replaced the code to use pointer arithmetic and allocated memory pool for Tx/Rx intermediate buffers to convert the single packet flow to burst and vice versa. In this code, we allocated a same memory pool which was used by both Tx buffer and the Rx buffer. We thought this might have some effect and so we implemented a version where we had 2 separate memory pools, 1 for Tx and 1 for Rx. But again in this case we are not able to see any difference in the performance. The modified code for the experiments is not available on the repository for which we gave a link earlier. That code just contains some tweaks which are not that important. In case, you can ask for it. Also the main code is there on the repository which is working and up to date which you can have a look at. We wanted to inform you about this and would like to hear from you on what else can we do to find out where the problem is. It would be really helpful if you can point out the mistake or problem in the code or give an idea is to what might be or what is creating this problem. We thank you for your time. Regards, Harsh and Hrishikesh On Mon, 3 Dec 2018 at 15:07, Harsh Patel wrote: > Hello, > The data mentioned in the previous mails are observations and the number > of threads mentioned are what the system is creating and not given by us to > the system. I'm not sure how to explain this by a picture but I will > provide a text explanation. > > First, we ran the Linux kernel code which uses raw sockets and we gave 2 > cores. That example used 2 threads on its own. > Secondly, we ran our DPDK in ns-3 code and we the same number of cores > i.e. 2 cores. That example spawned 6 threads on its own. > (Note:- These are observations) > All of the above statistics were provided to answer the question if both > the simulations might be given different number of cores and may be that > was the reason of the performance bottleneck. Clearly they are both using > same no. of cores (2) and the results are what I have sent earlier. (Raw > socket ~ 10 Mbps and DPDK ~ 2.5 Mbps) > > Now we thought that we might give more cores to DPDK in ns-3 code, which > might improve its performance. > This is where we gave 4 cores to our DPDK in ns-3 code which still spawned > the same 6 threads. And it gave the same results as 2 cores for DPDK in > ns-3. > This was the observation. > > From this, we assume that the number of cores is not a reason for the less > performance. This is not a problem we need to look somewhere else. > So, the problem due to which we are getting less performance and a > bottleneck around 2.5Mbps is somewhere else and we need to figure that out. > > Ask again if not clear. If clear, we need to see where the problem is and > can you help in finding the reason why this happennig? > > Thanks & Regards, > Harsh & Hrishikesh > > > On Fri, 30 Nov 2018 at 21:24, Wiles, Keith wrote: > >> >> >> > On Nov 30, 2018, at 3:02 AM, Harsh Patel >> wrote: >> > >> > Hello, >> > Sorry for the long delay, we were busy with some exams. >> > >> > 1) About the NUMA sockets >> > This is the result of the command you mentioned :- >> > ====================================================================== >> > Core and Socket Information (as reported by '/sys/devices/system/cpu') >> > ====================================================================== >> > >> > cores = [0, 1, 2, 3] >> > sockets = [0] >> > >> > Socket 0 >> > -------- >> > Core 0 [0] >> > Core 1 [1] >> > Core 2 [2] >> > Core 3 [3] >> > >> > We don't know much about this and would like your input on what else to >> be checked or what do we need to do. >> > >> > 2) The part where you asked for a graph >> > We used `ps` to analyse which CPU cores are being utilized. >> > The raw socket version had two logical threads which used cores 0 and 1. >> > The DPDK version had 6 logical threads, which also used cores 0 and 1. >> This is the case for which we showed you the results. >> > As the previous case had 2 cores and was not giving desired results, we >> tried to give more cores to see if the DPDK in ns-3 code can achieve the >> desired throughput and pps. (We thought giving more cores might improve the >> performance.) >> > For this new case, we provided 4 total cores using EAL arguments, upon >> which, it used cores 0-3. And still we got the same results as the one sent >> earlier. >> > We think this means that the bottleneck is a different problem >> unrelated to number of cores as of now. (This whole section is an answer to >> the question in the last paragraph raised by Kyle to which Keith asked for >> a graph) >> >> In the CPU output above you are running a four core system with no >> hyper-threads. This means you only have four core and four threads in the >> terms of DPDK. Using 6 logical threads will not improve performance in the >> DPDK case. DPDK normally uses a single thread per core. You can have more >> than one pthread per core, but having more than one thread per code >> requires the software to switch threads. Having context switch is not a >> good performance win in most cases. >> >> Not sure how your system is setup and a picture could help. >> >> I will be traveling all next week and responses will be slow. >> >> > >> > 3) About updating the TX_TIMEOUT and storing rte_get_timer_hz() >> > We have not tried this and will try it by today and will send you the >> status after that in some time. >> > >> > 4) For the suggestion by Stephen >> > We are not clear on what you suggested and it would be nice if you >> elaborate your suggestion. >> > >> > Thanks and Regards, >> > Harsh and Hrishikesh >> > >> > PS :- We are done with our exams and would be working now on this >> regularly. >> > >> > On Sun, 25 Nov 2018 at 10:05, Stephen Hemminger < >> stephen@networkplumber.org> wrote: >> > On Sat, 24 Nov 2018 16:01:04 +0000 >> > "Wiles, Keith" wrote: >> > >> > > > On Nov 22, 2018, at 9:54 AM, Harsh Patel >> wrote: >> > > > >> > > > Hi >> > > > >> > > > Thank you so much for the reply and for the solution. >> > > > >> > > > We used the given code. We were amazed by the pointer arithmetic >> you used, got to learn something new. >> > > > >> > > > But still we are under performing.The same bottleneck of ~2.5Mbps >> is seen. >> > > > >> > > > We also checked if the raw socket was using any extra (logical) >> cores than the DPDK. We found that raw socket has 2 logical threads running >> on 2 logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 >> logical CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the >> same bottleneck. >> > > > >> > > > We have updated our code (you can use the same links from previous >> mail). It would be helpful if you could help us in finding what causes the >> bottleneck. >> > > >> > > I looked at the code for a few seconds and noticed your TX_TIMEOUT is >> macro that calls (rte_get_timer_hz()/2014) just to be safe I would not call >> rte_get_timer_hz() time, but grab the value and store the hz locally and >> use that variable instead. This will not improve performance is my guess >> and I would have to look at the code the that routine to see if it buys you >> anything to store the value locally. If the getting hz is just a simple >> read of a variable then good, but still you should should a local variable >> within the object to hold the (rte_get_timer_hz()/2048) instead of doing >> the call and divide each time. >> > > >> > > > >> > > > Thanks and Regards, >> > > > Harsh and Hrishikesh >> > > > >> > > > >> > > > On Mon, Nov 19, 2018, 19:19 Wiles, Keith >> wrote: >> > > > >> > > > >> > > > > On Nov 17, 2018, at 4:05 PM, Kyle Larose >> wrote: >> > > > > >> > > > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel < >> thadodaharsh10@gmail.com> wrote: >> > > > >> >> > > > >> Hello, >> > > > >> Thanks a lot for going through the code and providing us with so >> much >> > > > >> information. >> > > > >> We removed all the memcpy/malloc from the data path as you >> suggested and >> > > > > ... >> > > > >> After removing this, we are able to see a performance gain but >> not as good >> > > > >> as raw socket. >> > > > >> >> > > > > >> > > > > You're using an unordered_map to map your buffer pointers back to >> the >> > > > > mbufs. While it may not do a memcpy all the time, It will likely >> end >> > > > > up doing a malloc arbitrarily when you insert or remove entries >> from >> > > > > the map. If it needs to resize the table, it'll be even worse. >> You may >> > > > > want to consider using librte_hash: >> > > > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even >> better, >> > > > > see if you can design the system to avoid needing to do a lookup >> like >> > > > > this. Can you return a handle with the mbuf pointer and the data >> > > > > together? >> > > > > >> > > > > You're also using floating point math where it's unnecessary (the >> > > > > timing check). Just multiply the numerator by 1000000 prior to >> doing >> > > > > the division. I doubt you'll overflow a uint64_t with that. It's >> not >> > > > > as efficient as integer math, though I'm not sure offhand it'd >> cause a >> > > > > major perf problem. >> > > > > >> > > > > One final thing: using a raw socket, the kernel will take over >> > > > > transmitting and receiving to the NIC itself. that means it is >> free to >> > > > > use multiple CPUs for the rx and tx. I notice that you only have >> one >> > > > > rx/tx queue, meaning at most one CPU can send and receive packets. >> > > > > When running your performance test with the raw socket, you may >> want >> > > > > to see how busy the system is doing packet sends and receives. Is >> it >> > > > > using more than one CPU's worth of processing? Is it using less, >> but >> > > > > when combined with your main application's usage, the overall >> system >> > > > > is still using more than one? >> > > > >> > > > Along with the floating point math, I would remove all floating >> point math and use the rte_rdtsc() function to use cycles. Using something >> like: >> > > > >> > > > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16); /* >> One 16th of a second use 2/4/8/16/32 power of two numbers to make the math >> simple divide */ >> > > > >> > > > cur_tsc = rte_rdtsc(); >> > > > >> > > > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */ >> > > > >> > > > while(1) { >> > > > cur_tsc = rte_rdtsc(); >> > > > if (cur_tsc >= next_tsc) { >> > > > flush(); >> > > > next_tsc += timo; >> > > > } >> > > > /* Do other stuff */ >> > > > } >> > > > >> > > > For the m_bufPktMap I would use the rte_hash or do not use a hash >> at all by grabbing the buffer address and subtract the >> > > > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) >> + RTE_MAX_HEADROOM); >> > > > >> > > > >> > > > DpdkNetDevice:Write(uint8_t *buffer, size_t length) >> > > > { >> > > > struct rte_mbuf *pkt; >> > > > uint64_t cur_tsc; >> > > > >> > > > pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct >> rte_mbuf) + RTE_MAX_HEADROOM); >> > > > >> > > > /* No need to test pkt, but buffer maybe tested to make >> sure it is not null above the math above */ >> > > > >> > > > pkt->pk_len = length; >> > > > pkt->data_len = length; >> > > > >> > > > rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt); >> > > > >> > > > cur_tsc = rte_rdtsc(); >> > > > >> > > > /* next_tsc is a private variable */ >> > > > if (cur_tsc >= next_tsc) { >> > > > rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer); >> /* hardcoded the queue id, should be fixed */ >> > > > next_tsc = cur_tsc + timo; /* timo is a fixed >> number of cycles to wait */ >> > > > } >> > > > return length; >> > > > } >> > > > >> > > > DpdkNetDevice::Read() >> > > > { >> > > > struct rte_mbuf *pkt; >> > > > >> > > > if (m_rxBuffer->length == 0) { >> > > > m_rxBuffer->next = 0; >> > > > m_rxBuffer->length = rte_eth_rx_burst(m_portId, 0, >> m_rxBuffer->pmts, MAX_PKT_BURST); >> > > > >> > > > if (m_rxBuffer->length == 0) >> > > > return std::make_pair(NULL, -1); >> > > > } >> > > > >> > > > pkt = m_rxBuffer->pkts[m_rxBuffer->next++]; >> > > > >> > > > /* do not use rte_pktmbuf_read() as it does a copy for the >> complete packet */ >> > > > >> > > > return std:make_pair(rte_pktmbuf_mtod(pkt, char *), >> pkt->pkt_len); >> > > > } >> > > > >> > > > void >> > > > DpdkNetDevice::FreeBuf(uint8_t *buf) >> > > > { >> > > > struct rte_mbuf *pkt; >> > > > >> > > > if (!buf) >> > > > return; >> > > > pkt = (struct rte_mbuf *)RTE_PKT_SUB(buf, sizeof(rte_mbuf) >> + RTE_MAX_HEADROOM); >> > > > >> > > > rte_pktmbuf_free(pkt); >> > > > } >> > > > >> > > > When your code is done with the buffer, then convert the buffer >> address back to a rte_mbuf pointer and call rte_pktmbuf_free(pkt); This >> should eliminate the copy and floating point code. Converting my C code to >> C++ priceless :-) >> > > > >> > > > Hopefully the buffer address passed is the original buffer address >> and has not be adjusted. >> > > > >> > > > >> > > > Regards, >> > > > Keith >> > > > >> > > >> > > Regards, >> > > Keith >> > > >> > >> > Also rdtsc causes cpu to stop doing any look ahead, so there is a >> heisenberg effect. >> > Adding more rdtsc will hurt performance. It also looks like your code >> is not doing bursting correctly. >> > What if multiple packets arrive in one rx_burst? >> >> Regards, >> Keith >> >>