From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <3chas3@gmail.com> Received: from mail-it0-f67.google.com (mail-it0-f67.google.com [209.85.214.67]) by dpdk.org (Postfix) with ESMTP id 9A3715681; Mon, 27 Aug 2018 17:51:55 +0200 (CEST) Received: by mail-it0-f67.google.com with SMTP id u13-v6so2910540iti.1; Mon, 27 Aug 2018 08:51:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:sender:from:date:message-id :subject:to:cc; bh=4T2MKxpPaUkNyREjBE2B1J5GUc8WC+7fEBtj5eWhJMY=; b=eDZKs4wM1JqS/lkgm9R6DCZptMAQu32h0jIWrAFr85Ctwd4p/ejrgyrqHgAbMfro3B E3y2k25TfYNcK+F1zwbozi16CA3zUNeyvefFUdkI1MIT88HQGQN9nNcYGTZAPE/3ktXT o48HkQwuo9ssH6Gjpnd5HGgjEsfXYAmA1YshxGsYSVEQ1O74wmF42VD5nLkBTVA3nHLG 5qStVpVm2P+0iy5nabnz9dCcVtuT7VGRWr6MLEEpSBhFPhRQRc0wd1dZONoKEoPbL/sv 9seQHoiuIKFZWNpEtiTlVFB8T+aItW4CW1NJ4wsY6QQdlkmYrRRvXR9ctAH6IbfxI6EC VtCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:sender:from :date:message-id:subject:to:cc; bh=4T2MKxpPaUkNyREjBE2B1J5GUc8WC+7fEBtj5eWhJMY=; b=njaFze18kOseeWvBaf8woXuR4HLEehFFG6qft3hdSCOm21MnVhM+q7/8r6rUS3y4u9 dklelbjbXBM4pZfTeTMwxcx5dxCg48q3H5dVPGby3Lm+QkD2bPwz7VMQRc8ktvIaZYIp t6VzSLoaa8aKY52vhXVsc008YpsXqiHjlaVgURS7/mt8+GUzLmd4ERnNdb7HPC/Dtyn/ cuPmyjeYdI1tBpeAEOYJ/qjYNP0R6e0QvcfER8FNPc5ue4l6yV3CvhO1WANrw9F/iNFq TfVCWp80c2ApVkXVu8AxfM8dVrFmkpnZsKtmEUZJvaiEd6QtqsOVZkBhxmi6vCj3TbM2 zEow== X-Gm-Message-State: APzg51BAiZ1JptpBl5ae3NwF3hXvJC0GWsWJxGLDvR8t2DYwG29DZjmS jIsB7bSWR2Et+tCwkB1GFvkTM1iAiM2+DfOoYO4= X-Google-Smtp-Source: ANB0VdZTlVDsfl1QgVhYQRH2v8GRtsEbdVqP6DhVz7Q7etORChOw1/7axYpqep1hi3sVAIgtbmEJThJBQlqfdLfu4uc= X-Received: by 2002:a24:cfd7:: with SMTP id y206-v6mr7392371itf.112.1535385114775; Mon, 27 Aug 2018 08:51:54 -0700 (PDT) MIME-Version: 1.0 References: <20180816125202.15980-1-bluca@debian.org> <20180816133208.26566-1-bluca@debian.org> <1534933159.5764.107.camel@debian.org> <20180822174316.GA29821@roosta> In-Reply-To: Sender: chasmosaurus@gmail.com X-Google-Sender-Delegation: chasmosaurus@gmail.com From: Chas Williams <3chas3@gmail.com> Date: Mon, 27 Aug 2018 11:51:42 -0400 X-Google-Sender-Auth: TZBGCoXNLn3b8ogedxvyMYh1BiQ Message-ID: To: Matan Azrad Cc: Eric Kinzie , bluca@debian.org, dev@dpdk.org, Declan Doherty , Chas Williams , stable@dpdk.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] [dpdk-stable] [PATCH v4] net/bonding: per-slave intermediate rx ring X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Aug 2018 15:51:56 -0000 On Mon, Aug 27, 2018 at 11:30 AM Matan Azrad wrote: > Hi Chas > > From: Chas Williams <3chas3@gmail.com> > >On Sun, Aug 26, 2018 at 3:40 AM Matan Azrad > wrote: > > > >From: Chas Williams > >>On Thu, Aug 23, 2018 at 3:28 AM Matan Azrad matan@mellanox.com> wrote: > >>Hi > >> > >>From: Eric Kinzie > >>> On Wed Aug 22 11:42:37 +0000 2018, Matan Azrad wrote: > >>> > Hi Luca > >>> > > >>> > From: Luca Boccassi > >>> > > On Wed, 2018-08-22 at 07:09 +0000, Matan Azrad wrote: > >>> > > > Hi Chas > >>> > > > > >>> > > > From: Chas Williams > >>> > > > > On Tue, Aug 21, 2018 at 11:43 AM Matan Azrad > >>> > > > > wrote: > >>> > > > > Hi Chas > >>> > > > > > >>> > > > > From: Chas Williams > >>> > > > > > On Tue, Aug 21, 2018 at 6:56 AM Matan Azrad > >>> > > > > > https://emea01.safelinks.protection.outlook.com/?url=3Dhttp%3A%2F%2Fx.com= &data=3D02%7C01%7Cmatan%40mellanox.com%7C7ee011bf19224cb17d9a08d60c202379%7= Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636709729507525613&sdata=3DTiWa= Rq8A%2FvUaqPIT6ajfJdcY7eIAGPqB%2BiHppcFdZJo%3D&reserved=3D0 > > > >>> > > > > > wrote: > >>> > > > > > Hi > >>> > > > > > > >>> > > > > > From: Chas Williams > >>> > > > > > > This will need to be implemented for some of the other RX > >>> > > > > > > burst methods at some point for other modes to see this > >>> > > > > > > performance improvement (with the exception of > active-backup). > >>> > > > > > > >>> > > > > > Yes, I think it should be done at least to > >>> > > > > > bond_ethdev_rx_burst_8023ad_fast_queue (should be easy) for > >>> now. > >>> > > > > > > >>> > > > > > There is some duplicated code between the various RX paths. > >>> > > > > > I would like to eliminate that as much as possible, so I wa= s > >>> > > > > > going to give that some thought first. > >>> > > > > > >>> > > > > There is no reason to stay this function as is while its twin > is > >>> > > > > changed. > >>> > > > > > >>> > > > > Unfortunately, this is all the patch I have at this time. > >>> > > > > > >>> > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > On Thu, Aug 16, 2018 at 9:32 AM Luca Boccassi > >>> > > > > > > https://emea01.safelinks.protection.outlook.com/?url=3Dhttp%3A%2F%2Fian.o= rg&data=3D02%7C01%7Cmatan%40mellanox.com%7C7ee011bf19224cb17d9a08d60c202379= %7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636709729507525613&sdata=3D0A= LmgN%2B6Xnrl3kKeeoJRGTlJmeLZlmcJwsOTcncMUWE%3D&reserved=3D0> > wrote: > >>> > > > > > > > >>> > > > > > > > During bond 802.3ad receive, a burst of packets is > fetched > >>> > > > > > > > from each slave into a local array and appended to > >>> > > > > > > > per-slave ring buffer. > >>> > > > > > > > Packets are taken from the head of the ring buffer and > >>> > > > > > > > returned to the caller. The number of mbufs provided t= o > >>> > > > > > > > each slave is sufficient to meet the requirements of th= e > >>> > > > > > > > ixgbe vector receive. > >>> > > > > > > >>> > > > > > Luca, > >>> > > > > > > >>> > > > > > Can you explain these requirements of ixgbe? > >>> > > > > > > >>> > > > > > The ixgbe (and some other Intel PMDs) have vectorized RX > >>> > > > > > routines that are more efficient (if not faster) taking > >>> > > > > > advantage of some advanced CPU instructions. I think you > need > >>> > > > > > to be receiving at least 32 packets or more. > >>> > > > > > >>> > > > > So, why to do it in bond which is a generic driver for all th= e > >>> > > > > vendors PMDs, If for ixgbe and other Intel nics it is better > you > >>> > > > > can force those PMDs to receive always 32 packets and to mana= ge > >>> > > > > a ring by themselves. > >>> > > > > > >>> > > > > The drawback of the ring is some additional latency on the > >>> > > > > receive path. > >>> > > > > In testing, the additional latency hasn't been an issue for > bonding. > >>> > > > > >>> > > > When bonding does processing slower it may be a bottleneck for > the > >>> > > > packet processing for some application. > >>> > > > > >>> > > > > The bonding PMD has a fair bit of overhead associated with th= e > >>> > > > > RX and TX path calculations. Most applications can just > arrange > >>> > > > > to call the RX path with a sufficiently large receive. Bondi= ng > >>> > > > > can't do this. > >>> > > > > >>> > > > I didn't talk on application I talked on the slave PMDs, The > slave > >>> > > > PMD can manage a ring by itself if it helps for its own > performance. > >>> > > > The bonding should not be oriented to specific PMDs. > >>> > > > >>> > > The issue though is that the performance problem is not with the > >>> > > individual PMDs - it's with bonding. There were no reports > regarding > >>> > > the individual PMDs. > >>> > > This comes from reports from customers from real world production > >>> > > deployments - the issue of bonding being too slow was raised > multiple > >>> times. > >>> > > This patch addresses those issues, again in production deployment= s, > >>> > > where it's been used for years, to users and customers > satisfaction. > >>> > > >>> > From Chas I understood that using burst of 32 helps for some slave > PMDs > >>> performance which makes sense. > >>> > I can't understand how the extra copy phases improves the bonding > itself > >>> performance: > >>> > > >>> > You added 2 copy phases in the bonding RX function: > >>> > 1. Get packets from the slave to a local array. > >>> > 2. Copy packet pointers from a local array to the ring array. > >>> > 3. Copy packet pointers from the ring array to the application arra= y. > >>> > > >>> > Each packet arriving to the application must pass the above 3 > phases(in a > >>> specific call or in previous calls). > >>> > > >>> > Without this patch we have only - > >>> > Get packets from the slave to the application array. > >>> > > >>> > Can you explain how the extra copies improves the bonding > performance? > >>> > > >>> > Looks like it improves the slaves PMDs and because of that the > bonding > >>> PMD performance becomes better. > >>> > >>> I'm not sure that adding more buffer management to the vector PMDs wi= ll > >>> improve the drivers' performance; it's just that calling the rx > function in such > >>> a way that it returns no data wastes time. > >> > >>Sorry, I don't fully understand what you said here, please rephrase. > >> > >>> The bonding driver is already an exercise in buffer management so > adding this layer of indirection here makes > >>> sense in my opinion, as does hiding the details of the consituent > interfaces where possible. > >> > >>Can you explain how this new buffer management with the extra pointer > copies improves the bonding itself performance? > >>Looks really strange to me. > >> > >>Because rings are generally quite efficient. > > > >But you are using a ring in addition to regular array management, it mus= t > hurt performance of the bonding PMD > >(means the bonding itself - not the slaves PMDs which are called from th= e > bonding) > It adds latency. It increases performance because we spend less CPU time reading from the PMDs. This means we have more CPU to use for post processing (i.e. routing). > > > >> > >>Bonding is in a middle ground between application and PMD. > >Yes. > >>What bonding is doing, may not improve all applications. > >Yes, but it can be solved using some bonding modes. > >> If using a ring to buffer the vectorized receive routines, improves > your particular application, > >>that's great. > >It may be not great and even bad for some other PMDs which are not > vectororized. > > > >> However, I don't think I can say that it would help all > >>applications. As you point out, there is overhead associated with > >>a ring. > >Yes. > >>Bonding's receive burst isn't especially efficient (in mode 4). > > > >Why? > > > >It makes a copy of the slaves, has a fair bit of stack usage, > >needs to check the slave status, and needs to examine each > >packet to see if it is a slow protocol packet. So each > >packet is essentially read twice. The fast queue code for mode 4 > >avoids some of this (and probably ignores checking collecting > >incorrectly). If you find a slow protocol packet, you need to > >chop it out of the array with memmove due to overlap. > > Agree. > So to improve the bonding performance you need to optimize the aboves > problems. > There is no connection to the ring. > And as I have described numerous times, these problems can't be easily fixed and preserve the existing API. > > >> Bonding benefits from being able to read as much as possible (within > limits of > >>course, large reads would blow out caches) from each slave. > > > >The slaves PMDs can benefits in the same way. > > > >>It can't return all that data though because applications tend to use > the > >>burst size that would be efficient for a typical PMD. > > > >What is the preferred burst size of the bonding? Maybe the application > should use it when they are using bonding. > > > >The preferred burst size for bonding would be the sum of all the > >slaves ideal read size. However, that's not likely to be simple > >since most applications decide early the size for the read/write > >burst operations. > > > >>An alternative might be to ask bonding applications to simply issue > larger reads for > >>certain modes. That's probably not as easy as it sounds given the > >>way that the burst length effects multiplexing. > > > >Can you explain it more? > > > >A longer burst size on one PMD will tend to favor that PMD > >over others. It will fill your internal queues with more > >of its packets. > > Agree, it's about fairness. > > > > >>Another solution might be just alternatively poll the individual > >>slaves on each rx burst. But that means you need to poll at a > >>faster rate. Depending on your application, you might not be > >>able to do that. > > >Again, can you be more precise in the above explanation? > > > >If the application knows that there are two slaves backing > >a bonding interface, the application could just read twice > >from the bonding interface, knowing that the bonding > >interface is going to alternate between the slaves. But > >this requires the application to know things about the bonding > >PMD, like the number of slaves. > > Why should the application poll twice? > Poll slave 0, than process it's packets, poll slave 1 than process its > packets... > What is the problem? > B ecause let's say that each slave is 10G and you are using link aggregation with two slaves. If you have tuned your application on the assumption that a PMD is approximately 10G, then you are going to be under polling the bonding PMD. For the above to work, you need to ensure that the application is polling sufficiently to keep up. > > >> We can avoid this scheduling overhead by just > >>doing the extra reads in bonding and buffering in a ring. > >> > >>Since bonding is going to be buffering everything in a ring, > > > >? I don't familiar with it. For now I don't think we need a ring. > > > >We have some empirical testing that shows performance improvements. > >How do you explain this? > > You are using a hack in bonding which hurt the bonding but improves the > vectorized PMD you use. > > >Rings aren't inefficient. > > Only when there is a need to use it. > I believe we have identified a need. > > >There's significant overhead of calling the rx burst routine on any PMD. > >You need to get the PMD data structures into local memory. > > Yes. > > >Reading as much as possible makes sense. > > Agree. > > > Could you generally do this for all PMDs? Yes, but the ring adds > latency. Latency > >that isn't an issue for the bonding PMD because of all the > >other inefficiencies (for mode 4). > > Enlarging the bonding latency by that way makes some vectorized slave PMD= s > happy and makes the bonding worst > for other slaves PMDs - this is not a good idea to put it upstream. > > Improving the other problems in the bonding(reduce the bonding latency) > will do the job for you and for others. > Latency is not the issue with bonding. The issue with bonding is the overhead associated with making an rx burst call. We add latency (via rings) to make part of the bonding driver more efficient. Again, I suspect it even helps the non-vectorized PMDs. Calling a PMD's rx burst routine is costly and we are switching between PMD's inside the bonding driver. Bonding is halfway between an application and a PMD. What we are doing in the bonding PMD is what an application would typically do. But instead of forcing all the bonding users to do this, we are doing it for them. > > > >it makes sense to just read as much as is as efficiently possible. Fo= r > the > >>Intel adapters, this means using a read big enough trigger the vectoriz= ed > >>versions. > >>> > > So I'd like to share this improvement rather than keeping it > private > >>> > > - because I'm nice that way :-P > >>> > > > >>> > > > > > Did you check for other vendor PMDs? It may hurt performanc= e > >>> > > > > > there.. > >>> > > > > > > >>> > > > > > I don't know, but I suspect probably not. For the most par= t > >>> > > > > > you are typically reading almost up to the vector > requirement. > >>> > > > > > But if one slave has just a single packet, then you can't > >>> > > > > > vectorize on the next slave. > >>> > > > > > > >>> > > > > > >>> > > > > I don't think that the ring overhead is better for PMDs which > >>> > > > > are not using the vectorized instructions. > >>> > > > > > >>> > > > > The non-vectorized PMDs are usually quite slow. The addition= al > >>> > > > > overhead doesn't make a difference in their performance. > >>> > > > > >>> > > > We should not do things worse than they are. > >>> > > > >>> > > There were no reports that this made things worse. The feedback > from > >>> > > production was that it made things better. > >>> > > >>> > Yes, It may be good for specific slaves drivers but hurt another > >>> > slaves drivers, So maybe it should stay private to specific > costumers using > >>> specific nics. > >>> > > >>> > Again, I can understand how this patch improves performance of some > >>> > PMDs therefore I think the bonding is not the place to add it but > maybe > >>> some PMDs. > >> > > >