From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3A1E6A0C47; Tue, 12 Oct 2021 13:28:30 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 82B5441143; Tue, 12 Oct 2021 13:28:28 +0200 (CEST) Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113]) by mails.dpdk.org (Postfix) with ESMTP id F3C8B41141 for ; Tue, 12 Oct 2021 13:28:26 +0200 (CEST) Received: from [192.168.38.17] (aros.oktetlabs.ru [192.168.38.17]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by shelob.oktetlabs.ru (Postfix) with ESMTPSA id 81CCB7F50A; Tue, 12 Oct 2021 14:28:26 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru 81CCB7F50A DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru; s=default; t=1634038106; bh=jFC0I3auSjD3Qaa7/vQ1ijNAlq5isfdhrsMMnccqXVY=; h=Subject:To:Cc:References:From:Date:In-Reply-To; b=pICcVqOaBXCOco7GgUjKm+pgXUG6wr+Ndq6QCsd6iE9rd4ljBHPg+aADHABDWSv/7 cOCe+Kyv3jOtSPV8erJh8lzwg/kv2EynMLY9fIX8yoh3htTa8nmDlx1ctsyFiR7aXr NPrUlRrm/1dcvmK4a4GryF10ngSNWcGQZnTojJHg= To: "Xueming(Steven) Li" , "dev@dpdk.org" Cc: "jerinjacobk@gmail.com" , NBU-Contact-Thomas Monjalon , Lior Margalit , Slava Ovsiienko , "konstantin.ananyev@intel.com" , "ferruh.yigit@intel.com" References: <20210727034204.20649-1-xuemingl@nvidia.com> <20210930145602.763969-1-xuemingl@nvidia.com> <8494d5f3-f134-e9d7-d782-dca9a9efaa03@oktetlabs.ru> <5584827eb502362297c71432991be7852438e94b.camel@nvidia.com> <2cfa1d1cf5b7fc01d361b04ec4fdf1500c89c50b.camel@nvidia.com> <6a61a2db-6a55-e857-52df-5fe97b5bd60e@oktetlabs.ru> <88a4a603bb3ca68898fdac21e4a3effd92a1d6e3.camel@nvidia.com> From: Andrew Rybchenko Organization: OKTET Labs Message-ID: <35b86a0a-3805-582f-85a3-69c3f8907ce0@oktetlabs.ru> Date: Tue, 12 Oct 2021 14:28:26 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: <88a4a603bb3ca68898fdac21e4a3effd92a1d6e3.camel@nvidia.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 10/12/21 1:55 PM, Xueming(Steven) Li wrote: > On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote: >> On 10/12/21 9:37 AM, Xueming(Steven) Li wrote: >>> On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote: >>>> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote: >>>>> Hi Xueming, >>>>> >>>>> On 9/30/21 5:55 PM, Xueming Li wrote: >>>>>> In current DPDK framework, all RX queues is pre-loaded with mbufs for >>>>>> incoming packets. When number of representors scale out in a switch >>>>>> domain, the memory consumption became significant. Further more, >>>>>> polling all ports leads to high cache miss, high latency and low >>>>>> throughputs. >>>>>> >>>>>> This patch introduces shared RX queue. PF and representors with same >>>>>> configuration in same switch domain could share RX queue set by >>>>>> specifying shared Rx queue offloading flag and sharing group. >>>>>> >>>>>> All ports that Shared Rx queue actually shares One Rx queue and only >>>>>> pre-load mbufs to one Rx queue, memory is saved. >>>>>> >>>>>> Polling any queue using same shared RX queue receives packets from all >>>>>> member ports. Source port is identified by mbuf->port. >>>>>> >>>>>> Multiple groups is supported by group ID. Port queue number in a shared >>>>>> group should be identical. Queue index is 1:1 mapped in shared group. >>>>>> An example of polling two share groups: >>>>>> core group queue >>>>>> 0 0 0 >>>>>> 1 0 1 >>>>>> 2 0 2 >>>>>> 3 0 3 >>>>>> 4 1 0 >>>>>> 5 1 1 >>>>>> 6 1 2 >>>>>> 7 1 3 >>>>>> >>>>>> Shared RX queue must be polled on single thread or core. If both PF0 and >>>>>> representor0 joined same share group, can't poll pf0rxq0 on core1 and >>>>>> rep0rxq0 on core2. Actually, polling one port within share group is >>>>>> sufficient since polling any port in group will return packets for any >>>>>> port in group. >>>>> >>>>> I apologize that I jump in into the review process that late. >>>> >>>> Appreciate the bold suggestion, never too late :) >>>> >>>>> >>>>> Frankly speaking I doubt that it is the best design to solve >>>>> the problem. Yes, I confirm that the problem exists, but I >>>>> think there is better and simpler way to solve it. >>>>> >>>>> The problem of the suggested solution is that it puts all >>>>> the headache about consistency to application and PMDs >>>>> without any help from ethdev layer to guarantee the >>>>> consistency. As the result I believe it will be either >>>>> missing/lost consistency checks or huge duplication in >>>>> each PMD which supports the feature. Shared RxQs must be >>>>> equally configured including number of queues, offloads >>>>> (taking device level Rx offloads into account), RSS >>>>> settings etc. So, applications must care about it and >>>>> PMDs (or ethdev layer) must check it. >>>> >>>> The name might be confusing, here is my understanding: >>>> 1. NIC shares the buffer supply HW Q between shared RxQs - for memory >>>> 2. PMD polls one shared RxQ - for latency and performance >>>> 3. Most per queue features like offloads and RSS not impacted. That's >>>> why this not mentioned. Some offloading might not being supported due >>>> to PMD or hw limitation, need to add check in PMD case by case. >>>> 4. Multiple group is defined for service level flexibility. For >>>> example, PF and VIP customer's load distributed via queues and dedicate >>>> cores. Low priority customers share one core with one shared queue. >>>> multiple groups enables more combination. >>>> 5. One port could assign queues to different group for polling >>>> flexibility. For example first 4 queues in group 0 and next 4 queues in >>>> group1, each group have other member ports with 4 queues, so the port >>>> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty, >>>> in other words, each core only poll one shared RxQ. >>>> >>>>> >>>>> The advantage of the solution is that any device may >>>>> create group and subsequent devices join. Absence of >>>>> primary device is nice. But do we really need it? >>>>> Will the design work if some representors are configured >>>>> to use shared RxQ, but some do not? Theoretically it >>>>> is possible, but could require extra non-trivial code >>>>> on fast path. >>>> >>>> If multiple groups, any device could be hot-unplugged. >>>> >>>> Mixed configuration is supported, the only difference is how to set >>>> mbuf->port. Since group is per queue, mixed is better to be supported, >>>> didn't see any difficulty here. >>>> >>>> PDM could select to support only group 0, same settings for each rxq, >>>> that fits most scenario. >>>> >>>>> >>>>> Also looking at the first two patch I don't understand >>>>> how application will find out which devices may share >>>>> RxQs. E.g. if we have two difference NICs which support >>>>> sharing, we can try to setup only one group 0, but >>>>> finally will have two devices (not one) which must be >>>>> polled. >>>>> >>>>> 1. We need extra flag in dev_info->dev_capa >>>>> RTE_ETH_DEV_CAPA_RX_SHARE to advertise that >>>>> the device supports Rx sharing. >>>> >>>> dev_info->rx_queue_offload_capa could be used here, no? >> >> It depends. But we definitely need a flag which >> says that below rx_domain makes sense. It could be >> either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload >> capability. >> >> The question is if it is really an offload. The offload is >> when something could be done by HW/FW and result is provided >> to SW. May be it is just a nit picking... >> >> May be we don't need an offload at all. Just have >> RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID >> as a flag that an RxQ should be shared (zero - default, >> no sharing). ethdev layer may check consistency on >> its layer to ensure that the device capability is >> reported if non-zero group is specified on queue setup. >> >>>> >>>>> >>>>> 2. I think we need "rx_domain" in device info >>>>> (which should be treated in boundaries of the >>>>> switch_domain) if and only if >>>>> RTE_ETH_DEV_CAPA_RX_SHARE is advertised. >>>>> Otherwise rx_domain value does not make sense. >>>> >>>> I see, this will give flexibility of different hw, will add it. >>>> >>>>> >>>>> (1) and (2) will allow application to find out which >>>>> devices can share Rx. >>>>> >>>>> 3. Primary device (representors backing device) should >>>>> advertise shared RxQ offload. Enabling of the offload >>>>> tells the device to provide packets to all device in >>>>> the Rx domain with mbuf->port filled in appropriately. >>>>> Also it allows app to identify primary device in the >>>>> Rx domain. When application enables the offload, it >>>>> must ensure that it does not treat used port_id as an >>>>> input port_id, but always check mbuf->port for each >>>>> packet. >>>>> >>>>> 4. A new Rx mode should be introduced for secondary >>>>> devices. It should not allow to configure RSS, specify >>>>> any Rx offloads etc. ethdev must ensure it. >>>>> It is an open question right now if it should require >>>>> to provide primary port_id. In theory representors >>>>> have it. However, may be it is nice for consistency >>>>> to ensure that application knows that it does. >>>>> If shared Rx mode is specified for device, application >>>>> does not need to setup RxQs and attempts to do it >>>>> should be discarded in ethdev. >>>>> For consistency it is better to ensure that number of >>>>> queues match. >>>> >>>> RSS and Rx offloads should be supported as individual, PMD needs to >>>> check if not supported. >> >> Thinking a bit more about it I agree that RSS settings could >> be individual. Offload could be individual as well, but I'm >> not sure about all offloads. E.g. Rx scatter which is related >> to Rx buffer size (which is shared since Rx mempool is shared) >> vs MTU. May be it is acceptable. We just must define rules >> what should happen if offloads contradict to each other. >> It should be highlighted in the description including >> driver callback to ensure that PMD maintainers are responsible >> for consistency checks. >> >>>> >>>>> It is an interesting question what should happen if >>>>> primary device is reconfigured and shared Rx is >>>>> disabled on reconfiguration. >>>> >>>> I feel better no primary port/queue assumption in configuration, all >>>> members are equally treated, each queue can join or quit share group, >>>> that's important to support multiple groups. >> >> I agree. The problem of many flexible solutions is >> complexity to support. We'll see how it goes. >> >>>> >>>>> >>>>> 5. If so, in theory implementation of the Rx burst >>>>> in the secondary could simply call Rx burst on >>>>> primary device. >>>>> >>>>> Andrew. >>>> >>> >>> Hi Andrew, >>> >>> I realized that we are talking different things, this feature >>> introduced 2 RxQ share: >>> 1. Share mempool to save memory >>> 2. Share polling to save latency >>> >>> What you suggested is reuse all RxQ configuration IIUC, maybe we should >>> break the flag into 3, so application could learn PMD capability and >>> configure accordingly, how do you think? >>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL >> >> Not sure that I understand. Just specify the same mempool >> on Rx queue setup. Isn't it sufficient? >> >>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL >> >> It implies pool sharing if I'm not mistaken. Of course, >> we can pool many different HW queues in one poll, but it >> hardly makes sense to care specially about it. >> IMHO RxQ sharing is a sharing of the underlying HW Rx queue. >> >>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL >> >> It is hardly a feature. Rather a possible limitation. > > Thanks, then I'd drop this suggestion then. > > Here is the TODO list, let me know if anything missing: > 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of entire Rx, but just some queues. > 2. RxQ share group check in ethdev > 3. add rx_domain into device info > >> >> Andrew. >