From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 3A1E6A0C47;
	Tue, 12 Oct 2021 13:28:30 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 82B5441143;
	Tue, 12 Oct 2021 13:28:28 +0200 (CEST)
Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113])
 by mails.dpdk.org (Postfix) with ESMTP id F3C8B41141
 for <dev@dpdk.org>; Tue, 12 Oct 2021 13:28:26 +0200 (CEST)
Received: from [192.168.38.17] (aros.oktetlabs.ru [192.168.38.17])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by shelob.oktetlabs.ru (Postfix) with ESMTPSA id 81CCB7F50A;
 Tue, 12 Oct 2021 14:28:26 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru 81CCB7F50A
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru;
 s=default; t=1634038106;
 bh=jFC0I3auSjD3Qaa7/vQ1ijNAlq5isfdhrsMMnccqXVY=;
 h=Subject:To:Cc:References:From:Date:In-Reply-To;
 b=pICcVqOaBXCOco7GgUjKm+pgXUG6wr+Ndq6QCsd6iE9rd4ljBHPg+aADHABDWSv/7
 cOCe+Kyv3jOtSPV8erJh8lzwg/kv2EynMLY9fIX8yoh3htTa8nmDlx1ctsyFiR7aXr
 NPrUlRrm/1dcvmK4a4GryF10ngSNWcGQZnTojJHg=
To: "Xueming(Steven) Li" <xuemingl@nvidia.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "jerinjacobk@gmail.com" <jerinjacobk@gmail.com>,
 NBU-Contact-Thomas Monjalon <thomas@monjalon.net>,
 Lior Margalit <lmargalit@nvidia.com>,
 Slava Ovsiienko <viacheslavo@nvidia.com>,
 "konstantin.ananyev@intel.com" <konstantin.ananyev@intel.com>,
 "ferruh.yigit@intel.com" <ferruh.yigit@intel.com>
References: <20210727034204.20649-1-xuemingl@nvidia.com>
 <20210930145602.763969-1-xuemingl@nvidia.com>
 <8494d5f3-f134-e9d7-d782-dca9a9efaa03@oktetlabs.ru>
 <5584827eb502362297c71432991be7852438e94b.camel@nvidia.com>
 <2cfa1d1cf5b7fc01d361b04ec4fdf1500c89c50b.camel@nvidia.com>
 <6a61a2db-6a55-e857-52df-5fe97b5bd60e@oktetlabs.ru>
 <88a4a603bb3ca68898fdac21e4a3effd92a1d6e3.camel@nvidia.com>
From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Organization: OKTET Labs
Message-ID: <35b86a0a-3805-582f-85a3-69c3f8907ce0@oktetlabs.ru>
Date: Tue, 12 Oct 2021 14:28:26 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.14.0
MIME-Version: 1.0
In-Reply-To: <88a4a603bb3ca68898fdac21e4a3effd92a1d6e3.camel@nvidia.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
>> On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
>>> On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
>>>> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
>>>>> Hi Xueming,
>>>>>
>>>>> On 9/30/21 5:55 PM, Xueming Li wrote:
>>>>>> In current DPDK framework, all RX queues is pre-loaded with mbufs for
>>>>>> incoming packets. When number of representors scale out in a switch
>>>>>> domain, the memory consumption became significant. Further more,
>>>>>> polling all ports leads to high cache miss, high latency and low
>>>>>> throughputs.
>>>>>>  
>>>>>> This patch introduces shared RX queue. PF and representors with same
>>>>>> configuration in same switch domain could share RX queue set by
>>>>>> specifying shared Rx queue offloading flag and sharing group.
>>>>>>
>>>>>> All ports that Shared Rx queue actually shares One Rx queue and only
>>>>>> pre-load mbufs to one Rx queue, memory is saved.
>>>>>>
>>>>>> Polling any queue using same shared RX queue receives packets from all
>>>>>> member ports. Source port is identified by mbuf->port.
>>>>>>
>>>>>> Multiple groups is supported by group ID. Port queue number in a shared
>>>>>> group should be identical. Queue index is 1:1 mapped in shared group.
>>>>>> An example of polling two share groups:
>>>>>>   core	group	queue
>>>>>>   0	0	0
>>>>>>   1	0	1
>>>>>>   2	0	2
>>>>>>   3	0	3
>>>>>>   4	1	0
>>>>>>   5	1	1
>>>>>>   6	1	2
>>>>>>   7	1	3
>>>>>>
>>>>>> Shared RX queue must be polled on single thread or core. If both PF0 and
>>>>>> representor0 joined same share group, can't poll pf0rxq0 on core1 and
>>>>>> rep0rxq0 on core2. Actually, polling one port within share group is
>>>>>> sufficient since polling any port in group will return packets for any
>>>>>> port in group.
>>>>>
>>>>> I apologize that I jump in into the review process that late.
>>>>
>>>> Appreciate the bold suggestion, never too late :)
>>>>
>>>>>
>>>>> Frankly speaking I doubt that it is the best design to solve
>>>>> the problem. Yes, I confirm that the problem exists, but I
>>>>> think there is better and simpler way to solve it.
>>>>>
>>>>> The problem of the suggested solution is that it puts all
>>>>> the headache about consistency to application and PMDs
>>>>> without any help from ethdev layer to guarantee the
>>>>> consistency. As the result I believe it will be either
>>>>> missing/lost consistency checks or huge duplication in
>>>>> each PMD which supports the feature. Shared RxQs must be
>>>>> equally configured including number of queues, offloads
>>>>> (taking device level Rx offloads into account), RSS
>>>>> settings etc. So, applications must care about it and
>>>>> PMDs (or ethdev layer) must check it.
>>>>
>>>> The name might be confusing, here is my understanding:
>>>> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
>>>> 2. PMD polls one shared RxQ - for latency and performance
>>>> 3. Most per queue features like offloads and RSS not impacted. That's
>>>> why this not mentioned. Some offloading might not being supported due
>>>> to PMD or hw limitation, need to add check in PMD case by case.
>>>> 4. Multiple group is defined for service level flexibility. For
>>>> example, PF and VIP customer's load distributed via queues and dedicate
>>>> cores. Low priority customers share one core with one shared queue.
>>>> multiple groups enables more combination.
>>>> 5. One port could assign queues to different group for polling
>>>> flexibility. For example first 4 queues in group 0 and next 4 queues in
>>>> group1, each group have other member ports with 4 queues, so the port
>>>> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
>>>> in other words, each core only poll one shared RxQ.
>>>>
>>>>>
>>>>> The advantage of the solution is that any device may
>>>>> create group and subsequent devices join. Absence of
>>>>> primary device is nice. But do we really need it?
>>>>> Will the design work if some representors are configured
>>>>> to use shared RxQ, but some do not? Theoretically it
>>>>> is possible, but could require extra non-trivial code
>>>>> on fast path.
>>>>
>>>> If multiple groups, any device could be hot-unplugged.
>>>>
>>>> Mixed configuration is supported, the only difference is how to set
>>>> mbuf->port. Since group is per queue, mixed is better to be supported,
>>>> didn't see any difficulty here.
>>>>
>>>> PDM could select to support only group 0, same settings for each rxq,
>>>> that fits most scenario.
>>>>
>>>>>
>>>>> Also looking at the first two patch I don't understand
>>>>> how application will find out which devices may share
>>>>> RxQs. E.g. if we have two difference NICs which support
>>>>> sharing, we can try to setup only one group 0, but
>>>>> finally will have two devices (not one) which must be
>>>>> polled.
>>>>>
>>>>> 1. We need extra flag in dev_info->dev_capa
>>>>>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>>>>>    the device supports Rx sharing.
>>>>
>>>> dev_info->rx_queue_offload_capa could be used here, no?
>>
>> It depends. But we definitely need a flag which
>> says that below rx_domain makes sense. It could be
>> either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
>> capability.
>>
>> The question is if it is really an offload. The offload is
>> when something could be done by HW/FW and result is provided
>> to SW. May be it is just a nit picking...
>>
>> May be we don't need an offload at all. Just have
>> RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
>> as a flag that an RxQ should be shared (zero - default,
>> no sharing). ethdev layer may check consistency on
>> its layer to ensure that the device capability is
>> reported if non-zero group is specified on queue setup.
>>
>>>>
>>>>>
>>>>> 2. I think we need "rx_domain" in device info
>>>>>    (which should be treated in boundaries of the
>>>>>    switch_domain) if and only if
>>>>>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>>>>>    Otherwise rx_domain value does not make sense.
>>>>
>>>> I see, this will give flexibility of different hw, will add it.
>>>>
>>>>>
>>>>> (1) and (2) will allow application to find out which
>>>>> devices can share Rx.
>>>>>
>>>>> 3. Primary device (representors backing device) should
>>>>>    advertise shared RxQ offload. Enabling of the offload
>>>>>    tells the device to provide packets to all device in
>>>>>    the Rx domain with mbuf->port filled in appropriately.
>>>>>    Also it allows app to identify primary device in the
>>>>>    Rx domain. When application enables the offload, it
>>>>>    must ensure that it does not treat used port_id as an
>>>>>    input port_id, but always check mbuf->port for each
>>>>>    packet.
>>>>>
>>>>> 4. A new Rx mode should be introduced for secondary
>>>>>    devices. It should not allow to configure RSS, specify
>>>>>    any Rx offloads etc. ethdev must ensure it.
>>>>>    It is an open question right now if it should require
>>>>>    to provide primary port_id. In theory representors
>>>>>    have it. However, may be it is nice for consistency
>>>>>    to ensure that application knows that it does.
>>>>>    If shared Rx mode is specified for device, application
>>>>>    does not need to setup RxQs and attempts to do it
>>>>>    should be discarded in ethdev.
>>>>>    For consistency it is better to ensure that number of
>>>>>    queues match.
>>>>
>>>> RSS and Rx offloads should be supported as individual, PMD needs to
>>>> check if not supported.
>>
>> Thinking a bit more about it I agree that RSS settings could
>> be individual. Offload could be individual as well, but I'm
>> not sure about all offloads. E.g. Rx scatter which is related
>> to Rx buffer size (which is shared since Rx mempool is shared)
>> vs MTU. May be it is acceptable. We just must define rules
>> what should happen if offloads contradict to each other.
>> It should be highlighted in the description including
>> driver callback to ensure that PMD maintainers are responsible
>> for consistency checks.
>>
>>>>
>>>>>    It is an interesting question what should happen if
>>>>>    primary device is reconfigured and shared Rx is
>>>>>    disabled on reconfiguration.
>>>>
>>>> I feel better no primary port/queue assumption in configuration, all
>>>> members are equally treated, each queue can join or quit share group,
>>>> that's important to support multiple groups.
>>
>> I agree. The problem of many flexible solutions is
>> complexity to support. We'll see how it goes.
>>
>>>>
>>>>>
>>>>> 5. If so, in theory implementation of the Rx burst
>>>>>    in the secondary could simply call Rx burst on
>>>>>    primary device.
>>>>>
>>>>> Andrew.
>>>>
>>>
>>> Hi Andrew,
>>>
>>> I realized that we are talking different things, this feature
>>> introduced 2 RxQ share:
>>> 1. Share mempool to save memory
>>> 2. Share polling to save latency
>>>
>>> What you suggested is reuse all RxQ configuration IIUC, maybe we should
>>> break the flag into 3, so application could learn PMD capability and
>>> configure accordingly, how do you think?
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
>>
>> Not sure that I understand. Just specify the same mempool
>> on Rx queue setup. Isn't it sufficient?
>>
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
>>
>> It implies pool sharing if I'm not mistaken. Of course,
>> we can pool many different HW queues in one poll, but it
>> hardly makes sense to care specially about it.
>> IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
>>
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
>>
>> It is hardly a feature. Rather a possible limitation.
> 
> Thanks, then I'd drop this suggestion then.
> 
> Here is the TODO list, let me know if anything missing:
> 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE

RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
entire Rx, but just some queues.

> 2. RxQ share group check in ethdev
> 3. add rx_domain into device info
> 
>>
>> Andrew.
>