From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 156E0A034F;
	Mon, 11 Oct 2021 13:49:35 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id A5F6C40E01;
	Mon, 11 Oct 2021 13:49:34 +0200 (CEST)
Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113])
 by mails.dpdk.org (Postfix) with ESMTP id 92B9440142
 for <dev@dpdk.org>; Mon, 11 Oct 2021 13:49:33 +0200 (CEST)
Received: from [192.168.38.17] (aros.oktetlabs.ru [192.168.38.17])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by shelob.oktetlabs.ru (Postfix) with ESMTPSA id 101D67F578;
 Mon, 11 Oct 2021 14:49:33 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru 101D67F578
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru;
 s=default; t=1633952973;
 bh=6ppl5vBYv9zmlf7zPTtCvwgltw6+0fwJC4k8V3UFdZE=;
 h=Subject:To:Cc:References:From:Date:In-Reply-To;
 b=t143dpIOkGZfgo0WagQL6qzunVpW0QXEszuJpTScC33K3L8Q4UwX5HSl64Fpeuh6R
 QiqqNeI8ZYsdD1EvIGHsrUFaDSsyFPcWfLT63DJi3kwqRULgk57RuJnu53F3q69wmZ
 oRTAP3dFX1e4T0NkuuV2rMJgl4Tq9u3Gb73P8lzU=
To: Xueming Li <xuemingl@nvidia.com>, dev@dpdk.org
Cc: Jerin Jacob <jerinjacobk@gmail.com>, Ferruh Yigit
 <ferruh.yigit@intel.com>, Viacheslav Ovsiienko <viacheslavo@nvidia.com>,
 Thomas Monjalon <thomas@monjalon.net>, Lior Margalit <lmargalit@nvidia.com>,
 Ananyev Konstantin <konstantin.ananyev@intel.com>
References: <20210727034204.20649-1-xuemingl@nvidia.com>
 <20210930145602.763969-1-xuemingl@nvidia.com>
From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Organization: OKTET Labs
Message-ID: <8494d5f3-f134-e9d7-d782-dca9a9efaa03@oktetlabs.ru>
Date: Mon, 11 Oct 2021 14:49:32 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.14.0
MIME-Version: 1.0
In-Reply-To: <20210930145602.763969-1-xuemingl@nvidia.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi Xueming,

On 9/30/21 5:55 PM, Xueming Li wrote:
> In current DPDK framework, all RX queues is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Further more,
> polling all ports leads to high cache miss, high latency and low
> throughputs.
> 
> This patch introduces shared RX queue. PF and representors with same
> configuration in same switch domain could share RX queue set by
> specifying shared Rx queue offloading flag and sharing group.
> 
> All ports that Shared Rx queue actually shares One Rx queue and only
> pre-load mbufs to one Rx queue, memory is saved.
> 
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
> 
> Multiple groups is supported by group ID. Port queue number in a shared
> group should be identical. Queue index is 1:1 mapped in shared group.
> An example of polling two share groups:
>   core	group	queue
>   0	0	0
>   1	0	1
>   2	0	2
>   3	0	3
>   4	1	0
>   5	1	1
>   6	1	2
>   7	1	3
> 
> Shared RX queue must be polled on single thread or core. If both PF0 and
> representor0 joined same share group, can't poll pf0rxq0 on core1 and
> rep0rxq0 on core2. Actually, polling one port within share group is
> sufficient since polling any port in group will return packets for any
> port in group.

I apologize that I jump in into the review process that late.

Frankly speaking I doubt that it is the best design to solve
the problem. Yes, I confirm that the problem exists, but I
think there is better and simpler way to solve it.

The problem of the suggested solution is that it puts all
the headache about consistency to application and PMDs
without any help from ethdev layer to guarantee the
consistency. As the result I believe it will be either
missing/lost consistency checks or huge duplication in
each PMD which supports the feature. Shared RxQs must be
equally configured including number of queues, offloads
(taking device level Rx offloads into account), RSS
settings etc. So, applications must care about it and
PMDs (or ethdev layer) must check it.

The advantage of the solution is that any device may
create group and subsequent devices join. Absence of
primary device is nice. But do we really need it?
Will the design work if some representors are configured
to use shared RxQ, but some do not? Theoretically it
is possible, but could require extra non-trivial code
on fast path.

Also looking at the first two patch I don't understand
how application will find out which devices may share
RxQs. E.g. if we have two difference NICs which support
sharing, we can try to setup only one group 0, but
finally will have two devices (not one) which must be
polled.

1. We need extra flag in dev_info->dev_capa
   RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
   the device supports Rx sharing.

2. I think we need "rx_domain" in device info
   (which should be treated in boundaries of the
   switch_domain) if and only if
   RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
   Otherwise rx_domain value does not make sense.

(1) and (2) will allow application to find out which
devices can share Rx.

3. Primary device (representors backing device) should
   advertise shared RxQ offload. Enabling of the offload
   tells the device to provide packets to all device in
   the Rx domain with mbuf->port filled in appropriately.
   Also it allows app to identify primary device in the
   Rx domain. When application enables the offload, it
   must ensure that it does not treat used port_id as an
   input port_id, but always check mbuf->port for each
   packet.

4. A new Rx mode should be introduced for secondary
   devices. It should not allow to configure RSS, specify
   any Rx offloads etc. ethdev must ensure it.
   It is an open question right now if it should require
   to provide primary port_id. In theory representors
   have it. However, may be it is nice for consistency
   to ensure that application knows that it does.
   If shared Rx mode is specified for device, application
   does not need to setup RxQs and attempts to do it
   should be discarded in ethdev.
   For consistency it is better to ensure that number of
   queues match.
   It is an interesting question what should happen if
   primary device is reconfigured and shared Rx is
   disabled on reconfiguration.

5. If so, in theory implementation of the Rx burst
   in the secondary could simply call Rx burst on
   primary device.

Andrew.