From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id A4AE3A0597;
	Fri, 10 Apr 2020 01:07:09 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 2A79A1C43C;
	Fri, 10 Apr 2020 01:07:09 +0200 (CEST)
Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com
 [209.85.208.196]) by dpdk.org (Postfix) with ESMTP id 1AC031C2F8
 for <dev@dpdk.org>; Fri, 10 Apr 2020 01:07:08 +0200 (CEST)
Received: by mail-lj1-f196.google.com with SMTP id z26so276780ljz.11
 for <dev@dpdk.org>; Thu, 09 Apr 2020 16:07:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=semihalf-com.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to:content-language:content-transfer-encoding;
 bh=UgnCbsjGOZFo3e0KPSpV69ToWdSct01ZYHdIi7s2Ld0=;
 b=t93TDucee42DulaCe9OMSGu5ZjUTKRh6hIa7/a1hPnD0RuRx1NvhWuwSQ3m7f4Pzuh
 CkdI/ud1R7Ec29u8T4QqWbOK6h/59WcJ1R9cfGtfetyVYnfwbt+CzzhwOehzT444b9y+
 hyeQfdxIwEFYXW64aLnYz2a9UZBrIuiSGzpmGw3vktynhVN7fCc8OUChV4kfmu6dfMOl
 0U9/Y0cZv6Cwi7cjx08yrROdc797cuVX+UvWtoEfhr77SHbmaMXCakiAbau5CXgF+1we
 fe5IrFSIjBc+U6SqPfvbvJGSpK3z3xaScmMQTYKZMYiJtqAs6YPrnHK/S2YNGeEsV1aS
 YNlA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=UgnCbsjGOZFo3e0KPSpV69ToWdSct01ZYHdIi7s2Ld0=;
 b=D45rrY89EhRITJWZyUKh5bQf9gDm04n7/hHoTKOgwkrMff0GfTcwqfwHmKA+ifPyyl
 HTxS5vMBBY3GsByxrv2d6UinBGCGSC/re+HnggIDnwEGCnd/iRLEUsN5QtkbKjsMLZEo
 cHaJz0pK9jl+84IPvbawZW5i8R6vv91uroFibilMujGKIKUl1g1QFfdGRtdoz/+VFYhT
 rUKbVP0XvTSpKJyrEBtMffKwiaklijUuLRCOKXaMnw1isCdIOINdbnK6wGPNJdm5l8AX
 4XXGUUPRbq1Lwg29PghENg9YaJzKFx8nCurD+UpH9fflUOcELd0I1oTfphvsH3sGglqT
 mccw==
X-Gm-Message-State: AGi0PuZoNgpQqlsPscS7DrXRVxwRfZS8m/xe4aI9XDy4HCxKQfY8VA/Z
 /0vvq32be6SVQyxyc2CJV8R2n1Mkr9A=
X-Google-Smtp-Source: APiQypJv3rD4kqOOXu54fFLjD5yVQuNyvZzHd7JO1aBEcigRmMfktEn4mSIFc/5IP/rJ/nhXxwjeQg==
X-Received: by 2002:a2e:8146:: with SMTP id t6mr1251276ljg.236.1586473627219; 
 Thu, 09 Apr 2020 16:07:07 -0700 (PDT)
Received: from [192.168.8.100] (user-5-173-33-152.play-internet.pl.
 [5.173.33.152])
 by smtp.gmail.com with ESMTPSA id v12sm100765ljh.6.2020.04.09.16.07.06
 for <dev@dpdk.org>
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Thu, 09 Apr 2020 16:07:06 -0700 (PDT)
To: dev@dpdk.org
References: <20200331192945.2466880-1-jerinj@marvell.com>
 <20200405085613.1336841-1-jerinj@marvell.com>
 <20200405085613.1336841-13-jerinj@marvell.com>
From: Andrzej Ostruszka <amo@semihalf.com>
Message-ID: <491dec21-e05a-2a06-9bb7-f6c5802250ef@semihalf.com>
Date: Fri, 10 Apr 2020 01:07:05 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Firefox/68.0 Thunderbird/68.4.1
MIME-Version: 1.0
In-Reply-To: <20200405085613.1336841-13-jerinj@marvell.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v4 12/29] graph: implement fastpath API
 routines
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 4/5/20 10:55 AM, jerinj@marvell.com wrote:
> From: Jerin Jacob <jerinj@marvell.com>
> 
> Adding implementation for rte_graph_walk() API. This will perform a walk
> on the circular buffer and call the process function of each node
> and collect the stats if stats collection is enabled.
> 
> Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
> Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> ---
[...]
> +__rte_experimental
> +static inline void
> +rte_graph_walk(struct rte_graph *graph)
> +{
> +	const rte_graph_off_t *cir_start = graph->cir_start;
> +	const rte_node_t mask = graph->cir_mask;
> +	uint32_t head = graph->head;
> +	struct rte_node *node;
> +	uint64_t start;
> +	uint16_t rc;
> +	void **objs;
> +
> +	/*
> +	 * Walk on the source node(s) ((cir_start - head) -> cir_start) and then
> +	 * on the pending streams (cir_start -> (cir_start + mask) -> cir_start)
> +	 * in a circular buffer fashion.
> +	 *
> +	 *	+-----+ <= cir_start - head [number of source nodes]
> +	 *	|     |
> +	 *	| ... | <= source nodes
> +	 *	|     |
> +	 *	+-----+ <= cir_start [head = 0] [tail = 0]
> +	 *	|     |
> +	 *	| ... | <= pending streams
> +	 *	|     |
> +	 *	+-----+ <= cir_start + mask
> +	 */
> +	while (likely(head != graph->tail)) {
> +		node = RTE_PTR_ADD(graph, cir_start[(int32_t)head++]);
> +		RTE_ASSERT(node->fence == RTE_GRAPH_FENCE);
> +		objs = node->objs;
> +		rte_prefetch0(objs);
> +
> +		if (rte_graph_has_stats_feature()) {
> +			start = rte_rdtsc();
> +			rc = node->process(graph, node, objs, node->idx);
> +			node->total_cycles += rte_rdtsc() - start;
> +			node->total_calls++;
> +			node->total_objs += rc;
> +		} else {
> +			node->process(graph, node, objs, node->idx);
> +		}
> +		node->idx = 0;

So I guess this is a responsibility of a node process function to handle
all objects.  What should it do if it is not possible.  E.g. after
tx_burst we usually drop packets, how do you drop objects in graph?  Do
you simply free them (does the node knows how object was allocated?) or
you need to pass it to "null" node.  Process function returns number of
objects processed (e.g. later RX/TX nodes), why it is not used here?

> +		head = likely((int32_t)head > 0) ? head & mask : head;
> +	}
> +	graph->tail = 0;
> +}
[...]
> +__rte_experimental
> +static inline void
> +rte_node_enqueue(struct rte_graph *graph, struct rte_node *node,
> +		 rte_edge_t next, void **objs, uint16_t nb_objs)
> +{
> +	node = __rte_node_next_node_get(node, next);
> +	const uint16_t idx = node->idx;
> +
> +	__rte_node_enqueue_prologue(graph, node, idx, nb_objs);
> +
> +	rte_memcpy(&node->objs[idx], objs, nb_objs * sizeof(void *));
> +	node->idx = idx + nb_objs;
> +}

I see how it works for usual scenario.  But is there some kind of
fork/join operation?  Just trying to imagine scenario where I have a
bunch of co-processors doing different operations for given
object/packet.  Then to to minimize latency I'd like to use them in
parallel and when all done then enqueue it to next node.  Is there a
support for that in terms of graph lib or should that be handled in the
process function of a single node?

[...]
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Enqueue objs to multiple next nodes for further processing and
> + * set the next nodes to pending state in the circular buffer.
> + * objs[i] will be enqueued to nexts[i].
> + *
> + * @param graph
> + *   Graph pointer returned from rte_graph_lookup().
> + * @param node
> + *   Current node pointer.
> + * @param nexts
> + *   List of relative next node indices to enqueue objs.
> + * @param objs
> + *   List of objs to enqueue.
> + * @param nb_objs
> + *   Number of objs to enqueue.
> + */
> +__rte_experimental
> +static inline void
> +rte_node_enqueue_next(struct rte_graph *graph, struct rte_node *node,
> +		      rte_edge_t *nexts, void **objs, uint16_t nb_objs)
> +{
> +	uint16_t i;
> +
> +	for (i = 0; i < nb_objs; i++)
> +		rte_node_enqueue_x1(graph, node, nexts[i], objs[i]);

I have noticed comments about x1/2/4 functions but since you defended
the performance there why not having some kind of use of them here
(Duff's device like unrolling)?  Just like you have in your performance
test.

[...]
> +__rte_experimental
> +static inline void **
> +rte_node_next_stream_get(struct rte_graph *graph, struct rte_node *node,
> +			 rte_edge_t next, uint16_t nb_objs)
> +{
> +	node = __rte_node_next_node_get(node, next);
> +	const uint16_t idx = node->idx;
> +	uint16_t free_space = node->size - idx;
> +
> +	if (unlikely(free_space < nb_objs))
> +		__rte_node_stream_alloc_size(graph, node, nb_objs);
> +
> +	return &node->objs[idx];
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Put the next stream to pending state in the circular buffer
> + * for further processing. Should be invoked followed by
> + * rte_node_next_stream_get().

Is the last sentence correct?

> + *
> + * @param graph
> + *   Graph pointer returned from rte_graph_lookup().
> + * @param node
> + *   Current node pointer.
> + * @param next
> + *   Relative next node index..
> + * @param idx
> + *   Number of objs updated in the stream after getting the stream using
> + *   rte_node_next_stream_get.
> + *
> + * @see rte_node_next_stream_get().
> + */
> +__rte_experimental
> +static inline void
> +rte_node_next_stream_put(struct rte_graph *graph, struct rte_node *node,
> +			 rte_edge_t next, uint16_t idx)

I understood the stream_move as an optimized enqueue, but could you
describe how these _get/put are meant to be used, and why not use
stream_move/enqueue?  I see in later (testing) patches that it is used
for source nodes, and I'm not sure I understand the difference between them.

With regards
Andrzej Ostruszka