From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 31834A059B;
	Fri, 10 Apr 2020 11:18:37 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 7DBA01D414;
	Fri, 10 Apr 2020 11:18:36 +0200 (CEST)
Received: from mail-il1-f193.google.com (mail-il1-f193.google.com
 [209.85.166.193]) by dpdk.org (Postfix) with ESMTP id D07A81D412
 for <dev@dpdk.org>; Fri, 10 Apr 2020 11:18:35 +0200 (CEST)
Received: by mail-il1-f193.google.com with SMTP id a6so1313133ilr.4
 for <dev@dpdk.org>; Fri, 10 Apr 2020 02:18:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=TBERkOf9zD4fKPGvJvP7ZUSMyXtG4/QsGefVJWyF9P8=;
 b=unBwEysob3Bi+ZwL/SErnuuSpY7RcqRZxzCJlDmj8Tw4K37UkWJpGNzOkfjgQ9DyMT
 GNVCEnRX0Q4WIk8t2IU4Ofd6N3ZTEgNfCeFLCP7RjZcWnPLlG1uAeJ55MxNPHcksN+KA
 XJ1ssDbXClJu7Bz2zSQ2NLYDK/PMrJny4nfj3SfYTpU1V3yM+24pGShDymHAXqh95UGo
 LCodgq4JI1R6g7/dB5DL0XyEw4sep54OGSm5Vv7SUVKPtmSDBbmMJ6p01Hu6KIRiRgAP
 vd4eKbKyrSOLCPPlLsJnpp0G/nXQbkywbRbKSSTcitZujWqC0XXvxvOI3/3S8yjbobRn
 URjQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=TBERkOf9zD4fKPGvJvP7ZUSMyXtG4/QsGefVJWyF9P8=;
 b=COXhZ6dyGO/ca8SdWKdtBM6V6ftu8wmkEWGaUjGBJYrKMktnqK2Qc4DdnK1xOHVIe/
 DN1/l4Zc5HuOfUV//hGCem50nMOBf4LlTDkUkt7xe8czZEMkyYgEeajKhdHBt4gZlr6P
 4G6OgBeuH84yUIIi8fd1pPwOUEBwFC5tvZ/8+5230f47rdmqIL7dDlhQJrTfTsFh5cOX
 DEHBXPo0hepPESCBPzUn1493KBhFjb44PADmdZ2NnsVehfGzJ/NGoMRBVWj2D72cb/0l
 y+5FX1Vx4mPfvvFu6VGW4qKnAhStTtfxoHSS1jsJK3NRTM6tzrnM+kxSKC1B3w7FRANN
 gzVQ==
X-Gm-Message-State: AGi0PuY3Tr7D4sDTT+bJXxpD+JSZkvMMXTq4oi2ACn1upYA8a8XAaduy
 +5PDoqDUBMOCYZx3H8GFouL5s5kSonMzlezilu8=
X-Google-Smtp-Source: APiQypJkNLHGNJXZe9XX4eZdseHE3HD29D8IpzzPpV+PVbhv+5faLtHeo6c/qNvEAgnyLQvEdKAKKxro4S+HsDiNmXc=
X-Received: by 2002:a92:9edb:: with SMTP id s88mr4437200ilk.294.1586510315090; 
 Fri, 10 Apr 2020 02:18:35 -0700 (PDT)
MIME-Version: 1.0
References: <20200331192945.2466880-1-jerinj@marvell.com>
 <20200405085613.1336841-1-jerinj@marvell.com>
 <20200405085613.1336841-13-jerinj@marvell.com>
 <491dec21-e05a-2a06-9bb7-f6c5802250ef@semihalf.com>
In-Reply-To: <491dec21-e05a-2a06-9bb7-f6c5802250ef@semihalf.com>
From: Jerin Jacob <jerinjacobk@gmail.com>
Date: Fri, 10 Apr 2020 14:48:19 +0530
Message-ID: <CALBAE1OpqNyKoZ297syUWs8XKJsk9EFxoYWk3PDC-1vAPFbhcA@mail.gmail.com>
To: Andrzej Ostruszka <amo@semihalf.com>
Cc: dpdk-dev <dev@dpdk.org>
Content-Type: text/plain; charset="UTF-8"
Subject: Re: [dpdk-dev] [PATCH v4 12/29] graph: implement fastpath API
	routines
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On Fri, Apr 10, 2020 at 4:37 AM Andrzej Ostruszka <amo@semihalf.com> wrote:
>
> On 4/5/20 10:55 AM, jerinj@marvell.com wrote:
> > From: Jerin Jacob <jerinj@marvell.com>
> >
> > Adding implementation for rte_graph_walk() API. This will perform a walk
> > on the circular buffer and call the process function of each node
> > and collect the stats if stats collection is enabled.
> >
> > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> > Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
> > Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > ---
> [...]
> > +__rte_experimental
> > +static inline void
> > +rte_graph_walk(struct rte_graph *graph)
> > +{
> > +     const rte_graph_off_t *cir_start = graph->cir_start;
> > +     const rte_node_t mask = graph->cir_mask;
> > +     uint32_t head = graph->head;
> > +     struct rte_node *node;
> > +     uint64_t start;
> > +     uint16_t rc;
> > +     void **objs;
> > +
> > +     /*
> > +      * Walk on the source node(s) ((cir_start - head) -> cir_start) and then
> > +      * on the pending streams (cir_start -> (cir_start + mask) -> cir_start)
> > +      * in a circular buffer fashion.
> > +      *
> > +      *      +-----+ <= cir_start - head [number of source nodes]
> > +      *      |     |
> > +      *      | ... | <= source nodes
> > +      *      |     |
> > +      *      +-----+ <= cir_start [head = 0] [tail = 0]
> > +      *      |     |
> > +      *      | ... | <= pending streams
> > +      *      |     |
> > +      *      +-----+ <= cir_start + mask
> > +      */
> > +     while (likely(head != graph->tail)) {
> > +             node = RTE_PTR_ADD(graph, cir_start[(int32_t)head++]);
> > +             RTE_ASSERT(node->fence == RTE_GRAPH_FENCE);
> > +             objs = node->objs;
> > +             rte_prefetch0(objs);
> > +
> > +             if (rte_graph_has_stats_feature()) {
> > +                     start = rte_rdtsc();
> > +                     rc = node->process(graph, node, objs, node->idx);
> > +                     node->total_cycles += rte_rdtsc() - start;
> > +                     node->total_calls++;
> > +                     node->total_objs += rc;
> > +             } else {
> > +                     node->process(graph, node, objs, node->idx);
> > +             }
> > +             node->idx = 0;
>
> So I guess this is a responsibility of a node process function to handle
> all objects.  What should it do if it is not possible.  E.g. after
> tx_burst we usually drop packets, how do you drop objects in graph?  Do
> you simply free them (does the node knows how object was allocated?) or
> you need to pass it to "null" node.  Process function returns number of
> objects processed (e.g. later RX/TX nodes), why it is not used here?

Graph library deals with (void *)objects, not the mbuf. So it can be used with
any datatypes.
For packet drop requirements, a "packet drop" node added in the libnode to
free the packets to mempool.
The downstream node can send  to "null"  node  if it needs to drop on the floor.
So it is the downstream node decision. The graph library is not
dictating any domain-specific
details.

>
> > +             head = likely((int32_t)head > 0) ? head & mask : head;
> > +     }
> > +     graph->tail = 0;
> > +}
> [...]
> > +__rte_experimental
> > +static inline void
> > +rte_node_enqueue(struct rte_graph *graph, struct rte_node *node,
> > +              rte_edge_t next, void **objs, uint16_t nb_objs)
> > +{
> > +     node = __rte_node_next_node_get(node, next);
> > +     const uint16_t idx = node->idx;
> > +
> > +     __rte_node_enqueue_prologue(graph, node, idx, nb_objs);
> > +
> > +     rte_memcpy(&node->objs[idx], objs, nb_objs * sizeof(void *));
> > +     node->idx = idx + nb_objs;
> > +}
>
> I see how it works for usual scenario.  But is there some kind of
> fork/join operation?  Just trying to imagine scenario where I have a
> bunch of co-processors doing different operations for given
> object/packet.  Then to to minimize latency I'd like to use them in
> parallel and when all done then enqueue it to next node.  Is there a
> support for that in terms of graph lib or should that be handled in the
> process function of a single node?

Yes. We  can create multiple graphs(each attached to a core) which
can have multiple instances of SRC nodes.

>
> [...]
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Enqueue objs to multiple next nodes for further processing and
> > + * set the next nodes to pending state in the circular buffer.
> > + * objs[i] will be enqueued to nexts[i].
> > + *
> > + * @param graph
> > + *   Graph pointer returned from rte_graph_lookup().
> > + * @param node
> > + *   Current node pointer.
> > + * @param nexts
> > + *   List of relative next node indices to enqueue objs.
> > + * @param objs
> > + *   List of objs to enqueue.
> > + * @param nb_objs
> > + *   Number of objs to enqueue.
> > + */
> > +__rte_experimental
> > +static inline void
> > +rte_node_enqueue_next(struct rte_graph *graph, struct rte_node *node,
> > +                   rte_edge_t *nexts, void **objs, uint16_t nb_objs)
> > +{
> > +     uint16_t i;
> > +
> > +     for (i = 0; i < nb_objs; i++)
> > +             rte_node_enqueue_x1(graph, node, nexts[i], objs[i]);
>
> I have noticed comments about x1/2/4 functions but since you defended
> the performance there why not having some kind of use of them here
> (Duff's device like unrolling)?  Just like you have in your performance
> test.

This function will be replaced with more SIMD type processing in future
based on the performance need.

>
> [...]
> > +__rte_experimental
> > +static inline void **
> > +rte_node_next_stream_get(struct rte_graph *graph, struct rte_node *node,
> > +                      rte_edge_t next, uint16_t nb_objs)
> > +{
> > +     node = __rte_node_next_node_get(node, next);
> > +     const uint16_t idx = node->idx;
> > +     uint16_t free_space = node->size - idx;
> > +
> > +     if (unlikely(free_space < nb_objs))
> > +             __rte_node_stream_alloc_size(graph, node, nb_objs);
> > +
> > +     return &node->objs[idx];
> > +}
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Put the next stream to pending state in the circular buffer
> > + * for further processing. Should be invoked followed by
> > + * rte_node_next_stream_get().
>
> Is the last sentence correct?

I will reword to "Should be invoked after rte_node_next_stream_get()"