DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ankur Dwivedi <adwivedi@marvell.com>
To: Vladimir Medvedkin <medvedkinv@gmail.com>
Cc: "Medvedkin, Vladimir" <vladimir.medvedkin@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>, Jerin Jacob <jerinj@marvell.com>,
	Nithin Kumar Dabilpuram <ndabilpuram@marvell.com>,
	Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>,
	Sunil Kumar Kori <skori@marvell.com>,
	Rakesh Kudurumalla <rkudurumalla@marvell.com>
Subject: RE: [EXTERNAL] Re: [PATCH v1 04/12] node: add process callback for IP4 FIB
Date: Wed, 23 Apr 2025 09:46:48 +0000	[thread overview]
Message-ID: <BY1PR18MB6109AF4D383E22F53030A679DDBA2@BY1PR18MB6109.namprd18.prod.outlook.com> (raw)
In-Reply-To: <CANDrEHnPsoiNwwiuCursBKoqGo+e0EXSg20sg5oS4KwwNrdB5A@mail.gmail.com>



Hi Vladimir,

>Hi Ankur,
>
>пт, 18 апр. 2025 г. в 15:45, Ankur Dwivedi <adwivedi@marvell.com
><mailto:adwivedi@marvell.com> >:
>
>
>
>	Hi Vladimir,
>	>> diff --git a/lib/node/ip4_lookup_fib.c b/lib/node/ip4_lookup_fib.c
>	>> index e87864e672..c535b191f8 100644
>	>> --- a/lib/node/ip4_lookup_fib.c
>	>> +++ b/lib/node/ip4_lookup_fib.c
>	>> @@ -40,6 +40,169 @@ static struct ip4_lookup_fib_node_main
>	>ip4_lookup_fib_nm;
>	>>   #define IP4_LOOKUP_NODE_PRIV1_OFF(ctx) \
>	>>      (((struct ip4_lookup_fib_node_ctx *)ctx)->mbuf_priv1_off)
>	>>
>	>> +static uint16_t
>	>> +ip4_lookup_fib_node_process(struct rte_graph *graph, struct
>rte_node
>	>*node, void **objs,
>	>> +                        uint16_t nb_objs)
>	>> +{
>	>> +    struct rte_mbuf *mbuf0, *mbuf1, *mbuf2, *mbuf3, **pkts;
>	>> +    struct rte_fib *fib = IP4_LOOKUP_NODE_FIB(node->ctx);
>	>> +    const int dyn = IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx);
>	>> +    struct rte_ipv4_hdr *ipv4_hdr;
>	>> +    uint64_t next_hop[nb_objs];
>	>> +    uint16_t lookup_err = 0;
>	>> +    void **to_next, **from;
>	>> +    uint16_t last_spec = 0;
>	>> +    rte_edge_t next_index;
>	>> +    uint16_t n_left_from;
>	>> +    uint32_t ip[nb_objs];
>	>> +    uint16_t held = 0;
>	>> +    uint32_t drop_nh;
>	>> +    uint16_t next;
>	>> +    int i, rc;
>	>> +
>	>> +    /* Speculative next */
>	>> +    next_index = RTE_NODE_IP4_LOOKUP_NEXT_REWRITE;
>	>> +    /* Drop node */
>	>> +    drop_nh =
>((uint32_t)RTE_NODE_IP4_LOOKUP_NEXT_PKT_DROP) <<
>	>16;
>	>> +
>	>> +    pkts = (struct rte_mbuf **)objs;
>	>> +    from = objs;
>	>> +    n_left_from = nb_objs;
>	>> +
>	>> +    /* Get stream for the speculated next node */
>	>> +    to_next = rte_node_next_stream_get(graph, node, next_index,
>	>> +nb_objs);
>	>> +
>	>> +    for (i = OBJS_PER_CLINE; i < RTE_GRAPH_BURST_SIZE; i +=
>	>OBJS_PER_CLINE)
>	>> +            rte_prefetch0(&objs[i]);
>	>
>	>Does this prefetching loop make any sense? Unless objs are not
>passed across
>	>threads this array likely to be in the cache already.
>	>
>	>And if objs are passed across threads, then why do you start
>prefetching from
>	>the next cache line instead of the first, and why don't you stop at
>nb_objs?
>	For example if cache size is 64 bytes. Then there will be 8 pointers to
>mbuf per cache line (if address size is 8 bytes). If nb_objs is 256, then the
>remaining pointers needs to be prefetched to cache. This loop helps to
>prefetch nb_objs pointers to cache.
>
>
>
>My comment was about necessity. I'll rephrase and enumerate my questions
>to be more clear:
>1. Are mbuf pointers contained in the objs array not in cache? My assumptions
>here are:
>    - If you run graph in RTC mode, objs array is very likely already be in your L1
>cache (since some previous node/nodes just put packets there)

Yes, the objs are more likely to be in L1 cache in RTC. But if they are not the above loop prefetches them.
>    - In dispatch mode it doesn't make much sense to run this lookup node in an
>another thread separately from the remaining nodes processing IPv4

If its run in another thread/lcore, the above prefetching will help.
>
>2. if you still thing this prefetching loop is required here, then:
>
>
>	The loop can be changed to stop at nb_objs.
>	for (i = OBJS_PER_CLINE; i < nb_objs; i += OBJS_PER_CLINE) { }
>
>
>
>  why you start from the OBJS_PER_CLINE (second cache line of the objs array)
Because the first line is prefetched in __rte_node_process() which is called in rte_graph_walk().

>3. "i < nb_objs". Should be "i < RTE_ALIGN_CEIL(nb_objs, OBJS_PER_CLINE) /
>OBJS_PER_CLINE"

In this loop i will be incremented by 1.  But the for loop which I mentioned above, also does the same thing but i is incremented by OBJS_PER_CLINE.
>
>
>	>
>	>> +
>	>> +#if RTE_GRAPH_BURST_SIZE > 64
>	>> +    for (i = 0; i < 4 && i < n_left_from; i++) {
>	>> +            rte_prefetch0(pkts[i]);
>	>> +            rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
>	>> +                                    sizeof(struct rte_ether_hdr)));
>	>
>	>This construction does not make sense to me. Same as similar
>constructions
>	>below
>	>
>	>The second prefetch has memory dependency of the data, that will
>be
>	>prefetched by the first one. Does removing this prefetch affects
>performance?
>	The first prefetches the data at mbuf address. The second prefetch is
>for the address containing ipv4 header. Both can be in separate cache lines, as
>ipv4 starts at mbuf->buf_addr + mbuf->data_off + sizeof(ethernet header).
>data_off is generally 64 bytes or 128 bytes depending on cache line size.
>	>
>
>
>
>Indeed, both of them are in separate cache lines. But my point here is about
>data dependency for the second prefetch. In order to issue the prefetch
>instruction for the cache line containing the ipv4 header you must know the
>address. You calculate this address by:
>rte_pktmbuf_mtod_offset(pkts[i], void *, sizeof(struct rte_ether_hdr))
>rte_pktmbuf_mtod_offset is accessing mbuf->buf_addr and mbuf->data_off ,
>as you mentioned. But, these fields are not in your L1 cache at the moment,
>because you just asked to prefetch them with the previous instruction.
Ack.
>So my suggestion here would be to look at how it is done in
>ipv4_lookup_sse.c:ip4_lookup_node_process_vec()
>Simplifying, in a run loop it prefetches mbuf + 2, then prefetches CL with the v4
>header of the mbuf + 1 (assuming in a previous invocation mbuf CL was
>already fetched into L1), and finally does processing of the mbuf + 0 (again,
>assuming the previous iteration ipv4 CL was fetched).
I have seen the implementation. The code looks fine to me. I will try to add similar code in fib lookup.
>
>
>	>> +    }
>	>> +#endif
>	>> +
>	>> +    i = 0;
>	>> +    while (n_left_from >= 4) {
>	>> +#if RTE_GRAPH_BURST_SIZE > 64
>	>> +            if (likely(n_left_from > 7)) {
>	>> +                    rte_prefetch0(pkts[4]);
>	>> +                    rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[4], void
>	>*,
>	>> +                                    sizeof(struct rte_ether_hdr)));
>	>> +                    rte_prefetch0(pkts[5]);
>	>> +                    rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[5], void
>	>*,
>	>> +                                    sizeof(struct rte_ether_hdr)));
>	>> +                    rte_prefetch0(pkts[6]);
>	>> +                    rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[6], void
>	>*,
>	>> +                                    sizeof(struct rte_ether_hdr)));
>	>> +                    rte_prefetch0(pkts[7]);
>	>> +                    rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[7], void
>	>*,
>	>> +                                    sizeof(struct rte_ether_hdr)));
>	>> +            }
>	>> +#endif
>	>> +
>	>> +            mbuf0 = pkts[0];
>	>> +            mbuf1 = pkts[1];
>	>> +            mbuf2 = pkts[2];
>	>> +            mbuf3 = pkts[3];
>	>> +            pkts += 4;
>	>> +            n_left_from -= 4;
>	>> +            /* Extract DIP of mbuf0 */
>	>> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf0, struct
>	>rte_ipv4_hdr *,
>	>> +                            sizeof(struct rte_ether_hdr));
>	>> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>	>> +            node_mbuf_priv1(mbuf0, dyn)->cksum = ipv4_hdr-
>	>>hdr_checksum;
>	>> +            node_mbuf_priv1(mbuf0, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>	>> +
>	>> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>	>> +
>	>> +            /* Extract DIP of mbuf1 */
>	>> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf1, struct
>	>rte_ipv4_hdr *,
>	>> +                            sizeof(struct rte_ether_hdr));
>	>> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>	>> +            node_mbuf_priv1(mbuf1, dyn)->cksum = ipv4_hdr-
>	>>hdr_checksum;
>	>> +            node_mbuf_priv1(mbuf1, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>	>> +
>	>> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>	>> +
>	>> +            /* Extract DIP of mbuf2 */
>	>> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf2, struct
>	>rte_ipv4_hdr *,
>	>> +                            sizeof(struct rte_ether_hdr));
>	>> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>	>> +            node_mbuf_priv1(mbuf2, dyn)->cksum = ipv4_hdr-
>	>>hdr_checksum;
>	>> +            node_mbuf_priv1(mbuf2, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>	>> +
>	>> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>	>> +
>	>> +            /* Extract DIP of mbuf3 */
>	>> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf3, struct
>	>rte_ipv4_hdr *,
>	>> +                            sizeof(struct rte_ether_hdr));
>	>> +
>	>> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>	>> +            node_mbuf_priv1(mbuf3, dyn)->cksum = ipv4_hdr-
>	>>hdr_checksum;
>	>> +            node_mbuf_priv1(mbuf3, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>	>> +
>	>> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>	>> +    }
>	>> +    while (n_left_from > 0) {
>	>> +            mbuf0 = pkts[0];
>	>> +            pkts += 1;
>	>> +            n_left_from -= 1;
>	>> +
>	>> +            /* Extract DIP of mbuf0 */
>	>> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf0, struct
>	>rte_ipv4_hdr *,
>	>> +                            sizeof(struct rte_ether_hdr));
>	>> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>	>> +            node_mbuf_priv1(mbuf0, dyn)->cksum = ipv4_hdr-
>	>>hdr_checksum;
>	>> +            node_mbuf_priv1(mbuf0, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>	>> +
>	>> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>	>> +    }
>	>> +
>	>> +    rc = rte_fib_lookup_bulk(fib, ip, next_hop, nb_objs);
>	>> +    if (unlikely(rc != 0))
>	>> +            return 0;
>	>> +
>	>> +    for (i = 0; i < nb_objs; i++) {
>	>> +            if (unlikely(next_hop[i] == FIB_DEFAULT_NH)) {
>	>> +                    next_hop[i] = drop_nh;
>	>maybe it is worth just do define FIB_DEFAULT_NH as a drop_nh vaue
>to omit
>	>these next_hop reassignments?
>	Ack.
>	>> +                    lookup_err += 1;
>	>> +            }
>	>> +
>	>> +            mbuf0 = (struct rte_mbuf *)objs[i];
>	>> +            node_mbuf_priv1(mbuf0, dyn)->nh =
>(uint16_t)next_hop[i];
>	>> +            next = (uint16_t)(next_hop[i] >> 16);
>	>> +
>	>> +            if (unlikely(next_index ^ next)) {
>	>> +                    /* Copy things successfully speculated till now */
>	>> +                    rte_memcpy(to_next, from, last_spec *
>	>sizeof(from[0]));
>	>> +                    from += last_spec;
>	>> +                    to_next += last_spec;
>	>> +                    held += last_spec;
>	>> +                    last_spec = 0;
>	>> +
>	>> +                    rte_node_enqueue_x1(graph, node, next, from[0]);
>	>> +                    from += 1;
>	>> +            } else {
>	>> +                    last_spec += 1;
>	>> +            }
>	>> +    }
>	>> +
>	>> +    /* !!! Home run !!! */
>	>> +    if (likely(last_spec == nb_objs)) {
>	>> +            rte_node_next_stream_move(graph, node, next_index);
>	>> +            return nb_objs;
>	>> +    }
>	>> +
>	>> +    NODE_INCREMENT_XSTAT_ID(node, 0, lookup_err != 0,
>lookup_err);
>	>> +    held += last_spec;
>	>> +    rte_memcpy(to_next, from, last_spec * sizeof(from[0]));
>	>> +    rte_node_next_stream_put(graph, node, next_index, held);
>	>> +
>	>> +    return nb_objs;
>	>> +}
>	>> +
>	>>
>RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_node_ip4_fib_route_add, 25.07)
>	>>   int
>	>>   rte_node_ip4_fib_route_add(uint32_t ip, uint8_t depth, uint16_t
>	>> next_hop, @@ -147,6 +310,7 @@ static struct rte_node_xstats
>	>ip4_lookup_fib_xstats = {
>	>>   };
>	>>
>	>>   static struct rte_node_register ip4_lookup_fib_node = {
>	>> +    .process = ip4_lookup_fib_node_process,
>	>>      .name = "ip4_lookup_fib",
>	>>
>	>>      .init = ip4_lookup_fib_node_init,
>	>
>	>--
>	>Regards,
>	>Vladimir
>
>
>
>
>
>--
>
>Regards,
>
>Vladimir

Regards,
Ankur


  reply	other threads:[~2025-04-23  9:46 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-15 12:10 [PATCH v1 00/12] add lookup fib nodes in graph library Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 01/12] fib: move macro to header file Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 02/12] node: add IP4 lookup FIB node Ankur Dwivedi
2025-04-16  7:32   ` Nitin Saxena
2025-04-16 10:26     ` [EXTERNAL] " Ankur Dwivedi
2025-04-16  9:34   ` Medvedkin, Vladimir
2025-04-16 10:07     ` [EXTERNAL] " Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 03/12] node: add IP4 FIB route add Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 04/12] node: add process callback for IP4 FIB Ankur Dwivedi
2025-04-16  7:54   ` Nitin Saxena
2025-04-16 12:54   ` Medvedkin, Vladimir
2025-04-18  7:38     ` [EXTERNAL] " Ankur Dwivedi
2025-04-19 18:14       ` Vladimir Medvedkin
2025-04-23  9:46         ` Ankur Dwivedi [this message]
2025-04-15 12:10 ` [PATCH v1 05/12] node: add next node in packet classification Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 06/12] app/graph: add IP4 lookup mode command Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 07/12] fib: move macro to header file Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 08/12] node: add IP6 lookup FIB node Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 09/12] node: add IP6 FIB route add Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 10/12] node: add process callback for IP6 FIB Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 11/12] node: add next node in packet classification Ankur Dwivedi
2025-04-15 12:10 ` [PATCH v1 12/12] app/graph: add IP6 lookup mode command Ankur Dwivedi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BY1PR18MB6109AF4D383E22F53030A679DDBA2@BY1PR18MB6109.namprd18.prod.outlook.com \
    --to=adwivedi@marvell.com \
    --cc=dev@dpdk.org \
    --cc=jerinj@marvell.com \
    --cc=medvedkinv@gmail.com \
    --cc=ndabilpuram@marvell.com \
    --cc=pbhagavatula@marvell.com \
    --cc=rkudurumalla@marvell.com \
    --cc=skori@marvell.com \
    --cc=vladimir.medvedkin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).