DPDK patches and discussions
 help / color / mirror / Atom feed
From: Yongseok Koh <yskoh@mellanox.com>
To: Andrew Rybchenko <arybchenko@solarflare.com>
Cc: wenzhuo.lu@intel.com, jingjing.wu@intel.com,
	olivier.matz@6wind.com, dev@dpdk.org,
	konstantin.ananyev@intel.com, stephen@networkplumber.org,
	thomas@monjalon.net, adrien.mazarguil@6wind.com,
	nelio.laranjeiro@6wind.com
Subject: Re: [dpdk-dev] [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
Date: Thu, 26 Apr 2018 10:18:14 -0700	[thread overview]
Message-ID: <20180426171813.GA8437@yongseok-MBP.local> (raw)
In-Reply-To: <1a65f081-4c92-78d5-b00c-08e66fdef5c8@solarflare.com>

On Thu, Apr 26, 2018 at 07:05:01PM +0300, Andrew Rybchenko wrote:
> On 04/26/2018 04:10 AM, Yongseok Koh wrote:
> > This patch introduces a new way of attaching an external buffer to a mbuf.
> > 
> > Attaching an external buffer is quite similar to mbuf indirection in
> > replacing buffer addresses and length of a mbuf, but a few differences:
> >    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> >      2 as long as the direct mbuf itself isn't freed after the attachment.
> >      In such cases, the buffer area of a direct mbuf must be read-only. But
> >      external buffer has its own refcnt and it starts from 1. Unless
> >      multiple mbufs are attached to a mbuf having an external buffer, the
> >      external buffer is writable.
> >    - There's no need to allocate buffer from a mempool. Any buffer can be
> >      attached with appropriate free callback.
> >    - Smaller metadata is required to maintain shared data such as refcnt.
> 
> I'm still unsure how reference counters for external buffer work.
> 
> Let's consider the following example:
> 
>                         |<-mbuf1 buf_len-->|<-mbuf2 buf_len-->|
> +--+-----------+----+--------------+----+--------------+---+- - -
> |  global      |head|mbuf1 data    |head|mbuf2 data    | |
> |  shinfo=2    |room| |room|              |   |
> +--+-----------+----+--------------+----+--------------+---+- - -
>                ^ ^
> +----------+   |      +----------+ |
> | mbuf1    +---+      | mbuf2    +-+
> | refcnt=1 |          | refcnt=1 |
> +----------+          +----------+
> 
> I.e. we have big buffer which is sliced into many small data
> buffers referenced from mbufs.
> 
> shinfo reference counter is used to control when big buffer
> may be freed. But what controls sharing of each block?
> 
> headroom and following mbuf data (buf_len) is owned by
> corresponding mbuf and the mbuf owner can do everything
> with the space (prepend data, append data, modify etc).
> I.e. it is read-write in above terminology.
> 
> What should happen if mbuf1 is cloned? Right now it will result
> in a new mbuf1a with reference counter 1 and incremented shinfo
> reference counter. And what does say that corresponding area
> is read-only now? It looks like nothing.
> 
> As I understand it should be solved using per data area shinfo
> which free callback decrements big buffer reference counter.

I have to admit that I was confused at the moment and I mixed two different
use-cases.

1) Transmitting a large storage block.

                |<--mbuf1 buf_len-->|<--mbuf2 buf_len-->|
 +--+-----------+----+--------------+----+--------------+---+- - -
 |  global      |head|mbuf1 data    |head|mbuf2 data    |   |
 |  shinfo=2    |room|              |room|              |   |
 +--+-----------+----+--------------+----+--------------+---+- - -
                ^                   ^
 +----------+   |      +----------+ |
 | mbuf1    +---+      | mbuf2    +-+
 | refcnt=1 |          | refcnt=1 |
 +----------+          +----------+
       ^                     ^
       |next                 |next
 +-----+----+          +----------+ 
 | mbuf1_hdr|          | mbuf2_hdr|
 | refcnt=1 |          | refcnt=1 |
 +----------+          +----------+

Yes, in this case, the large external buffer should always be read-only. And
necessary network headers should be linked via m->next. Owners of m1 or m2
can't alter any bit in the external buffer because shinfo->refcnt > 1.

2) Slicing a large buffer and provide r-w buffers.

                |<--mbuf1 buf_len-->|      |<--mbuf2 buf_len-->|
 +--+-----------+----+--------------+------+----+--------------+------+----+- - -
 |  user data   |head|mbuf1 data    |shinfo|head|mbuf2 data    |shinfo|    |
 |  refc=2      |room|              |refc=1|room|              |refc=1|    |
 +--+-----------+----+--------------+------+----+--------------+------+----+- - -
                ^                          ^
 +----------+   |             +----------+ |
 | mbuf1    +---+             | mbuf2    +-+
 | refcnt=1 |                 | refcnt=1 |
 +----------+                 +----------+
 
Here, the user data for the large chunk isn't rte_mbuf_ext_shared_info but a
custom structure managed by user in order to free the whole chunk. free_cb would
decrement a custom refcnt in custom way. But librte_mbuf doesn't need to be
aware of it.  It is user's responsibility. The library is just responsible for
calling free_cb when shinfo->refcnt gets to zero.

> So, we have two reference counters per each mbuf with external
> buffer (plus reference counter per big buffer).
> Two reference counters sounds too much and it looks like
> mbuf-with-extbuf reference counter is not really used
> (since on clone/attach update shinfo refcnt).
> It is still two counters to check on free.

Each refcnt implies whether it is r or r-w. Even for direct mbuf, if two users
are accessing it, refcnt is 2 and it is read-only. This should mean both
mbuf metadata and its data area are all read-only. Users can alter neither
various length fields nor packet data, for example. For non-direct mbufs, still
its refcnt should be used, but refcnt only represents the metadata is shared and
read-only if it is more than 1. So, refcnt of mbuf-with-extbuf is still used.
Incrementing refcnt means an entity acquired access to the object, including
cases of attachment (indirec/extbuf).

> Have you considered alternative approach to use mbuf refcnt
> as sharing indicator for extbuf data? However, in this case
> indirect referencing extbuf would logically look like:
> 
> +----------+    +--------+     +--------+
> | indirect +--->| extbuf +---->|  data  |
> |  mbuf    |    |  mbuf  |     |        |
> +----------+    +--------+     +--------+
> 
> It looks like it would allow to avoid two reference counters
> per data block as above. Personally I'm not sure which approach
> is better and would like to hear what you and other reviewers
> think about it.

So, I still think this patch is okay.

> Some minor notes below as well.
> 
> > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > ---
> > 
> > ** This patch can pass the mbuf_autotest. **
> > 
> > Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> > will be submitted separately rebased on a differnet patchset which
> > accommodates new memory hotplug design to mlx PMDs.
> > 
> > v6:
> > * rte_pktmbuf_attach_extbuf() doesn't take NULL shinfo. Instead,
> >    rte_pktmbuf_ext_shinfo_init_helper() is added.
> > * bug fix in rte_pktmbuf_attach() - shinfo wasn't saved to mi.
> > * minor changes from review.
> > 
> > v5:
> > * rte_pktmbuf_attach_extbuf() sets headroom to 0.
> > * if shinfo is provided when attaching, user should initialize it.
> > * minor changes from review.
> > 
> > v4:
> > * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
> >    user can pass memory for shared data via shinfo argument.
> > * minor changes from review.
> > 
> > v3:
> > * implement external buffer attachment instead of introducing buf_off for
> >    mbuf indirection.
> > 
> >   lib/librte_mbuf/rte_mbuf.h | 335 +++++++++++++++++++++++++++++++++++++++++----
> >   1 file changed, 306 insertions(+), 29 deletions(-)
> > 
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 43aaa9c5f..0a6885281 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
[...]
> >   /** Mbuf prefetch */
> >   #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
> >   	if ((m) != NULL)                        \
> > @@ -1213,11 +1307,157 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> >   }
> >   /**
> > + * Initialize shared data at the end of an external buffer before attaching
> > + * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory
> > + * initialization but a helper function to simply spare a few bytes at the
> > + * end of the buffer for shared data. If shared data is allocated
> > + * separately, this should not be called but application has to properly
> > + * initialize the shared data according to its need.
> > + *
> > + * Free callback and its argument is saved and the refcnt is set to 1.
> > + *
> > + * @warning
> > + * buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) after this
> > + * initialization. For example,
> 
> May be buf_len should be inout and it should be done by the function?
> Just a question since current approach looks fragile.

Yeah, I thought about that but I didn't want to alter user's variable, I thought
it could be error-prone. Anyway either way is okay to me. Will wait for a day to
get input because I will send out a new version (hopefully last :-) to fix the
nit you mentioned below.

> > + *
> > + *   struct rte_mbuf_ext_shared_info *shinfo =
> > + *          rte_pktmbuf_ext_shinfo_init_helpfer(buf_addr, buf_len,
> > + *                                              free_cb, fcb_arg);
> > + *   buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > + *   rte_pktmbuf_attach_extbuf(m, buf_addr, buf_iova, buf_len, shinfo);
> > + *
> > + * @param buf_addr
> > + *   The pointer to the external buffer.
> > + * @param buf_len
> > + *   The size of the external buffer. buf_len must be larger than the size
> > + *   of ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If
> > + *   not enough, this function will return NULL.
> > + * @param free_cb
> > + *   Free callback function to call when the external buffer needs to be
> > + *   freed.
> > + * @param fcb_opaque
> > + *   Argument for the free callback function.
> > + *
> > + * @return
> > + *   A pointer to the initialized shared data on success, return NULL
> > + *   otherwise.
> > + */
> > +static inline struct rte_mbuf_ext_shared_info *
> > +rte_pktmbuf_ext_shinfo_init_helper(void *buf_addr, uint16_t buf_len,
> > +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> > +{
> > +	struct rte_mbuf_ext_shared_info *shinfo;
> > +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> > +
> > +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > +				sizeof(*shinfo)), sizeof(uintptr_t));
> > +	if ((void *)shinfo <= buf_addr)
> > +		return NULL;
> > +
> > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > +
> > +	shinfo->free_cb = free_cb;
> > +	shinfo->fcb_opaque = fcb_opaque;
> 
> Just a nit, but I'd suggest to initialize in the same order as in the
> struct.
> (if there is no reasons why reference counter should be initialized first)

Will do.

Thanks,
Yongseok

  parent reply	other threads:[~2018-04-26 17:18 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-10  1:25 [dpdk-dev] [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
2018-03-10  1:25 ` [dpdk-dev] [PATCH v1 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
2018-03-10  1:25 ` [dpdk-dev] [PATCH v1 2/6] net/mlx5: separate filling Rx flags Yongseok Koh
2018-03-10  1:25 ` [dpdk-dev] [PATCH v1 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
2018-03-12  9:13   ` Nélio Laranjeiro
2018-03-10  1:25 ` [dpdk-dev] [PATCH v1 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
2018-03-12  9:20   ` Nélio Laranjeiro
2018-03-10  1:25 ` [dpdk-dev] [PATCH v1 5/6] net/mlx5: release Tx queue resource earlier than Rx Yongseok Koh
2018-03-10  1:25 ` [dpdk-dev] [PATCH v1 6/6] app/testpmd: conserve mbuf indirection flag Yongseok Koh
2018-04-02 18:50 ` [dpdk-dev] [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
2018-04-02 18:50   ` [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
2018-04-03  8:26     ` Olivier Matz
2018-04-04  0:12       ` Yongseok Koh
2018-04-09 16:04         ` Olivier Matz
2018-04-10  1:59           ` Yongseok Koh
2018-04-11  0:25             ` Ananyev, Konstantin
2018-04-11  5:33               ` Yongseok Koh
2018-04-11 11:39                 ` Ananyev, Konstantin
2018-04-11 14:02                   ` Andrew Rybchenko
2018-04-11 17:18                     ` Yongseok Koh
2018-04-11 17:08                   ` Yongseok Koh
2018-04-12 16:34                     ` Ananyev, Konstantin
2018-04-12 18:58                       ` Yongseok Koh
2018-04-02 18:50   ` [dpdk-dev] [PATCH v2 2/6] net/mlx5: separate filling Rx flags Yongseok Koh
2018-04-02 18:50   ` [dpdk-dev] [PATCH v2 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
2018-04-02 18:50   ` [dpdk-dev] [PATCH v2 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
2018-04-02 18:50   ` [dpdk-dev] [PATCH v2 5/6] net/mlx5: release Tx queue resource earlier than Rx Yongseok Koh
2018-04-02 18:50   ` [dpdk-dev] [PATCH v2 6/6] app/testpmd: conserve mbuf indirection flag Yongseok Koh
2018-04-19  1:11 ` [dpdk-dev] [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Yongseok Koh
2018-04-19  1:11   ` [dpdk-dev] [PATCH v3 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-23 11:53   ` [dpdk-dev] [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
2018-04-24  2:04     ` Yongseok Koh
2018-04-25 13:16       ` Ananyev, Konstantin
2018-04-25 16:44         ` Yongseok Koh
2018-04-25 18:05           ` Ananyev, Konstantin
2018-04-23 16:18   ` Olivier Matz
2018-04-24  1:29     ` Yongseok Koh
2018-04-24 15:36       ` Olivier Matz
2018-04-24  1:38 ` [dpdk-dev] [PATCH v4 " Yongseok Koh
2018-04-24  1:38   ` [dpdk-dev] [PATCH v4 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-24  5:01   ` [dpdk-dev] [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf Stephen Hemminger
2018-04-24 11:47     ` Yongseok Koh
2018-04-24 12:28   ` Andrew Rybchenko
2018-04-24 16:02     ` Olivier Matz
2018-04-24 18:21       ` [dpdk-dev] ***Spam*** " Andrew Rybchenko
2018-04-24 19:15         ` [dpdk-dev] " Olivier Matz
2018-04-24 20:22           ` Thomas Monjalon
2018-04-24 21:53             ` Yongseok Koh
2018-04-24 22:15               ` Thomas Monjalon
2018-04-25  8:21               ` Olivier Matz
2018-04-25 15:06             ` Stephen Hemminger
2018-04-24 23:34           ` Yongseok Koh
2018-04-25 14:45             ` Andrew Rybchenko
2018-04-25 17:40               ` Yongseok Koh
2018-04-25  8:28       ` Olivier Matz
2018-04-25  9:08         ` Yongseok Koh
2018-04-25  9:19           ` Yongseok Koh
2018-04-25 20:00             ` Olivier Matz
2018-04-25 22:54               ` Yongseok Koh
2018-04-24 22:30     ` Yongseok Koh
2018-04-25  2:53 ` [dpdk-dev] [PATCH v5 " Yongseok Koh
2018-04-25  2:53   ` [dpdk-dev] [PATCH v5 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-25 13:31   ` [dpdk-dev] [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
2018-04-25 17:06     ` Yongseok Koh
2018-04-25 17:23       ` Ananyev, Konstantin
2018-04-25 18:02         ` Yongseok Koh
2018-04-25 18:22           ` Yongseok Koh
2018-04-25 18:30             ` Yongseok Koh
2018-04-26  1:10 ` [dpdk-dev] [PATCH v6 " Yongseok Koh
2018-04-26  1:10   ` [dpdk-dev] [PATCH v6 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-26 11:39   ` [dpdk-dev] [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
2018-04-26 16:05   ` Andrew Rybchenko
2018-04-26 16:10     ` Thomas Monjalon
2018-04-26 19:42       ` Olivier Matz
2018-04-26 19:58         ` Thomas Monjalon
2018-04-26 20:07           ` Olivier Matz
2018-04-26 20:24             ` Thomas Monjalon
2018-04-26 17:18     ` Yongseok Koh [this message]
2018-04-26 19:45       ` Olivier Matz
2018-04-27  0:01 ` [dpdk-dev] [PATCH v7 " Yongseok Koh
2018-04-27  0:01   ` [dpdk-dev] [PATCH v7 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-27  8:00     ` Andrew Rybchenko
2018-04-27  7:22   ` [dpdk-dev] [PATCH v7 1/2] mbuf: support attaching external buffer to mbuf Andrew Rybchenko
2018-04-27 17:22 ` [dpdk-dev] [PATCH v8 " Yongseok Koh
2018-04-27 17:22   ` [dpdk-dev] [PATCH v8 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-27 18:09   ` [dpdk-dev] [PATCH v8 1/2] mbuf: support attaching external buffer to mbuf Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180426171813.GA8437@yongseok-MBP.local \
    --to=yskoh@mellanox.com \
    --cc=adrien.mazarguil@6wind.com \
    --cc=arybchenko@solarflare.com \
    --cc=dev@dpdk.org \
    --cc=jingjing.wu@intel.com \
    --cc=konstantin.ananyev@intel.com \
    --cc=nelio.laranjeiro@6wind.com \
    --cc=olivier.matz@6wind.com \
    --cc=stephen@networkplumber.org \
    --cc=thomas@monjalon.net \
    --cc=wenzhuo.lu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).