DPDK patches and discussions
 help / color / mirror / Atom feed
From: Francois Ozog <francois.ozog@linaro.org>
To: Tiwei Bie <tiwei.bie@intel.com>, dev <dev@dpdk.org>
Cc: Alejandro Lucero <alejandro.lucero@netronome.com>,
	 "Liang, Cunming" <cunming.liang@intel.com>,
	Bruce Richardson <bruce.richardson@intel.com>,
	 Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	brouer@redhat.com
Subject: Re: [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
Date: Wed, 10 Apr 2019 12:02:28 +0200
Message-ID: <CAHFG_=W5eX2UiiGfZzeL+12iGG2juSAJ9wyLneAhy4tDBhqMcw@mail.gmail.com> (raw)
Message-ID: <20190410100228.ruk6AiNj5ChKtCapZOMUDUaanASQcWSho8wVK-a4WzY@z> (raw)
In-Reply-To: <20190408093601.GA12313@dpdk-tbie.sh.intel.com>

Hi all,

I presented an approach in Fosdem
(https://archive.fosdem.org/2018/schedule/event/netmdev/) and feel
happy someone is picking up.

If we step back a little, the mdev concept is to allow userland to be
given a direct control over the hardware data path on a device still
controlled by the kernel.
From a code base perspective, this can shrink down PMD code size b y a
significant size: only 10% of the PMD code is actual data path, the
rest being device control!
The concept is perfect for DPDK, SPDK and many other scenarios (AI
accelerators).
Should the work be triggered by DPDK community, it should be
applicable to a broader set of communities: SPDK, VPP, ODP, AF_XDP....

We bumped into many sharing (between kernel and userland) complexities
particularly when a single PCI device controls two ports.
So let's assume we try to solve a subset of the cases: coherent IO
memory and a dedicated PCI space (by whatever mechanism) per port.

What are the "things to solve"?

1) enumeration: enumerating and "capturing" an mdev device (the patch I assume)
2) bifurcation: designating the queues to capture in userland (may be
all) with a hardware driven rule (flow director or more generic)
3) memory management: dealing with rings and buffer management on rx
and tx paths

The bifurcation can be as simple as : all queues in userland, or quite
rich: TCP port 80 goes to userland while the rest (ICMP...) go to
kernel. If the kernel gets some of the traffic there will be a routing
information sharing problem to solve. We had a few experiments here.
Conclusion is its doable but many corner cases make it a big work. And
it would be nice if the queue selection can be made very generic (and
not tied to flow director).
Let's state this is for further study for  now.

Lets focus on memory management of VFIO exposed devices.
I haven't refreshed my knowledge of the VFIO framework so you may want
to correct a few points...
First of all, DPDK is made to switch packets and particularly between ports.
With VFIO, this means all devices are in the same virtual IOVA which
is tricky to implement in the kernel.
There are a few strategies to do that all requiring significant mdev
extensions and more probably a kernel infrastructure change. The good
news is it can be made in such a way that selected drivers implement
the change, not requiring all the drivers to be touched.
Another big question is: is the kernel allocating the memory then the
userland gets a map to it, or does the userland allocates the memory
and the kernel just maintains the IOVA mapping.
I would favor kernel allocation and userland gets a map to it (in the
unified IOVA). One reason being that memory allocation strategy can be
very different from hardware to hardware:
- driver allocates packet buffers and populate a single ring of packet per queue
- driver allocates packet buffers of different sizes and populate
multiple rings per queue (for instance rings of 128, 256, 1024, 2048
byte arrays per queue)
- driver allocates an unstructured memory area (say 32MB) and give it
to hardware (no prepopulation of rings).
So the userland framework (DPDK, SPDK, ODP, VPP, AF_XDP,
proprietary...) can just query for queues and rings to the kernel
driver that knows what has to be done for the driver. The userland
framework just has to create the relevant objects (queues, rings,
packet buffers) to the provided kernel information.

Exposing VFIO devices to DPDK and other frameworks is a major topic,
and I suggest that at the same time enumeration is done, a broader
discussion on the data path itself happens.
Data path discussion is about memory management (above) and packet
descriptors. Exposing hardware dependent structures in the userland is
not the most widely accepted wisdom.
So I would rather assume hardware natively produce hardware, vendor,
OS independent descriptors. Candidates can be: DPDK mbuf, VPP vlib_buf
or virtio 1.1. I would favor a packet descriptor that supports a
combination of inline offloads (VxLAN + IPSec + TSO...) : if virtio
1.1 could be extended with some DPDK mbuf fields that would be perfect
;-) That looks science fiction but I know that some smartNICs and
other hardware, the hardware produced packet descriptor format can be
flexible....

Cheers

FF



On Mon, 8 Apr 2019 at 11:36, Tiwei Bie <tiwei.bie@intel.com> wrote:
>
> On Mon, Apr 08, 2019 at 09:44:07AM +0100, Alejandro Lucero wrote:
> > On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:
> > > Hi everyone,
> > >
> > > This is a draft implementation of the mdev (Mediated device [1])
> > > bus support in DPDK. Mdev is a way to virtualize devices in Linux
> > > kernel. Based on the device-api (mdev_type/device_api), there could
> > > be different types of mdev devices (e.g. vfio-pci). In this RFC,
> > > one mdev bus is introduced to scan the mdev devices in the system
> > > and do the probe based on the device-api.
> > >
> > > Take the mdev devices whose device-api is "vfio-pci" as an example,
> > > in this RFC, these devices will be probed by a mdev driver provided
> > > by PCI bus, which will plug them to the PCI bus. And they will be
> > > probed with the drivers registered on the PCI bus based on VendorID/
> > > DeviceID/... then.
> > >
> > >                      +----------+
> > >                      | mdev bus |
> > >                      +----+-----+
> > >                           |
> > >          +----------------+----+------+------+
> > >          |                     |      |      |
> > >    mdev_vfio_pci               ......
> > > (device-api: vfio-pci)
> > >
> > > There are also other ways to add mdev device support in DPDK (e.g.
> > > let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> > > appreciated!
> >
> > Hi Tiwei,
> >
> > Thanks for the patchset. I was close to send a patchset with the same mdev
> > support, but I'm glad to see your patchset first because I think it is
> > interesting to see another view of how to implemented this.
> >
> > After going through your patch I was a bit confused about how the mdev device
> > to mdev driver match was done. But then I realized the approach you are
> > following is different to my implementation, likely due to having different
> > purposes. If I understand the idea behind, you want to have same PCI PMD
> > drivers working with devices, PCI devices, created from mediated devices.
>
> Exactly!
>
> > That
> > is the reason there is just one mdev driver, the one for vfio-pci mediated
> > devices type.
> >
> > My approach was different and I though having specific PMD mdev support was
> > necessary, with the PMD requiring to register a mdev driver. I can see, after
> > reading your patch, it can be perfectly possible to have the same PMDs for
> > "pure" PCI devices and PCI devices made from mediated devices, and if the PMD
> > requires to do something different due to the mediated devices intrinsics, then
> > explicitly supporting that per PMD. I got specific ioctl calls between the PMD
> > and the mediating driver but this can also be done with your approach.
> >
> > I'm working on having a mediated PF, what is a different purpose than the Intel
> > scalable I/O idea, so I will merge this patchset with my code and see if it
> > works.
>
> Cool! Thanks!
>
> >
> > Thanks!
> >
> >
> > > [1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
> > >
> > > Thanks,
> > > Tiwei
> > >
> > > Tiwei Bie (3):
> > >   eal: add a helper for reading string from sysfs
> > >   bus/mdev: add mdev bus support
> > >   bus/pci: add mdev support
> > >
> > >  config/common_base                        |   5 +
> > >  config/common_linux                       |   1 +
> > >  drivers/bus/Makefile                      |   1 +
> > >  drivers/bus/mdev/Makefile                 |  41 +++
> > >  drivers/bus/mdev/linux/Makefile           |   6 +
> > >  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
> > >  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
> > >  drivers/bus/mdev/meson.build              |  15 ++
> > >  drivers/bus/mdev/private.h                |  90 +++++++
> > >  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
> > >  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
> > >  drivers/bus/meson.build                   |   2 +-
> > >  drivers/bus/pci/Makefile                  |   3 +
> > >  drivers/bus/pci/linux/Makefile            |   4 +
> > >  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
> > >  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
> > >  drivers/bus/pci/meson.build               |   4 +-
> > >  drivers/bus/pci/pci_common.c              |  17 +-
> > >  drivers/bus/pci/private.h                 |   9 +
> > >  drivers/bus/pci/rte_bus_pci.h             |  11 +-
> > >  lib/librte_eal/common/eal_filesystem.h    |   7 +
> > >  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
> > >  lib/librte_eal/linux/eal/eal.c            |  22 ++
> > >  lib/librte_eal/rte_eal_version.map        |   1 +
> > >  mk/rte.app.mk                             |   1 +
> > >  25 files changed, 1163 insertions(+), 19 deletions(-)
> > >  create mode 100644 drivers/bus/mdev/Makefile
> > >  create mode 100644 drivers/bus/mdev/linux/Makefile
> > >  create mode 100644 drivers/bus/mdev/linux/mdev.c
> > >  create mode 100644 drivers/bus/mdev/mdev.c
> > >  create mode 100644 drivers/bus/mdev/meson.build
> > >  create mode 100644 drivers/bus/mdev/private.h
> > >  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
> > >  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
> > >  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > >
> > > --
> > > 2.17.1
> >
> >



--
François-Frédéric Ozog | Director Linaro Edge & Fog Computing Group
T: +33.67221.6485
francois.ozog@linaro.org | Skype: ffozog

  parent reply	other threads:[~2019-04-10 10:02 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-03  7:18 Tiwei Bie
2019-04-03  7:18 ` Tiwei Bie
2019-04-03  7:18 ` [dpdk-dev] [RFC 1/3] eal: add a helper for reading string from sysfs Tiwei Bie
2019-04-03  7:18   ` Tiwei Bie
2019-04-03  7:18 ` [dpdk-dev] [RFC 2/3] bus/mdev: add mdev bus support Tiwei Bie
2019-04-03  7:18   ` Tiwei Bie
2019-04-03  7:18 ` [dpdk-dev] [RFC 3/3] bus/pci: add mdev support Tiwei Bie
2019-04-03  7:18   ` Tiwei Bie
2019-04-03 14:13   ` Wiles, Keith
2019-04-03 14:13     ` Wiles, Keith
2019-04-04  4:19     ` Tiwei Bie
2019-04-04  4:19       ` Tiwei Bie
2019-04-08  8:44 ` [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Alejandro Lucero
2019-04-08  8:44   ` Alejandro Lucero
2019-04-08  9:36   ` Tiwei Bie
2019-04-08  9:36     ` Tiwei Bie
2019-04-10 10:02     ` Francois Ozog [this message]
2019-04-10 10:02       ` Francois Ozog
2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 1/5] bus/pci: introduce an internal representation of PCI device Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 2/5] bus/pci: avoid depending on private value in kernel source Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 3/5] bus/pci: introduce helper for MMIO read and write Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 4/5] eal: add a helper for reading string from sysfs Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 5/5] bus/pci: add mdev support Tiwei Bie
2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 1/6] bus/pci: introduce an internal representation of PCI device Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 2/6] bus/pci: avoid depending on private value in kernel source Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 3/6] bus/pci: introduce helper for MMIO read and write Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs Chenbo Xia
2021-06-01  5:37         ` Stephen Hemminger
2021-06-08  5:47           ` Xia, Chenbo
2021-06-01  5:39         ` Stephen Hemminger
2021-06-08  5:48           ` Xia, Chenbo
2021-06-11  7:19         ` Thomas Monjalon
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 5/6] bus/pci: add mdev support Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 6/6] bus/pci: add sparse mmap support for mediated PCI devices Chenbo Xia
2021-06-11  7:15       ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Thomas Monjalon
2021-06-15  2:49         ` Xia, Chenbo
2021-06-15  7:48           ` Thomas Monjalon
2021-06-15 10:44             ` Xia, Chenbo
2021-06-15 11:57             ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHFG_=W5eX2UiiGfZzeL+12iGG2juSAJ9wyLneAhy4tDBhqMcw@mail.gmail.com' \
    --to=francois.ozog@linaro.org \
    --cc=alejandro.lucero@netronome.com \
    --cc=brouer@redhat.com \
    --cc=bruce.richardson@intel.com \
    --cc=cunming.liang@intel.com \
    --cc=dev@dpdk.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=tiwei.bie@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git