DPDK patches and discussions
 help / color / mirror / Atom feed
From: Gaetan Rivet <grive@u256.net>
To: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH] eal: fix to set the rte_device ptr's device args before hotplug
Date: Mon, 10 Feb 2020 16:00:15 +0100	[thread overview]
Message-ID: <adcdc359-4c8b-61a0-02e3-5ab9726bbcc0@u256.net> (raw)
In-Reply-To: <83f7e3a5-c36f-a755-1c71-0c248fe48d0e@u256.net>

On 10/02/2020 15:34, Gaetan Rivet wrote:
> Hi Somnath,
> 
> Reformating the mails, to keep with the inner-posting (mixing top-posting and inner-posting
> makes it hard to follow). See the end of the mail.
> 
> 

[...]

>> Any updates on this ?
>> My thoughts on this are just as I'd suspected / suggested earlier the
>> parent device (06:02:0) gets probed once
>> during ovs-vswitchd bring up right after these 2 cmds
>>
>> ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
>> # Now start ovs-vswitchd (--no-ovsdb-server, since it is already started)
>> ovs-ctl --no-ovsdb-server --db-sock="$DB_SOCK" start
>>
>> After this point the pci probe was invoked on the parent device, but
>> without any devargs hence it is NULL.
>>
>> Now when the 'ovs add-port' cmd is issued to add a representor port
>> for example say 06:02:01 (using 06:02:00 as parent device)
>> ovs-vsctl add-port ovsbr0 vfrep1 -- set Interface vfrep1 type=dpdk
>> options:dpdk-devargs=0000:06:02.0,representor=[1]
>>
>> What happens is
>>
>> *before* pci_name_set()
>> p dev
>> $1 = (struct rte_pci_device *) 0x55bd6b0
>> (gdb) p /x *dev
>> $2 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, device = {next =
>> {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x0, driver = 0x0, bus =
>> 0x3a59a00, numa_node = 0x0,
>>      devargs = 0x0}, addr = {domain = 0x0, bus = 0x6, devid = 0x2,
>> function = 0x0}.......}
>>
>> *after* pci_name_set()
>>
>> p /x *dev
>> $3 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, device = {next =
>> {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x0, driver = 0x0, bus =
>> 0x3a59a00, numa_node = 0x0,
>>      devargs = 0x55a4ae0},addr = {domain = 0x0, bus = 0x6, devid = 0x2,
>> function = 0x0}......}
>>
>> As you can see, 'devargs' for the pci_device is now populated ..
>>
>> But if you go further down in pci_scan_one()
>>
>> pci_scan_one (dirname=0x7ffea330a5e0
>> "/sys/bus/pci/devices/0000:06:02.0", addr=0x7ffea330a5d0)
>>      at /root/dpdk-int_nxt/drivers/bus/pci/linux/pci.c:351
>> 351                                     if (!rte_dev_is_probed(&dev2->device)) {
>> (gdb) p dev2
>> $5 = (struct rte_pci_device *) 0x54de5e0
>> (gdb) p /x *dev2
>> $6 = {next = {tqe_next = 0x5307460, tqe_prev = 0x5539f80}, device =
>> {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x54e4f00, driver =
>> 0x3a6b7f0,
>>      bus = 0x3a59a00, numa_node = 0x0, devargs = 0x0}, addr = {domain =
>> 0x0, bus = 0x6, devid = 0x2, function = 0x0}
>>
>> we hit the pci_device ( 0x54de5e0) that was created during the first
>> time probe of the parent device ?
>> Since it is already probed, it goes to line 381 where it does frees
>> the just allocated 'pci_device'  inside pci_scan_one() via 'free(dev)'
>>
>> As you can see this pci_device does not have it's devargs set (rightly
>> so as initially there were no devargs for this device)
>>
>> But as shown in the stack trace in my previous reply, when
>> pci_find_device() walks the rte_pci_devices list , it finds this
>> earlier probed device (without devargs)
>>
> 
> Alright, I think you are right, this is what is happening.
> The issue IMO, is that the PCI scan is thus hitting an already existing device, but we have
> missed the case where the new device has more info that the previous one (linked devargs).
> 
>> pci_find_device (start=0x0, cmp=0x4b92d8 <cmp_dev_name>,
>> data=0x55a4af8) at /root/dpdk-19.11/drivers/bus/pci/pci_common.c:426
>> 426                             return &pdev->device;
>> (gdb) p pdev
>> $15 = (struct rte_pci_device *) 0x54de5e0
>>
>> (gdb) p /x *pdev
>> $16 = {next = {tqe_next = 0x5307460, tqe_prev = 0x5539f80}, device =
>> {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x54e4f00, driver =
>> 0x3a6b7f0,
>>      bus = 0x3a59a00, numa_node = 0x0, devargs = 0x0}, addr = {domain =
>> 0x0, bus = 0x6, devid = 0x2, function = 0x0},
>>
>> Hope this explains why the pci_device has NULL devargs at this point
>> and how my fix to set it at this point helps solve the problem?
>>
>> Please let me know your thoughts
>>
> 
> I think the proper fix is instead to have a clean update during the PCI scan.
> The proper way is to keep the old device, but override its fields as new info was found.
> 
> Something like calling pci_name_set(dev2); line 359 maybe. BSD should also be updated for consistency.
> 
> My issue with your patch is that I think this issue is very specific to the PCI bus and the capabilities
> of some devices on it. It would be better to have a fully compliant scan + probe ops considering
> the supported capabilities, instead of forcing this on all buses.
> 
> I'm wondering whether this can happen with an already existing devargs? If so, is it more correct to keep
> the new one, or ignore the probe, free the new devargs and report an error? If the former,
> please clean up properly the devargs using rte_devargs_remove() (before calling pci_name_set() of course).

Unfortunately, this could probably happen with some devargs. In particular, the rte_devargs_insert() function will test whether devargs name are matching before inserting a new devargs. There is an ambiguity with the BDF / DBDF notation for the PCI devices in particular, which could lead to devargs already existing for some devices, but with a new one being inserted.

The fix would be to have the bus parse the device, and compare the binary blob resulting from this parsing.
The current bus->parse() API might help, but is currently too limited (no way to know how many bytes long the result is).

This is fastidious to have to write this because some leeway was allowed to the PCI back in the day. For now at least, mind this possibility when writing your fix, those above considerations could maybe be worked out with future API evolutions.

  reply	other threads:[~2020-02-10 15:00 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-04  9:15 Somnath Kotur
2020-02-04  9:49 ` Gaetan Rivet
2020-02-05  8:52   ` Somnath Kotur
2020-02-05  9:17     ` Gaetan Rivet
2020-02-06  4:06       ` Somnath Kotur
2020-02-07  5:52         ` Somnath Kotur
2020-02-10 14:34           ` Gaetan Rivet
2020-02-10 15:00             ` Gaetan Rivet [this message]
2020-02-11  4:33             ` Somnath Kotur
2020-02-11  8:38               ` Gaetan Rivet
2020-02-14  6:46                 ` Somnath Kotur

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adcdc359-4c8b-61a0-02e3-5ab9726bbcc0@u256.net \
    --to=grive@u256.net \
    --cc=dev@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).