[dpdk-dev] Question on mlx5 PMD txq memory registration

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] Question on mlx5 PMD txq memory registration
@ 2017-07-17 13:29 Sagi Grimberg
  2017-07-17 21:02 ` Nélio Laranjeiro
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2017-07-17 13:29 UTC (permalink / raw)
  To: dev, Shahaf Shuler, Nélio Laranjeiro, Yongseok Koh,
	Roy Shterman, Alexander Solganik

Hi,

Looking at the code, it looks like mlx5 keeps a MR cache per TX queue
(each MR registers a rte_mempool).

Once a TX queue is created, mlx5 scans existing mempools and
pre-registers a MR for each mempool it meets (using rte_mempool_walk).
For each MR registration that exceeds the TX queue cache, it removes the
first entry from the cache and deregisters that MR (in txq_mp2mr_reg).

Now on TX burst, if the driver sees a mbuf from an unknown mempool, it
registers the mempool on the fly and again *deregister* the first MR in
the cache.

My question is, what guarantees that no inflight send operations posted
on the TX queue when we deregister and remove a MR from the cache?

AFAICT, it is the driver responsibility to guarantee to never deregister
a memory region that has inflight send operations posted, otherwise
the send operation *will* complete with a local protection error. Is
that taken care of?

Another question, why is the MR cache maintained per TX queue and not
per device? If the application starts N TX queues then a single mempool
will be registered N times instead of just once. Having lots of MR
instances will pollute the device ICMC pretty badly. Am I missing
something?

Thanks,
Sagi.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-17 13:29 [dpdk-dev] Question on mlx5 PMD txq memory registration Sagi Grimberg
@ 2017-07-17 21:02 ` Nélio Laranjeiro
  2017-07-19  6:21   ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Nélio Laranjeiro @ 2017-07-17 21:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: dev, Shahaf Shuler, Yongseok Koh, Roy Shterman, Alexander Solganik

Hello Sagi,

On Mon, Jul 17, 2017 at 04:29:34PM +0300, Sagi Grimberg wrote:
> Hi,
> 
> Looking at the code, it looks like mlx5 keeps a MR cache per TX queue
> (each MR registers a rte_mempool).
> 
> Once a TX queue is created, mlx5 scans existing mempools and
> pre-registers a MR for each mempool it meets (using rte_mempool_walk).
> For each MR registration that exceeds the TX queue cache, it removes the
> first entry from the cache and deregisters that MR (in txq_mp2mr_reg).
> 
> Now on TX burst, if the driver sees a mbuf from an unknown mempool, it
> registers the mempool on the fly and again *deregister* the first MR in
> the cache.
> 
> My question is, what guarantees that no inflight send operations posted
> on the TX queue when we deregister and remove a MR from the cache?

There is none, if you send a burst of 9 packets each one coming from a
different mempool the first one will be dropped.

> AFAICT, it is the driver responsibility to guarantee to never deregister
> a memory region that has inflight send operations posted, otherwise
> the send operation *will* complete with a local protection error. Is
> that taken care of?

Up to now we have assumed that the user knows its application needs and
he can increase this cache size to its needs through the configuration
item.
This way this limit and guarantee becomes true.

> Another question, why is the MR cache maintained per TX queue and not
> per device? If the application starts N TX queues then a single mempool
> will be registered N times instead of just once. Having lots of MR
> instances will pollute the device ICMC pretty badly. Am I missing
> something?

Having this cache per device needs a lock on the device structure while
threads are sending packets.  Having such locks cost cycles, that is why
the cache is per queue.  Another point is, having several mempool per
device is something common, whereas having several mempool per queues is
not, it seems logical to have this cache per queue for those two
reasons.

I am currently re-working this part of the code to improve it using
reference counters instead. The cache will remain for performance
purpose.  This will fix the issues you are pointing.

Are you facing some kind of issue?  Maybe you can share it, it can help
to improve things.

Thanks,

-- 
Nélio Laranjeiro
6WIND

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-17 21:02 ` Nélio Laranjeiro
@ 2017-07-19  6:21   ` Sagi Grimberg
  2017-07-20 13:55     ` Nélio Laranjeiro
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2017-07-19  6:21 UTC (permalink / raw)
  To: Nélio Laranjeiro
  Cc: dev, Shahaf Shuler, Yongseok Koh, Roy Shterman,
	Alexander Solganik, Leon Romanovsky

> There is none, if you send a burst of 9 packets each one coming from a
> different mempool the first one will be dropped.

Its worse than just a drop, without debug enabled the error completion
is ignored, and the wqe_pi is taken from an invalid field, which leads
to bogus mbufs free (elts_tail is not valid).

>> AFAICT, it is the driver responsibility to guarantee to never deregister
>> a memory region that has inflight send operations posted, otherwise
>> the send operation *will* complete with a local protection error. Is
>> that taken care of?
> 
> Up to now we have assumed that the user knows its application needs and
> he can increase this cache size to its needs through the configuration
> item.
> This way this limit and guarantee becomes true.

That is an undocumented assumption.

>> Another question, why is the MR cache maintained per TX queue and not
>> per device? If the application starts N TX queues then a single mempool
>> will be registered N times instead of just once. Having lots of MR
>> instances will pollute the device ICMC pretty badly. Am I missing
>> something?
> 
> Having this cache per device needs a lock on the device structure while
> threads are sending packets.

Not sure why it needs a lock at all. it *may* need an rcu protection
or rw_lock if at all.

> Having such locks cost cycles, that is why
> the cache is per queue.  Another point is, having several mempool per
> device is something common, whereas having several mempool per queues is
> not, it seems logical to have this cache per queue for those two
> reasons.
> 
> 
> I am currently re-working this part of the code to improve it using
> reference counters instead. The cache will remain for performance
> purpose.  This will fix the issues you are pointing.

AFAICT, all this caching mechanism is just working around the fact
that mlx5 allocates resources on top of the existing verbs interface.
I think it should work like any other pmd driver, i.e. use mbuf the
physical addresses.

The mlx5 device (like all other rdma devices) has a global DMA lkey that
spans the entire physical address space. Just about all the kernel
drivers heavily use this lkey. IMO, the mlx5_pmd driver should be able
to query the kernel what this lkey is and ask for the kernel to create
the QP with privilege level to post send/recv operations with that lkey.

And then, mlx5_pmd becomes like other drivers working with physical
addresses instead of working around the memory registration sub-optimally.

And while were on the subject, what is the plan of detaching mlx5_pmd
from its MLNX_OFED dependency? Mellanox has been doing a good job
upstreaming the needed features (rdma-core). CC'ing Leon (who is
co-maintaining the user-space rdma tree.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-19  6:21   ` Sagi Grimberg
@ 2017-07-20 13:55     ` Nélio Laranjeiro
  2017-07-20 14:06       ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Nélio Laranjeiro @ 2017-07-20 13:55 UTC (permalink / raw)
  To: Sagi Grimberg, Shahaf Shuler
  Cc: dev, Yongseok Koh, Roy Shterman, Alexander Solganik, Leon Romanovsky

On Wed, Jul 19, 2017 at 09:21:39AM +0300, Sagi Grimberg wrote:
> 
> > There is none, if you send a burst of 9 packets each one coming from a
> > different mempool the first one will be dropped.
> 
> Its worse than just a drop, without debug enabled the error completion
> is ignored, and the wqe_pi is taken from an invalid field, which leads
> to bogus mbufs free (elts_tail is not valid).

Right
 
> > > AFAICT, it is the driver responsibility to guarantee to never deregister
> > > a memory region that has inflight send operations posted, otherwise
> > > the send operation *will* complete with a local protection error. Is
> > > that taken care of?
> > 
> > Up to now we have assumed that the user knows its application needs and
> > he can increase this cache size to its needs through the configuration
> > item.
> > This way this limit and guarantee becomes true.
> 
> That is an undocumented assumption.

I agree it should be documented, in reality you are the first one we
know having this issue.
 
> > > Another question, why is the MR cache maintained per TX queue and not
> > > per device? If the application starts N TX queues then a single mempool
> > > will be registered N times instead of just once. Having lots of MR
> > > instances will pollute the device ICMC pretty badly. Am I missing
> > > something?
> > 
> > Having this cache per device needs a lock on the device structure while
> > threads are sending packets.
> 
> Not sure why it needs a lock at all. it *may* need an rcu protection
> or rw_lock if at all.

Tx queues may run on several CPU there is a need to be sure this array
cannot be modified by two threads at the same time.  Anyway it is
costly.

> > Having such locks cost cycles, that is why
> > the cache is per queue.  Another point is, having several mempool per
> > device is something common, whereas having several mempool per queues is
> > not, it seems logical to have this cache per queue for those two
> > reasons.
> > 
> > 
> > I am currently re-working this part of the code to improve it using
> > reference counters instead. The cache will remain for performance
> > purpose.  This will fix the issues you are pointing.
> 
> AFAICT, all this caching mechanism is just working around the fact
> that mlx5 allocates resources on top of the existing verbs interface.
> I think it should work like any other pmd driver, i.e. use mbuf the
> physical addresses.
> 
> The mlx5 device (like all other rdma devices) has a global DMA lkey that
> spans the entire physical address space. Just about all the kernel
> drivers heavily use this lkey. IMO, the mlx5_pmd driver should be able
> to query the kernel what this lkey is and ask for the kernel to create
> the QP with privilege level to post send/recv operations with that lkey.
> 
> And then, mlx5_pmd becomes like other drivers working with physical
> addresses instead of working around the memory registration sub-optimally.

It is one possibility discussed also with Mellanox guys, the point is
this breaks the security point of view which is also an important stuff.
If this is added in the future it will certainly be as an option, this
way both will be possible, the application could then choose about
security vs performance.

I don't know any planning on this from Mellanox side, maybe Shahaf have.

> And while were on the subject, what is the plan of detaching mlx5_pmd
> from its MLNX_OFED dependency? Mellanox has been doing a good job
> upstreaming the needed features (rdma-core). CC'ing Leon (who is
> co-maintaining the user-space rdma tree.

This is also a in progress in PMD part, it should be part of the next
DPDK release.

-- 
Nélio Laranjeiro
6WIND

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-20 13:55     ` Nélio Laranjeiro
@ 2017-07-20 14:06       ` Sagi Grimberg
  2017-07-20 15:20         ` Shahaf Shuler
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2017-07-20 14:06 UTC (permalink / raw)
  To: Nélio Laranjeiro, Shahaf Shuler
  Cc: dev, Yongseok Koh, Roy Shterman, Alexander Solganik, Leon Romanovsky

>> Its worse than just a drop, without debug enabled the error completion
>> is ignored, and the wqe_pi is taken from an invalid field, which leads
>> to bogus mbufs free (elts_tail is not valid).
> 
> Right

A simple work-around would be to simply fill a correct tail so that
error completions will still have it (although I'm not sure the which
fields are reliable other than the status in error completions).

>> Not sure why it needs a lock at all. it *may* need an rcu protection
>> or rw_lock if at all.
> 
> Tx queues may run on several CPU there is a need to be sure this array
> cannot be modified by two threads at the same time.  Anyway it is
> costly.

As I said, there are primitives which are designed to handle frequent
reads and rare mutations.

>> AFAICT, all this caching mechanism is just working around the fact
>> that mlx5 allocates resources on top of the existing verbs interface.
>> I think it should work like any other pmd driver, i.e. use mbuf the
>> physical addresses.
>>
>> The mlx5 device (like all other rdma devices) has a global DMA lkey that
>> spans the entire physical address space. Just about all the kernel
>> drivers heavily use this lkey. IMO, the mlx5_pmd driver should be able
>> to query the kernel what this lkey is and ask for the kernel to create
>> the QP with privilege level to post send/recv operations with that lkey.
>>
>> And then, mlx5_pmd becomes like other drivers working with physical
>> addresses instead of working around the memory registration sub-optimally.
> 
> It is one possibility discussed also with Mellanox guys, the point is
> this breaks the security point of view which is also an important stuff.

What security aspect? The entire dpdk model builds on top of physical
addresses awareness running under root permissions. I'm not saying
exposing it to the application nor granting remote permissions to the
physical space.

mlx5_pmd is a network driver, and as a driver, it should allowed to use
the device dma lkey as it sees fit. I honestly think its pretty much
mandatory considering the work-arounds mlx5_pmd tries to do (which we
agreed are broken).

> If this is added in the future it will certainly be as an option, this
> way both will be possible, the application could then choose about
> security vs performance.

Why should the application even be aware of that? Does any other driver
expose the user how it maps pkt mbufs to the NIC? Just like the MR
handling, its 100% internal to mlx5 and no reason why the user should
ever be exposed to any of these details.

> I don't know any planning on this from Mellanox side, maybe Shahaf have.

rdma-core has a very nice vendor extension mechanism (which is important
because we really don't want to pollute the verbs API just for dpdk).
Its very easy to expose the dma lkey and create the TX queue-pairs with
reserved lkey attributes via this mechanism. Just the kernel needs to
verify root permissions before exposing it.

>> And while were on the subject, what is the plan of detaching mlx5_pmd
>> from its MLNX_OFED dependency? Mellanox has been doing a good job
>> upstreaming the needed features (rdma-core). CC'ing Leon (who is
>> co-maintaining the user-space rdma tree.
> 
> This is also a in progress in PMD part, it should be part of the next
> DPDK release.

That is *very* good to hear! Can you guys share a branch? I'm
willing to take it for testing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-20 14:06       ` Sagi Grimberg
@ 2017-07-20 15:20         ` Shahaf Shuler
  2017-07-20 16:22           ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Shahaf Shuler @ 2017-07-20 15:20 UTC (permalink / raw)
  To: Sagi Grimberg, Nélio Laranjeiro
  Cc: dev, Yongseok Koh, Roy Shterman, Alexander Solganik, Leon Romanovsky

Hi Sagi, 

Thursday, July 20, 2017 5:06 PM, Sagi Grimberg:
> >> Its worse than just a drop, without debug enabled the error
> >> completion is ignored, and the wqe_pi is taken from an invalid field,
> >> which leads to bogus mbufs free (elts_tail is not valid).
> >
> > Right
> 
> A simple work-around would be to simply fill a correct tail so that error
> completions will still have it (although I'm not sure the which fields are
> reliable other than the status in error completions).

As Nélio said, we have patches which solves the issue from the root cause.
In the meanwhile the walk around is to have large enough MR cache.
I agree documentation is missing.  
 
> 
> >> Not sure why it needs a lock at all. it *may* need an rcu protection
> >> or rw_lock if at all.
> >
> > Tx queues may run on several CPU there is a need to be sure this array
> > cannot be modified by two threads at the same time.  Anyway it is
> > costly.
> 
> As I said, there are primitives which are designed to handle frequent reads
> and rare mutations.

Even with such primitives, rarely lock is more than never lock. 

> 
> >> AFAICT, all this caching mechanism is just working around the fact
> >> that mlx5 allocates resources on top of the existing verbs interface.
> >> I think it should work like any other pmd driver, i.e. use mbuf the
> >> physical addresses.
> >>
> >> The mlx5 device (like all other rdma devices) has a global DMA lkey
> >> that spans the entire physical address space. Just about all the
> >> kernel drivers heavily use this lkey. IMO, the mlx5_pmd driver should
> >> be able to query the kernel what this lkey is and ask for the kernel
> >> to create the QP with privilege level to post send/recv operations with
> that lkey.
> >>
> >> And then, mlx5_pmd becomes like other drivers working with physical
> >> addresses instead of working around the memory registration sub-
> optimally.
> >
> > It is one possibility discussed also with Mellanox guys, the point is
> > this breaks the security point of view which is also an important stuff.
> 
> What security aspect? The entire dpdk model builds on top of physical
> addresses awareness running under root permissions. 
>I'm not saying
> exposing it to the application nor granting remote permissions to the physical
> space.
> 
> mlx5_pmd is a network driver, and as a driver, it should allowed to use the
> device dma lkey as it sees fit. I honestly think its pretty much mandatory
> considering the work-arounds mlx5_pmd tries to do (which we agreed are
> broken).

True, There are many PMDs which can work only with physical memory.
However Mellanox NICs have the option to work with virtual one thus provide more security.
The fact running under root doesn't mean you have privileges to access every physical page on the server (even if you try very hard to be aware). 

The issue here, AFAIU, is performance.  
We are now looking into ways to provide the same performance as if it was only a single lkey, while preserving the security feature.  
 
> 
> > If this is added in the future it will certainly be as an option, this
> > way both will be possible, the application could then choose about
> > security vs performance.
> 
> Why should the application even be aware of that? Does any other driver
> expose the user how it maps pkt mbufs to the NIC? Just like the MR
> handling, its 100% internal to mlx5 and no reason why the user should ever
> be exposed to any of these details.

Other option is the reserved lkey as you suggested, but it will lose the security guarantees.
Like every performance optimization it should be the application decision.
In fact, there are some discussions on the ML of exposing the option to use va instead of pa. [1]

> 
> > I don't know any planning on this from Mellanox side, maybe Shahaf have.
> 
> rdma-core has a very nice vendor extension mechanism (which is important
> because we really don't want to pollute the verbs API just for dpdk).
> Its very easy to expose the dma lkey and create the TX queue-pairs with
> reserved lkey attributes via this mechanism. Just the kernel needs to verify
> root permissions before exposing it.
> 
> >> And while were on the subject, what is the plan of detaching mlx5_pmd
> >> from its MLNX_OFED dependency? Mellanox has been doing a good job
> >> upstreaming the needed features (rdma-core). CC'ing Leon (who is
> >> co-maintaining the user-space rdma tree.
> >
> > This is also a in progress in PMD part, it should be part of the next
> > DPDK release.
> 
> That is *very* good to hear! Can you guys share a branch? I'm willing to take
> it for testing.

The branch is still pre-mature. It may be good enough for external testing in about two weeks.
Contact me directly and I will provide it to you. 

[1] http://dpdk.org/ml/archives/dev/2017-June/067156.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-20 15:20         ` Shahaf Shuler
@ 2017-07-20 16:22           ` Sagi Grimberg
  2017-07-23  8:17             ` Shahaf Shuler
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2017-07-20 16:22 UTC (permalink / raw)
  To: Shahaf Shuler, Nélio Laranjeiro
  Cc: dev, Yongseok Koh, Roy Shterman, Alexander Solganik, Leon Romanovsky

>> As I said, there are primitives which are designed to handle frequent reads
>> and rare mutations.
> 
> Even with such primitives, rarely lock is more than never lock.

You do realize that the cache mutation involves ibv_dereg_mr() right?
Any locking scheme for mutation is negligible compared to that, and
rcu_read_lock is practically free.

>>> It is one possibility discussed also with Mellanox guys, the point is
>>> this breaks the security point of view which is also an important stuff.
>>
>> What security aspect? The entire dpdk model builds on top of physical
>> addresses awareness running under root permissions.
>> I'm not saying
>> exposing it to the application nor granting remote permissions to the physical
>> space.
>>
>> mlx5_pmd is a network driver, and as a driver, it should allowed to use the
>> device dma lkey as it sees fit. I honestly think its pretty much mandatory
>> considering the work-arounds mlx5_pmd tries to do (which we agreed are
>> broken).
> 
> True, There are many PMDs which can work only with physical memory.
> However Mellanox NICs have the option to work with virtual one thus provide more security.

I don't understand the security argument. Its completely private to the
driver. anything under librte is equivalent to an OS wrt networking, so
I fail to see what is the security feature your talking about.

> The fact running under root doesn't mean you have privileges to access every physical page on the server (even if you try very hard to be aware).

But dpdk core mbufs structs are built this way.

> The issue here, AFAIU, is performance.
> We are now looking into ways to provide the same performance as if it was only a single lkey, while preserving the security feature.

Hmm, What exactly do you have in mind?

I'm hoping that you are not referring to ODP. If you are, I think
that latency unpredictability would be a huge non-starter, page-faults
are way too expensive for dpdk users.

Again, personally, I don't think that using virtual memory registration
are very useful for a network driver, It's a driver, not an application.

>>> If this is added in the future it will certainly be as an option, this
>>> way both will be possible, the application could then choose about
>>> security vs performance.
>>
>> Why should the application even be aware of that? Does any other driver
>> expose the user how it maps pkt mbufs to the NIC? Just like the MR
>> handling, its 100% internal to mlx5 and no reason why the user should ever
>> be exposed to any of these details.
> 
> Other option is the reserved lkey as you suggested, but it will lose the security guarantees.

OK, obviously I'm missing something. Can you articulate what do you mean
by "security guarantees"?

> Like every performance optimization it should be the application decision.

I tend to disagree with you on this. The application must never be aware
of the nitty details of the underlying driver implementation. In fact,
propagating this information to the application sounds like a layering
violation.

> In fact, there are some discussions on the ML of exposing the option to use va instead of pa. [1]

My understanding is that the correct proposal was to hide this
knowledge from the user. Also the use-case is not security (which
I'm not even sure what it means yet).

>>> This is also a in progress in PMD part, it should be part of the next
>>> DPDK release.
>>
>> That is *very* good to hear! Can you guys share a branch? I'm willing to take
>> it for testing.
> 
> The branch is still pre-mature. It may be good enough for external testing in about two weeks.
> Contact me directly and I will provide it to you.

Awesome!

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-20 16:22           ` Sagi Grimberg
@ 2017-07-23  8:17             ` Shahaf Shuler
  2017-07-23  9:03               ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Shahaf Shuler @ 2017-07-23  8:17 UTC (permalink / raw)
  To: Sagi Grimberg, Nélio Laranjeiro
  Cc: dev, Yongseok Koh, Roy Shterman, Alexander Solganik, Leon Romanovsky

Thursday, July 20, 2017 7:22 PM, Sagi Grimberg:
> >> As I said, there are primitives which are designed to handle frequent
> >> reads and rare mutations.
> >
> > Even with such primitives, rarely lock is more than never lock.
> 
> You do realize that the cache mutation involves ibv_dereg_mr() right?
> Any locking scheme for mutation is negligible compared to that, and
> rcu_read_lock is practically free.

The fix for the issue you presented includes not to dereg MR on datapath. MR will be deregistered only on device close.
MRs replace on cache can still happen, but it won't involve any de-registration. 

> 
> >>> It is one possibility discussed also with Mellanox guys, the point
> >>> is this breaks the security point of view which is also an important stuff.
> >>
> >> What security aspect? The entire dpdk model builds on top of physical
> >> addresses awareness running under root permissions.
> >> I'm not saying
> >> exposing it to the application nor granting remote permissions to the
> >> physical space.
> >>
> >> mlx5_pmd is a network driver, and as a driver, it should allowed to
> >> use the device dma lkey as it sees fit. I honestly think its pretty
> >> much mandatory considering the work-arounds mlx5_pmd tries to do
> >> (which we agreed are broken).
> >
> > True, There are many PMDs which can work only with physical memory.
> > However Mellanox NICs have the option to work with virtual one thus
> provide more security.
> 
> I don't understand the security argument. Its completely private to the
> driver. anything under librte is equivalent to an OS wrt networking, so I fail to
> see what is the security feature your talking about.

You are correct that as a root you are able to do whatever you want on the server.
The security I refer to is to protect against badly written code.

The fact the PMD only registers the mempools, and use the device engine to translate the VA, provide some protection.
For example, one DPDK process will not be able to access the memory of other DPDK process *by mistake*.

I am not saying using the reserved lkey is not a good suggestion, and we plan to test its value.
All I am saying is there are maybe other option to provide the same performance with the extra protection mentioned above.
One of them can be to use indirect keys. One indirect key to represent 64b memory area, and other regular keys for the hugepages.
The rest of the memory area can be filled with keys pointing to /dev/null. 

> 
> > The fact running under root doesn't mean you have privileges to access
> every physical page on the server (even if you try very hard to be aware).
> 
> But dpdk core mbufs structs are built this way.
> 
> > The issue here, AFAIU, is performance.
> > We are now looking into ways to provide the same performance as if it was
> only a single lkey, while preserving the security feature.
> 
> Hmm, What exactly do you have in mind?
> 
> I'm hoping that you are not referring to ODP. If you are, I think that latency
> unpredictability would be a huge non-starter, page-faults are way too
> expensive for dpdk users.

No ODP :). 
As all relevant DPDK memory is on top of hugepages, there is no reason to avoid registration and pinning  in advance. 

> 
> Again, personally, I don't think that using virtual memory registration are very
> useful for a network driver, It's a driver, not an application.
> 
> >>> If this is added in the future it will certainly be as an option,
> >>> this way both will be possible, the application could then choose
> >>> about security vs performance.
> >>
> >> Why should the application even be aware of that? Does any other
> >> driver expose the user how it maps pkt mbufs to the NIC? Just like
> >> the MR handling, its 100% internal to mlx5 and no reason why the user
> >> should ever be exposed to any of these details.
> >
> > Other option is the reserved lkey as you suggested, but it will lose the
> security guarantees.
> 
> OK, obviously I'm missing something. Can you articulate what do you mean
> by "security guarantees"?
> 
> > Like every performance optimization it should be the application decision.
> 
> I tend to disagree with you on this. The application must never be aware of
> the nitty details of the underlying driver implementation. In fact, propagating
> this information to the application sounds like a layering violation.
> 
> > In fact, there are some discussions on the ML of exposing the option
> > to use va instead of pa. [1]
> 
> My understanding is that the correct proposal was to hide this knowledge
> from the user. Also the use-case is not security (which I'm not even sure
> what it means yet).
> 
> >>> This is also a in progress in PMD part, it should be part of the
> >>> next DPDK release.
> >>
> >> That is *very* good to hear! Can you guys share a branch? I'm willing
> >> to take it for testing.
> >
> > The branch is still pre-mature. It may be good enough for external testing in
> about two weeks.
> > Contact me directly and I will provide it to you.
> 
> Awesome!
> 
> Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-23  8:17             ` Shahaf Shuler
@ 2017-07-23  9:03               ` Sagi Grimberg
  2017-07-24 13:44                 ` Bruce Richardson
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2017-07-23  9:03 UTC (permalink / raw)
  To: Shahaf Shuler, Nélio Laranjeiro
  Cc: dev, Yongseok Koh, Roy Shterman, Alexander Solganik, Leon Romanovsky

>> I don't understand the security argument. Its completely private to the
>> driver. anything under librte is equivalent to an OS wrt networking, so I fail to
>> see what is the security feature your talking about.
> 
> You are correct that as a root you are able to do whatever you want on the server.
> The security I refer to is to protect against badly written code.
> 
> The fact the PMD only registers the mempools, and use the device engine to translate the VA, provide some protection.
> For example, one DPDK process will not be able to access the memory of other DPDK process *by mistake*.

Well, this is a fair argument, but without a *complete* solution for all
of dpdk peripherals, it has very little merit (if at all). A badly
written code can just as easily crash a server by passing a mbuf to
a crypto device or another network device that co-exists with mlx5.

So, while I understand the argument, I think its value is not worth the
hassle that mlx5_pmd needs to take to achieve it. Did this come from a
real requirement (from a real implementation)?

> I am not saying using the reserved lkey is not a good suggestion, and we plan to test its value.
> All I am saying is there are maybe other option to provide the same performance with the extra protection mentioned above.
> One of them can be to use indirect keys. One indirect key to represent 64b memory area, and other regular keys for the hugepages.
> The rest of the memory area can be filled with keys pointing to /dev/null.

If I understand what you are suggesting, this would trigger out-of-order
transfers on an indirect memory key just about always (each transfer can 
originate from a different hugepage and SGL resolution alone will
require a walk on the memory key context SGL list). I'm afraid this
would introduce a very bad performance scaling due to the fact that a
SGL context (klm) will need to be fetched from the ICM for essentially
every send operation.

Having said that, its just my 2 cents, if your solution works then I
don't really care. You are the one testing it...

>>> The fact running under root doesn't mean you have privileges to access
>> every physical page on the server (even if you try very hard to be aware).
>>
>> But dpdk core mbufs structs are built this way.
>>
>>> The issue here, AFAIU, is performance.
>>> We are now looking into ways to provide the same performance as if it was
>> only a single lkey, while preserving the security feature.
>>
>> Hmm, What exactly do you have in mind?
>>
>> I'm hoping that you are not referring to ODP. If you are, I think that latency
>> unpredictability would be a huge non-starter, page-faults are way too
>> expensive for dpdk users.
> 
> No ODP :).
> As all relevant DPDK memory is on top of hugepages, there is no reason to avoid registration and pinning  in advance.

Agree.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-23  9:03               ` Sagi Grimberg
@ 2017-07-24 13:44                 ` Bruce Richardson
  2017-07-27 10:48                   ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Richardson @ 2017-07-24 13:44 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Shahaf Shuler, Nélio Laranjeiro, dev, Yongseok Koh,
	Roy Shterman, Alexander Solganik, Leon Romanovsky

On Sun, Jul 23, 2017 at 12:03:41PM +0300, Sagi Grimberg wrote:
> 
> > > I don't understand the security argument. Its completely private to the
> > > driver. anything under librte is equivalent to an OS wrt networking, so I fail to
> > > see what is the security feature your talking about.
> > 
> > You are correct that as a root you are able to do whatever you want on the server.
> > The security I refer to is to protect against badly written code.
> > 
> > The fact the PMD only registers the mempools, and use the device engine to translate the VA, provide some protection.
> > For example, one DPDK process will not be able to access the memory of other DPDK process *by mistake*.
> 
> Well, this is a fair argument, but without a *complete* solution for all
> of dpdk peripherals, it has very little merit (if at all). A badly
> written code can just as easily crash a server by passing a mbuf to
> a crypto device or another network device that co-exists with mlx5.
> 
> So, while I understand the argument, I think its value is not worth the
> hassle that mlx5_pmd needs to take to achieve it. Did this come from a
> real requirement (from a real implementation)?
> 
Would using VFIO (and the IOMMU) not allow us to provide an equivalent
level of security to what is provided by the current scheme? From what I
see on-list there are a few folks already looking into that area, and
taking advantage of the IOMMU should improve security of all devices in
DPDK.

/Bruce

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
  2017-07-24 13:44                 ` Bruce Richardson
@ 2017-07-27 10:48                   ` Sagi Grimberg
  0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2017-07-27 10:48 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Shahaf Shuler, Nélio Laranjeiro, dev, Yongseok Koh,
	Roy Shterman, Alexander Solganik, Leon Romanovsky

>> Well, this is a fair argument, but without a *complete* solution for all
>> of dpdk peripherals, it has very little merit (if at all). A badly
>> written code can just as easily crash a server by passing a mbuf to
>> a crypto device or another network device that co-exists with mlx5.
>>
>> So, while I understand the argument, I think its value is not worth the
>> hassle that mlx5_pmd needs to take to achieve it. Did this come from a
>> real requirement (from a real implementation)?
>>
> Would using VFIO (and the IOMMU) not allow us to provide an equivalent
> level of security to what is provided by the current scheme?

mlx5 does not take over the device with vfio, it simply asks the
kernel to setup resources for it and sets a mac steering rule to
direct traffic to its own rings. Also, I'm not aware of any way to
enforce iommu is enabled.

> From what I
> see on-list there are a few folks already looking into that area, and
> taking advantage of the IOMMU should improve security of all devices in
> DPDK.

I agree that this can be improved in dpdk, I was simply arguing that
mlx5 guarantees alone are not very valuable, especially considering
the work-arounds taken in mlx5 to achieve it.

mlx5 can be converted to take over the device with vfio and simply not
deal with memory registration aspects, but that is really up to mlx5
maintainers.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-07-27 10:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-17 13:29 [dpdk-dev] Question on mlx5 PMD txq memory registration Sagi Grimberg
2017-07-17 21:02 ` Nélio Laranjeiro
2017-07-19  6:21   ` Sagi Grimberg
2017-07-20 13:55     ` Nélio Laranjeiro
2017-07-20 14:06       ` Sagi Grimberg
2017-07-20 15:20         ` Shahaf Shuler
2017-07-20 16:22           ` Sagi Grimberg
2017-07-23  8:17             ` Shahaf Shuler
2017-07-23  9:03               ` Sagi Grimberg
2017-07-24 13:44                 ` Bruce Richardson
2017-07-27 10:48                   ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).