DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path
@ 2024-03-28 23:33 Andrey Ignatov
  2024-03-28 23:44 ` Stephen Hemminger
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andrey Ignatov @ 2024-03-28 23:33 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Chenbo Xia, Wei Shen, Andrey Ignatov

Currently virtio_dev_tx_packed() always allocates requested @count of
packets, no matter how many packets are really available on the virtio
Tx ring. Later it has to free all packets it didn't use and if, for
example, there were zero available packets on the ring, then all @count
mbufs would be allocated just to be freed afterwards.

This wastes CPU cycles since rte_pktmbuf_alloc_bulk() /
rte_pktmbuf_free_bulk() do quite a lot of work.

Optimize it by using the same idea as the virtio_dev_tx_split() uses on
the Tx split path: estimate the number of available entries on the ring
and allocate only that number of mbufs.

On the split path it's pretty easy to estimate.

On the packed path it's more work since it requires checking flags for
up to @count of descriptors. Still it's much less expensive than the
alloc/free pair.

The new get_nb_avail_entries_packed() function doesn't change how
virtio_dev_tx_packed() works with regard to memory barriers since the
barrier between checking flags and other descriptor fields is still in
place later in virtio_dev_tx_batch_packed() and
virtio_dev_tx_single_packed().

The difference for a guest transmitting ~17Gbps with MTU 1500 on a `perf
record` / `perf report` (on lower pps the savings will be bigger):

* Before the change:

    Samples: 18K of event 'cycles:P', Event count (approx.): 19206831288
      Children      Self      Pid:Command
    -  100.00%   100.00%   798808:dpdk-worker1
                <... skip ...>
                - 99.09% pkt_burst_io_forward
                   - 90.26% common_fwd_stream_receive
                      - 90.04% rte_eth_rx_burst
                         - 75.53% eth_vhost_rx
                            - 74.29% rte_vhost_dequeue_burst
                               - 71.48% virtio_dev_tx_packed_compliant
                                  + 17.11% rte_pktmbuf_alloc_bulk
                                  + 11.80% rte_pktmbuf_free_bulk
                                  + 2.11% vhost_user_inject_irq
                                    0.75% rte_pktmbuf_reset
                                    0.53% __rte_pktmbuf_free_seg_via_array
                                 0.88% vhost_queue_stats_update
                         + 13.66% mlx5_rx_burst_vec
                   + 8.69% common_fwd_stream_transmit

* After:

    Samples: 18K of event 'cycles:P', Event count (approx.): 19225310840
      Children      Self      Pid:Command
    -  100.00%   100.00%   859754:dpdk-worker1
                <... skip ...>
                - 98.61% pkt_burst_io_forward
                   - 86.29% common_fwd_stream_receive
                      - 85.84% rte_eth_rx_burst
                         - 61.94% eth_vhost_rx
                            - 60.05% rte_vhost_dequeue_burst
                               - 55.98% virtio_dev_tx_packed_compliant
                                  + 3.43% rte_pktmbuf_alloc_bulk
                                  + 2.50% vhost_user_inject_irq
                                 1.17% vhost_queue_stats_update
                                 0.76% rte_rwlock_read_unlock
                                 0.54% rte_rwlock_read_trylock
                         + 22.21% mlx5_rx_burst_vec
                   + 12.00% common_fwd_stream_transmit

It can be seen that virtio_dev_tx_packed_compliant() goes from 71.48% to
55.98% with rte_pktmbuf_alloc_bulk() going from 17.11% to 3.43% and
rte_pktmbuf_free_bulk() going away completely.

Signed-off-by: Andrey Ignatov <rdna@apple.com>
---
 lib/vhost/virtio_net.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index 1359c5fb1f..b406b5d7d9 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -3484,6 +3484,35 @@ virtio_dev_tx_single_packed(struct virtio_net *dev,
 	return ret;
 }
 
+static __rte_always_inline uint16_t
+get_nb_avail_entries_packed(const struct vhost_virtqueue *__rte_restrict vq,
+			    uint16_t max_nb_avail_entries)
+{
+	const struct vring_packed_desc *descs = vq->desc_packed;
+	bool avail_wrap = vq->avail_wrap_counter;
+	uint16_t avail_idx = vq->last_avail_idx;
+	uint16_t nb_avail_entries = 0;
+	uint16_t flags;
+
+	while (nb_avail_entries < max_nb_avail_entries) {
+		flags = descs[avail_idx].flags;
+
+		if ((avail_wrap != !!(flags & VRING_DESC_F_AVAIL)) ||
+		    (avail_wrap == !!(flags & VRING_DESC_F_USED)))
+			return nb_avail_entries;
+
+		if (!(flags & VRING_DESC_F_NEXT))
+			++nb_avail_entries;
+
+		if (unlikely(++avail_idx >= vq->size)) {
+			avail_idx -= vq->size;
+			avail_wrap = !avail_wrap;
+		}
+	}
+
+	return nb_avail_entries;
+}
+
 __rte_always_inline
 static uint16_t
 virtio_dev_tx_packed(struct virtio_net *dev,
@@ -3497,6 +3526,10 @@ virtio_dev_tx_packed(struct virtio_net *dev,
 {
 	uint32_t pkt_idx = 0;
 
+	count = get_nb_avail_entries_packed(vq, count);
+	if (count == 0)
+		return 0;
+
 	if (rte_pktmbuf_alloc_bulk(mbuf_pool, pkts, count)) {
 		vq->stats.mbuf_alloc_failed += count;
 		return 0;
-- 
2.41.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path
  2024-03-28 23:33 [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path Andrey Ignatov
@ 2024-03-28 23:44 ` Stephen Hemminger
  2024-03-29  0:10   ` Andrey Ignatov
  2024-04-03 10:19 ` [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path Maxime Coquelin
  2024-06-12  8:32 ` Maxime Coquelin
  2 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2024-03-28 23:44 UTC (permalink / raw)
  To: Andrey Ignatov; +Cc: dev, Maxime Coquelin, Chenbo Xia, Wei Shen

On Thu, 28 Mar 2024 16:33:38 -0700
Andrey Ignatov <rdna@apple.com> wrote:

>  
> +static __rte_always_inline uint16_t
> +get_nb_avail_entries_packed(const struct vhost_virtqueue *__rte_restrict vq,
> +			    uint16_t max_nb_avail_entries)
> +{

You don't need always inline, the compiler will do it anyway.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path
  2024-03-28 23:44 ` Stephen Hemminger
@ 2024-03-29  0:10   ` Andrey Ignatov
  2024-03-29  2:53     ` Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: Andrey Ignatov @ 2024-03-29  0:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Maxime Coquelin, Chenbo Xia, Wei Shen

On Thu, Mar 28, 2024 at 04:44:26PM -0700, Stephen Hemminger wrote:
> On Thu, 28 Mar 2024 16:33:38 -0700
> Andrey Ignatov <rdna@apple.com> wrote:
> 
> >  
> > +static __rte_always_inline uint16_t
> > +get_nb_avail_entries_packed(const struct vhost_virtqueue *__rte_restrict vq,
> > +			    uint16_t max_nb_avail_entries)
> > +{
> 
> You don't need always inline, the compiler will do it anyway.

I can remove it in v2, but it's not completely obvious to me how is it
decided when to specify it explicitly and when not?

I see plenty of __rte_always_inline in this file:

% git grep -c '^static __rte_always_inline' lib/vhost/virtio_net.c
lib/vhost/virtio_net.c:66

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path
  2024-03-29  0:10   ` Andrey Ignatov
@ 2024-03-29  2:53     ` Stephen Hemminger
  2024-03-29 13:04       ` Maxime Coquelin
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2024-03-29  2:53 UTC (permalink / raw)
  To: Andrey Ignatov; +Cc: dev, Maxime Coquelin, Chenbo Xia, Wei Shen

On Thu, 28 Mar 2024 17:10:42 -0700
Andrey Ignatov <rdna@apple.com> wrote:

> > 
> > You don't need always inline, the compiler will do it anyway.  
> 
> I can remove it in v2, but it's not completely obvious to me how is it
> decided when to specify it explicitly and when not?
> 
> I see plenty of __rte_always_inline in this file:
> 
> % git grep -c '^static __rte_always_inline' lib/vhost/virtio_net.c
> lib/vhost/virtio_net.c:66


Cargo cult really.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path
  2024-03-29  2:53     ` Stephen Hemminger
@ 2024-03-29 13:04       ` Maxime Coquelin
  2024-03-29 13:42         ` The effect of inlining Morten Brørup
  0 siblings, 1 reply; 11+ messages in thread
From: Maxime Coquelin @ 2024-03-29 13:04 UTC (permalink / raw)
  To: Stephen Hemminger, Andrey Ignatov; +Cc: dev, Chenbo Xia, Wei Shen

Hi Stephen,

On 3/29/24 03:53, Stephen Hemminger wrote:
> On Thu, 28 Mar 2024 17:10:42 -0700
> Andrey Ignatov <rdna@apple.com> wrote:
> 
>>>
>>> You don't need always inline, the compiler will do it anyway.
>>
>> I can remove it in v2, but it's not completely obvious to me how is it
>> decided when to specify it explicitly and when not?
>>
>> I see plenty of __rte_always_inline in this file:
>>
>> % git grep -c '^static __rte_always_inline' lib/vhost/virtio_net.c
>> lib/vhost/virtio_net.c:66
> 
> 
> Cargo cult really.
> 

Cargo cult... really?

Well, I just did a quick test by comparing IO forwarding with testpmd
between main branch and with adding a patch that removes all the
inline/noinline in lib/vhost/virtio_net.c [0].

main branch: 14.63Mpps
main branch - inline/noinline: 10.24Mpps

Andrey, thanks for the patch, I'll have a look at it next week.

Maxime

[0]: https://pastebin.com/72P2npZ0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* The effect of inlining
  2024-03-29 13:04       ` Maxime Coquelin
@ 2024-03-29 13:42         ` Morten Brørup
  2024-03-29 20:26           ` Tyler Retzlaff
  2024-04-01 15:20           ` Mattias Rönnblom
  0 siblings, 2 replies; 11+ messages in thread
From: Morten Brørup @ 2024-03-29 13:42 UTC (permalink / raw)
  To: Maxime Coquelin, Stephen Hemminger, Andrey Ignatov
  Cc: dev, Chenbo Xia, Wei Shen, techboard

+CC techboard

> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, 29 March 2024 14.05
> 
> Hi Stephen,
> 
> On 3/29/24 03:53, Stephen Hemminger wrote:
> > On Thu, 28 Mar 2024 17:10:42 -0700
> > Andrey Ignatov <rdna@apple.com> wrote:
> >
> >>>
> >>> You don't need always inline, the compiler will do it anyway.
> >>
> >> I can remove it in v2, but it's not completely obvious to me how is
> it
> >> decided when to specify it explicitly and when not?
> >>
> >> I see plenty of __rte_always_inline in this file:
> >>
> >> % git grep -c '^static __rte_always_inline' lib/vhost/virtio_net.c
> >> lib/vhost/virtio_net.c:66
> >
> >
> > Cargo cult really.
> >
> 
> Cargo cult... really?
> 
> Well, I just did a quick test by comparing IO forwarding with testpmd
> between main branch and with adding a patch that removes all the
> inline/noinline in lib/vhost/virtio_net.c [0].
> 
> main branch: 14.63Mpps
> main branch - inline/noinline: 10.24Mpps

Thank you for testing this, Maxime. Very interesting!

It is sometimes suggested on techboard meetings that we should convert more inline functions to non-inline for improved API/ABI stability, with the argument that the performance of inlining is negligible.

I think this test proves that the sum of many small (negligible) performance differences it not negligible!

> 
> Andrey, thanks for the patch, I'll have a look at it next week.
> 
> Maxime
> 
> [0]: https://pastebin.com/72P2npZ0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: The effect of inlining
  2024-03-29 13:42         ` The effect of inlining Morten Brørup
@ 2024-03-29 20:26           ` Tyler Retzlaff
  2024-04-01 15:20           ` Mattias Rönnblom
  1 sibling, 0 replies; 11+ messages in thread
From: Tyler Retzlaff @ 2024-03-29 20:26 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Maxime Coquelin, Stephen Hemminger, Andrey Ignatov, dev,
	Chenbo Xia, Wei Shen, techboard

On Fri, Mar 29, 2024 at 02:42:49PM +0100, Morten Brørup wrote:
> +CC techboard
> 
> > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > Sent: Friday, 29 March 2024 14.05
> > 
> > Hi Stephen,
> > 
> > On 3/29/24 03:53, Stephen Hemminger wrote:
> > > On Thu, 28 Mar 2024 17:10:42 -0700
> > > Andrey Ignatov <rdna@apple.com> wrote:
> > >
> > >>>
> > >>> You don't need always inline, the compiler will do it anyway.
> > >>
> > >> I can remove it in v2, but it's not completely obvious to me how is
> > it
> > >> decided when to specify it explicitly and when not?
> > >>
> > >> I see plenty of __rte_always_inline in this file:
> > >>
> > >> % git grep -c '^static __rte_always_inline' lib/vhost/virtio_net.c
> > >> lib/vhost/virtio_net.c:66
> > >
> > >
> > > Cargo cult really.
> > >
> > 
> > Cargo cult... really?
> > 
> > Well, I just did a quick test by comparing IO forwarding with testpmd
> > between main branch and with adding a patch that removes all the
> > inline/noinline in lib/vhost/virtio_net.c [0].
> > 
> > main branch: 14.63Mpps
> > main branch - inline/noinline: 10.24Mpps
> 
> Thank you for testing this, Maxime. Very interesting!
> 
> It is sometimes suggested on techboard meetings that we should convert more inline functions to non-inline for improved API/ABI stability, with the argument that the performance of inlining is negligible.

removing inline functions probably has an even more profound negative
impact when using dynamic linking. for all the value of msvc's dll
scoped security features they do have overheads per-call that can't be
wished away i imagine equivalents in gcc are the same.

> 
> I think this test proves that the sum of many small (negligible) performance differences it not negligible!

sure looks that way, though i think there is some distinction to be made
between inline and *forced* inline.

force inline may be losing us some opportunity for the compiler to
optimize better than is obvious to us.

> 
> > 
> > Andrey, thanks for the patch, I'll have a look at it next week.
> > 
> > Maxime
> > 
> > [0]: https://pastebin.com/72P2npZ0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: The effect of inlining
  2024-03-29 13:42         ` The effect of inlining Morten Brørup
  2024-03-29 20:26           ` Tyler Retzlaff
@ 2024-04-01 15:20           ` Mattias Rönnblom
  2024-04-03 16:01             ` Morten Brørup
  1 sibling, 1 reply; 11+ messages in thread
From: Mattias Rönnblom @ 2024-04-01 15:20 UTC (permalink / raw)
  To: Morten Brørup, Maxime Coquelin, Stephen Hemminger, Andrey Ignatov
  Cc: dev, Chenbo Xia, Wei Shen, techboard

On 2024-03-29 14:42, Morten Brørup wrote:
> +CC techboard
> 
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Friday, 29 March 2024 14.05
>>
>> Hi Stephen,
>>
>> On 3/29/24 03:53, Stephen Hemminger wrote:
>>> On Thu, 28 Mar 2024 17:10:42 -0700
>>> Andrey Ignatov <rdna@apple.com> wrote:
>>>
>>>>>
>>>>> You don't need always inline, the compiler will do it anyway.
>>>>
>>>> I can remove it in v2, but it's not completely obvious to me how is
>> it
>>>> decided when to specify it explicitly and when not?
>>>>
>>>> I see plenty of __rte_always_inline in this file:
>>>>
>>>> % git grep -c '^static __rte_always_inline' lib/vhost/virtio_net.c
>>>> lib/vhost/virtio_net.c:66
>>>
>>>
>>> Cargo cult really.
>>>
>>
>> Cargo cult... really?
>>
>> Well, I just did a quick test by comparing IO forwarding with testpmd
>> between main branch and with adding a patch that removes all the
>> inline/noinline in lib/vhost/virtio_net.c [0].
>>
>> main branch: 14.63Mpps
>> main branch - inline/noinline: 10.24Mpps
> 
> Thank you for testing this, Maxime. Very interesting!
> 
> It is sometimes suggested on techboard meetings that we should convert more inline functions to non-inline for improved API/ABI stability, with the argument that the performance of inlining is negligible.
> 

I think you are mixing two different (but related) things here.
1) marking functions with the inline family of keywords/attributes
2) keeping function definitions in header files

1) does not affect the ABI, while 2) does. Neither 1) nor 2) affects the 
API (i.e., source-level compatibility).

2) *allows* for function inlining even in non-LTO builds, but doesn't 
force it.

If you don't believe 2) makes a difference performance-wise, it follows 
that you also don't believe LTO makes much of a difference. Both have 
the same effect: allowing the compiler to reason over a larger chunk of 
your program.

Allowing the compiler to inline small, often-called functions is crucial 
for performance, in my experience. If the target symbol tend to be in a 
shared object, the difference is even larger. It's also quite common 
that you see no effect of LTO (other than a reduction of code footprint).

As LTO becomes more practical to use, 2) loses much of its appeal.

If PGO ever becomes practical to use, maybe 1) will as well.

> I think this test proves that the sum of many small (negligible) performance differences it not negligible!
> 
>>
>> Andrey, thanks for the patch, I'll have a look at it next week.
>>
>> Maxime
>>
>> [0]: https://pastebin.com/72P2npZ0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path
  2024-03-28 23:33 [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path Andrey Ignatov
  2024-03-28 23:44 ` Stephen Hemminger
@ 2024-04-03 10:19 ` Maxime Coquelin
  2024-06-12  8:32 ` Maxime Coquelin
  2 siblings, 0 replies; 11+ messages in thread
From: Maxime Coquelin @ 2024-04-03 10:19 UTC (permalink / raw)
  To: Andrey Ignatov, dev; +Cc: Chenbo Xia, Wei Shen



On 3/29/24 00:33, Andrey Ignatov wrote:
> Currently virtio_dev_tx_packed() always allocates requested @count of
> packets, no matter how many packets are really available on the virtio
> Tx ring. Later it has to free all packets it didn't use and if, for
> example, there were zero available packets on the ring, then all @count
> mbufs would be allocated just to be freed afterwards.
> 
> This wastes CPU cycles since rte_pktmbuf_alloc_bulk() /
> rte_pktmbuf_free_bulk() do quite a lot of work.
> 
> Optimize it by using the same idea as the virtio_dev_tx_split() uses on
> the Tx split path: estimate the number of available entries on the ring
> and allocate only that number of mbufs.
> 
> On the split path it's pretty easy to estimate.
> 
> On the packed path it's more work since it requires checking flags for
> up to @count of descriptors. Still it's much less expensive than the
> alloc/free pair.
> 
> The new get_nb_avail_entries_packed() function doesn't change how
> virtio_dev_tx_packed() works with regard to memory barriers since the
> barrier between checking flags and other descriptor fields is still in
> place later in virtio_dev_tx_batch_packed() and
> virtio_dev_tx_single_packed().
> 
> The difference for a guest transmitting ~17Gbps with MTU 1500 on a `perf
> record` / `perf report` (on lower pps the savings will be bigger):
> 
> * Before the change:
> 
>      Samples: 18K of event 'cycles:P', Event count (approx.): 19206831288
>        Children      Self      Pid:Command
>      -  100.00%   100.00%   798808:dpdk-worker1
>                  <... skip ...>
>                  - 99.09% pkt_burst_io_forward
>                     - 90.26% common_fwd_stream_receive
>                        - 90.04% rte_eth_rx_burst
>                           - 75.53% eth_vhost_rx
>                              - 74.29% rte_vhost_dequeue_burst
>                                 - 71.48% virtio_dev_tx_packed_compliant
>                                    + 17.11% rte_pktmbuf_alloc_bulk
>                                    + 11.80% rte_pktmbuf_free_bulk
>                                    + 2.11% vhost_user_inject_irq
>                                      0.75% rte_pktmbuf_reset
>                                      0.53% __rte_pktmbuf_free_seg_via_array
>                                   0.88% vhost_queue_stats_update
>                           + 13.66% mlx5_rx_burst_vec
>                     + 8.69% common_fwd_stream_transmit
> 
> * After:
> 
>      Samples: 18K of event 'cycles:P', Event count (approx.): 19225310840
>        Children      Self      Pid:Command
>      -  100.00%   100.00%   859754:dpdk-worker1
>                  <... skip ...>
>                  - 98.61% pkt_burst_io_forward
>                     - 86.29% common_fwd_stream_receive
>                        - 85.84% rte_eth_rx_burst
>                           - 61.94% eth_vhost_rx
>                              - 60.05% rte_vhost_dequeue_burst
>                                 - 55.98% virtio_dev_tx_packed_compliant
>                                    + 3.43% rte_pktmbuf_alloc_bulk
>                                    + 2.50% vhost_user_inject_irq
>                                   1.17% vhost_queue_stats_update
>                                   0.76% rte_rwlock_read_unlock
>                                   0.54% rte_rwlock_read_trylock
>                           + 22.21% mlx5_rx_burst_vec
>                     + 12.00% common_fwd_stream_transmit
> 
> It can be seen that virtio_dev_tx_packed_compliant() goes from 71.48% to
> 55.98% with rte_pktmbuf_alloc_bulk() going from 17.11% to 3.43% and
> rte_pktmbuf_free_bulk() going away completely.
> 
> Signed-off-by: Andrey Ignatov <rdna@apple.com>
> ---
>   lib/vhost/virtio_net.c | 33 +++++++++++++++++++++++++++++++++
>   1 file changed, 33 insertions(+)
> 

Thanks for the contribution and the detailed commit message.

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Maxime


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: The effect of inlining
  2024-04-01 15:20           ` Mattias Rönnblom
@ 2024-04-03 16:01             ` Morten Brørup
  0 siblings, 0 replies; 11+ messages in thread
From: Morten Brørup @ 2024-04-03 16:01 UTC (permalink / raw)
  To: Mattias Rönnblom, Maxime Coquelin, Stephen Hemminger,
	Andrey Ignatov
  Cc: dev, Chenbo Xia, Wei Shen, techboard

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 1 April 2024 17.20
> 
> On 2024-03-29 14:42, Morten Brørup wrote:
> > +CC techboard
> >
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Friday, 29 March 2024 14.05
> >>
> >> Hi Stephen,
> >>
> >> On 3/29/24 03:53, Stephen Hemminger wrote:
> >>> On Thu, 28 Mar 2024 17:10:42 -0700
> >>> Andrey Ignatov <rdna@apple.com> wrote:
> >>>
> >>>>>
> >>>>> You don't need always inline, the compiler will do it anyway.
> >>>>
> >>>> I can remove it in v2, but it's not completely obvious to me how is
> >> it
> >>>> decided when to specify it explicitly and when not?
> >>>>
> >>>> I see plenty of __rte_always_inline in this file:
> >>>>
> >>>> % git grep -c '^static __rte_always_inline' lib/vhost/virtio_net.c
> >>>> lib/vhost/virtio_net.c:66
> >>>
> >>>
> >>> Cargo cult really.
> >>>
> >>
> >> Cargo cult... really?
> >>
> >> Well, I just did a quick test by comparing IO forwarding with testpmd
> >> between main branch and with adding a patch that removes all the
> >> inline/noinline in lib/vhost/virtio_net.c [0].
> >>
> >> main branch: 14.63Mpps
> >> main branch - inline/noinline: 10.24Mpps
> >
> > Thank you for testing this, Maxime. Very interesting!
> >
> > It is sometimes suggested on techboard meetings that we should convert
> more inline functions to non-inline for improved API/ABI stability, with
> the argument that the performance of inlining is negligible.
> >
> 
> I think you are mixing two different (but related) things here.
> 1) marking functions with the inline family of keywords/attributes
> 2) keeping function definitions in header files

I'm talking about 2. The reason for wanting to avoid inline function definitions in header files is to hide more of the implementation behind the API, thus making it easier to change the implementation without breaking the API/ABI. Sorry about not making this clear.

> 
> 1) does not affect the ABI, while 2) does. Neither 1) nor 2) affects the
> API (i.e., source-level compatibility).
> 
> 2) *allows* for function inlining even in non-LTO builds, but doesn't
> force it.
> 
> If you don't believe 2) makes a difference performance-wise, it follows
> that you also don't believe LTO makes much of a difference. Both have
> the same effect: allowing the compiler to reason over a larger chunk of
> your program.
> 
> Allowing the compiler to inline small, often-called functions is crucial
> for performance, in my experience. If the target symbol tend to be in a
> shared object, the difference is even larger. It's also quite common
> that you see no effect of LTO (other than a reduction of code
> footprint).
> 
> As LTO becomes more practical to use, 2) loses much of its appeal.
> 
> If PGO ever becomes practical to use, maybe 1) will as well.
> 
> > I think this test proves that the sum of many small (negligible)
> performance differences it not negligible!
> >
> >>
> >> Andrey, thanks for the patch, I'll have a look at it next week.
> >>
> >> Maxime
> >>
> >> [0]: https://pastebin.com/72P2npZ0
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path
  2024-03-28 23:33 [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path Andrey Ignatov
  2024-03-28 23:44 ` Stephen Hemminger
  2024-04-03 10:19 ` [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path Maxime Coquelin
@ 2024-06-12  8:32 ` Maxime Coquelin
  2 siblings, 0 replies; 11+ messages in thread
From: Maxime Coquelin @ 2024-06-12  8:32 UTC (permalink / raw)
  To: Andrey Ignatov, dev; +Cc: Chenbo Xia, Wei Shen



On 3/29/24 00:33, Andrey Ignatov wrote:
> Currently virtio_dev_tx_packed() always allocates requested @count of
> packets, no matter how many packets are really available on the virtio
> Tx ring. Later it has to free all packets it didn't use and if, for
> example, there were zero available packets on the ring, then all @count
> mbufs would be allocated just to be freed afterwards.
> 
> This wastes CPU cycles since rte_pktmbuf_alloc_bulk() /
> rte_pktmbuf_free_bulk() do quite a lot of work.
> 
> Optimize it by using the same idea as the virtio_dev_tx_split() uses on
> the Tx split path: estimate the number of available entries on the ring
> and allocate only that number of mbufs.
> 
> On the split path it's pretty easy to estimate.
> 
> On the packed path it's more work since it requires checking flags for
> up to @count of descriptors. Still it's much less expensive than the
> alloc/free pair.
> 
> The new get_nb_avail_entries_packed() function doesn't change how
> virtio_dev_tx_packed() works with regard to memory barriers since the
> barrier between checking flags and other descriptor fields is still in
> place later in virtio_dev_tx_batch_packed() and
> virtio_dev_tx_single_packed().
> 
> The difference for a guest transmitting ~17Gbps with MTU 1500 on a `perf
> record` / `perf report` (on lower pps the savings will be bigger):
> 
> * Before the change:
> 
>      Samples: 18K of event 'cycles:P', Event count (approx.): 19206831288
>        Children      Self      Pid:Command
>      -  100.00%   100.00%   798808:dpdk-worker1
>                  <... skip ...>
>                  - 99.09% pkt_burst_io_forward
>                     - 90.26% common_fwd_stream_receive
>                        - 90.04% rte_eth_rx_burst
>                           - 75.53% eth_vhost_rx
>                              - 74.29% rte_vhost_dequeue_burst
>                                 - 71.48% virtio_dev_tx_packed_compliant
>                                    + 17.11% rte_pktmbuf_alloc_bulk
>                                    + 11.80% rte_pktmbuf_free_bulk
>                                    + 2.11% vhost_user_inject_irq
>                                      0.75% rte_pktmbuf_reset
>                                      0.53% __rte_pktmbuf_free_seg_via_array
>                                   0.88% vhost_queue_stats_update
>                           + 13.66% mlx5_rx_burst_vec
>                     + 8.69% common_fwd_stream_transmit
> 
> * After:
> 
>      Samples: 18K of event 'cycles:P', Event count (approx.): 19225310840
>        Children      Self      Pid:Command
>      -  100.00%   100.00%   859754:dpdk-worker1
>                  <... skip ...>
>                  - 98.61% pkt_burst_io_forward
>                     - 86.29% common_fwd_stream_receive
>                        - 85.84% rte_eth_rx_burst
>                           - 61.94% eth_vhost_rx
>                              - 60.05% rte_vhost_dequeue_burst
>                                 - 55.98% virtio_dev_tx_packed_compliant
>                                    + 3.43% rte_pktmbuf_alloc_bulk
>                                    + 2.50% vhost_user_inject_irq
>                                   1.17% vhost_queue_stats_update
>                                   0.76% rte_rwlock_read_unlock
>                                   0.54% rte_rwlock_read_trylock
>                           + 22.21% mlx5_rx_burst_vec
>                     + 12.00% common_fwd_stream_transmit
> 
> It can be seen that virtio_dev_tx_packed_compliant() goes from 71.48% to
> 55.98% with rte_pktmbuf_alloc_bulk() going from 17.11% to 3.43% and
> rte_pktmbuf_free_bulk() going away completely.
> 
> Signed-off-by: Andrey Ignatov <rdna@apple.com>
> ---
>   lib/vhost/virtio_net.c | 33 +++++++++++++++++++++++++++++++++
>   1 file changed, 33 insertions(+)
> 

Applied to next-virtio/for-next-net.

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-06-12  8:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28 23:33 [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path Andrey Ignatov
2024-03-28 23:44 ` Stephen Hemminger
2024-03-29  0:10   ` Andrey Ignatov
2024-03-29  2:53     ` Stephen Hemminger
2024-03-29 13:04       ` Maxime Coquelin
2024-03-29 13:42         ` The effect of inlining Morten Brørup
2024-03-29 20:26           ` Tyler Retzlaff
2024-04-01 15:20           ` Mattias Rönnblom
2024-04-03 16:01             ` Morten Brørup
2024-04-03 10:19 ` [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path Maxime Coquelin
2024-06-12  8:32 ` Maxime Coquelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).