* [dpdk-dev] [PATCH 0/2] net/virtio: optimize virtio net header reset @ 2017-01-11 4:27 Yuanhan Liu 2017-01-11 4:27 ` [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling Yuanhan Liu 2017-01-11 4:27 ` [dpdk-dev] [PATCH 2/2] net/virtio: optimize header reset on any layout Yuanhan Liu 0 siblings, 2 replies; 16+ messages in thread From: Yuanhan Liu @ 2017-01-11 4:27 UTC (permalink / raw) To: dev; +Cc: Tan Jianfeng, Wang Zhihong, Yuanhan Liu This two patches optimized the virtio net header reset when TSO is not actually used (though it could be enabled). The basic idea is to not reset (assign 0) when it's already 0. This could avoid some severe cache issues. Micro benchmarking shows it could boost the performance up to 20+%. --- Yuanhan Liu (2): net/virtio: fix performance regression due to TSO enabling net/virtio: optimize header reset on any layout drivers/net/virtio/virtio_rxtx.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) -- 1.9.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-11 4:27 [dpdk-dev] [PATCH 0/2] net/virtio: optimize virtio net header reset Yuanhan Liu @ 2017-01-11 4:27 ` Yuanhan Liu 2017-01-11 7:59 ` Maxime Coquelin 2017-01-11 14:51 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon 2017-01-11 4:27 ` [dpdk-dev] [PATCH 2/2] net/virtio: optimize header reset on any layout Yuanhan Liu 1 sibling, 2 replies; 16+ messages in thread From: Yuanhan Liu @ 2017-01-11 4:27 UTC (permalink / raw) To: dev Cc: Tan Jianfeng, Wang Zhihong, Yuanhan Liu, Olivier Matz, Maxime Coquelin, Michael S. Tsirkin, stable TSO is now enabled, but it's not actually being used by default in a simple L2 forward mode. In such case, we have to zero the virtio net headers, to inform the vhost backend that no offload is being used: hdr->csum_start = 0; hdr->csum_offset = 0; hdr->flags = 0; hdr->gso_type = 0; hdr->gso_size = 0; hdr->hdr_len = 0; Such writes could be very costly; it introduces severe cache issues: The above operations introduce cache write for each packet, which stalls the read operation from the vhost backend. The fact that virtio net header is initiated to zero in PMD driver init stage means that these costly writes are unnecessary and could be avoided: if (hdr->csum_start != 0) hdr->csum_start = 0; And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the performance drop introduced by TSO enabling is recovered: it could be up to 20% in micro benchmarking. Fixes: 58169a9c8153 ("net/virtio: support Tx checksum offload") Fixes: 696573046e9e ("net/virtio: support TSO") Cc: Olivier Matz <olivier.matz@6wind.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: stable@dpdk.org Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> --- drivers/net/virtio/virtio_rxtx.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 1e5a6b9..8ec2f1a 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -258,6 +258,12 @@ vtpci_with_feature(hw, VIRTIO_NET_F_HOST_TSO6); } +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + if ((var) != (val)) \ + (var) = (val); \ +} while (0) + static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push) @@ -337,9 +343,9 @@ break; default: - hdr->csum_start = 0; - hdr->csum_offset = 0; - hdr->flags = 0; + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); break; } @@ -355,9 +361,9 @@ cookie->l3_len + cookie->l4_len; } else { - hdr->gso_type = 0; - hdr->gso_size = 0; - hdr->hdr_len = 0; + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); } } -- 1.9.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-11 4:27 ` [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling Yuanhan Liu @ 2017-01-11 7:59 ` Maxime Coquelin 2017-01-11 8:08 ` Yuanhan Liu 2017-01-11 8:22 ` Olivier MATZ 2017-01-11 14:51 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon 1 sibling, 2 replies; 16+ messages in thread From: Maxime Coquelin @ 2017-01-11 7:59 UTC (permalink / raw) To: Yuanhan Liu, dev Cc: Tan Jianfeng, Wang Zhihong, Olivier Matz, Michael S. Tsirkin, stable On 01/11/2017 05:27 AM, Yuanhan Liu wrote: > TSO is now enabled, but it's not actually being used by default in a > simple L2 forward mode. In such case, we have to zero the virtio net > headers, to inform the vhost backend that no offload is being used: > > hdr->csum_start = 0; > hdr->csum_offset = 0; > hdr->flags = 0; > > hdr->gso_type = 0; > hdr->gso_size = 0; > hdr->hdr_len = 0; > > Such writes could be very costly; it introduces severe cache issues: > The above operations introduce cache write for each packet, which > stalls the read operation from the vhost backend. > > The fact that virtio net header is initiated to zero in PMD driver > init stage means that these costly writes are unnecessary and could > be avoided: > > if (hdr->csum_start != 0) > hdr->csum_start = 0; > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the > performance drop introduced by TSO enabling is recovered: it could > be up to 20% in micro benchmarking. Very nice! > > Fixes: 58169a9c8153 ("net/virtio: support Tx checksum offload") > Fixes: 696573046e9e ("net/virtio: support TSO") > > Cc: Olivier Matz <olivier.matz@6wind.com> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com> > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: stable@dpdk.org > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > --- > drivers/net/virtio/virtio_rxtx.c | 18 ++++++++++++------ > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > index 1e5a6b9..8ec2f1a 100644 > --- a/drivers/net/virtio/virtio_rxtx.c > +++ b/drivers/net/virtio/virtio_rxtx.c > @@ -258,6 +258,12 @@ > vtpci_with_feature(hw, VIRTIO_NET_F_HOST_TSO6); > } > > +/* avoid write operation when necessary, to lessen cache issues */ > +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > + if ((var) != (val)) \ > + (var) = (val); \ > +} while (0) As it is intended to go in -stable, I think this is fine to have it only in the driver, but for v17.02, maybe we should have another patch on top that declares it somewhere so that other libs and drivers can make use of it? > + > static inline void > virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, > uint16_t needed, int use_indirect, int can_push) > @@ -337,9 +343,9 @@ > break; > > default: > - hdr->csum_start = 0; > - hdr->csum_offset = 0; > - hdr->flags = 0; > + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); > + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); > + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); > break; > } > > @@ -355,9 +361,9 @@ > cookie->l3_len + > cookie->l4_len; > } else { > - hdr->gso_type = 0; > - hdr->gso_size = 0; > - hdr->hdr_len = 0; > + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); > + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); > + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); > } > } > > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks! Maxime ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-11 7:59 ` Maxime Coquelin @ 2017-01-11 8:08 ` Yuanhan Liu 2017-01-11 8:22 ` Olivier MATZ 1 sibling, 0 replies; 16+ messages in thread From: Yuanhan Liu @ 2017-01-11 8:08 UTC (permalink / raw) To: Maxime Coquelin Cc: dev, Tan Jianfeng, Wang Zhihong, Olivier Matz, Michael S. Tsirkin, stable, Thomas Monjalon On Wed, Jan 11, 2017 at 08:59:28AM +0100, Maxime Coquelin wrote: > >+/* avoid write operation when necessary, to lessen cache issues */ > >+#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > >+ if ((var) != (val)) \ > >+ (var) = (val); \ > >+} while (0) > As it is intended to go in -stable, I think this is fine to have it > only in the driver, but for v17.02, maybe we should have another patch on > top that declares it somewhere so that other libs and drivers can > make use of it? Yes, if it has any other usages :) --yliu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-11 7:59 ` Maxime Coquelin 2017-01-11 8:08 ` Yuanhan Liu @ 2017-01-11 8:22 ` Olivier MATZ 1 sibling, 0 replies; 16+ messages in thread From: Olivier MATZ @ 2017-01-11 8:22 UTC (permalink / raw) To: Maxime Coquelin Cc: Yuanhan Liu, dev, Tan Jianfeng, Wang Zhihong, Olivier Matz, Michael S. Tsirkin, stable Hi Yuanhan, Thanks for the fix. On Wed, 11 Jan 2017 08:59:28 +0100, Maxime Coquelin <maxime.coquelin@redhat.com> wrote: > On 01/11/2017 05:27 AM, Yuanhan Liu wrote: > > TSO is now enabled, but it's not actually being used by default in a > > simple L2 forward mode. In such case, we have to zero the virtio net > > headers, to inform the vhost backend that no offload is being used: > > > > hdr->csum_start = 0; > > hdr->csum_offset = 0; > > hdr->flags = 0; > > > > hdr->gso_type = 0; > > hdr->gso_size = 0; > > hdr->hdr_len = 0; > > > > Such writes could be very costly; it introduces severe cache issues: > > The above operations introduce cache write for each packet, which > > stalls the read operation from the vhost backend. > > > > The fact that virtio net header is initiated to zero in PMD driver > > init stage means that these costly writes are unnecessary and could > > be avoided: > > > > if (hdr->csum_start != 0) > > hdr->csum_start = 0; > > > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the > > performance drop introduced by TSO enabling is recovered: it could > > be up to 20% in micro benchmarking. > Very nice! Yep, I'm also impressed by the result. I would have think this is something already done by the hardware and transparent to the software. > > > > > Fixes: 58169a9c8153 ("net/virtio: support Tx checksum offload") > > Fixes: 696573046e9e ("net/virtio: support TSO") > > > > Cc: Olivier Matz <olivier.matz@6wind.com> > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com> > > Cc: Michael S. Tsirkin <mst@redhat.com> > > Cc: stable@dpdk.org > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > > [...] > > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-11 4:27 ` [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling Yuanhan Liu 2017-01-11 7:59 ` Maxime Coquelin @ 2017-01-11 14:51 ` Thomas Monjalon 2017-01-12 2:30 ` Yuanhan Liu 1 sibling, 1 reply; 16+ messages in thread From: Thomas Monjalon @ 2017-01-11 14:51 UTC (permalink / raw) To: Yuanhan Liu, Jianbo Liu, Jerin Jacob, Jan Viktorin, Chao Zhu Cc: dev, Tan Jianfeng, Wang Zhihong, Olivier Matz, Maxime Coquelin, Michael S. Tsirkin 2017-01-11 12:27, Yuanhan Liu: > The fact that virtio net header is initiated to zero in PMD driver > init stage means that these costly writes are unnecessary and could > be avoided: > > if (hdr->csum_start != 0) > hdr->csum_start = 0; > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the > performance drop introduced by TSO enabling is recovered: it could > be up to 20% in micro benchmarking. This patch is adding a condition to assignments. We need a benchmark on other architectures like ARM. Please anyone? [...] > +/* avoid write operation when necessary, to lessen cache issues */ > +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > + if ((var) != (val)) \ > + (var) = (val); \ > +} while (0) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-11 14:51 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon @ 2017-01-12 2:30 ` Yuanhan Liu 2017-01-12 15:02 ` Jan Viktorin 0 siblings, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2017-01-12 2:30 UTC (permalink / raw) To: Thomas Monjalon Cc: Jianbo Liu, Jerin Jacob, Jan Viktorin, Chao Zhu, dev, Tan Jianfeng, Wang Zhihong, Olivier Matz, Maxime Coquelin, Michael S. Tsirkin On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote: > 2017-01-11 12:27, Yuanhan Liu: > > The fact that virtio net header is initiated to zero in PMD driver > > init stage means that these costly writes are unnecessary and could > > be avoided: > > > > if (hdr->csum_start != 0) > > hdr->csum_start = 0; > > > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the > > performance drop introduced by TSO enabling is recovered: it could > > be up to 20% in micro benchmarking. > > This patch is adding a condition to assignments. > We need a benchmark on other architectures like ARM. Please anyone? I think the cost of condition should be way lower than the cost from the penalty introduced by the cache issue, that I don't see it would perform bad on other platforms. But, of course, testing is always welcome! --yliu > > > [...] > > +/* avoid write operation when necessary, to lessen cache issues */ > > +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > > + if ((var) != (val)) \ > > + (var) = (val); \ > > +} while (0) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-12 2:30 ` Yuanhan Liu @ 2017-01-12 15:02 ` Jan Viktorin 2017-01-13 6:13 ` Yuanhan Liu 0 siblings, 1 reply; 16+ messages in thread From: Jan Viktorin @ 2017-01-12 15:02 UTC (permalink / raw) To: Yuanhan Liu Cc: Thomas Monjalon, Jianbo Liu, Jerin Jacob, Chao Zhu, dev, Tan Jianfeng, Wang Zhihong, Olivier Matz, Maxime Coquelin, Michael S. Tsirkin, Orsák Michal On Thu, 12 Jan 2017 10:30:58 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote: > > 2017-01-11 12:27, Yuanhan Liu: > > > The fact that virtio net header is initiated to zero in PMD driver > > > init stage means that these costly writes are unnecessary and could > > > be avoided: > > > > > > if (hdr->csum_start != 0) > > > hdr->csum_start = 0; > > > > > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the > > > performance drop introduced by TSO enabling is recovered: it could > > > be up to 20% in micro benchmarking. > > > > This patch is adding a condition to assignments. > > We need a benchmark on other architectures like ARM. Please anyone? > > I think the cost of condition should be way lower than the cost from the > penalty introduced by the cache issue, that I don't see it would perform > bad on other platforms. > > But, of course, testing is always welcome! > > --yliu Hello, we've done a synthetic measurement, principle briefly: == Without condition check == start = gettimeofday(); for (i = 0; i < 1024*1024*128; ++i) { hdr->csum_start = 0; hdr->csum_offset = 0; hdr->flags = 0; } end = gettimeofday(); == With condition check == start = gettimeofday(); for (i = 0; i < 1024*1024*128; ++i) { ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); ASSIGN_UNLESS_EQUAL(hdr->flags, 0); } end = gettimeofday(); == Results == Computed as total time of all threads: for i = 1..THREAD_COUNT: result += end[i] - start[i] cpu threads without-check (ms) with-check Xeon E5-2670 1 516 529 Xeon E5-2670 2 1155 953 Xeon E5-2670 8 8947 5044 Xeon E5-2670 16 23335 16836 Zynq-7020 (armv7) 1 6735 7205 Zynq-7020 (armv7) 2 13753 14418 The advantage for Intel is evident when increasing the number of threads. However, on 32-bit ARMs we might expect some performance drop. Regards Jan > > > > > > [...] > > > +/* avoid write operation when necessary, to lessen cache issues */ > > > +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > > > + if ((var) != (val)) \ > > > + (var) = (val); \ > > > +} while (0) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-12 15:02 ` Jan Viktorin @ 2017-01-13 6:13 ` Yuanhan Liu 2017-01-16 7:12 ` Yuanhan Liu 0 siblings, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2017-01-13 6:13 UTC (permalink / raw) To: Jan Viktorin Cc: Thomas Monjalon, Jianbo Liu, Jerin Jacob, Chao Zhu, dev, Tan Jianfeng, Wang Zhihong, Olivier Matz, Maxime Coquelin, Michael S. Tsirkin, Orsák Michal On Thu, Jan 12, 2017 at 04:02:56PM +0100, Jan Viktorin wrote: > On Thu, 12 Jan 2017 10:30:58 +0800 > Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > > > On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote: > > > 2017-01-11 12:27, Yuanhan Liu: > > > > The fact that virtio net header is initiated to zero in PMD driver > > > > init stage means that these costly writes are unnecessary and could > > > > be avoided: > > > > > > > > if (hdr->csum_start != 0) > > > > hdr->csum_start = 0; > > > > > > > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the > > > > performance drop introduced by TSO enabling is recovered: it could > > > > be up to 20% in micro benchmarking. > > > > > > This patch is adding a condition to assignments. > > > We need a benchmark on other architectures like ARM. Please anyone? > > > > I think the cost of condition should be way lower than the cost from the > > penalty introduced by the cache issue, that I don't see it would perform > > bad on other platforms. > > > > But, of course, testing is always welcome! > > > > --yliu > > Hello, > > we've done a synthetic measurement, principle briefly: Thanks! > > == Without condition check == > > start = gettimeofday(); > > for (i = 0; i < 1024*1024*128; ++i) { > hdr->csum_start = 0; > hdr->csum_offset = 0; > hdr->flags = 0; > } > > end = gettimeofday(); > > > == With condition check == > > start = gettimeofday(); > > for (i = 0; i < 1024*1024*128; ++i) { > ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); > ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); > ASSIGN_UNLESS_EQUAL(hdr->flags, 0); > } > > end = gettimeofday(); But it's not the test methodology I'd expect. You are purely testing the instruction cycles. The drop on ARM is something more like "the if instruction takes more cycles than the simple assignment". This macro is used in the case that one process is heavily writing same value (0 here) again and again while another process is heavily read it also again and again. That means cache violation always happen. With this macro, however, this cache issue could be avoided, since no write happens. For such workload, I don't think it would behaviour worse on ARM. --yliu > == Results == > > Computed as total time of all threads: > > for i = 1..THREAD_COUNT: > result += end[i] - start[i] > > cpu threads without-check (ms) with-check > Xeon E5-2670 1 516 529 > Xeon E5-2670 2 1155 953 > Xeon E5-2670 8 8947 5044 > Xeon E5-2670 16 23335 16836 > Zynq-7020 (armv7) 1 6735 7205 > Zynq-7020 (armv7) 2 13753 14418 > > The advantage for Intel is evident when increasing the number > of threads. > > However, on 32-bit ARMs we might expect some performance drop. > > Regards > Jan > > > > > > > > > > [...] > > > > +/* avoid write operation when necessary, to lessen cache issues */ > > > > +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > > > > + if ((var) != (val)) \ > > > > + (var) = (val); \ > > > > +} while (0) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-13 6:13 ` Yuanhan Liu @ 2017-01-16 7:12 ` Yuanhan Liu [not found] ` <46569522-b2c3-2a33-9111-049b73c79760@stud.fit.vutbr.cz> 0 siblings, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2017-01-16 7:12 UTC (permalink / raw) To: Jan Viktorin Cc: Thomas Monjalon, Jianbo Liu, Jerin Jacob, Chao Zhu, dev, Tan Jianfeng, Wang Zhihong, Olivier Matz, Maxime Coquelin, Michael S. Tsirkin, Orsák Michal On Fri, Jan 13, 2017 at 02:13:09PM +0800, Yuanhan Liu wrote: > But it's not the test methodology I'd expect. You are purely testing > the instruction cycles. The drop on ARM is something more like "the > if instruction takes more cycles than the simple assignment". > > This macro is used in the case that one process is heavily writing > same value (0 here) again and again while another process is heavily > read it also again and again. That means cache violation always > happen. With this macro, however, this cache issue could be avoided, > since no write happens. > > For such workload, I don't think it would behaviour worse on ARM. No reply yet; I will treat it as no objections, and please shout out if any. Both applied to dpdk-next-virtio. --yliu ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <46569522-b2c3-2a33-9111-049b73c79760@stud.fit.vutbr.cz>]
[parent not found: <20170116111256.GA11439@yliu-dev.sh.intel.com>]
[parent not found: <8e8178c6-caa2-1b6e-10a0-c83820868db5@stud.fit.vutbr.cz>]
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling [not found] ` <8e8178c6-caa2-1b6e-10a0-c83820868db5@stud.fit.vutbr.cz> @ 2017-01-16 11:21 ` Yuanhan Liu [not found] ` <a12848b4-76ef-29bc-f512-81bd8c1b9b76@stud.fit.vutbr.cz> 0 siblings, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2017-01-16 11:21 UTC (permalink / raw) To: Michal Orsák; +Cc: Thomas Monjalon, Jan Viktorin, dev On Mon, Jan 16, 2017 at 12:14:04PM +0100, Michal Orsák wrote: > On 16.1.2017 12:12, Yuanhan Liu wrote: > >On Mon, Jan 16, 2017 at 12:05:04PM +0100, Michal Orsák wrote: > >>On 16.1.2017 08:12, Yuanhan Liu wrote: > >>>On Fri, Jan 13, 2017 at 02:13:09PM +0800, Yuanhan Liu wrote: > >>>>But it's not the test methodology I'd expect. You are purely testing > >>>>the instruction cycles. The drop on ARM is something more like "the > >>>>if instruction takes more cycles than the simple assignment". > >>>> > >>>>This macro is used in the case that one process is heavily writing > >>>>same value (0 here) again and again while another process is heavily > >>>>read it also again and again. That means cache violation always > >>>>happen. With this macro, however, this cache issue could be avoided, > >>>>since no write happens. > >>>> > >>>>For such workload, I don't think it would behaviour worse on ARM. > >>>No reply yet; I will treat it as no objections, and please shout out if any. > >>> > >>>Both applied to dpdk-next-virtio. > >>> > >>> --yliu > >>Hello, > >> > >> > >>currently I am running short of time. If you have any test prepared which i > >>can just ran, please send me a link. > >No link, but you could try: > > > >- a typical PVP test > > > >- a txonly test: running txonly fwd mode in guest PMD while running > > rxonly in fwd mode. > > > >The second is a micro test, thus I saw way bigger boost. > > > >When are you available for the testing, btw? > 25.1.2017+ Okay, I will hold on a while to apply them. --yliu ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <a12848b4-76ef-29bc-f512-81bd8c1b9b76@stud.fit.vutbr.cz>]
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling [not found] ` <a12848b4-76ef-29bc-f512-81bd8c1b9b76@stud.fit.vutbr.cz> @ 2017-01-30 13:30 ` Yuanhan Liu 2017-01-30 13:54 ` Maxime Coquelin 0 siblings, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2017-01-30 13:30 UTC (permalink / raw) To: Michal Orsák; +Cc: Thomas Monjalon, Jan Viktorin, dev On Mon, Jan 16, 2017 at 12:26:41PM +0100, Michal Orsák wrote: > >>>>>>For such workload, I don't think it would behaviour worse on ARM. > >>>>>No reply yet; I will treat it as no objections, and please shout out if any. > >>>>> > >>>>>Both applied to dpdk-next-virtio. > >>>>> > >>>>> --yliu > >>>>Hello, > >>>> > >>>> > >>>>currently I am running short of time. If you have any test prepared which i > >>>>can just ran, please send me a link. > >>>No link, but you could try: > >>> > >>>- a typical PVP test > >>> > >>>- a txonly test: running txonly fwd mode in guest PMD while running > >>> rxonly in fwd mode. > >>> > >>>The second is a micro test, thus I saw way bigger boost. > >>> > >>>When are you available for the testing, btw? > >>25.1.2017+ > >Okay, I will hold on a while to apply them. > Ok, I will send you results when I have them. Again, no reply yet. I'm appying them to dpdk-next-virtio. I really don't think it could perform worse in ARM, as it removes a costly cache invalidation operation (which should be more expensive than the instruction cycles). OTOH, again, testing is welcome, if it's later proved to be worse on ARM (which I highly doubt), I could revert them. --yliu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-30 13:30 ` Yuanhan Liu @ 2017-01-30 13:54 ` Maxime Coquelin 2017-01-30 14:10 ` Yuanhan Liu 0 siblings, 1 reply; 16+ messages in thread From: Maxime Coquelin @ 2017-01-30 13:54 UTC (permalink / raw) To: Yuanhan Liu, Michal Orsák; +Cc: Thomas Monjalon, Jan Viktorin, dev On 01/30/2017 02:30 PM, Yuanhan Liu wrote: > On Mon, Jan 16, 2017 at 12:26:41PM +0100, Michal Orsák wrote: >>>>>>>> For such workload, I don't think it would behaviour worse on ARM. >>>>>>> No reply yet; I will treat it as no objections, and please shout out if any. >>>>>>> >>>>>>> Both applied to dpdk-next-virtio. >>>>>>> >>>>>>> --yliu >>>>>> Hello, >>>>>> >>>>>> >>>>>> currently I am running short of time. If you have any test prepared which i >>>>>> can just ran, please send me a link. >>>>> No link, but you could try: >>>>> >>>>> - a typical PVP test >>>>> >>>>> - a txonly test: running txonly fwd mode in guest PMD while running >>>>> rxonly in fwd mode. >>>>> >>>>> The second is a micro test, thus I saw way bigger boost. >>>>> >>>>> When are you available for the testing, btw? >>>> 25.1.2017+ >>> Okay, I will hold on a while to apply them. >> Ok, I will send you results when I have them. > > Again, no reply yet. I'm appying them to dpdk-next-virtio. I really > don't think it could perform worse in ARM, as it removes a costly > cache invalidation operation (which should be more expensive than the > instruction cycles). > > OTOH, again, testing is welcome, if it's later proved to be worse on > ARM (which I highly doubt), I could revert them. Or have a per-architecture definition of ASSIGN_UNLESS_EQUAL if it proved to regress on ARM? Maxime > > --yliu > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling 2017-01-30 13:54 ` Maxime Coquelin @ 2017-01-30 14:10 ` Yuanhan Liu 0 siblings, 0 replies; 16+ messages in thread From: Yuanhan Liu @ 2017-01-30 14:10 UTC (permalink / raw) To: Maxime Coquelin; +Cc: Michal Orsák, Thomas Monjalon, Jan Viktorin, dev On Mon, Jan 30, 2017 at 02:54:16PM +0100, Maxime Coquelin wrote: > >OTOH, again, testing is welcome, if it's later proved to be worse on > >ARM (which I highly doubt), I could revert them. > Or have a per-architecture definition of ASSIGN_UNLESS_EQUAL if it > proved to regress on ARM? Yes, we could try that. --yliu ^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH 2/2] net/virtio: optimize header reset on any layout 2017-01-11 4:27 [dpdk-dev] [PATCH 0/2] net/virtio: optimize virtio net header reset Yuanhan Liu 2017-01-11 4:27 ` [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling Yuanhan Liu @ 2017-01-11 4:27 ` Yuanhan Liu 2017-01-11 8:01 ` Maxime Coquelin 1 sibling, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2017-01-11 4:27 UTC (permalink / raw) To: dev Cc: Tan Jianfeng, Wang Zhihong, Yuanhan Liu, Maxime Coquelin, Michael S. Tsirkin, stable When any layout is used, the header is stored in the head room of mbuf. mbuf is allocated and filled by user, means there is no gurateen the header is all zero for non TSO case. Therefore, we have to do the reset by ourself: memest(hdr, 0, head_size); The memset has two impacts on performance: - memset could not be inlined, which is a bit costly. - more importantly, it touches the mbuf, which could introduce severe cache issues as described by former patch. Similiary, we could do the same trick: reset just when necessary, when the corresponding field is already 0, which is likely true for a simple l2 forward case. It could boost the performance up to 20+% in micro benchmarking. Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: stable@dpdk.org Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> --- drivers/net/virtio/virtio_rxtx.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 8ec2f1a..5ca3a88 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -292,8 +292,14 @@ hdr = (struct virtio_net_hdr *) rte_pktmbuf_prepend(cookie, head_size); /* if offload disabled, it is not zeroed below, do it now */ - if (offload == 0) - memset(hdr, 0, head_size); + if (offload == 0) { + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } } else if (use_indirect) { /* setup tx ring slot to point to indirect * descriptor list stored in reserved region. -- 1.9.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/virtio: optimize header reset on any layout 2017-01-11 4:27 ` [dpdk-dev] [PATCH 2/2] net/virtio: optimize header reset on any layout Yuanhan Liu @ 2017-01-11 8:01 ` Maxime Coquelin 0 siblings, 0 replies; 16+ messages in thread From: Maxime Coquelin @ 2017-01-11 8:01 UTC (permalink / raw) To: Yuanhan Liu, dev; +Cc: Tan Jianfeng, Wang Zhihong, Michael S. Tsirkin, stable On 01/11/2017 05:27 AM, Yuanhan Liu wrote: > When any layout is used, the header is stored in the head room of mbuf. > mbuf is allocated and filled by user, means there is no gurateen the > header is all zero for non TSO case. Therefore, we have to do the reset > by ourself: > > memest(hdr, 0, head_size); > > The memset has two impacts on performance: > > - memset could not be inlined, which is a bit costly. > - more importantly, it touches the mbuf, which could introduce severe > cache issues as described by former patch. > > Similiary, we could do the same trick: reset just when necessary, when > the corresponding field is already 0, which is likely true for a simple > l2 forward case. It could boost the performance up to 20+% in micro > benchmarking. > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com> > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: stable@dpdk.org > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > --- > drivers/net/virtio/virtio_rxtx.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > index 8ec2f1a..5ca3a88 100644 > --- a/drivers/net/virtio/virtio_rxtx.c > +++ b/drivers/net/virtio/virtio_rxtx.c > @@ -292,8 +292,14 @@ > hdr = (struct virtio_net_hdr *) > rte_pktmbuf_prepend(cookie, head_size); > /* if offload disabled, it is not zeroed below, do it now */ > - if (offload == 0) > - memset(hdr, 0, head_size); > + if (offload == 0) { > + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); > + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); > + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); > + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); > + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); > + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); > + } > } else if (use_indirect) { > /* setup tx ring slot to point to indirect > * descriptor list stored in reserved region. > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks! Maxime ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-01-30 14:08 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-11 4:27 [dpdk-dev] [PATCH 0/2] net/virtio: optimize virtio net header reset Yuanhan Liu 2017-01-11 4:27 ` [dpdk-dev] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling Yuanhan Liu 2017-01-11 7:59 ` Maxime Coquelin 2017-01-11 8:08 ` Yuanhan Liu 2017-01-11 8:22 ` Olivier MATZ 2017-01-11 14:51 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon 2017-01-12 2:30 ` Yuanhan Liu 2017-01-12 15:02 ` Jan Viktorin 2017-01-13 6:13 ` Yuanhan Liu 2017-01-16 7:12 ` Yuanhan Liu [not found] ` <46569522-b2c3-2a33-9111-049b73c79760@stud.fit.vutbr.cz> [not found] ` <20170116111256.GA11439@yliu-dev.sh.intel.com> [not found] ` <8e8178c6-caa2-1b6e-10a0-c83820868db5@stud.fit.vutbr.cz> 2017-01-16 11:21 ` Yuanhan Liu [not found] ` <a12848b4-76ef-29bc-f512-81bd8c1b9b76@stud.fit.vutbr.cz> 2017-01-30 13:30 ` Yuanhan Liu 2017-01-30 13:54 ` Maxime Coquelin 2017-01-30 14:10 ` Yuanhan Liu 2017-01-11 4:27 ` [dpdk-dev] [PATCH 2/2] net/virtio: optimize header reset on any layout Yuanhan Liu 2017-01-11 8:01 ` Maxime Coquelin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).