* [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test @ 2016-12-05 8:26 Zhiyong Yang 2016-12-02 10:00 ` Maxime Coquelin ` (4 more replies) 0 siblings, 5 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-05 8:26 UTC (permalink / raw) To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev DPDK code has met performance drop badly in some case when calling glibc function memset. Reference to discussions about memset in http://dpdk.org/ml/archives/dev/2016-October/048628.html It is necessary to introduce more high efficient function to fix it. One important thing about rte_memset is that we can get clear control on what instruction flow is used. This patchset introduces rte_memset to bring more high efficient implementation, and will bring obvious perf improvement, especially for small N bytes in the most application scenarios. Patch 1 implements rte_memset in the file rte_memset.h on IA platform The file supports three types of instruction sets including sse & avx (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of vectorization and inline function to improve the perf on IA. In addition, cache line and memory alignment are fully taken into consideration. Patch 2 implements functional autotest to validates the function whether to work in a right way. Patch 3 implements performance autotest separately in cache and memory. Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4% performance improvements on IA platform from virtio/vhost non-mergeable loopback testing. Zhiyong Yang (4): eal/common: introduce rte_memset on IA platform app/test: add functional autotest for rte_memset app/test: add performance autotest for rte_memset lib/librte_vhost: improve vhost perf using rte_memset app/test/Makefile | 3 + app/test/test_memset.c | 158 +++++++++ app/test/test_memset_perf.c | 348 +++++++++++++++++++ doc/guides/rel_notes/release_17_02.rst | 11 + .../common/include/arch/x86/rte_memset.h | 376 +++++++++++++++++++++ lib/librte_eal/common/include/generic/rte_memset.h | 51 +++ lib/librte_vhost/virtio_net.c | 18 +- 7 files changed, 958 insertions(+), 7 deletions(-) create mode 100644 app/test/test_memset.c create mode 100644 app/test/test_memset_perf.c create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test 2016-12-05 8:26 [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang @ 2016-12-02 10:00 ` Maxime Coquelin 2016-12-06 6:33 ` Yang, Zhiyong 2016-12-05 8:26 ` [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang ` (3 subsequent siblings) 4 siblings, 1 reply; 44+ messages in thread From: Maxime Coquelin @ 2016-12-02 10:00 UTC (permalink / raw) To: Zhiyong Yang, dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev Hi Zhiyong, On 12/05/2016 09:26 AM, Zhiyong Yang wrote: > DPDK code has met performance drop badly in some case when calling glibc > function memset. Reference to discussions about memset in > http://dpdk.org/ml/archives/dev/2016-October/048628.html > It is necessary to introduce more high efficient function to fix it. > One important thing about rte_memset is that we can get clear control > on what instruction flow is used. > > This patchset introduces rte_memset to bring more high efficient > implementation, and will bring obvious perf improvement, especially > for small N bytes in the most application scenarios. > > Patch 1 implements rte_memset in the file rte_memset.h on IA platform > The file supports three types of instruction sets including sse & avx > (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of > vectorization and inline function to improve the perf on IA. In addition, > cache line and memory alignment are fully taken into consideration. > > Patch 2 implements functional autotest to validates the function whether > to work in a right way. > > Patch 3 implements performance autotest separately in cache and memory. > > Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4% > performance improvements on IA platform from virtio/vhost non-mergeable > loopback testing. > > Zhiyong Yang (4): > eal/common: introduce rte_memset on IA platform > app/test: add functional autotest for rte_memset > app/test: add performance autotest for rte_memset > lib/librte_vhost: improve vhost perf using rte_memset > > app/test/Makefile | 3 + > app/test/test_memset.c | 158 +++++++++ > app/test/test_memset_perf.c | 348 +++++++++++++++++++ > doc/guides/rel_notes/release_17_02.rst | 11 + > .../common/include/arch/x86/rte_memset.h | 376 +++++++++++++++++++++ > lib/librte_eal/common/include/generic/rte_memset.h | 51 +++ > lib/librte_vhost/virtio_net.c | 18 +- > 7 files changed, 958 insertions(+), 7 deletions(-) > create mode 100644 app/test/test_memset.c > create mode 100644 app/test/test_memset_perf.c > create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h > create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h > Thanks for the series, idea looks good to me. Wouldn't be worth to also use rte_memset in Virtio PMD (not compiled/tested)? : diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 22d97a4..a5f70c4 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, rte_pktmbuf_prepend(cookie, head_size); /* if offload disabled, it is not zeroed below, do it now */ if (offload == 0) - memset(hdr, 0, head_size); + rte_memset(hdr, 0, head_size); } else if (use_indirect) { /* setup tx ring slot to point to indirect * descriptor list stored in reserved region. Cheers, Maxime ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test 2016-12-02 10:00 ` Maxime Coquelin @ 2016-12-06 6:33 ` Yang, Zhiyong 2016-12-06 8:29 ` Maxime Coquelin 0 siblings, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-06 6:33 UTC (permalink / raw) To: Maxime Coquelin, dev; +Cc: yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin Hi, Maxime: > -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > Sent: Friday, December 2, 2016 6:01 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org > Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com> > Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset > and related test > > Hi Zhiyong, > > On 12/05/2016 09:26 AM, Zhiyong Yang wrote: > > DPDK code has met performance drop badly in some case when calling > > glibc function memset. Reference to discussions about memset in > > http://dpdk.org/ml/archives/dev/2016-October/048628.html > > It is necessary to introduce more high efficient function to fix it. > > One important thing about rte_memset is that we can get clear control > > on what instruction flow is used. > > > > This patchset introduces rte_memset to bring more high efficient > > implementation, and will bring obvious perf improvement, especially > > for small N bytes in the most application scenarios. > > > > Patch 1 implements rte_memset in the file rte_memset.h on IA platform > > The file supports three types of instruction sets including sse & avx > > (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of > > vectorization and inline function to improve the perf on IA. In > > addition, cache line and memory alignment are fully taken into > consideration. > > > > Patch 2 implements functional autotest to validates the function > > whether to work in a right way. > > > > Patch 3 implements performance autotest separately in cache and memory. > > > > Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring > > 3%~4% performance improvements on IA platform from virtio/vhost > > non-mergeable loopback testing. > > > > Zhiyong Yang (4): > > eal/common: introduce rte_memset on IA platform > > app/test: add functional autotest for rte_memset > > app/test: add performance autotest for rte_memset > > lib/librte_vhost: improve vhost perf using rte_memset > > > > app/test/Makefile | 3 + > > app/test/test_memset.c | 158 +++++++++ > > app/test/test_memset_perf.c | 348 +++++++++++++++++++ > > doc/guides/rel_notes/release_17_02.rst | 11 + > > .../common/include/arch/x86/rte_memset.h | 376 > +++++++++++++++++++++ > > lib/librte_eal/common/include/generic/rte_memset.h | 51 +++ > > lib/librte_vhost/virtio_net.c | 18 +- > > 7 files changed, 958 insertions(+), 7 deletions(-) create mode > > 100644 app/test/test_memset.c create mode 100644 > > app/test/test_memset_perf.c create mode 100644 > > lib/librte_eal/common/include/arch/x86/rte_memset.h > > create mode 100644 > lib/librte_eal/common/include/generic/rte_memset.h > > > > Thanks for the series, idea looks good to me. > > Wouldn't be worth to also use rte_memset in Virtio PMD (not > compiled/tested)? : > I think rte_memset maybe can bring some benefit here, but , I'm not clear how to enter the branch and test it. :) thanks Zhiyong > diff --git a/drivers/net/virtio/virtio_rxtx.c > b/drivers/net/virtio/virtio_rxtx.c > index 22d97a4..a5f70c4 100644 > --- a/drivers/net/virtio/virtio_rxtx.c > +++ b/drivers/net/virtio/virtio_rxtx.c > @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, > struct rte_mbuf *cookie, > rte_pktmbuf_prepend(cookie, head_size); > /* if offload disabled, it is not zeroed below, do it now */ > if (offload == 0) > - memset(hdr, 0, head_size); > + rte_memset(hdr, 0, head_size); > } else if (use_indirect) { > /* setup tx ring slot to point to indirect > * descriptor list stored in reserved region. > > Cheers, > Maxime ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test 2016-12-06 6:33 ` Yang, Zhiyong @ 2016-12-06 8:29 ` Maxime Coquelin 2016-12-07 9:28 ` Yang, Zhiyong 0 siblings, 1 reply; 44+ messages in thread From: Maxime Coquelin @ 2016-12-06 8:29 UTC (permalink / raw) To: Yang, Zhiyong, dev Cc: yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin, Pierre Pfister (ppfister) On 12/06/2016 07:33 AM, Yang, Zhiyong wrote: > Hi, Maxime: > >> -----Original Message----- >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] >> Sent: Friday, December 2, 2016 6:01 PM >> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org >> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce >> <bruce.richardson@intel.com>; Ananyev, Konstantin >> <konstantin.ananyev@intel.com> >> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset >> and related test >> >> Hi Zhiyong, >> >> On 12/05/2016 09:26 AM, Zhiyong Yang wrote: >>> DPDK code has met performance drop badly in some case when calling >>> glibc function memset. Reference to discussions about memset in >>> http://dpdk.org/ml/archives/dev/2016-October/048628.html >>> It is necessary to introduce more high efficient function to fix it. >>> One important thing about rte_memset is that we can get clear control >>> on what instruction flow is used. >>> >>> This patchset introduces rte_memset to bring more high efficient >>> implementation, and will bring obvious perf improvement, especially >>> for small N bytes in the most application scenarios. >>> >>> Patch 1 implements rte_memset in the file rte_memset.h on IA platform >>> The file supports three types of instruction sets including sse & avx >>> (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of >>> vectorization and inline function to improve the perf on IA. In >>> addition, cache line and memory alignment are fully taken into >> consideration. >>> >>> Patch 2 implements functional autotest to validates the function >>> whether to work in a right way. >>> >>> Patch 3 implements performance autotest separately in cache and memory. >>> >>> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring >>> 3%~4% performance improvements on IA platform from virtio/vhost >>> non-mergeable loopback testing. >>> >>> Zhiyong Yang (4): >>> eal/common: introduce rte_memset on IA platform >>> app/test: add functional autotest for rte_memset >>> app/test: add performance autotest for rte_memset >>> lib/librte_vhost: improve vhost perf using rte_memset >>> >>> app/test/Makefile | 3 + >>> app/test/test_memset.c | 158 +++++++++ >>> app/test/test_memset_perf.c | 348 +++++++++++++++++++ >>> doc/guides/rel_notes/release_17_02.rst | 11 + >>> .../common/include/arch/x86/rte_memset.h | 376 >> +++++++++++++++++++++ >>> lib/librte_eal/common/include/generic/rte_memset.h | 51 +++ >>> lib/librte_vhost/virtio_net.c | 18 +- >>> 7 files changed, 958 insertions(+), 7 deletions(-) create mode >>> 100644 app/test/test_memset.c create mode 100644 >>> app/test/test_memset_perf.c create mode 100644 >>> lib/librte_eal/common/include/arch/x86/rte_memset.h >>> create mode 100644 >> lib/librte_eal/common/include/generic/rte_memset.h >>> >> >> Thanks for the series, idea looks good to me. >> >> Wouldn't be worth to also use rte_memset in Virtio PMD (not >> compiled/tested)? : >> > > I think rte_memset maybe can bring some benefit here, but , I'm not clear how to > enter the branch and test it. :) Indeed, you will need Pierre's patch: [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set Thanks, Maxime > > thanks > Zhiyong > >> diff --git a/drivers/net/virtio/virtio_rxtx.c >> b/drivers/net/virtio/virtio_rxtx.c >> index 22d97a4..a5f70c4 100644 >> --- a/drivers/net/virtio/virtio_rxtx.c >> +++ b/drivers/net/virtio/virtio_rxtx.c >> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, >> struct rte_mbuf *cookie, >> rte_pktmbuf_prepend(cookie, head_size); >> /* if offload disabled, it is not zeroed below, do it now */ >> if (offload == 0) >> - memset(hdr, 0, head_size); >> + rte_memset(hdr, 0, head_size); >> } else if (use_indirect) { >> /* setup tx ring slot to point to indirect >> * descriptor list stored in reserved region. >> >> Cheers, >> Maxime ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test 2016-12-06 8:29 ` Maxime Coquelin @ 2016-12-07 9:28 ` Yang, Zhiyong 2016-12-07 9:37 ` Yuanhan Liu 0 siblings, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-07 9:28 UTC (permalink / raw) To: Maxime Coquelin, dev Cc: yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin, Pierre Pfister (ppfister) Hi, Maxime: > -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > Sent: Tuesday, December 6, 2016 4:30 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org > Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; Pierre Pfister (ppfister) > <ppfister@cisco.com> > Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset > and related test > > > > On 12/06/2016 07:33 AM, Yang, Zhiyong wrote: > > Hi, Maxime: > > > >> -----Original Message----- > >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > >> Sent: Friday, December 2, 2016 6:01 PM > >> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org > >> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce > >> <bruce.richardson@intel.com>; Ananyev, Konstantin > >> <konstantin.ananyev@intel.com> > >> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset > >> and related test > >> > >> Hi Zhiyong, > >> > >> On 12/05/2016 09:26 AM, Zhiyong Yang wrote: > >>> DPDK code has met performance drop badly in some case when calling > >>> glibc function memset. Reference to discussions about memset in > >>> http://dpdk.org/ml/archives/dev/2016-October/048628.html > >>> It is necessary to introduce more high efficient function to fix it. > >>> One important thing about rte_memset is that we can get clear > >>> control on what instruction flow is used. > >>> > >>> This patchset introduces rte_memset to bring more high efficient > >>> implementation, and will bring obvious perf improvement, especially > >>> for small N bytes in the most application scenarios. > >>> > >>> Patch 1 implements rte_memset in the file rte_memset.h on IA > >>> platform The file supports three types of instruction sets including > >>> sse & avx (128bits), avx2(256bits) and avx512(512bits). rte_memset > >>> makes use of vectorization and inline function to improve the perf > >>> on IA. In addition, cache line and memory alignment are fully taken > >>> into > >> consideration. > >>> > >>> Patch 2 implements functional autotest to validates the function > >>> whether to work in a right way. > >>> > >>> Patch 3 implements performance autotest separately in cache and > memory. > >>> > >>> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring > >>> 3%~4% performance improvements on IA platform from virtio/vhost > >>> non-mergeable loopback testing. > >>> > >>> Zhiyong Yang (4): > >>> eal/common: introduce rte_memset on IA platform > >>> app/test: add functional autotest for rte_memset > >>> app/test: add performance autotest for rte_memset > >>> lib/librte_vhost: improve vhost perf using rte_memset > >>> > >>> app/test/Makefile | 3 + > >>> app/test/test_memset.c | 158 +++++++++ > >>> app/test/test_memset_perf.c | 348 > +++++++++++++++++++ > >>> doc/guides/rel_notes/release_17_02.rst | 11 + > >>> .../common/include/arch/x86/rte_memset.h | 376 > >> +++++++++++++++++++++ > >>> lib/librte_eal/common/include/generic/rte_memset.h | 51 +++ > >>> lib/librte_vhost/virtio_net.c | 18 +- > >>> 7 files changed, 958 insertions(+), 7 deletions(-) create mode > >>> 100644 app/test/test_memset.c create mode 100644 > >>> app/test/test_memset_perf.c create mode 100644 > >>> lib/librte_eal/common/include/arch/x86/rte_memset.h > >>> create mode 100644 > >> lib/librte_eal/common/include/generic/rte_memset.h > >>> > >> > >> Thanks for the series, idea looks good to me. > >> > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not > >> compiled/tested)? : > >> > > > > I think rte_memset maybe can bring some benefit here, but , I'm not > > clear how to enter the branch and test it. :) > > Indeed, you will need Pierre's patch: > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set > > Thanks, > Maxime > > Thank you Maxime. I can see a little, but not obviously performance improvement here. You know, memset(hdr, 0, head_size); only consumes fewer cycles for virtio pmd. head_size only 10 or 12 bytes. I optimize rte_memset perf further for N=8~15 bytes. The main purpose of Introducing rte_memset is that we can use it to avoid perf drop issue instead of glibc memset on some platform, I think. > > > >> diff --git a/drivers/net/virtio/virtio_rxtx.c > >> b/drivers/net/virtio/virtio_rxtx.c > >> index 22d97a4..a5f70c4 100644 > >> --- a/drivers/net/virtio/virtio_rxtx.c > >> +++ b/drivers/net/virtio/virtio_rxtx.c > >> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, > >> struct rte_mbuf *cookie, > >> rte_pktmbuf_prepend(cookie, head_size); > >> /* if offload disabled, it is not zeroed below, do it now */ > >> if (offload == 0) > >> - memset(hdr, 0, head_size); > >> + rte_memset(hdr, 0, head_size); > >> } else if (use_indirect) { > >> /* setup tx ring slot to point to indirect > >> * descriptor list stored in reserved region. > >> > >> Cheers, > >> Maxime ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test 2016-12-07 9:28 ` Yang, Zhiyong @ 2016-12-07 9:37 ` Yuanhan Liu 2016-12-07 9:43 ` Yang, Zhiyong 0 siblings, 1 reply; 44+ messages in thread From: Yuanhan Liu @ 2016-12-07 9:37 UTC (permalink / raw) To: Yang, Zhiyong Cc: Maxime Coquelin, dev, Richardson, Bruce, Ananyev, Konstantin, Pierre Pfister (ppfister) On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote: > > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not > > >> compiled/tested)? : > > >> > > > > > > I think rte_memset maybe can bring some benefit here, but , I'm not > > > clear how to enter the branch and test it. :) > > > > Indeed, you will need Pierre's patch: > > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set I will apply it shortly. > > Thanks, > > Maxime > > > > Thank you Maxime. > I can see a little, but not obviously performance improvement here. Are you you have run into that code piece? FYI, you have to enable virtio 1.0 explicitly, which is disabled by deafault. > You know, memset(hdr, 0, head_size); only consumes fewer cycles for virtio pmd. > head_size only 10 or 12 bytes. > I optimize rte_memset perf further for N=8~15 bytes. > The main purpose of Introducing rte_memset is that we can use it > to avoid perf drop issue instead of glibc memset on some platform, I think. For this case (as well as the 4th patch), it's more about making sure rte_memset is inlined. --yliu ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test 2016-12-07 9:37 ` Yuanhan Liu @ 2016-12-07 9:43 ` Yang, Zhiyong 2016-12-07 9:48 ` Yuanhan Liu 0 siblings, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-07 9:43 UTC (permalink / raw) To: Yuanhan Liu Cc: Maxime Coquelin, dev, Richardson, Bruce, Ananyev, Konstantin, Pierre Pfister (ppfister) Hi, yuanhan: > -----Original Message----- > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > Sent: Wednesday, December 7, 2016 5:38 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org; > Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; Pierre Pfister (ppfister) > <ppfister@cisco.com> > Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset > and related test > > On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote: > > > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not > > > >> compiled/tested)? : > > > >> > > > > > > > > I think rte_memset maybe can bring some benefit here, but , I'm > > > > not clear how to enter the branch and test it. :) > > > > > > Indeed, you will need Pierre's patch: > > > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set > > I will apply it shortly. > > > > Thanks, > > > Maxime > > > > > > Thank you Maxime. > > I can see a little, but not obviously performance improvement here. > > Are you you have run into that code piece? FYI, you have to enable virtio 1.0 > explicitly, which is disabled by deafault. Yes. I use the patch from Pierre and set offload = 0 ; Thanks Zhiyong > > > You know, memset(hdr, 0, head_size); only consumes fewer cycles for > virtio pmd. > > head_size only 10 or 12 bytes. > > I optimize rte_memset perf further for N=8~15 bytes. > > The main purpose of Introducing rte_memset is that we can use it to > > avoid perf drop issue instead of glibc memset on some platform, I think. > > For this case (as well as the 4th patch), it's more about making sure > rte_memset is inlined. > > --yliu ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test 2016-12-07 9:43 ` Yang, Zhiyong @ 2016-12-07 9:48 ` Yuanhan Liu 0 siblings, 0 replies; 44+ messages in thread From: Yuanhan Liu @ 2016-12-07 9:48 UTC (permalink / raw) To: Yang, Zhiyong Cc: Maxime Coquelin, dev, Richardson, Bruce, Ananyev, Konstantin, Pierre Pfister (ppfister) On Wed, Dec 07, 2016 at 09:43:06AM +0000, Yang, Zhiyong wrote: > > On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote: > > > > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not > > > > >> compiled/tested)? : > > > > >> > > > > > > > > > > I think rte_memset maybe can bring some benefit here, but , I'm > > > > > not clear how to enter the branch and test it. :) > > > > > > > > Indeed, you will need Pierre's patch: > > > > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set > > > > I will apply it shortly. > > > > > > Thanks, > > > > Maxime > > > > > > > > Thank you Maxime. > > > I can see a little, but not obviously performance improvement here. > > > > Are you you have run into that code piece? FYI, you have to enable virtio 1.0 > > explicitly, which is disabled by deafault. > > Yes. I use the patch from Pierre and set offload = 0 ; I meant virtio 1.0. Have you added following options for the QEMU virtio-net device? disable-modern=false --yliu ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-05 8:26 [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang 2016-12-02 10:00 ` Maxime Coquelin @ 2016-12-05 8:26 ` Zhiyong Yang 2016-12-02 10:25 ` Thomas Monjalon 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang 2016-12-05 8:26 ` [dpdk-dev] [PATCH 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang ` (2 subsequent siblings) 4 siblings, 2 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-05 8:26 UTC (permalink / raw) To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev, Zhiyong Yang Performance drop has been caused in some cases when DPDK code calls glibc function memset. reference to discussions about memset in http://dpdk.org/ml/archives/dev/2016-October/048628.html It is necessary to introduce more high efficient function to fix it. One important thing about rte_memset is that we can get clear control on what instruction flow is used. This patch supports instruction sets such as sse & avx(128 bits), avx2(256 bits) and avx512(512bits). rte_memset makes full use of vectorization and inline function to improve the perf on IA. In addition, cache line and memory alignment are fully taken into consideration. Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> --- .../common/include/arch/x86/rte_memset.h | 376 +++++++++++++++++++++ lib/librte_eal/common/include/generic/rte_memset.h | 51 +++ 2 files changed, 427 insertions(+) create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h diff --git a/lib/librte_eal/common/include/arch/x86/rte_memset.h b/lib/librte_eal/common/include/arch/x86/rte_memset.h new file mode 100644 index 0000000..3b2d3a3 --- /dev/null +++ b/lib/librte_eal/common/include/arch/x86/rte_memset.h @@ -0,0 +1,376 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2016 Intel Corporation. All rights reserved. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_MEMSET_X86_64_H_ +#define _RTE_MEMSET_X86_64_H_ + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * @file + * + * Functions for vectorised implementation of memset(). + */ + +#include <stdio.h> +#include <stdint.h> +#include <string.h> +#include <rte_vect.h> + +static inline void * +rte_memset(void *dst, int a, size_t n) __attribute__((always_inline)); + +static inline void +rte_memset_less16(void *dst, int a, size_t n) +{ + uintptr_t dstu = (uintptr_t)dst; + + if (n & 0x01) { + *(uint8_t *)dstu = (uint8_t)a; + dstu = (uintptr_t)((uint8_t *)dstu + 1); + } + if (n & 0x02) { + *(uint16_t *)dstu = (uint8_t)a | (((uint8_t)a) << 8); + dstu = (uintptr_t)((uint16_t *)dstu + 1); + } + if (n & 0x04) { + uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8)); + + *(uint32_t *)dstu = (uint32_t)(b | (b << 16)); + dstu = (uintptr_t)((uint32_t *)dstu + 1); + } + if (n & 0x08) { + uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8)); + uint32_t c = b | (b << 16); + + *(uint32_t *)dstu = c; + *((uint32_t *)dstu + 1) = c; + dstu = (uintptr_t)((uint32_t *)dstu + 2); + } +} + +static inline void +rte_memset16(uint8_t *dst, int8_t a) +{ + __m128i xmm0; + + xmm0 = _mm_set1_epi8(a); + _mm_storeu_si128((__m128i *)dst, xmm0); +} + +static inline void +rte_memset_17to32(void *dst, int a, size_t n) +{ + rte_memset16((uint8_t *)dst, a); + rte_memset16((uint8_t *)dst - 16 + n, a); +} + +#ifdef RTE_MACHINE_CPUFLAG_AVX512 + +/** + * AVX512 implementation below + */ + +static inline void +rte_memset32(uint8_t *dst, int8_t a) +{ + __m256i ymm0; + + ymm0 = _mm256_set1_epi8(a); + _mm256_storeu_si256((__m256i *)dst, ymm0); +} + +static inline void +rte_memset64(uint8_t *dst, int8_t a) +{ + __m512i zmm0; + + zmm0 = _mm512_set1_epi8(a); + _mm512_storeu_si512((void *)dst, zmm0); +} + +static inline void +rte_memset128blocks(uint8_t *dst, int8_t a, size_t n) +{ + __m512i zmm0; + + zmm0 = _mm512_set1_epi8(a); + while (n >= 128) { + n -= 128; + _mm512_store_si512((void *)(dst + 0 * 64), zmm0); + _mm512_store_si512((void *)(dst + 1 * 64), zmm0); + dst = dst + 128; + } +} + +static inline void * +rte_memset(void *dst, int a, size_t n) +{ + void *ret = dst; + size_t dstofss; + size_t bits; + + if (n < 16) { + rte_memset_less16(dst, a, n); + return ret; + } else if (n == 16) { + rte_memset16((uint8_t *)dst, a); + return ret; + } + if (n <= 32) { + rte_memset_17to32(dst, a, n); + return ret; + } + if (n <= 64) { + rte_memset32((uint8_t *)dst, a); + rte_memset32((uint8_t *)dst - 32 + n, a); + return ret; + } + if (n >= 256) { + dstofss = ((uintptr_t)dst & 0x3F); + if (dstofss > 0) { + dstofss = 64 - dstofss; + n -= dstofss; + rte_memset64((uint8_t *)dst, a); + dst = (uint8_t *)dst + dstofss; + } + rte_memset128blocks((uint8_t *)dst, a, n); + bits = n; + n = n & 127; + bits -= n; + dst = (uint8_t *)dst + bits; + } + if (n > 128) { + n -= 128; + rte_memset64((uint8_t *)dst, a); + rte_memset64((uint8_t *)dst + 64, a); + dst = (uint8_t *)dst + 128; + } + if (n > 64) { + rte_memset64((uint8_t *)dst, a); + rte_memset64((uint8_t *)dst - 64 + n, a); + return ret; + } + if (n > 0) + rte_memset64((uint8_t *)dst - 64 + n, a); + return ret; +} + +#elif defined RTE_MACHINE_CPUFLAG_AVX2 + +/** + * AVX2 implementation below + */ + +static inline void +rte_memset32(uint8_t *dst, int8_t a) +{ + __m256i ymm0; + + ymm0 = _mm256_set1_epi8(a); + _mm256_storeu_si256((__m256i *)dst, ymm0); +} + +static inline void +rte_memset_33to64(void *dst, int a, size_t n) +{ + rte_memset32((uint8_t *)dst, a); + rte_memset32((uint8_t *)dst - 32 + n, a); +} + +static inline void +rte_memset64blocks(uint8_t *dst, int8_t a, size_t n) +{ + __m256i ymm0; + + ymm0 = _mm256_set1_epi8(a); + while (n >= 64) { + n -= 64; + _mm256_store_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0); + _mm256_store_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm0); + dst = (uint8_t *)dst + 64; + + } +} + +static inline void * +rte_memset(void *dst, int a, size_t n) +{ + void *ret = dst; + size_t dstofss; + size_t bits; + + if (n < 16) { + rte_memset_less16(dst, a, n); + return ret; + } else if (n == 16) { + rte_memset16((uint8_t *)dst, a); + return ret; + } + if (n <= 32) { + rte_memset_17to32(dst, a, n); + return ret; + } + if (n <= 64) { + rte_memset_33to64(dst, a, n); + return ret; + } + if (n > 64) { + dstofss = (uintptr_t)dst & 0x1F; + if (dstofss > 0) { + dstofss = 32 - dstofss; + n -= dstofss; + rte_memset32((uint8_t *)dst, a); + dst = (uint8_t *)dst + dstofss; + } + rte_memset64blocks((uint8_t *)dst, a, n); + bits = n; + n = n & 63; + bits -= n; + dst = (uint8_t *)dst + bits; + } + if (n > 32) { + rte_memset_33to64(dst, a, n); + return ret; + } + if (n > 0) + rte_memset32((uint8_t *)dst - 32 + n, a); + return ret; +} + +#else /* RTE_MACHINE_CPUFLAG */ + +/** + * SSE && AVX implementation below + */ + +static inline void +rte_memset32(uint8_t *dst, int8_t a) +{ + __m128i xmm0 = _mm_set1_epi8(a); + + _mm_storeu_si128((__m128i *)dst, xmm0); + _mm_storeu_si128((__m128i *)(dst + 16), xmm0); +} + +static inline void +rte_memset16blocks(uint8_t *dst, int8_t a, size_t n) +{ + __m128i xmm0 = _mm_set1_epi8(a); + + while (n >= 16) { + n -= 16; + _mm_store_si128((__m128i *)(dst + 0 * 16), xmm0); + dst = (uint8_t *)dst + 16; + } +} + +static inline void +rte_memset64blocks(uint8_t *dst, int8_t a, size_t n) +{ + __m128i xmm0 = _mm_set1_epi8(a); + + while (n >= 64) { + n -= 64; + _mm_store_si128((__m128i *)(dst + 0 * 16), xmm0); + _mm_store_si128((__m128i *)(dst + 1 * 16), xmm0); + _mm_store_si128((__m128i *)(dst + 2 * 16), xmm0); + _mm_store_si128((__m128i *)(dst + 3 * 16), xmm0); + dst = (uint8_t *)dst + 64; + } +} + +static inline void * +rte_memset(void *dst, int a, size_t n) +{ + void *ret = dst; + size_t dstofss; + size_t bits; + + if (n < 16) { + rte_memset_less16(dst, a, n); + return ret; + } else if (n == 16) { + rte_memset16((uint8_t *)dst, a); + return ret; + } + if (n <= 32) { + rte_memset_17to32(dst, a, n); + return ret; + } + if (n <= 48) { + rte_memset32((uint8_t *)dst, a); + rte_memset16((uint8_t *)dst - 16 + n, a); + return ret; + } + if (n <= 64) { + rte_memset32((uint8_t *)dst, a); + rte_memset16((uint8_t *)dst + 32, a); + rte_memset16((uint8_t *)dst - 16 + n, a); + return ret; + } + if (n > 64) { + dstofss = (uintptr_t)dst & 0xF; + if (dstofss > 0) { + dstofss = 16 - dstofss; + n -= dstofss; + rte_memset16((uint8_t *)dst, a); + dst = (uint8_t *)dst + dstofss; + } + rte_memset64blocks((uint8_t *)dst, a, n); + bits = n; + n &= 63; + bits -= n; + dst = (uint8_t *)dst + bits; + rte_memset16blocks((uint8_t *)dst, a, n); + bits = n; + n &= 0xf; + bits -= n; + dst = (uint8_t *)dst + bits; + if (n > 0) { + rte_memset16((uint8_t *)dst - 16 + n, a); + return ret; + } + } + return ret; +} + +#endif /* RTE_MACHINE_CPUFLAG */ + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_MEMSET_H_ */ diff --git a/lib/librte_eal/common/include/generic/rte_memset.h b/lib/librte_eal/common/include/generic/rte_memset.h new file mode 100644 index 0000000..416a638 --- /dev/null +++ b/lib/librte_eal/common/include/generic/rte_memset.h @@ -0,0 +1,51 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2016 Intel Corporation. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_MEMSET_H_ +#define _RTE_MEMSET_H_ + +/** + * @file + * + * Functions for vectorised implementation of memset(). + */ +#ifndef _RTE_MEMSET_X86_64_H_ + +#define rte_memset memset + +#else + +static void * +rte_memset(void *dst, int a, size_t n); + +#endif +#endif /* _RTE_MEMSET_H_ */ -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-05 8:26 ` [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang @ 2016-12-02 10:25 ` Thomas Monjalon 2016-12-08 7:41 ` Yang, Zhiyong 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang 1 sibling, 1 reply; 44+ messages in thread From: Thomas Monjalon @ 2016-12-02 10:25 UTC (permalink / raw) To: Zhiyong Yang Cc: dev, yuanhan.liu, bruce.richardson, konstantin.ananyev, Pablo de Lara 2016-12-05 16:26, Zhiyong Yang: > +#ifndef _RTE_MEMSET_X86_64_H_ Is this implementation specific to 64-bit? > + > +#define rte_memset memset > + > +#else > + > +static void * > +rte_memset(void *dst, int a, size_t n); > + > +#endif If I understand well, rte_memset (as rte_memcpy) is using the most recent instructions available (and enabled) when compiling. It is not adapting the instructions to the run-time CPU. There is no need to downgrade at run-time the instruction set as it is obviously not a supported case, but it would be nice to be able to upgrade a "default compilation" at run-time as it is done in rte_acl. I explain this case more clearly for reference: We can have AVX512 supported in the compiler but disable it when compiling (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost everywhere. When running this binary on a CPU having AVX512 support, it will not benefit of the AVX512 improvement. Though, we can compile an AVX512 version of some functions and use them only if the running CPU is capable. This kind of miracle can be achieved in two ways: 1/ For generic C code compiled with a recent GCC, a function can be built for several CPUs thanks to the attribute target_clones. 2/ For manually optimized functions using CPU-specific intrinsics or asm, it is possible to build them with non-default flags thanks to the attribute target. 3/ For manually optimized files using CPU-specific intrinsics or asm, we use specifics flags in the makefile. The function clone in case 1/ is dynamically chosen at run-time through ifunc resolver. The specific functions in cases 2/ and 3/ must chosen at run-time by initializing a function pointer thanks to rte_cpu_get_flag_enabled(). Note that rte_hash and software crypto PMDs have a run-time check with rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile. Next step for these libraries? Back to rte_memset, I think you should try the solution 2/. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-02 10:25 ` Thomas Monjalon @ 2016-12-08 7:41 ` Yang, Zhiyong 2016-12-08 9:26 ` Ananyev, Konstantin 2016-12-08 15:09 ` Thomas Monjalon 0 siblings, 2 replies; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-08 7:41 UTC (permalink / raw) To: Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin, De Lara Guarch, Pablo HI, Thomas: Sorry for late reply. I have been being always considering your suggestion. > -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Friday, December 2, 2016 6:25 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > 2016-12-05 16:26, Zhiyong Yang: > > +#ifndef _RTE_MEMSET_X86_64_H_ > > Is this implementation specific to 64-bit? > Yes. > > + > > +#define rte_memset memset > > + > > +#else > > + > > +static void * > > +rte_memset(void *dst, int a, size_t n); > > + > > +#endif > > If I understand well, rte_memset (as rte_memcpy) is using the most recent > instructions available (and enabled) when compiling. > It is not adapting the instructions to the run-time CPU. > There is no need to downgrade at run-time the instruction set as it is > obviously not a supported case, but it would be nice to be able to upgrade a > "default compilation" at run-time as it is done in rte_acl. > I explain this case more clearly for reference: > > We can have AVX512 supported in the compiler but disable it when compiling > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost > everywhere. > When running this binary on a CPU having AVX512 support, it will not benefit > of the AVX512 improvement. > Though, we can compile an AVX512 version of some functions and use them > only if the running CPU is capable. > This kind of miracle can be achieved in two ways: > > 1/ For generic C code compiled with a recent GCC, a function can be built for > several CPUs thanks to the attribute target_clones. > > 2/ For manually optimized functions using CPU-specific intrinsics or asm, it is > possible to build them with non-default flags thanks to the attribute target. > > 3/ For manually optimized files using CPU-specific intrinsics or asm, we use > specifics flags in the makefile. > > The function clone in case 1/ is dynamically chosen at run-time through ifunc > resolver. > The specific functions in cases 2/ and 3/ must chosen at run-time by > initializing a function pointer thanks to rte_cpu_get_flag_enabled(). > > Note that rte_hash and software crypto PMDs have a run-time check with > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile. > Next step for these libraries? > > Back to rte_memset, I think you should try the solution 2/. I have read the ACL code, if I understand well , for complex algo implementation, it is good idea, but Choosing functions at run time will bring some overhead. For frequently called function Which consumes small cycles, the overhead maybe is more than the gains optimizations brings For example, for most applications in dpdk, memset only set N = 10 or 12bytes. It consumes fewer cycles. Thanks Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-08 7:41 ` Yang, Zhiyong @ 2016-12-08 9:26 ` Ananyev, Konstantin 2016-12-08 9:53 ` Yang, Zhiyong 2016-12-08 15:09 ` Thomas Monjalon 1 sibling, 1 reply; 44+ messages in thread From: Ananyev, Konstantin @ 2016-12-08 9:26 UTC (permalink / raw) To: Yang, Zhiyong, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo Hi Zhiyong, > > HI, Thomas: > Sorry for late reply. I have been being always considering your suggestion. > > > -----Original Message----- > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > Sent: Friday, December 2, 2016 6:25 PM > > To: Yang, Zhiyong <zhiyong.yang@intel.com> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > <bruce.richardson@intel.com>; Ananyev, Konstantin > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > > IA platform > > > > 2016-12-05 16:26, Zhiyong Yang: > > > +#ifndef _RTE_MEMSET_X86_64_H_ > > > > Is this implementation specific to 64-bit? > > > > Yes. > > > > + > > > +#define rte_memset memset > > > + > > > +#else > > > + > > > +static void * > > > +rte_memset(void *dst, int a, size_t n); > > > + > > > +#endif > > > > If I understand well, rte_memset (as rte_memcpy) is using the most recent > > instructions available (and enabled) when compiling. > > It is not adapting the instructions to the run-time CPU. > > There is no need to downgrade at run-time the instruction set as it is > > obviously not a supported case, but it would be nice to be able to upgrade a > > "default compilation" at run-time as it is done in rte_acl. > > I explain this case more clearly for reference: > > > > We can have AVX512 supported in the compiler but disable it when compiling > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost > > everywhere. > > When running this binary on a CPU having AVX512 support, it will not benefit > > of the AVX512 improvement. > > Though, we can compile an AVX512 version of some functions and use them > > only if the running CPU is capable. > > This kind of miracle can be achieved in two ways: > > > > 1/ For generic C code compiled with a recent GCC, a function can be built for > > several CPUs thanks to the attribute target_clones. > > > > 2/ For manually optimized functions using CPU-specific intrinsics or asm, it is > > possible to build them with non-default flags thanks to the attribute target. > > > > 3/ For manually optimized files using CPU-specific intrinsics or asm, we use > > specifics flags in the makefile. > > > > The function clone in case 1/ is dynamically chosen at run-time through ifunc > > resolver. > > The specific functions in cases 2/ and 3/ must chosen at run-time by > > initializing a function pointer thanks to rte_cpu_get_flag_enabled(). > > > > Note that rte_hash and software crypto PMDs have a run-time check with > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile. > > Next step for these libraries? > > > > Back to rte_memset, I think you should try the solution 2/. > > I have read the ACL code, if I understand well , for complex algo implementation, > it is good idea, but Choosing functions at run time will bring some overhead. For frequently called function > Which consumes small cycles, the overhead maybe is more than the gains optimizations brings > For example, for most applications in dpdk, memset only set N = 10 or 12bytes. It consumes fewer cycles. But then what the point to have an rte_memset() using vector instructions at all? >From what you are saying the most common case is even less then SSE register size. Konstantin > > Thanks > Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-08 9:26 ` Ananyev, Konstantin @ 2016-12-08 9:53 ` Yang, Zhiyong 2016-12-08 10:27 ` Bruce Richardson 2016-12-08 10:30 ` Ananyev, Konstantin 0 siblings, 2 replies; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-08 9:53 UTC (permalink / raw) To: Ananyev, Konstantin, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo Hi, Konstantin: > -----Original Message----- > From: Ananyev, Konstantin > Sent: Thursday, December 8, 2016 5:26 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > <thomas.monjalon@6wind.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > > Hi Zhiyong, > > > > > HI, Thomas: > > Sorry for late reply. I have been being always considering your > suggestion. > > > > > -----Original Message----- > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > > Sent: Friday, December 2, 2016 6:25 PM > > > To: Yang, Zhiyong <zhiyong.yang@intel.com> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > <bruce.richardson@intel.com>; Ananyev, Konstantin > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.guarch@intel.com> > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce > rte_memset > > > on IA platform > > > > > > 2016-12-05 16:26, Zhiyong Yang: > > > > +#ifndef _RTE_MEMSET_X86_64_H_ > > > > > > Is this implementation specific to 64-bit? > > > > > > > Yes. > > > > > > + > > > > +#define rte_memset memset > > > > + > > > > +#else > > > > + > > > > +static void * > > > > +rte_memset(void *dst, int a, size_t n); > > > > + > > > > +#endif > > > > > > If I understand well, rte_memset (as rte_memcpy) is using the most > > > recent instructions available (and enabled) when compiling. > > > It is not adapting the instructions to the run-time CPU. > > > There is no need to downgrade at run-time the instruction set as it > > > is obviously not a supported case, but it would be nice to be able > > > to upgrade a "default compilation" at run-time as it is done in rte_acl. > > > I explain this case more clearly for reference: > > > > > > We can have AVX512 supported in the compiler but disable it when > > > compiling > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost > > > everywhere. > > > When running this binary on a CPU having AVX512 support, it will not > > > benefit of the AVX512 improvement. > > > Though, we can compile an AVX512 version of some functions and use > > > them only if the running CPU is capable. > > > This kind of miracle can be achieved in two ways: > > > > > > 1/ For generic C code compiled with a recent GCC, a function can be > > > built for several CPUs thanks to the attribute target_clones. > > > > > > 2/ For manually optimized functions using CPU-specific intrinsics or > > > asm, it is possible to build them with non-default flags thanks to the > attribute target. > > > > > > 3/ For manually optimized files using CPU-specific intrinsics or > > > asm, we use specifics flags in the makefile. > > > > > > The function clone in case 1/ is dynamically chosen at run-time > > > through ifunc resolver. > > > The specific functions in cases 2/ and 3/ must chosen at run-time by > > > initializing a function pointer thanks to rte_cpu_get_flag_enabled(). > > > > > > Note that rte_hash and software crypto PMDs have a run-time check > > > with > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile. > > > Next step for these libraries? > > > > > > Back to rte_memset, I think you should try the solution 2/. > > > > I have read the ACL code, if I understand well , for complex algo > > implementation, it is good idea, but Choosing functions at run time > > will bring some overhead. For frequently called function Which > > consumes small cycles, the overhead maybe is more than the gains > optimizations brings For example, for most applications in dpdk, memset only > set N = 10 or 12bytes. It consumes fewer cycles. > > But then what the point to have an rte_memset() using vector instructions at > all? > From what you are saying the most common case is even less then SSE > register size. > Konstantin For most cases, memset is used such as memset(address, 0, sizeof(struct xxx)); The use case here is small by accident, I only give an example here. but rte_memset is introduced to need consider generic case. sizeof(struct xxx) is not limited to very small size, such as less than SSE register size. I just want to say that the size for the most use case is not very large, So cycles consumed Is not large. It is not suited to choose function at run-time since overhead is considered. thanks Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-08 9:53 ` Yang, Zhiyong @ 2016-12-08 10:27 ` Bruce Richardson 2016-12-08 10:30 ` Ananyev, Konstantin 1 sibling, 0 replies; 44+ messages in thread From: Bruce Richardson @ 2016-12-08 10:27 UTC (permalink / raw) To: Yang, Zhiyong Cc: Ananyev, Konstantin, Thomas Monjalon, dev, yuanhan.liu, De Lara Guarch, Pablo On Thu, Dec 08, 2016 at 09:53:12AM +0000, Yang, Zhiyong wrote: > Hi, Konstantin: > > > -----Original Message----- > > From: Ananyev, Konstantin > > Sent: Thursday, December 8, 2016 5:26 PM > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > > <thomas.monjalon@6wind.com> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > > IA platform > > > > > > Hi Zhiyong, > > > > > > > > HI, Thomas: > > > Sorry for late reply. I have been being always considering your > > suggestion. > > > > > > > -----Original Message----- > > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > > > Sent: Friday, December 2, 2016 6:25 PM > > > > To: Yang, Zhiyong <zhiyong.yang@intel.com> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > <bruce.richardson@intel.com>; Ananyev, Konstantin > > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > > > > <pablo.de.lara.guarch@intel.com> > > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > rte_memset > > > > on IA platform > > > > > > > > 2016-12-05 16:26, Zhiyong Yang: > > > > > +#ifndef _RTE_MEMSET_X86_64_H_ > > > > > > > > Is this implementation specific to 64-bit? > > > > > > > > > > Yes. > > > > > > > > + > > > > > +#define rte_memset memset > > > > > + > > > > > +#else > > > > > + > > > > > +static void * > > > > > +rte_memset(void *dst, int a, size_t n); > > > > > + > > > > > +#endif > > > > > > > > If I understand well, rte_memset (as rte_memcpy) is using the most > > > > recent instructions available (and enabled) when compiling. > > > > It is not adapting the instructions to the run-time CPU. > > > > There is no need to downgrade at run-time the instruction set as it > > > > is obviously not a supported case, but it would be nice to be able > > > > to upgrade a "default compilation" at run-time as it is done in rte_acl. > > > > I explain this case more clearly for reference: > > > > > > > > We can have AVX512 supported in the compiler but disable it when > > > > compiling > > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost > > > > everywhere. > > > > When running this binary on a CPU having AVX512 support, it will not > > > > benefit of the AVX512 improvement. > > > > Though, we can compile an AVX512 version of some functions and use > > > > them only if the running CPU is capable. > > > > This kind of miracle can be achieved in two ways: > > > > > > > > 1/ For generic C code compiled with a recent GCC, a function can be > > > > built for several CPUs thanks to the attribute target_clones. > > > > > > > > 2/ For manually optimized functions using CPU-specific intrinsics or > > > > asm, it is possible to build them with non-default flags thanks to the > > attribute target. > > > > > > > > 3/ For manually optimized files using CPU-specific intrinsics or > > > > asm, we use specifics flags in the makefile. > > > > > > > > The function clone in case 1/ is dynamically chosen at run-time > > > > through ifunc resolver. > > > > The specific functions in cases 2/ and 3/ must chosen at run-time by > > > > initializing a function pointer thanks to rte_cpu_get_flag_enabled(). > > > > > > > > Note that rte_hash and software crypto PMDs have a run-time check > > > > with > > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile. > > > > Next step for these libraries? > > > > > > > > Back to rte_memset, I think you should try the solution 2/. > > > > > > I have read the ACL code, if I understand well , for complex algo > > > implementation, it is good idea, but Choosing functions at run time > > > will bring some overhead. For frequently called function Which > > > consumes small cycles, the overhead maybe is more than the gains > > optimizations brings For example, for most applications in dpdk, memset only > > set N = 10 or 12bytes. It consumes fewer cycles. > > > > But then what the point to have an rte_memset() using vector instructions at > > all? > > From what you are saying the most common case is even less then SSE > > register size. > > Konstantin > > For most cases, memset is used such as memset(address, 0, sizeof(struct xxx)); > The use case here is small by accident, I only give an example here. > but rte_memset is introduced to need consider generic case. > sizeof(struct xxx) is not limited to very small size, such as less than SSE register size. > I just want to say that the size for the most use case is not very large, So cycles consumed > Is not large. It is not suited to choose function at run-time since overhead is considered. > For small copies with sizes specified at compile time, do compilers not fully inline the memset call with a fixed-size equivalent. I believe some compilers used to do so with memcpy - which is why we had a macro for it in DPDK, so that compile-time constant copies would use regular memcpy. If that is also the case for memset, then we should perhaps specify that rte_memset is only for relatively large copies, e.g. >64 bytes. In that case, run-time detection may be worthwhile. /Bruce ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-08 9:53 ` Yang, Zhiyong 2016-12-08 10:27 ` Bruce Richardson @ 2016-12-08 10:30 ` Ananyev, Konstantin 2016-12-11 12:32 ` Yang, Zhiyong 1 sibling, 1 reply; 44+ messages in thread From: Ananyev, Konstantin @ 2016-12-08 10:30 UTC (permalink / raw) To: Yang, Zhiyong, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo > -----Original Message----- > From: Yang, Zhiyong > Sent: Thursday, December 8, 2016 9:53 AM > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas Monjalon <thomas.monjalon@6wind.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform > > Hi, Konstantin: > > > -----Original Message----- > > From: Ananyev, Konstantin > > Sent: Thursday, December 8, 2016 5:26 PM > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > > <thomas.monjalon@6wind.com> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > > IA platform > > > > > > Hi Zhiyong, > > > > > > > > HI, Thomas: > > > Sorry for late reply. I have been being always considering your > > suggestion. > > > > > > > -----Original Message----- > > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > > > Sent: Friday, December 2, 2016 6:25 PM > > > > To: Yang, Zhiyong <zhiyong.yang@intel.com> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > <bruce.richardson@intel.com>; Ananyev, Konstantin > > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > > > > <pablo.de.lara.guarch@intel.com> > > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > rte_memset > > > > on IA platform > > > > > > > > 2016-12-05 16:26, Zhiyong Yang: > > > > > +#ifndef _RTE_MEMSET_X86_64_H_ > > > > > > > > Is this implementation specific to 64-bit? > > > > > > > > > > Yes. > > > > > > > > + > > > > > +#define rte_memset memset > > > > > + > > > > > +#else > > > > > + > > > > > +static void * > > > > > +rte_memset(void *dst, int a, size_t n); > > > > > + > > > > > +#endif > > > > > > > > If I understand well, rte_memset (as rte_memcpy) is using the most > > > > recent instructions available (and enabled) when compiling. > > > > It is not adapting the instructions to the run-time CPU. > > > > There is no need to downgrade at run-time the instruction set as it > > > > is obviously not a supported case, but it would be nice to be able > > > > to upgrade a "default compilation" at run-time as it is done in rte_acl. > > > > I explain this case more clearly for reference: > > > > > > > > We can have AVX512 supported in the compiler but disable it when > > > > compiling > > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost > > > > everywhere. > > > > When running this binary on a CPU having AVX512 support, it will not > > > > benefit of the AVX512 improvement. > > > > Though, we can compile an AVX512 version of some functions and use > > > > them only if the running CPU is capable. > > > > This kind of miracle can be achieved in two ways: > > > > > > > > 1/ For generic C code compiled with a recent GCC, a function can be > > > > built for several CPUs thanks to the attribute target_clones. > > > > > > > > 2/ For manually optimized functions using CPU-specific intrinsics or > > > > asm, it is possible to build them with non-default flags thanks to the > > attribute target. > > > > > > > > 3/ For manually optimized files using CPU-specific intrinsics or > > > > asm, we use specifics flags in the makefile. > > > > > > > > The function clone in case 1/ is dynamically chosen at run-time > > > > through ifunc resolver. > > > > The specific functions in cases 2/ and 3/ must chosen at run-time by > > > > initializing a function pointer thanks to rte_cpu_get_flag_enabled(). > > > > > > > > Note that rte_hash and software crypto PMDs have a run-time check > > > > with > > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile. > > > > Next step for these libraries? > > > > > > > > Back to rte_memset, I think you should try the solution 2/. > > > > > > I have read the ACL code, if I understand well , for complex algo > > > implementation, it is good idea, but Choosing functions at run time > > > will bring some overhead. For frequently called function Which > > > consumes small cycles, the overhead maybe is more than the gains > > optimizations brings For example, for most applications in dpdk, memset only > > set N = 10 or 12bytes. It consumes fewer cycles. > > > > But then what the point to have an rte_memset() using vector instructions at > > all? > > From what you are saying the most common case is even less then SSE > > register size. > > Konstantin > > For most cases, memset is used such as memset(address, 0, sizeof(struct xxx)); Ok then I suppose for such cases you don't need any special function and memset() would still be the best choice, right? > The use case here is small by accident, I only give an example here. > but rte_memset is introduced to need consider generic case. We can have rte_memset_huge() or so instead, and document that it should be used for sizes greater than some cutoff point. Inside it you can just call a function pointer installed at startup (same as rte_acl_classify() does). For big sizes, I suppose the price of extra function pointer call would not affect performance much. For sizes smaller then this cutoff point you still can use either rte_memset_scalar() or just normal rte_memset(). Something like that: extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); static inline void* rte_memset_huge(void *s, int c, size_t n) { return __rte_memset_vector(s, c, n); } static inline void * rte_memset(void *s, int c, size_t n) { If (n < XXX) return rte_memset_scalar(s, c, n); else return rte_memset_huge(s, c, n); } XXX could be either a define, or could also be a variable, so it can be setuped at startup, depending on the architecture. Would that work? Konstantin > sizeof(struct xxx) is not limited to very small size, such as less than SSE register size. > I just want to say that the size for the most use case is not very large, So cycles consumed > Is not large. It is not suited to choose function at run-time since overhead is considered. > > thanks > Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-08 10:30 ` Ananyev, Konstantin @ 2016-12-11 12:32 ` Yang, Zhiyong 2016-12-15 6:51 ` Yang, Zhiyong 0 siblings, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-11 12:32 UTC (permalink / raw) To: Ananyev, Konstantin, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo Hi, Konstantin, Bruce: > -----Original Message----- > From: Ananyev, Konstantin > Sent: Thursday, December 8, 2016 6:31 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > <thomas.monjalon@6wind.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > > > > -----Original Message----- > > From: Yang, Zhiyong > > Sent: Thursday, December 8, 2016 9:53 AM > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > Monjalon <thomas.monjalon@6wind.com> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > on IA platform > > > > Hi, Konstantin: > > > > > -----Original Message----- > > > From: Ananyev, Konstantin > > > Sent: Thursday, December 8, 2016 5:26 PM > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > > > <thomas.monjalon@6wind.com> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.guarch@intel.com> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > on IA platform > > > > > > > > > Hi Zhiyong, > > > > > > > > > > > HI, Thomas: > > > > Sorry for late reply. I have been being always considering your > > > suggestion. > > > > > > > > > -----Original Message----- > > > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > > > > Sent: Friday, December 2, 2016 6:25 PM > > > > > To: Yang, Zhiyong <zhiyong.yang@intel.com> > > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > > <bruce.richardson@intel.com>; Ananyev, Konstantin > > > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > > > > > <pablo.de.lara.guarch@intel.com> > > > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > > rte_memset > > > > > on IA platform > > > > > > > > > > 2016-12-05 16:26, Zhiyong Yang: > > > > > > +#ifndef _RTE_MEMSET_X86_64_H_ > > > > > > > > > > Is this implementation specific to 64-bit? > > > > > > > > > > > > > Yes. > > > > > > > > > > + > > > > > > +#define rte_memset memset > > > > > > + > > > > > > +#else > > > > > > + > > > > > > +static void * > > > > > > +rte_memset(void *dst, int a, size_t n); > > > > > > + > > > > > > +#endif > > > > > > > > > > If I understand well, rte_memset (as rte_memcpy) is using the > > > > > most recent instructions available (and enabled) when compiling. > > > > > It is not adapting the instructions to the run-time CPU. > > > > > There is no need to downgrade at run-time the instruction set as > > > > > it is obviously not a supported case, but it would be nice to be > > > > > able to upgrade a "default compilation" at run-time as it is done in > rte_acl. > > > > > I explain this case more clearly for reference: > > > > > > > > > > We can have AVX512 supported in the compiler but disable it when > > > > > compiling > > > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running > > > > > almost everywhere. > > > > > When running this binary on a CPU having AVX512 support, it will > > > > > not benefit of the AVX512 improvement. > > > > > Though, we can compile an AVX512 version of some functions and > > > > > use them only if the running CPU is capable. > > > > > This kind of miracle can be achieved in two ways: > > > > > > > > > > 1/ For generic C code compiled with a recent GCC, a function can > > > > > be built for several CPUs thanks to the attribute target_clones. > > > > > > > > > > 2/ For manually optimized functions using CPU-specific > > > > > intrinsics or asm, it is possible to build them with non-default > > > > > flags thanks to the > > > attribute target. > > > > > > > > > > 3/ For manually optimized files using CPU-specific intrinsics or > > > > > asm, we use specifics flags in the makefile. > > > > > > > > > > The function clone in case 1/ is dynamically chosen at run-time > > > > > through ifunc resolver. > > > > > The specific functions in cases 2/ and 3/ must chosen at > > > > > run-time by initializing a function pointer thanks to > rte_cpu_get_flag_enabled(). > > > > > > > > > > Note that rte_hash and software crypto PMDs have a run-time > > > > > check with > > > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the > Makefile. > > > > > Next step for these libraries? > > > > > > > > > > Back to rte_memset, I think you should try the solution 2/. > > > > > > > > I have read the ACL code, if I understand well , for complex algo > > > > implementation, it is good idea, but Choosing functions at run > > > > time will bring some overhead. For frequently called function > > > > Which consumes small cycles, the overhead maybe is more than the > > > > gains > > > optimizations brings For example, for most applications in dpdk, > > > memset only set N = 10 or 12bytes. It consumes fewer cycles. > > > > > > But then what the point to have an rte_memset() using vector > > > instructions at all? > > > From what you are saying the most common case is even less then SSE > > > register size. > > > Konstantin > > > > For most cases, memset is used such as memset(address, 0, > > sizeof(struct xxx)); > > Ok then I suppose for such cases you don't need any special function and > memset() would still be the best choice, right? > In fact, the bad performance drop has been found on IVB, Please reference to http://dpdk.org/ml/archives/dev/2016-October/048628.html The following code cause the perf issue memset((void *)(uintptr_t)&(virtio_hdr->hdr),0 , dev->vhost_hlen); vhost_hlen is 10 or 12 bytes, So, glibc memset is not used here. > > The use case here is small by accident, I only give an example here. > > but rte_memset is introduced to need consider generic case. > > We can have rte_memset_huge() or so instead, and document that it should > be used for sizes greater than some cutoff point. > Inside it you can just call a function pointer installed at startup (same as > rte_acl_classify() does). > For big sizes, I suppose the price of extra function pointer call would not > affect performance much. > For sizes smaller then this cutoff point you still can use either > rte_memset_scalar() or just normal rte_memset(). > Something like that: > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > static inline void* > rte_memset_huge(void *s, int c, size_t n) { > return __rte_memset_vector(s, c, n); > } > > static inline void * > rte_memset(void *s, int c, size_t n) > { > If (n < XXX) > return rte_memset_scalar(s, c, n); > else > return rte_memset_huge(s, c, n); > } > > XXX could be either a define, or could also be a variable, so it can be setuped > at startup, depending on the architecture. > > Would that work? > Konstantin > The idea sounds good. It maybe is more feasible for rte_memcpy and rte_memset. If I understand well , the idea from Bruce is similar, right ? > > sizeof(struct xxx) is not limited to very small size, such as less than SSE > register size. > > I just want to say that the size for the most use case is not very > > large, So cycles consumed Is not large. It is not suited to choose function at > run-time since overhead is considered. > > > > thanks > > Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-11 12:32 ` Yang, Zhiyong @ 2016-12-15 6:51 ` Yang, Zhiyong 2016-12-15 10:12 ` Bruce Richardson 2016-12-15 10:53 ` Ananyev, Konstantin 0 siblings, 2 replies; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-15 6:51 UTC (permalink / raw) To: Yang, Zhiyong, Ananyev, Konstantin, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo Hi, Thomas, Konstantin: > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong > Sent: Sunday, December 11, 2016 8:33 PM > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > Monjalon <thomas.monjalon@6wind.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > Hi, Konstantin, Bruce: > > > -----Original Message----- > > From: Ananyev, Konstantin > > Sent: Thursday, December 8, 2016 6:31 PM > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > > <thomas.monjalon@6wind.com> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > on IA platform > > > > > > > > > -----Original Message----- > > > From: Yang, Zhiyong > > > Sent: Thursday, December 8, 2016 9:53 AM > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > > Monjalon <thomas.monjalon@6wind.com> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.guarch@intel.com> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > on IA platform > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > static inline void* > > rte_memset_huge(void *s, int c, size_t n) { > > return __rte_memset_vector(s, c, n); } > > > > static inline void * > > rte_memset(void *s, int c, size_t n) > > { > > If (n < XXX) > > return rte_memset_scalar(s, c, n); > > else > > return rte_memset_huge(s, c, n); > > } > > > > XXX could be either a define, or could also be a variable, so it can > > be setuped at startup, depending on the architecture. > > > > Would that work? > > Konstantin > > I have implemented the code for choosing the functions at run time. rte_memcpy is used more frequently, So I test it at run time. typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline void * rte_memcpy(void *dst, const void *src, size_t n) { return rte_memcpy_vector(dst, src, n); } In order to reduce the overhead at run time, I assign the function address to var rte_memcpy_vector before main() starts to init the var. static void __attribute__((constructor)) rte_memcpy_init(void) { if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) { rte_memcpy_vector = rte_memcpy_avx2; } else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) { rte_memcpy_vector = rte_memcpy_sse; } else { rte_memcpy_vector = memcpy; } } I run the same virtio/vhost loopback tests without NIC. I can see the throughput drop when running choosing functions at run time compared to original code as following on the same platform(my machine is haswell) Packet size perf drop 64 -4% 256 -5.4% 1024 -5% 1500 -2.5% Another thing, I run the memcpy_perf_autotest, when N= <128, the rte_memcpy perf gains almost disappears When choosing functions at run time. For N=other numbers, the perf gains will become narrow. Thanks Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-15 6:51 ` Yang, Zhiyong @ 2016-12-15 10:12 ` Bruce Richardson 2016-12-16 10:19 ` Yang, Zhiyong 2016-12-15 10:53 ` Ananyev, Konstantin 1 sibling, 1 reply; 44+ messages in thread From: Bruce Richardson @ 2016-12-15 10:12 UTC (permalink / raw) To: Yang, Zhiyong Cc: Ananyev, Konstantin, Thomas Monjalon, dev, yuanhan.liu, De Lara Guarch, Pablo On Thu, Dec 15, 2016 at 06:51:08AM +0000, Yang, Zhiyong wrote: > Hi, Thomas, Konstantin: > > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong > > Sent: Sunday, December 11, 2016 8:33 PM > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > Monjalon <thomas.monjalon@6wind.com> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > > IA platform > > > > Hi, Konstantin, Bruce: > > > > > -----Original Message----- > > > From: Ananyev, Konstantin > > > Sent: Thursday, December 8, 2016 6:31 PM > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > > > <thomas.monjalon@6wind.com> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.guarch@intel.com> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > on IA platform > > > > > > > > > > > > > -----Original Message----- > > > > From: Yang, Zhiyong > > > > Sent: Thursday, December 8, 2016 9:53 AM > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > > > Monjalon <thomas.monjalon@6wind.com> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > > <pablo.de.lara.guarch@intel.com> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > > on IA platform > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > static inline void* > > > rte_memset_huge(void *s, int c, size_t n) { > > > return __rte_memset_vector(s, c, n); } > > > > > > static inline void * > > > rte_memset(void *s, int c, size_t n) > > > { > > > If (n < XXX) > > > return rte_memset_scalar(s, c, n); > > > else > > > return rte_memset_huge(s, c, n); > > > } > > > > > > XXX could be either a define, or could also be a variable, so it can > > > be setuped at startup, depending on the architecture. > > > > > > Would that work? > > > Konstantin > > > > I have implemented the code for choosing the functions at run time. > rte_memcpy is used more frequently, So I test it at run time. > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n); > extern rte_memcpy_vector_t rte_memcpy_vector; > static inline void * > rte_memcpy(void *dst, const void *src, size_t n) > { > return rte_memcpy_vector(dst, src, n); > } > In order to reduce the overhead at run time, > I assign the function address to var rte_memcpy_vector before main() starts to init the var. > > static void __attribute__((constructor)) > rte_memcpy_init(void) > { > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > { > rte_memcpy_vector = rte_memcpy_avx2; > } > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > { > rte_memcpy_vector = rte_memcpy_sse; > } > else > { > rte_memcpy_vector = memcpy; > } > > } > I run the same virtio/vhost loopback tests without NIC. > I can see the throughput drop when running choosing functions at run time > compared to original code as following on the same platform(my machine is haswell) > Packet size perf drop > 64 -4% > 256 -5.4% > 1024 -5% > 1500 -2.5% > Another thing, I run the memcpy_perf_autotest, when N= <128, > the rte_memcpy perf gains almost disappears > When choosing functions at run time. For N=other numbers, the perf gains will become narrow. > How narrow. How significant is the improvement that we gain from having to maintain our own copy of memcpy. If the libc version is nearly as good we should just use that. /Bruce ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-15 10:12 ` Bruce Richardson @ 2016-12-16 10:19 ` Yang, Zhiyong 2016-12-19 6:27 ` Yuanhan Liu 0 siblings, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-16 10:19 UTC (permalink / raw) To: Richardson, Bruce Cc: Ananyev, Konstantin, Thomas Monjalon, dev, yuanhan.liu, De Lara Guarch, Pablo Hi, Bruce: > -----Original Message----- > From: Richardson, Bruce > Sent: Thursday, December 15, 2016 6:13 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com> > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org; > yuanhan.liu@linux.intel.com; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > On Thu, Dec 15, 2016 at 06:51:08AM +0000, Yang, Zhiyong wrote: > > Hi, Thomas, Konstantin: > > > > > -----Original Message----- > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong > > > Sent: Sunday, December 11, 2016 8:33 PM > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > > Monjalon <thomas.monjalon@6wind.com> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.guarch@intel.com> > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce > rte_memset > > > on IA platform > > > > > > Hi, Konstantin, Bruce: > > > > > > > -----Original Message----- > > > > From: Ananyev, Konstantin > > > > Sent: Thursday, December 8, 2016 6:31 PM > > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > > > > <thomas.monjalon@6wind.com> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > > <pablo.de.lara.guarch@intel.com> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > > > rte_memset on IA platform > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Yang, Zhiyong > > > > > Sent: Thursday, December 8, 2016 9:53 AM > > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > > > > Monjalon <thomas.monjalon@6wind.com> > > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > > > <pablo.de.lara.guarch@intel.com> > > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > > > > rte_memset on IA platform > > > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > > > static inline void* > > > > rte_memset_huge(void *s, int c, size_t n) { > > > > return __rte_memset_vector(s, c, n); } > > > > > > > > static inline void * > > > > rte_memset(void *s, int c, size_t n) { > > > > If (n < XXX) > > > > return rte_memset_scalar(s, c, n); > > > > else > > > > return rte_memset_huge(s, c, n); } > > > > > > > > XXX could be either a define, or could also be a variable, so it > > > > can be setuped at startup, depending on the architecture. > > > > > > > > Would that work? > > > > Konstantin > > > > > > I have implemented the code for choosing the functions at run time. > > rte_memcpy is used more frequently, So I test it at run time. > > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline > > void * rte_memcpy(void *dst, const void *src, size_t n) { > > return rte_memcpy_vector(dst, src, n); } In order to reduce > > the overhead at run time, I assign the function address to var > > rte_memcpy_vector before main() starts to init the var. > > > > static void __attribute__((constructor)) > > rte_memcpy_init(void) > > { > > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > > { > > rte_memcpy_vector = rte_memcpy_avx2; > > } > > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > > { > > rte_memcpy_vector = rte_memcpy_sse; > > } > > else > > { > > rte_memcpy_vector = memcpy; > > } > > > > } > > I run the same virtio/vhost loopback tests without NIC. > > I can see the throughput drop when running choosing functions at run > > time compared to original code as following on the same platform(my > machine is haswell) > > Packet size perf drop > > 64 -4% > > 256 -5.4% > > 1024 -5% > > 1500 -2.5% > > Another thing, I run the memcpy_perf_autotest, when N= <128, the > > rte_memcpy perf gains almost disappears When choosing functions at run > > time. For N=other numbers, the perf gains will become narrow. > > > How narrow. How significant is the improvement that we gain from having to > maintain our own copy of memcpy. If the libc version is nearly as good we > should just use that. > > /Bruce Zhihong sent a patch about rte_memcpy, From the patch, we can see the optimization job for memcpy will bring obvious perf improvements than glibc for DPDK. http://www.dpdk.org/dev/patchwork/patch/17753/ git log as following: This patch is tested on Ivy Bridge, Haswell and Skylake, it provides up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging from 64 to 1500 bytes. thanks Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-16 10:19 ` Yang, Zhiyong @ 2016-12-19 6:27 ` Yuanhan Liu 2016-12-20 2:41 ` Yao, Lei A 0 siblings, 1 reply; 44+ messages in thread From: Yuanhan Liu @ 2016-12-19 6:27 UTC (permalink / raw) To: Yang, Zhiyong Cc: Richardson, Bruce, Ananyev, Konstantin, Thomas Monjalon, dev, De Lara Guarch, Pablo, Wang, Zhihong On Fri, Dec 16, 2016 at 10:19:43AM +0000, Yang, Zhiyong wrote: > > > I run the same virtio/vhost loopback tests without NIC. > > > I can see the throughput drop when running choosing functions at run > > > time compared to original code as following on the same platform(my > > machine is haswell) > > > Packet size perf drop > > > 64 -4% > > > 256 -5.4% > > > 1024 -5% > > > 1500 -2.5% > > > Another thing, I run the memcpy_perf_autotest, when N= <128, the > > > rte_memcpy perf gains almost disappears When choosing functions at run > > > time. For N=other numbers, the perf gains will become narrow. > > > > > How narrow. How significant is the improvement that we gain from having to > > maintain our own copy of memcpy. If the libc version is nearly as good we > > should just use that. > > > > /Bruce > > Zhihong sent a patch about rte_memcpy, From the patch, > we can see the optimization job for memcpy will bring obvious perf improvements > than glibc for DPDK. Just a clarification: it's better than the __original DPDK__ rte_memcpy but not the glibc one. That makes me think have any one tested the memcpy with big packets? Does the one from DPDK outweigh the one from glibc, even for big packets? --yliu > http://www.dpdk.org/dev/patchwork/patch/17753/ > git log as following: > This patch is tested on Ivy Bridge, Haswell and Skylake, it provides > up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging > from 64 to 1500 bytes. > > thanks > Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-19 6:27 ` Yuanhan Liu @ 2016-12-20 2:41 ` Yao, Lei A 0 siblings, 0 replies; 44+ messages in thread From: Yao, Lei A @ 2016-12-20 2:41 UTC (permalink / raw) To: Yuanhan Liu, Yang, Zhiyong Cc: Richardson, Bruce, Ananyev, Konstantin, Thomas Monjalon, dev, De Lara Guarch, Pablo, Wang, Zhihong > On Fri, Dec 16, 2016 at 10:19:43AM +0000, Yang, Zhiyong wrote: > > > > I run the same virtio/vhost loopback tests without NIC. > > > > I can see the throughput drop when running choosing functions at run > > > > time compared to original code as following on the same platform(my > > > machine is haswell) > > > > Packet size perf drop > > > > 64 -4% > > > > 256 -5.4% > > > > 1024 -5% > > > > 1500 -2.5% > > > > Another thing, I run the memcpy_perf_autotest, when N= <128, the > > > > rte_memcpy perf gains almost disappears When choosing functions at > run > > > > time. For N=other numbers, the perf gains will become narrow. > > > > > > > How narrow. How significant is the improvement that we gain from > having to > > > maintain our own copy of memcpy. If the libc version is nearly as good we > > > should just use that. > > > > > > /Bruce > > > > Zhihong sent a patch about rte_memcpy, From the patch, > > we can see the optimization job for memcpy will bring obvious perf > improvements > > than glibc for DPDK. > > Just a clarification: it's better than the __original DPDK__ rte_memcpy > but not the glibc one. That makes me think have any one tested the memcpy > with big packets? Does the one from DPDK outweigh the one from glibc, > even for big packets? > > --yliu > I have test the loopback performanc rte_memcpy and glibc memcpy. For both small packer and Big packet, rte_memcpy has better performance. My test enviromen is following CPU: BDW Ubutnu16.04 Kernal: 4.4.0 gcc : 5.4.0 Path: mergeable Size rte_memcpy performance gain 64 31% 128 35% 260 27% 520 33% 1024 18% 1500 12% --Lei > > http://www.dpdk.org/dev/patchwork/patch/17753/ > > git log as following: > > This patch is tested on Ivy Bridge, Haswell and Skylake, it provides > > up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging > > from 64 to 1500 bytes. > > > > thanks > > Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-15 6:51 ` Yang, Zhiyong 2016-12-15 10:12 ` Bruce Richardson @ 2016-12-15 10:53 ` Ananyev, Konstantin 2016-12-16 2:15 ` Yang, Zhiyong 1 sibling, 1 reply; 44+ messages in thread From: Ananyev, Konstantin @ 2016-12-15 10:53 UTC (permalink / raw) To: Yang, Zhiyong, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo Hi Zhiyong, > -----Original Message----- > From: Yang, Zhiyong > Sent: Thursday, December 15, 2016 6:51 AM > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas Monjalon > <thomas.monjalon@6wind.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform > > Hi, Thomas, Konstantin: > > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong > > Sent: Sunday, December 11, 2016 8:33 PM > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > Monjalon <thomas.monjalon@6wind.com> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > > IA platform > > > > Hi, Konstantin, Bruce: > > > > > -----Original Message----- > > > From: Ananyev, Konstantin > > > Sent: Thursday, December 8, 2016 6:31 PM > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > > > <thomas.monjalon@6wind.com> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.guarch@intel.com> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > on IA platform > > > > > > > > > > > > > -----Original Message----- > > > > From: Yang, Zhiyong > > > > Sent: Thursday, December 8, 2016 9:53 AM > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > > > Monjalon <thomas.monjalon@6wind.com> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > > <pablo.de.lara.guarch@intel.com> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > > on IA platform > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > static inline void* > > > rte_memset_huge(void *s, int c, size_t n) { > > > return __rte_memset_vector(s, c, n); } > > > > > > static inline void * > > > rte_memset(void *s, int c, size_t n) > > > { > > > If (n < XXX) > > > return rte_memset_scalar(s, c, n); > > > else > > > return rte_memset_huge(s, c, n); > > > } > > > > > > XXX could be either a define, or could also be a variable, so it can > > > be setuped at startup, depending on the architecture. > > > > > > Would that work? > > > Konstantin > > > > I have implemented the code for choosing the functions at run time. > rte_memcpy is used more frequently, So I test it at run time. > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n); > extern rte_memcpy_vector_t rte_memcpy_vector; > static inline void * > rte_memcpy(void *dst, const void *src, size_t n) > { > return rte_memcpy_vector(dst, src, n); > } > In order to reduce the overhead at run time, > I assign the function address to var rte_memcpy_vector before main() starts to init the var. > > static void __attribute__((constructor)) > rte_memcpy_init(void) > { > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > { > rte_memcpy_vector = rte_memcpy_avx2; > } > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > { > rte_memcpy_vector = rte_memcpy_sse; > } > else > { > rte_memcpy_vector = memcpy; > } > > } I thought we discussed a bit different approach. In which rte_memcpy_vector() (rte_memeset_vector) would be called only after some cutoff point, i.e: void rte_memcpy(void *dst, const void *src, size_t len) { if (len < N) memcpy(dst, src, len); else rte_memcpy_vector(dst, src, len); } If you just always call rte_memcpy_vector() for every len, then it means that compiler most likely has always to generate a proper call (not inlining happening). For small length(s) price of extra function would probably overweight any potential gain with SSE/AVX2 implementation. Konstantin > I run the same virtio/vhost loopback tests without NIC. > I can see the throughput drop when running choosing functions at run time > compared to original code as following on the same platform(my machine is haswell) > Packet size perf drop > 64 -4% > 256 -5.4% > 1024 -5% > 1500 -2.5% > Another thing, I run the memcpy_perf_autotest, when N= <128, > the rte_memcpy perf gains almost disappears > When choosing functions at run time. For N=other numbers, the perf gains will become narrow. > > Thanks > Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-15 10:53 ` Ananyev, Konstantin @ 2016-12-16 2:15 ` Yang, Zhiyong 2016-12-16 11:47 ` Ananyev, Konstantin 0 siblings, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-16 2:15 UTC (permalink / raw) To: Ananyev, Konstantin, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo Hi,Konstantin: > -----Original Message----- > From: Ananyev, Konstantin > Sent: Thursday, December 15, 2016 6:54 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > <thomas.monjalon@6wind.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > Hi Zhiyong, > > > -----Original Message----- > > From: Yang, Zhiyong > > Sent: Thursday, December 15, 2016 6:51 AM > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Ananyev, Konstantin > > <konstantin.ananyev@intel.com>; Thomas Monjalon > > <thomas.monjalon@6wind.com> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > on IA platform > > > > Hi, Thomas, Konstantin: > > > > > -----Original Message----- > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong > > > Sent: Sunday, December 11, 2016 8:33 PM > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > > Monjalon <thomas.monjalon@6wind.com> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.guarch@intel.com> > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce > rte_memset > > > on IA platform > > > > > > Hi, Konstantin, Bruce: > > > > > > > -----Original Message----- > > > > From: Ananyev, Konstantin > > > > Sent: Thursday, December 8, 2016 6:31 PM > > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > > > > <thomas.monjalon@6wind.com> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > > <pablo.de.lara.guarch@intel.com> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > > > rte_memset on IA platform > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Yang, Zhiyong > > > > > Sent: Thursday, December 8, 2016 9:53 AM > > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas > > > > > Monjalon <thomas.monjalon@6wind.com> > > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > > > > > <pablo.de.lara.guarch@intel.com> > > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > > > > rte_memset on IA platform > > > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > > > static inline void* > > > > rte_memset_huge(void *s, int c, size_t n) { > > > > return __rte_memset_vector(s, c, n); } > > > > > > > > static inline void * > > > > rte_memset(void *s, int c, size_t n) { > > > > If (n < XXX) > > > > return rte_memset_scalar(s, c, n); > > > > else > > > > return rte_memset_huge(s, c, n); } > > > > > > > > XXX could be either a define, or could also be a variable, so it > > > > can be setuped at startup, depending on the architecture. > > > > > > > > Would that work? > > > > Konstantin > > > > > > I have implemented the code for choosing the functions at run time. > > rte_memcpy is used more frequently, So I test it at run time. > > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline > > void * rte_memcpy(void *dst, const void *src, size_t n) { > > return rte_memcpy_vector(dst, src, n); } In order to reduce > > the overhead at run time, I assign the function address to var > > rte_memcpy_vector before main() starts to init the var. > > > > static void __attribute__((constructor)) > > rte_memcpy_init(void) > > { > > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > > { > > rte_memcpy_vector = rte_memcpy_avx2; > > } > > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > > { > > rte_memcpy_vector = rte_memcpy_sse; > > } > > else > > { > > rte_memcpy_vector = memcpy; > > } > > > > } > > I thought we discussed a bit different approach. > In which rte_memcpy_vector() (rte_memeset_vector) would be called only > after some cutoff point, i.e: > > void > rte_memcpy(void *dst, const void *src, size_t len) { > if (len < N) memcpy(dst, src, len); > else rte_memcpy_vector(dst, src, len); > } > > If you just always call rte_memcpy_vector() for every len, then it means that > compiler most likely has always to generate a proper call (not inlining > happening). > For small length(s) price of extra function would probably overweight any > potential gain with SSE/AVX2 implementation. > > Konstantin Yes, in fact, from my tests, For small length(s) rte_memset is far better than glibc memset, For large lengths, rte_memset is only a bit better than memset. because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine. >For small length(s) price of extra function would probably overweight any >potential gain. This is the key point. I think it should include the scalar optimization, not only vector optimization. The value of rte_memset is always inlined and for small lengths it will be better. when in some case We are not sure that memset is always inlined by compiler. It seems that choosing function at run time will lose the gains. The following is tested on haswell by patch code. ** rte_memset() - memset perf tests (C = compile-time constant) ** ======== ======= ======== ======= ======== Size memset in cache memset in mem (bytes) (ticks) (ticks) ------- -------------- --------------- ============= 32B aligned ================ 3 3 - 8 19 - 128 4 4 - 8 13 - 128 8 2 - 7 19 - 128 9 2 - 7 19 - 127 12 2 - 7 19 - 127 17 3 - 8 19 - 132 64 3 - 8 28 - 168 128 7 - 13 54 - 200 255 8 - 20 100 - 223 511 14 - 20 187 - 314 1024 24 - 29 328 - 379 8192 198 - 225 1829 - 2193 Thanks Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-16 2:15 ` Yang, Zhiyong @ 2016-12-16 11:47 ` Ananyev, Konstantin 2016-12-20 9:31 ` Yang, Zhiyong 0 siblings, 1 reply; 44+ messages in thread From: Ananyev, Konstantin @ 2016-12-16 11:47 UTC (permalink / raw) To: Yang, Zhiyong, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo Hi Zhiyong, > > > > > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > > > > > static inline void* > > > > > rte_memset_huge(void *s, int c, size_t n) { > > > > > return __rte_memset_vector(s, c, n); } > > > > > > > > > > static inline void * > > > > > rte_memset(void *s, int c, size_t n) { > > > > > If (n < XXX) > > > > > return rte_memset_scalar(s, c, n); > > > > > else > > > > > return rte_memset_huge(s, c, n); } > > > > > > > > > > XXX could be either a define, or could also be a variable, so it > > > > > can be setuped at startup, depending on the architecture. > > > > > > > > > > Would that work? > > > > > Konstantin > > > > > > > > I have implemented the code for choosing the functions at run time. > > > rte_memcpy is used more frequently, So I test it at run time. > > > > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, > > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline > > > void * rte_memcpy(void *dst, const void *src, size_t n) { > > > return rte_memcpy_vector(dst, src, n); } In order to reduce > > > the overhead at run time, I assign the function address to var > > > rte_memcpy_vector before main() starts to init the var. > > > > > > static void __attribute__((constructor)) > > > rte_memcpy_init(void) > > > { > > > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > > > { > > > rte_memcpy_vector = rte_memcpy_avx2; > > > } > > > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > > > { > > > rte_memcpy_vector = rte_memcpy_sse; > > > } > > > else > > > { > > > rte_memcpy_vector = memcpy; > > > } > > > > > > } > > > > I thought we discussed a bit different approach. > > In which rte_memcpy_vector() (rte_memeset_vector) would be called only > > after some cutoff point, i.e: > > > > void > > rte_memcpy(void *dst, const void *src, size_t len) { > > if (len < N) memcpy(dst, src, len); > > else rte_memcpy_vector(dst, src, len); > > } > > > > If you just always call rte_memcpy_vector() for every len, then it means that > > compiler most likely has always to generate a proper call (not inlining > > happening). > > > For small length(s) price of extra function would probably overweight any > > potential gain with SSE/AVX2 implementation. > > > > Konstantin > > Yes, in fact, from my tests, For small length(s) rte_memset is far better than glibc memset, > For large lengths, rte_memset is only a bit better than memset. > because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine. Ok, thanks for clarification. >From previous mails I got a wrong impression that on big lengths rte_memset_vector() is significantly faster than memset(). > > >For small length(s) price of extra function would probably overweight any > >potential gain. > This is the key point. I think it should include the scalar optimization, not only vector optimization. > > The value of rte_memset is always inlined and for small lengths it will be better. > when in some case We are not sure that memset is always inlined by compiler. Ok, so do you know in what cases memset() is not get inlined? Is it when len parameter can't be precomputed by the compiler (is not a constant)? So to me it sounds like: - We don't need to have an optimized verision of rte_memset() for big sizes. - Which probably means we don't need an arch specific versions of rte_memset_vector() at all - for small sizes (<= 32B) scalar version would be good enough. - For big sizes we can just rely on memset(). Is that so? > It seems that choosing function at run time will lose the gains. > The following is tested on haswell by patch code. Not sure what columns 2 and 3 in the table below mean? Konstantin > ** rte_memset() - memset perf tests > (C = compile-time constant) ** > ======== ======= ======== ======= ======== > Size memset in cache memset in mem > (bytes) (ticks) (ticks) > ------- -------------- --------------- > ============= 32B aligned ================ > 3 3 - 8 19 - 128 > 4 4 - 8 13 - 128 > 8 2 - 7 19 - 128 > 9 2 - 7 19 - 127 > 12 2 - 7 19 - 127 > 17 3 - 8 19 - 132 > 64 3 - 8 28 - 168 > 128 7 - 13 54 - 200 > 255 8 - 20 100 - 223 > 511 14 - 20 187 - 314 > 1024 24 - 29 328 - 379 > 8192 198 - 225 1829 - 2193 > > Thanks > Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-16 11:47 ` Ananyev, Konstantin @ 2016-12-20 9:31 ` Yang, Zhiyong 0 siblings, 0 replies; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-20 9:31 UTC (permalink / raw) To: Ananyev, Konstantin, Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo Hi, Konstantin: > -----Original Message----- > From: Ananyev, Konstantin > Sent: Friday, December 16, 2016 7:48 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon > <thomas.monjalon@6wind.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > Hi Zhiyong, > > > > > > > > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t > > > > > > n); > > > > > > > > > > > > static inline void* > > > > > > rte_memset_huge(void *s, int c, size_t n) { > > > > > > return __rte_memset_vector(s, c, n); } > > > > > > > > > > > > static inline void * > > > > > > rte_memset(void *s, int c, size_t n) { > > > > > > If (n < XXX) > > > > > > return rte_memset_scalar(s, c, n); > > > > > > else > > > > > > return rte_memset_huge(s, c, n); } > > > > > > > > > > > > XXX could be either a define, or could also be a variable, so > > > > > > it can be setuped at startup, depending on the architecture. > > > > > > > > > > > > Would that work? > > > > > > Konstantin > > > > > > > > > > I have implemented the code for choosing the functions at run time. > > > > rte_memcpy is used more frequently, So I test it at run time. > > > > > > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, > > > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static > > > > inline void * rte_memcpy(void *dst, const void *src, size_t n) { > > > > return rte_memcpy_vector(dst, src, n); } In order to > > > > reduce the overhead at run time, I assign the function address to > > > > var rte_memcpy_vector before main() starts to init the var. > > > > > > > > static void __attribute__((constructor)) > > > > rte_memcpy_init(void) > > > > { > > > > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > > > > { > > > > rte_memcpy_vector = rte_memcpy_avx2; > > > > } > > > > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > > > > { > > > > rte_memcpy_vector = rte_memcpy_sse; > > > > } > > > > else > > > > { > > > > rte_memcpy_vector = memcpy; > > > > } > > > > > > > > } > > > > > > I thought we discussed a bit different approach. > > > In which rte_memcpy_vector() (rte_memeset_vector) would be called > > > only after some cutoff point, i.e: > > > > > > void > > > rte_memcpy(void *dst, const void *src, size_t len) { > > > if (len < N) memcpy(dst, src, len); > > > else rte_memcpy_vector(dst, src, len); } > > > > > > If you just always call rte_memcpy_vector() for every len, then it > > > means that compiler most likely has always to generate a proper call > > > (not inlining happening). > > > > > For small length(s) price of extra function would probably > > > overweight any potential gain with SSE/AVX2 implementation. > > > > > > Konstantin > > > > Yes, in fact, from my tests, For small length(s) rte_memset is far > > better than glibc memset, For large lengths, rte_memset is only a bit better > than memset. > > because memset use the AVX2/SSE, too. Of course, it will use AVX512 on > future machine. > > Ok, thanks for clarification. > From previous mails I got a wrong impression that on big lengths > rte_memset_vector() is significantly faster than memset(). > > > > > >For small length(s) price of extra function would probably overweight > > >any > > >potential gain. > > This is the key point. I think it should include the scalar optimization, not > only vector optimization. > > > > The value of rte_memset is always inlined and for small lengths it will be > better. > > when in some case We are not sure that memset is always inlined by > compiler. > > Ok, so do you know in what cases memset() is not get inlined? > Is it when len parameter can't be precomputed by the compiler (is not a > constant)? > > So to me it sounds like: > - We don't need to have an optimized verision of rte_memset() for big sizes. > - Which probably means we don't need an arch specific versions of > rte_memset_vector() at all - > for small sizes (<= 32B) scalar version would be good enough. > - For big sizes we can just rely on memset(). > Is that so? Using memset has actually met some trouble in some case, such as http://dpdk.org/ml/archives/dev/2016-October/048628.html > > > It seems that choosing function at run time will lose the gains. > > The following is tested on haswell by patch code. > > Not sure what columns 2 and 3 in the table below mean? > Konstantin Column1 shows Size(bytes). Column2 shows rte_memset Vs memset perf results in cache Column3 shows rte_memset Vs memset perf results in memory. The data is gotten using rte_rdtsc(); The test can be run using [PATCH 3/4] app/test: add performance autotest for rte_memset Thanks Zhiyong > > > ** rte_memset() - memset perf tests > > (C = compile-time constant) ** ======== ======= ======== > > ======= ======== > > Size memset in cache memset in mem > > (bytes) (ticks) (ticks) > > ------- -------------- --------------- ============= 32B aligned > > ================ > > 3 3 - 8 19 - 128 > > 4 4 - 8 13 - 128 > > 8 2 - 7 19 - 128 > > 9 2 - 7 19 - 127 > > 12 2 - 7 19 - 127 > > 17 3 - 8 19 - 132 > > 64 3 - 8 28 - 168 > > 128 7 - 13 54 - 200 > > 255 8 - 20 100 - 223 > > 511 14 - 20 187 - 314 > > 1024 24 - 29 328 - 379 > > 8192 198 - 225 1829 - 2193 > > > > Thanks > > Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-08 7:41 ` Yang, Zhiyong 2016-12-08 9:26 ` Ananyev, Konstantin @ 2016-12-08 15:09 ` Thomas Monjalon 2016-12-11 12:04 ` Yang, Zhiyong 1 sibling, 1 reply; 44+ messages in thread From: Thomas Monjalon @ 2016-12-08 15:09 UTC (permalink / raw) To: Yang, Zhiyong Cc: dev, yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin, De Lara Guarch, Pablo 2016-12-08 07:41, Yang, Zhiyong: > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > 2016-12-05 16:26, Zhiyong Yang: > > > +#ifndef _RTE_MEMSET_X86_64_H_ > > > > Is this implementation specific to 64-bit? > > > > Yes. So should we rename this file? rte_memset.h -> rte_memset_64.h You need also to create a file rte_memset.h for each arch. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform 2016-12-08 15:09 ` Thomas Monjalon @ 2016-12-11 12:04 ` Yang, Zhiyong 0 siblings, 0 replies; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-11 12:04 UTC (permalink / raw) To: Thomas Monjalon Cc: dev, yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin, De Lara Guarch, Pablo Hi, Thomas: > -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Thursday, December 8, 2016 11:10 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > 2016-12-08 07:41, Yang, Zhiyong: > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > > 2016-12-05 16:26, Zhiyong Yang: > > > > +#ifndef _RTE_MEMSET_X86_64_H_ > > > > > > Is this implementation specific to 64-bit? > > > > > > > Yes. > > So should we rename this file? > rte_memset.h -> rte_memset_64.h > > You need also to create a file rte_memset.h for each arch. Ok ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test 2016-12-05 8:26 ` [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang 2016-12-02 10:25 ` Thomas Monjalon @ 2016-12-27 10:04 ` Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang ` (4 more replies) 1 sibling, 5 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw) To: dev Cc: yuanhan.liu, thomas.monjalon, bruce.richardson, konstantin.ananyev, pablo.de.lara.guarch DPDK code has met performance drop badly in some case when calling glibc function memset. Reference to discussions about memset in http://dpdk.org/ml/archives/dev/2016-October/048628.html It is necessary to introduce more high efficient function to fix it. One important thing about rte_memset is that we can get clear control on what instruction flow is used. This patchset introduces rte_memset to bring more high efficient implementation, and will bring obvious perf improvement, especially for small N bytes in the most application scenarios. Patch 1 implements rte_memset in the file rte_memset.h on IA platform The file supports three types of instruction sets including sse & avx (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of vectorization and inline function to improve the perf on IA. In addition, cache line and memory alignment are fully taken into consideration. Patch 2 implements functional autotest to validates the function whether to work in a right way. Patch 3 implements performance autotest separately in cache and memory. We can see the perf of rte_memset is obviously better than glibc memset especially for small N bytes. Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4% performance improvements on IA platform from virtio/vhost non-mergeable loopback testing. Changes in V2: Patch 1: Rename rte_memset.h -> rte_memset_64.h and create a file rte_memset.h for each arch. Patch 3: add the perf comparation data between rte_memset and memset on haswell. Patch 4: Modify release_17_02.rst description. Zhiyong Yang (4): eal/common: introduce rte_memset on IA platform app/test: add functional autotest for rte_memset app/test: add performance autotest for rte_memset lib/librte_vhost: improve vhost perf using rte_memset app/test/Makefile | 3 + app/test/test_memset.c | 158 +++++++++ app/test/test_memset_perf.c | 348 +++++++++++++++++++ doc/guides/rel_notes/release_17_02.rst | 7 + .../common/include/arch/arm/rte_memset.h | 36 ++ .../common/include/arch/ppc_64/rte_memset.h | 36 ++ .../common/include/arch/tile/rte_memset.h | 36 ++ .../common/include/arch/x86/rte_memset.h | 51 +++ .../common/include/arch/x86/rte_memset_64.h | 378 +++++++++++++++++++++ lib/librte_eal/common/include/generic/rte_memset.h | 52 +++ lib/librte_vhost/virtio_net.c | 18 +- 11 files changed, 1116 insertions(+), 7 deletions(-) create mode 100644 app/test/test_memset.c create mode 100644 app/test/test_memset_perf.c create mode 100644 lib/librte_eal/common/include/arch/arm/rte_memset.h create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memset.h create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memset.h create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset_64.h create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang @ 2016-12-27 10:04 ` Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang ` (3 subsequent siblings) 4 siblings, 0 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw) To: dev Cc: yuanhan.liu, thomas.monjalon, bruce.richardson, konstantin.ananyev, pablo.de.lara.guarch, Zhiyong Yang Performance drop has been caused in some cases when DPDK code calls glibc function memset. please reference to discussions about memset in http://dpdk.org/ml/archives/dev/2016-October/048628.html It is necessary to introduce more high efficient function to fix it. One important thing about rte_memset is that we can get clear control on what instruction flow is used. This patch supports instruction sets such as sse & avx(128 bits), avx2(256 bits) and avx512(512bits). rte_memset makes full use of vectorization and inline function to improve the perf on IA. In addition, cache line and memory alignment are fully taken into consideration. Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> --- Changes in V2: Rename rte_memset.h -> rte_memset_64.h and create a file rte_memset.h for each arch. .../common/include/arch/arm/rte_memset.h | 36 ++ .../common/include/arch/ppc_64/rte_memset.h | 36 ++ .../common/include/arch/tile/rte_memset.h | 36 ++ .../common/include/arch/x86/rte_memset.h | 51 +++ .../common/include/arch/x86/rte_memset_64.h | 378 +++++++++++++++++++++ lib/librte_eal/common/include/generic/rte_memset.h | 52 +++ 6 files changed, 589 insertions(+) create mode 100644 lib/librte_eal/common/include/arch/arm/rte_memset.h create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memset.h create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memset.h create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset_64.h create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h diff --git a/lib/librte_eal/common/include/arch/arm/rte_memset.h b/lib/librte_eal/common/include/arch/arm/rte_memset.h new file mode 100644 index 0000000..6945f6d --- /dev/null +++ b/lib/librte_eal/common/include/arch/arm/rte_memset.h @@ -0,0 +1,36 @@ +/* + * BSD LICENSE + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of RehiveTech nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_MEMSET_ARM_H_ +#define _RTE_MEMSET_ARM_H_ + +#define rte_memset memset + +#endif /* _RTE_MEMSET_ARM_H_ */ diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_memset.h b/lib/librte_eal/common/include/arch/ppc_64/rte_memset.h new file mode 100644 index 0000000..0d73f05 --- /dev/null +++ b/lib/librte_eal/common/include/arch/ppc_64/rte_memset.h @@ -0,0 +1,36 @@ +/* + * BSD LICENSE + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of IBM Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#ifndef _RTE_MEMSET_PPC_64_H_ +#define _RTE_MEMSET_PPC_64_H_ + +#define rte_memset memset + +#endif /* _RTE_MEMSET_PPC_64_H_ */ diff --git a/lib/librte_eal/common/include/arch/tile/rte_memset.h b/lib/librte_eal/common/include/arch/tile/rte_memset.h new file mode 100644 index 0000000..e8a1aa1 --- /dev/null +++ b/lib/librte_eal/common/include/arch/tile/rte_memset.h @@ -0,0 +1,36 @@ +/* + * BSD LICENSE + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of EZchip Semiconductor nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#ifndef _RTE_MEMSET_TILE_H_ +#define _RTE_MEMSET_TILE_H_ + +#define rte_memset memset + +#endif /* _RTE_MEMSET_TILE_H_ */ diff --git a/lib/librte_eal/common/include/arch/x86/rte_memset.h b/lib/librte_eal/common/include/arch/x86/rte_memset.h new file mode 100644 index 0000000..86e0812 --- /dev/null +++ b/lib/librte_eal/common/include/arch/x86/rte_memset.h @@ -0,0 +1,51 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2016 Intel Corporation. All rights reserved. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_MEMSET_X86_H_ +#define _RTE_MEMSET_X86_H_ + +#ifdef __cplusplus +extern "C" { +#endif + +#ifdef RTE_ARCH_X86_64 +#include "rte_memset_64.h" +#else +#define rte_memset memset +#endif + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_MEMSET_X86_64_H_ */ diff --git a/lib/librte_eal/common/include/arch/x86/rte_memset_64.h b/lib/librte_eal/common/include/arch/x86/rte_memset_64.h new file mode 100644 index 0000000..f25d344 --- /dev/null +++ b/lib/librte_eal/common/include/arch/x86/rte_memset_64.h @@ -0,0 +1,378 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2016 Intel Corporation. All rights reserved. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_MEMSET_X86_64_H_ +#define _RTE_MEMSET_X86_64_H_ + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * @file + * + * Functions for vectorised implementation of memset(). + */ + +#include <stdio.h> +#include <stdint.h> +#include <string.h> +#include <rte_vect.h> + +static inline void * +rte_memset(void *dst, int a, size_t n) __attribute__((always_inline)); + +static inline void +rte_memset_less16(void *dst, int a, size_t n) +{ + uintptr_t dstu = (uintptr_t)dst; + + if (n >= 8) { + uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8)); + uint16_t c = ((uint8_t)a | (((uint8_t)a) << 8)); + uint32_t d = b | c << 16; + uint64_t e = d | ((uint64_t)d << 32); + + *(uint64_t *)dstu = e; + *(uint64_t *)((uint8_t *)dstu + n - 8) = e; + } else { + if (n & 0x01) { + *(uint8_t *)dstu = (uint8_t)a; + dstu = (uintptr_t)((uint8_t *)dstu + 1); + } + if (n & 0x02) { + *(uint16_t *)dstu = (uint8_t)a | (((uint8_t)a) << 8); + dstu = (uintptr_t)((uint16_t *)dstu + 1); + } + if (n & 0x04) { + uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8)); + + *(uint32_t *)dstu = (uint32_t)(b | (b << 16)); + dstu = (uintptr_t)((uint32_t *)dstu + 1); + } + } +} + +static inline void +rte_memset16(uint8_t *dst, int8_t a) +{ + __m128i xmm0; + + xmm0 = _mm_set1_epi8(a); + _mm_storeu_si128((__m128i *)dst, xmm0); +} + +static inline void +rte_memset_17to32(void *dst, int a, size_t n) +{ + rte_memset16((uint8_t *)dst, a); + rte_memset16((uint8_t *)dst - 16 + n, a); +} + +#ifdef RTE_MACHINE_CPUFLAG_AVX512 + +/** + * AVX512 implementation below + */ + +static inline void +rte_memset32(uint8_t *dst, int8_t a) +{ + __m256i ymm0; + + ymm0 = _mm256_set1_epi8(a); + _mm256_storeu_si256((__m256i *)dst, ymm0); +} + +static inline void +rte_memset64(uint8_t *dst, int8_t a) +{ + __m512i zmm0; + + zmm0 = _mm512_set1_epi8(a); + _mm512_storeu_si512((void *)dst, zmm0); +} + +static inline void +rte_memset128blocks(uint8_t *dst, int8_t a, size_t n) +{ + __m512i zmm0; + + zmm0 = _mm512_set1_epi8(a); + while (n >= 128) { + n -= 128; + _mm512_store_si512((void *)(dst + 0 * 64), zmm0); + _mm512_store_si512((void *)(dst + 1 * 64), zmm0); + dst = dst + 128; + } +} + +static inline void * +rte_memset(void *dst, int a, size_t n) +{ + void *ret = dst; + size_t dstofss; + size_t bits; + + if (n < 16) { + rte_memset_less16(dst, a, n); + return ret; + } else if (n == 16) { + rte_memset16((uint8_t *)dst, a); + return ret; + } + if (n <= 32) { + rte_memset_17to32(dst, a, n); + return ret; + } + if (n <= 64) { + rte_memset32((uint8_t *)dst, a); + rte_memset32((uint8_t *)dst - 32 + n, a); + return ret; + } + if (n >= 256) { + dstofss = ((uintptr_t)dst & 0x3F); + if (dstofss > 0) { + dstofss = 64 - dstofss; + n -= dstofss; + rte_memset64((uint8_t *)dst, a); + dst = (uint8_t *)dst + dstofss; + } + rte_memset128blocks((uint8_t *)dst, a, n); + bits = n; + n = n & 127; + bits -= n; + dst = (uint8_t *)dst + bits; + } + if (n > 128) { + n -= 128; + rte_memset64((uint8_t *)dst, a); + rte_memset64((uint8_t *)dst + 64, a); + dst = (uint8_t *)dst + 128; + } + if (n > 64) { + rte_memset64((uint8_t *)dst, a); + rte_memset64((uint8_t *)dst - 64 + n, a); + return ret; + } + if (n > 0) + rte_memset64((uint8_t *)dst - 64 + n, a); + return ret; +} + +#elif defined RTE_MACHINE_CPUFLAG_AVX2 + +/** + * AVX2 implementation below + */ + +static inline void +rte_memset32(uint8_t *dst, int8_t a) +{ + __m256i ymm0; + + ymm0 = _mm256_set1_epi8(a); + _mm256_storeu_si256((__m256i *)dst, ymm0); +} + +static inline void +rte_memset_33to64(void *dst, int a, size_t n) +{ + rte_memset32((uint8_t *)dst, a); + rte_memset32((uint8_t *)dst - 32 + n, a); +} + +static inline void +rte_memset64blocks(uint8_t *dst, int8_t a, size_t n) +{ + __m256i ymm0; + + ymm0 = _mm256_set1_epi8(a); + while (n >= 64) { + n -= 64; + _mm256_store_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0); + _mm256_store_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm0); + dst = (uint8_t *)dst + 64; + + } +} + +static inline void * +rte_memset(void *dst, int a, size_t n) +{ + void *ret = dst; + size_t dstofss; + size_t bits; + + if (n < 16) { + rte_memset_less16(dst, a, n); + return ret; + } else if (n == 16) { + rte_memset16((uint8_t *)dst, a); + return ret; + } + if (n <= 32) { + rte_memset_17to32(dst, a, n); + return ret; + } + if (n <= 64) { + rte_memset_33to64(dst, a, n); + return ret; + } + if (n > 64) { + dstofss = (uintptr_t)dst & 0x1F; + if (dstofss > 0) { + dstofss = 32 - dstofss; + n -= dstofss; + rte_memset32((uint8_t *)dst, a); + dst = (uint8_t *)dst + dstofss; + } + rte_memset64blocks((uint8_t *)dst, a, n); + bits = n; + n = n & 63; + bits -= n; + dst = (uint8_t *)dst + bits; + } + if (n > 32) { + rte_memset_33to64(dst, a, n); + return ret; + } + if (n > 0) + rte_memset32((uint8_t *)dst - 32 + n, a); + return ret; +} + +#else /* RTE_MACHINE_CPUFLAG */ + +/** + * SSE && AVX implementation below + */ + +static inline void +rte_memset32(uint8_t *dst, int8_t a) +{ + __m128i xmm0 = _mm_set1_epi8(a); + + _mm_storeu_si128((__m128i *)dst, xmm0); + _mm_storeu_si128((__m128i *)(dst + 16), xmm0); +} + +static inline void +rte_memset16blocks(uint8_t *dst, int8_t a, size_t n) +{ + __m128i xmm0 = _mm_set1_epi8(a); + + while (n >= 16) { + n -= 16; + _mm_store_si128((__m128i *)(dst + 0 * 16), xmm0); + dst = (uint8_t *)dst + 16; + } +} + +static inline void +rte_memset64blocks(uint8_t *dst, int8_t a, size_t n) +{ + __m128i xmm0 = _mm_set1_epi8(a); + + while (n >= 64) { + n -= 64; + _mm_store_si128((__m128i *)(dst + 0 * 16), xmm0); + _mm_store_si128((__m128i *)(dst + 1 * 16), xmm0); + _mm_store_si128((__m128i *)(dst + 2 * 16), xmm0); + _mm_store_si128((__m128i *)(dst + 3 * 16), xmm0); + dst = (uint8_t *)dst + 64; + } +} + +static inline void * +rte_memset(void *dst, int a, size_t n) +{ + void *ret = dst; + size_t dstofss; + size_t bits; + + if (n < 16) { + rte_memset_less16(dst, a, n); + return ret; + } else if (n == 16) { + rte_memset16((uint8_t *)dst, a); + return ret; + } + if (n <= 32) { + rte_memset_17to32(dst, a, n); + return ret; + } + if (n <= 48) { + rte_memset32((uint8_t *)dst, a); + rte_memset16((uint8_t *)dst - 16 + n, a); + return ret; + } + if (n <= 64) { + rte_memset32((uint8_t *)dst, a); + rte_memset16((uint8_t *)dst + 32, a); + rte_memset16((uint8_t *)dst - 16 + n, a); + return ret; + } + if (n > 64) { + dstofss = (uintptr_t)dst & 0xF; + if (dstofss > 0) { + dstofss = 16 - dstofss; + n -= dstofss; + rte_memset16((uint8_t *)dst, a); + dst = (uint8_t *)dst + dstofss; + } + rte_memset64blocks((uint8_t *)dst, a, n); + bits = n; + n &= 63; + bits -= n; + dst = (uint8_t *)dst + bits; + rte_memset16blocks((uint8_t *)dst, a, n); + bits = n; + n &= 0xf; + bits -= n; + dst = (uint8_t *)dst + bits; + if (n > 0) { + rte_memset16((uint8_t *)dst - 16 + n, a); + return ret; + } + } + return ret; +} + +#endif /* RTE_MACHINE_CPUFLAG */ + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_MEMSET_H_ */ diff --git a/lib/librte_eal/common/include/generic/rte_memset.h b/lib/librte_eal/common/include/generic/rte_memset.h new file mode 100644 index 0000000..b03a7d0 --- /dev/null +++ b/lib/librte_eal/common/include/generic/rte_memset.h @@ -0,0 +1,52 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2014 Intel Corporation. All rights reserved. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_MEMSET_H_ +#define _RTE_MEMSET_H_ + +/** + * @file + * + * Functions for vectorised implementation of memset(). + */ +#ifdef _RTE_MEMSET_X86_64_H_ + +static void * +rte_memset(void *dst, int a, size_t n); + +#else + +#define rte_memset memset + +#endif +#endif /* _RTE_MEMSET_H_ */ -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH v2 2/4] app/test: add functional autotest for rte_memset 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang @ 2016-12-27 10:04 ` Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 3/4] app/test: add performance " Zhiyong Yang ` (2 subsequent siblings) 4 siblings, 0 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw) To: dev Cc: yuanhan.liu, thomas.monjalon, bruce.richardson, konstantin.ananyev, pablo.de.lara.guarch, Zhiyong Yang The file implements the functional autotest for rte_memset, which validates the new function rte_memset whether to work in a right way. The implementation of test_memcpy.c is used as a reference. Usage: step 1: run ./x86_64-native-linuxapp-gcc/app/test step 2: run command memset_autotest at the run time. Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> --- app/test/Makefile | 2 + app/test/test_memset.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 160 insertions(+) create mode 100644 app/test/test_memset.c diff --git a/app/test/Makefile b/app/test/Makefile index 5be023a..82da3f3 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -123,6 +123,8 @@ SRCS-y += test_logs.c SRCS-y += test_memcpy.c SRCS-y += test_memcpy_perf.c +SRCS-y += test_memset.c + SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_perf.c diff --git a/app/test/test_memset.c b/app/test/test_memset.c new file mode 100644 index 0000000..c9020bf --- /dev/null +++ b/app/test/test_memset.c @@ -0,0 +1,158 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2016 Intel Corporation. All rights reserved. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#include <stdint.h> +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <rte_common.h> +#include <rte_random.h> +#include <rte_memset.h> +#include "test.h" + +/* + * Set this to the maximum buffer size you want to test. If it is 0, then the + * values in the buf_sizes[] array below will be used. + */ +#define TEST_VALUE_RANGE 0 +#define MAX_INT8 127 +#define MIN_INT8 -128 +/* List of buffer sizes to test */ +#if TEST_VALUE_RANGE == 0 +static size_t buf_sizes[] = { + 0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, + 255, 256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518, + 1522, 1600, 2048, 3072, 4096, 5120, 6144, 7168, 8192 +}; +/* MUST be as large as largest packet size above */ +#define BUFFER_SIZE 8192 +#else /* TEST_VALUE_RANGE != 0 */ +static size_t buf_sizes[TEST_VALUE_RANGE]; +#define BUFFER_SIZE TEST_VALUE_RANGE +#endif /* TEST_VALUE_RANGE == 0 */ + +/* Data is aligned on this many bytes (power of 2) */ +#define ALIGNMENT_UNIT 32 + +/* + * Create two buffers, and initialize the one as the reference buffer with + * random values. Another(dest_buff) is assigned by the reference buffer. + * Set some memory area of dest_buff by using ch and then compare to see + * if the rte_memset is successful. The bytes outside the setted area are + * also checked to make sure they are not changed. + */ +static int +test_single_memset(unsigned int off_dst, int ch, size_t size) +{ + unsigned int i; + uint8_t dest_buff[BUFFER_SIZE + ALIGNMENT_UNIT]; + uint8_t ref_buff[BUFFER_SIZE + ALIGNMENT_UNIT]; + void *ret; + + /* Setup buffers */ + for (i = 0; i < BUFFER_SIZE + ALIGNMENT_UNIT; i++) { + ref_buff[i] = (uint8_t) rte_rand(); + dest_buff[i] = ref_buff[i]; + } + /* Do the rte_memset */ + ret = rte_memset(dest_buff + off_dst, ch, size); + if (ret != (dest_buff + off_dst)) { + printf("rte_memset() returned %p, not %p\n", + ret, dest_buff + off_dst); + } + /* Check nothing before offset was affected */ + for (i = 0; i < off_dst; i++) { + if (dest_buff[i] != ref_buff[i]) { + printf("rte_memset() failed for %u bytes (offsets=%u): \ + [modified before start of dst].\n", + (unsigned int)size, off_dst); + return -1; + } + } + /* Check every byte was setted */ + for (i = 0; i < size; i++) { + if (dest_buff[i + off_dst] != (uint8_t)ch) { + printf("rte_memset() failed for %u bytes (offsets=%u): \ + [didn't memset byte %u].\n", + (unsigned int)size, off_dst, i); + return -1; + } + } + /* Check nothing after memset was affected */ + for (i = off_dst + size; i < BUFFER_SIZE + ALIGNMENT_UNIT; i++) { + if (dest_buff[i] != ref_buff[i]) { + printf("rte_memset() failed for %u bytes (offsets=%u): \ + [memset too many].\n", + (unsigned int)size, off_dst); + return -1; + } + } + return 0; +} + +/* + * Check functionality for various buffer sizes and data offsets/alignments. + */ +static int +func_test(void) +{ + unsigned int off_dst, i; + unsigned int num_buf_sizes = sizeof(buf_sizes) / sizeof(buf_sizes[0]); + int ret; + int j; + + for (j = MIN_INT8; j <= MAX_INT8; j++) { + for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) { + for (i = 0; i < num_buf_sizes; i++) { + ret = test_single_memset(off_dst, j, + buf_sizes[i]); + if (ret != 0) + return -1; + } + } + } + return 0; +} + +static int +test_memset(void) +{ + int ret; + + ret = func_test(); + if (ret != 0) + return -1; + return 0; +} + +REGISTER_TEST_COMMAND(memset_autotest, test_memset); -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH v2 3/4] app/test: add performance autotest for rte_memset 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang @ 2016-12-27 10:04 ` Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang 2017-01-09 9:48 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Yang, Zhiyong 4 siblings, 0 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw) To: dev Cc: yuanhan.liu, thomas.monjalon, bruce.richardson, konstantin.ananyev, pablo.de.lara.guarch, Zhiyong Yang The file implements the perf autotest for rte_memset. The perf data can be gotten compared between rte_memset and memset when you run it. We can see the perf of rte_memset obviously is better than glibc memset especially for small N bytes. The first column shows the N size for memset & rte_memset. The second column lists a set of numbers for rte_memset Vs memset perf in cache. The third column lists a set of numbers for rte_memset Vs memset perf in memory. The following data is gotten on haswell. ** rte_memset() - memset perf tests (C = compile-time constant) ** ======== ======= ======== ======= ======== Size memset in cache memset in mem (bytes) (ticks) (ticks) ------- -------------- --------------- ============= 32B aligned ================ 1 3 - 8 14 - 115 3 4 - 8 19 - 125 6 3 - 7 19 - 125 8 3 - 6 19 - 124 12 3 - 6 19 - 124 15 3 - 6 19 - 125 16 3 - 8 13 - 125 32 3 - 7 19 - 133 64 3 - 7 28 - 162 65 6 - 8 41 - 182 128 6 - 13 54 - 199 192 8 - 13 77 - 273 255 8 - 16 100 - 222 512 17 - 14 187 - 247 768 22 - 20 270 - 362 1024 29 - 28 329 - 377 2048 63 - 57 564 - 601 4096 104 - 102 993 - 1025 8192 200 - 211 1831 - 2270 ------ -------------- -------------- ------ C 6 2 - 2 19 - 19 C 64 2 - 6 28 - 33 C 128 3 - 12 54 - 59 C 192 5 - 29 77 - 83 C 256 6 - 35 100 - 105 C 512 12 - 60 188 - 195 C 768 18 - 20 271 - 362 C 1024 24 - 29 329 - 377 Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> --- Change in V2: Add perf comparation data between rte_memset and memset on haswell. app/test/Makefile | 1 + app/test/test_memset_perf.c | 348 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 349 insertions(+) create mode 100644 app/test/test_memset_perf.c diff --git a/app/test/Makefile b/app/test/Makefile index 82da3f3..1c3e7f1 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -124,6 +124,7 @@ SRCS-y += test_memcpy.c SRCS-y += test_memcpy_perf.c SRCS-y += test_memset.c +SRCS-y += test_memset_perf.c SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c diff --git a/app/test/test_memset_perf.c b/app/test/test_memset_perf.c new file mode 100644 index 0000000..83b15b5 --- /dev/null +++ b/app/test/test_memset_perf.c @@ -0,0 +1,348 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2016 Intel Corporation. All rights reserved. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#include <stdint.h> +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <rte_common.h> +#include <rte_cycles.h> +#include <rte_random.h> +#include <rte_malloc.h> +#include <rte_memset.h> +#include "test.h" + +/* + * Set this to the maximum buffer size you want to test. If it is 0, then the + * values in the buf_sizes[] array below will be used. + */ +#define TEST_VALUE_RANGE 0 + +/* List of buffer sizes to test */ +#if TEST_VALUE_RANGE == 0 +static size_t buf_sizes[] = { + 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, + 70, 85, 96, 105, 115, 127, 128, 129, 161, 191, 192, 193, 255, 256, + 257, 319, 320, 321, 383, 384, 385, 447, 448, 449, 511, 512, 513, + 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600, 2048, 2560, + 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192 +}; +/* MUST be as large as largest packet size above */ +#define SMALL_BUFFER_SIZE 8192 +#else /* TEST_VALUE_RANGE != 0 */ +static size_t buf_sizes[TEST_VALUE_RANGE]; +#define SMALL_BUFFER_SIZE TEST_VALUE_RANGE +#endif /* TEST_VALUE_RANGE == 0 */ + +/* + * Arrays of this size are used for measuring uncached memory accesses by + * picking a random location within the buffer. Make this smaller if there are + * memory allocation errors. + */ +#define LARGE_BUFFER_SIZE (100 * 1024 * 1024) + +/* How many times to run timing loop for performance tests */ +#define TEST_ITERATIONS 1000000 +#define TEST_BATCH_SIZE 100 + +/* Data is aligned on this many bytes (power of 2) */ +#ifdef RTE_MACHINE_CPUFLAG_AVX512F +#define ALIGNMENT_UNIT 64 +#elif defined RTE_MACHINE_CPUFLAG_AVX2 +#define ALIGNMENT_UNIT 32 +#else /* RTE_MACHINE_CPUFLAG */ +#define ALIGNMENT_UNIT 16 +#endif /* RTE_MACHINE_CPUFLAG */ + +/* + * Pointers used in performance tests. The two large buffers are for uncached + * access where random addresses within the buffer are used for each + * memset. The two small buffers are for cached access. + */ +static uint8_t *large_buf_read, *large_buf_write; +static uint8_t *small_buf_read, *small_buf_write; + +/* Initialise data buffers. */ +static int +init_buffers(void) +{ + unsigned int i; + + large_buf_read = rte_malloc("memset", LARGE_BUFFER_SIZE + + ALIGNMENT_UNIT, ALIGNMENT_UNIT); + if (large_buf_read == NULL) + goto error_large_buf_read; + + large_buf_write = rte_malloc("memset", LARGE_BUFFER_SIZE + + ALIGNMENT_UNIT, ALIGNMENT_UNIT); + if (large_buf_write == NULL) + goto error_large_buf_write; + + small_buf_read = rte_malloc("memset", SMALL_BUFFER_SIZE + + ALIGNMENT_UNIT, ALIGNMENT_UNIT); + if (small_buf_read == NULL) + goto error_small_buf_read; + + small_buf_write = rte_malloc("memset", SMALL_BUFFER_SIZE + + ALIGNMENT_UNIT, ALIGNMENT_UNIT); + if (small_buf_write == NULL) + goto error_small_buf_write; + + for (i = 0; i < LARGE_BUFFER_SIZE; i++) + large_buf_read[i] = rte_rand(); + for (i = 0; i < SMALL_BUFFER_SIZE; i++) + small_buf_read[i] = rte_rand(); + + return 0; + +error_small_buf_write: + rte_free(small_buf_read); +error_small_buf_read: + rte_free(large_buf_write); +error_large_buf_write: + rte_free(large_buf_read); +error_large_buf_read: + printf("ERROR: not enough memory\n"); + return -1; +} + +/* Cleanup data buffers */ +static void +free_buffers(void) +{ + rte_free(large_buf_read); + rte_free(large_buf_write); + rte_free(small_buf_read); + rte_free(small_buf_write); +} + +/* + * Get a random offset into large array, with enough space needed to perform + * max memset size. Offset is aligned, uoffset is used for unalignment setting. + */ +static inline size_t +get_rand_offset(size_t uoffset) +{ + return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) & + ~(ALIGNMENT_UNIT - 1)) + uoffset; +} + +/* Fill in destination addresses. */ +static inline void +fill_addr_arrays(size_t *dst_addr, int is_dst_cached, size_t dst_uoffset) +{ + unsigned int i; + + for (i = 0; i < TEST_BATCH_SIZE; i++) + dst_addr[i] = (is_dst_cached) ? dst_uoffset : + get_rand_offset(dst_uoffset); +} + +/* + * WORKAROUND: For some reason the first test doing an uncached write + * takes a very long time (~25 times longer than is expected). So we do + * it once without timing. + */ +static void +do_uncached_write(uint8_t *dst, int is_dst_cached, size_t size) +{ + unsigned int i, j; + size_t dst_addrs[TEST_BATCH_SIZE]; + int ch = rte_rand() & 0xff; + + for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) { + fill_addr_arrays(dst_addrs, is_dst_cached, 0); + for (j = 0; j < TEST_BATCH_SIZE; j++) + rte_memset(dst+dst_addrs[j], ch, size); + } +} + +/* + * Run a single memset performance test. This is a macro to ensure that if + * the "size" parameter is a constant it won't be converted to a variable. + */ +#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset, size) \ +do { \ + unsigned int iter, t; \ + size_t dst_addrs[TEST_BATCH_SIZE]; \ + uint64_t start_time, total_time = 0; \ + uint64_t total_time2 = 0; \ + int ch = rte_rand() & 0xff; \ + \ + for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {\ + fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset); \ + start_time = rte_rdtsc(); \ + for (t = 0; t < TEST_BATCH_SIZE; t++) \ + rte_memset(dst+dst_addrs[t], ch, size); \ + total_time += rte_rdtsc() - start_time; \ + } \ + for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {\ + fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset); \ + start_time = rte_rdtsc(); \ + for (t = 0; t < TEST_BATCH_SIZE; t++) \ + memset(dst+dst_addrs[t], ch, size); \ + total_time2 += rte_rdtsc() - start_time; \ + } \ + printf("%8.0f -", (double)total_time / TEST_ITERATIONS); \ + printf("%5.0f", (double)total_time2 / TEST_ITERATIONS); \ +} while (0) + +/* Run aligned memset tests. */ +#define ALL_PERF_TESTS_FOR_SIZE(n) \ +do { \ + if (__builtin_constant_p(n)) \ + printf("\nC%6u", (unsigned int)n); \ + else \ + printf("\n%7u", (unsigned int)n); \ + SINGLE_PERF_TEST(small_buf_write, 1, 0, n); \ + SINGLE_PERF_TEST(large_buf_write, 0, 0, n); \ +} while (0) + +/* Run unaligned memset tests */ +#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n) \ +do { \ + if (__builtin_constant_p(n)) \ + printf("\nC%6u", (unsigned int)n); \ + else \ + printf("\n%7u", (unsigned int)n); \ + SINGLE_PERF_TEST(small_buf_write, 1, 1, n); \ + SINGLE_PERF_TEST(large_buf_write, 0, 1, n); \ +} while (0) + +/* Run memset tests for constant length */ +#define ALL_PERF_TEST_FOR_CONSTANT \ +do { \ + TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U); \ + TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U); \ + TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U); \ +} while (0) + +/* Run all memset tests for aligned constant cases */ +static inline void +perf_test_constant_aligned(void) +{ +#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE + ALL_PERF_TEST_FOR_CONSTANT; +#undef TEST_CONSTANT +} + +/* Run all memset tests for unaligned constant cases */ +static inline void +perf_test_constant_unaligned(void) +{ +#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE_UNALIGNED + ALL_PERF_TEST_FOR_CONSTANT; +#undef TEST_CONSTANT +} + +/* Run all memset tests for aligned variable cases */ +static inline void +perf_test_variable_aligned(void) +{ + unsigned int n = sizeof(buf_sizes) / sizeof(buf_sizes[0]); + unsigned int i; + + for (i = 0; i < n; i++) + ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]); +} + +/* Run all memset tests for unaligned variable cases */ +static inline void +perf_test_variable_unaligned(void) +{ + unsigned int n = sizeof(buf_sizes) / sizeof(buf_sizes[0]); + unsigned int i; + + for (i = 0; i < n; i++) + ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]); +} + +/* Run all memset tests */ +static int +perf_test(void) +{ + int ret; + + ret = init_buffers(); + if (ret != 0) + return ret; + +#if TEST_VALUE_RANGE != 0 + /* Set up buf_sizes array, if required */ + unsigned int i; + + for (i = 0; i < TEST_VALUE_RANGE; i++) + buf_sizes[i] = i; +#endif + + /* See function comment */ + do_uncached_write(large_buf_write, 0, SMALL_BUFFER_SIZE); + + printf("\n** rte_memset() - memset perf tests \t\n \ + (C = compile-time constant) **\n" + "======== ======= ======== ======= ========\n" + " Size memset in cache memset in mem\n" + "(bytes) (ticks) (ticks)\n" + "------- -------------- ---------------"); + + printf("\n============= %2dB aligned ================", ALIGNMENT_UNIT); + /* Do aligned tests where size is a variable */ + perf_test_variable_aligned(); + printf("\n------ -------------- -------------- ------"); + /* Do aligned tests where size is a compile-time constant */ + perf_test_constant_aligned(); + printf("\n============= Unaligned ==================="); + /* Do unaligned tests where size is a variable */ + perf_test_variable_unaligned(); + printf("\n------ -------------- -------------- ------"); + /* Do unaligned tests where size is a compile-time constant */ + perf_test_constant_unaligned(); + printf("\n====== ============== ============== =======\n\n"); + + free_buffers(); + + return 0; +} + +static int +test_memset_perf(void) +{ + int ret; + + ret = perf_test(); + if (ret != 0) + return -1; + return 0; +} + +REGISTER_TEST_COMMAND(memset_perf_autotest, test_memset_perf); -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH v2 4/4] lib/librte_vhost: improve vhost perf using rte_memset 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang ` (2 preceding siblings ...) 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 3/4] app/test: add performance " Zhiyong Yang @ 2016-12-27 10:04 ` Zhiyong Yang 2017-01-09 9:48 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Yang, Zhiyong 4 siblings, 0 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw) To: dev Cc: yuanhan.liu, thomas.monjalon, bruce.richardson, konstantin.ananyev, pablo.de.lara.guarch, Zhiyong Yang Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4% performance improvements on IA platform from virtio/vhost non-mergeable loopback testing. Two key points have been considered: 1. One variable initialization could be saved, which involves memory store. 2. copy_virtio_net_hdr involves both load (from stack, the virtio_hdr var) and store (to virtio driver memory), while rte_memset just involves store. Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> --- Changes in V2: Modify release_17_02.rst description. doc/guides/rel_notes/release_17_02.rst | 7 +++++++ lib/librte_vhost/virtio_net.c | 18 +++++++++++------- 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst index 180af82..3d39cde 100644 --- a/doc/guides/rel_notes/release_17_02.rst +++ b/doc/guides/rel_notes/release_17_02.rst @@ -52,6 +52,13 @@ New Features See the :ref:`Generic flow API <Generic_flow_API>` documentation for more information. +* **Introduced rte_memset on IA platform.** + + Performance drop had been caused in some cases on Ivybridge when DPDK code calls + glibc function memset. It was necessary to introduce more high efficient function + to replace it. The function rte_memset supported three types of instruction sets + including sse & avx(128 bits), avx2(256 bits) and avx512(512bits) and have better + performance than glibc memset. Resolved Issues --------------- diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index 595f67c..392b31b 100644 --- a/lib/librte_vhost/virtio_net.c +++ b/lib/librte_vhost/virtio_net.c @@ -37,6 +37,7 @@ #include <rte_mbuf.h> #include <rte_memcpy.h> +#include <rte_memset.h> #include <rte_ether.h> #include <rte_ip.h> #include <rte_virtio_net.h> @@ -194,7 +195,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs, uint32_t cpy_len; struct vring_desc *desc; uint64_t desc_addr; - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0}; + struct virtio_net_hdr *virtio_hdr; desc = &descs[desc_idx]; desc_addr = gpa_to_vva(dev, desc->addr); @@ -208,8 +209,9 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs, rte_prefetch0((void *)(uintptr_t)desc_addr); - virtio_enqueue_offload(m, &virtio_hdr.hdr); - copy_virtio_net_hdr(dev, desc_addr, virtio_hdr); + virtio_hdr = (struct virtio_net_hdr *)(uintptr_t)desc_addr; + rte_memset(virtio_hdr, 0, sizeof(*virtio_hdr)); + virtio_enqueue_offload(m, virtio_hdr); vhost_log_write(dev, desc->addr, dev->vhost_hlen); PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0); @@ -459,7 +461,6 @@ static inline int __attribute__((always_inline)) copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m, struct buf_vector *buf_vec, uint16_t num_buffers) { - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0}; uint32_t vec_idx = 0; uint64_t desc_addr; uint32_t mbuf_offset, mbuf_avail; @@ -480,7 +481,6 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m, hdr_phys_addr = buf_vec[vec_idx].buf_addr; rte_prefetch0((void *)(uintptr_t)hdr_addr); - virtio_hdr.num_buffers = num_buffers; LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n", dev->vid, num_buffers); @@ -512,8 +512,12 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m, } if (hdr_addr) { - virtio_enqueue_offload(hdr_mbuf, &virtio_hdr.hdr); - copy_virtio_net_hdr(dev, hdr_addr, virtio_hdr); + struct virtio_net_hdr_mrg_rxbuf *hdr = + (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr; + + rte_memset(&(hdr->hdr), 0, sizeof(hdr->hdr)); + hdr->num_buffers = num_buffers; + virtio_enqueue_offload(hdr_mbuf, &(hdr->hdr)); vhost_log_write(dev, hdr_phys_addr, dev->vhost_hlen); PRINT_PACKET(dev, (uintptr_t)hdr_addr, dev->vhost_hlen, 0); -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang ` (3 preceding siblings ...) 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang @ 2017-01-09 9:48 ` Yang, Zhiyong 2017-01-17 6:24 ` Yang, Zhiyong 4 siblings, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2017-01-09 9:48 UTC (permalink / raw) To: thomas.monjalon, Richardson, Bruce, Ananyev, Konstantin Cc: yuanhan.liu, De Lara Guarch, Pablo, dev Hi, Thomas, Bruce, Konstantin: Any comments about the patchset? Do I need to modify anything? Thanks Zhiyong > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhiyong Yang > Sent: Tuesday, December 27, 2016 6:05 PM > To: dev@dpdk.org > Cc: yuanhan.liu@linux.intel.com; thomas.monjalon@6wind.com; Richardson, > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com> > Subject: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and > related test > > DPDK code has met performance drop badly in some case when calling glibc > function memset. Reference to discussions about memset in > http://dpdk.org/ml/archives/dev/2016-October/048628.html > It is necessary to introduce more high efficient function to fix it. > One important thing about rte_memset is that we can get clear control on > what instruction flow is used. > > This patchset introduces rte_memset to bring more high efficient > implementation, and will bring obvious perf improvement, especially for > small N bytes in the most application scenarios. > > Patch 1 implements rte_memset in the file rte_memset.h on IA platform The > file supports three types of instruction sets including sse & avx (128bits), > avx2(256bits) and avx512(512bits). rte_memset makes use of vectorization > and inline function to improve the perf on IA. In addition, cache line and > memory alignment are fully taken into consideration. > > Patch 2 implements functional autotest to validates the function whether to > work in a right way. > > Patch 3 implements performance autotest separately in cache and memory. > We can see the perf of rte_memset is obviously better than glibc memset > especially for small N bytes. > > Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4% > performance improvements on IA platform from virtio/vhost non-mergeable > loopback testing. > > Changes in V2: > > Patch 1: > Rename rte_memset.h -> rte_memset_64.h and create a file rte_memset.h > for each arch. > > Patch 3: > add the perf comparation data between rte_memset and memset on > haswell. > > Patch 4: > Modify release_17_02.rst description. > > Zhiyong Yang (4): > eal/common: introduce rte_memset on IA platform > app/test: add functional autotest for rte_memset > app/test: add performance autotest for rte_memset > lib/librte_vhost: improve vhost perf using rte_memset > > app/test/Makefile | 3 + > app/test/test_memset.c | 158 +++++++++ > app/test/test_memset_perf.c | 348 +++++++++++++++++++ > doc/guides/rel_notes/release_17_02.rst | 7 + > .../common/include/arch/arm/rte_memset.h | 36 ++ > .../common/include/arch/ppc_64/rte_memset.h | 36 ++ > .../common/include/arch/tile/rte_memset.h | 36 ++ > .../common/include/arch/x86/rte_memset.h | 51 +++ > .../common/include/arch/x86/rte_memset_64.h | 378 > +++++++++++++++++++++ > lib/librte_eal/common/include/generic/rte_memset.h | 52 +++ > lib/librte_vhost/virtio_net.c | 18 +- > 11 files changed, 1116 insertions(+), 7 deletions(-) create mode 100644 > app/test/test_memset.c create mode 100644 app/test/test_memset_perf.c > create mode 100644 > lib/librte_eal/common/include/arch/arm/rte_memset.h > create mode 100644 > lib/librte_eal/common/include/arch/ppc_64/rte_memset.h > create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memset.h > create mode 100644 > lib/librte_eal/common/include/arch/x86/rte_memset.h > create mode 100644 > lib/librte_eal/common/include/arch/x86/rte_memset_64.h > create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h > > -- > 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test 2017-01-09 9:48 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Yang, Zhiyong @ 2017-01-17 6:24 ` Yang, Zhiyong 2017-01-17 20:14 ` Thomas Monjalon 0 siblings, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2017-01-17 6:24 UTC (permalink / raw) To: thomas.monjalon, Richardson, Bruce, Ananyev, Konstantin Cc: yuanhan.liu, De Lara Guarch, Pablo, dev Hi, Thomas: Does this patchset have chance to be applied for 1702 release? Thanks Zhiyong > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong > Sent: Monday, January 9, 2017 5:49 PM > To: thomas.monjalon@6wind.com; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com> > Cc: yuanhan.liu@linux.intel.com; De Lara Guarch, Pablo > <pablo.de.lara.guarch@intel.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset > and related test > > Hi, Thomas, Bruce, Konstantin: > > Any comments about the patchset? Do I need to modify anything? > > Thanks > Zhiyong > > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhiyong Yang > > Sent: Tuesday, December 27, 2016 6:05 PM > > To: dev@dpdk.org > > Cc: yuanhan.liu@linux.intel.com; thomas.monjalon@6wind.com; > > Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.guarch@intel.com> > > Subject: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset > > and related test > > > > DPDK code has met performance drop badly in some case when calling > > glibc function memset. Reference to discussions about memset in > > http://dpdk.org/ml/archives/dev/2016-October/048628.html > > It is necessary to introduce more high efficient function to fix it. > > One important thing about rte_memset is that we can get clear control > > on what instruction flow is used. > > > > This patchset introduces rte_memset to bring more high efficient > > implementation, and will bring obvious perf improvement, especially > > for small N bytes in the most application scenarios. > > > > Patch 1 implements rte_memset in the file rte_memset.h on IA platform > > The file supports three types of instruction sets including sse & avx > > (128bits), > > avx2(256bits) and avx512(512bits). rte_memset makes use of > > vectorization and inline function to improve the perf on IA. In > > addition, cache line and memory alignment are fully taken into > consideration. > > > > Patch 2 implements functional autotest to validates the function > > whether to work in a right way. > > > > Patch 3 implements performance autotest separately in cache and memory. > > We can see the perf of rte_memset is obviously better than glibc > > memset especially for small N bytes. > > > > Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring > > 3%~4% performance improvements on IA platform from virtio/vhost > > non-mergeable loopback testing. > > > > Changes in V2: > > > > Patch 1: > > Rename rte_memset.h -> rte_memset_64.h and create a file > rte_memset.h > > for each arch. > > > > Patch 3: > > add the perf comparation data between rte_memset and memset on > > haswell. > > > > Patch 4: > > Modify release_17_02.rst description. > > > > Zhiyong Yang (4): > > eal/common: introduce rte_memset on IA platform > > app/test: add functional autotest for rte_memset > > app/test: add performance autotest for rte_memset > > lib/librte_vhost: improve vhost perf using rte_memset > > > > app/test/Makefile | 3 + > > app/test/test_memset.c | 158 +++++++++ > > app/test/test_memset_perf.c | 348 +++++++++++++++++++ > > doc/guides/rel_notes/release_17_02.rst | 7 + > > .../common/include/arch/arm/rte_memset.h | 36 ++ > > .../common/include/arch/ppc_64/rte_memset.h | 36 ++ > > .../common/include/arch/tile/rte_memset.h | 36 ++ > > .../common/include/arch/x86/rte_memset.h | 51 +++ > > .../common/include/arch/x86/rte_memset_64.h | 378 > > +++++++++++++++++++++ > > lib/librte_eal/common/include/generic/rte_memset.h | 52 +++ > > lib/librte_vhost/virtio_net.c | 18 +- > > 11 files changed, 1116 insertions(+), 7 deletions(-) create mode > > 100644 app/test/test_memset.c create mode 100644 > > app/test/test_memset_perf.c create mode 100644 > > lib/librte_eal/common/include/arch/arm/rte_memset.h > > create mode 100644 > > lib/librte_eal/common/include/arch/ppc_64/rte_memset.h > > create mode 100644 > > lib/librte_eal/common/include/arch/tile/rte_memset.h > > create mode 100644 > > lib/librte_eal/common/include/arch/x86/rte_memset.h > > create mode 100644 > > lib/librte_eal/common/include/arch/x86/rte_memset_64.h > > create mode 100644 > lib/librte_eal/common/include/generic/rte_memset.h > > > > -- > > 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test 2017-01-17 6:24 ` Yang, Zhiyong @ 2017-01-17 20:14 ` Thomas Monjalon 2017-01-18 0:15 ` Vincent JARDIN 2017-01-18 2:42 ` Yang, Zhiyong 0 siblings, 2 replies; 44+ messages in thread From: Thomas Monjalon @ 2017-01-17 20:14 UTC (permalink / raw) To: Yang, Zhiyong Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu, De Lara Guarch, Pablo, dev 2017-01-17 06:24, Yang, Zhiyong: > Hi, Thomas: > Does this patchset have chance to be applied for 1702 release? It could be part of 17.02 but there are some issues: The x86 part did not receive any ack from x86 maintainers. checkpatch reports some warnings, especially about counting elements of an array. Please use RTE_DIM. The file in generic/ is for doxygen only. Please check how it is done for other files. The description is "Functions for vectorised implementation of memset()." Does it mean memset from glibc does not use vector instructions? The functional autotest is not integrated in the basic test suite. I wish this kind of review would be done by someone else. As it has not a big performance impact, this series could wait the next release. By the way, have you tried to work on glibc, as I had suggested? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test 2017-01-17 20:14 ` Thomas Monjalon @ 2017-01-18 0:15 ` Vincent JARDIN 2017-01-18 2:42 ` Yang, Zhiyong 1 sibling, 0 replies; 44+ messages in thread From: Vincent JARDIN @ 2017-01-18 0:15 UTC (permalink / raw) To: Thomas Monjalon, Yang, Zhiyong Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu, De Lara Guarch, Pablo, dev Le 17/01/2017 à 21:14, Thomas Monjalon a écrit : > By the way, have you tried to work on glibc, as I had suggested? Nothing here: https://sourceware.org/cgi-bin/search.cgi?wm=wrd&form=extended&m=all&s=D&ul=%2Fml%2Flibc-alpha%2F%25&q=memset ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test 2017-01-17 20:14 ` Thomas Monjalon 2017-01-18 0:15 ` Vincent JARDIN @ 2017-01-18 2:42 ` Yang, Zhiyong 2017-01-18 7:42 ` Thomas Monjalon 1 sibling, 1 reply; 44+ messages in thread From: Yang, Zhiyong @ 2017-01-18 2:42 UTC (permalink / raw) To: Thomas Monjalon Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu, De Lara Guarch, Pablo, dev hi, Thomas: Thanks for your reply. > -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Wednesday, January 18, 2017 4:14 AM > To: Yang, Zhiyong <zhiyong.yang@intel.com> > Cc: Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; yuanhan.liu@linux.intel.com; De Lara > Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset > and related test > > 2017-01-17 06:24, Yang, Zhiyong: > > Hi, Thomas: > > Does this patchset have chance to be applied for 1702 release? > > It could be part of 17.02 but there are some issues: > > The x86 part did not receive any ack from x86 maintainers. Ok > > checkpatch reports some warnings, especially about counting elements of an > array. Please use RTE_DIM. Ok, I ignore these warning as reference to current release code. More clean code will been sent in future. > > The file in generic/ is for doxygen only. > Please check how it is done for other files. Ok. I don't know this before. :), thank you. > > The description is "Functions for vectorised implementation of memset()." > Does it mean memset from glibc does not use vector instructions? > Sorry for causing misleading understanding, Glibc memset() use vectorization instructions to implement optimization, of course. I just want to say "the functions for implementing the same functionality like glibc memset() ". My bad English expressions. :) > The functional autotest is not integrated in the basic test suite. > I can run command line "memset_autotest", It seems that I leave something out. > I wish this kind of review would be done by someone else. > As it has not a big performance impact, this series could wait the next release. Ok. Maybe memset() consumes small ratio for current DPDK data path. > By the way, have you tried to work on glibc, as I had suggested? I'm not familiar with glibc regulation, as far as I know, glibc is using X86 asm, rather than intrinsic. I will consider your suggestion. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test 2017-01-18 2:42 ` Yang, Zhiyong @ 2017-01-18 7:42 ` Thomas Monjalon 2017-01-19 1:36 ` Yang, Zhiyong 0 siblings, 1 reply; 44+ messages in thread From: Thomas Monjalon @ 2017-01-18 7:42 UTC (permalink / raw) To: Yang, Zhiyong Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu, De Lara Guarch, Pablo, dev 2017-01-18 02:42, Yang, Zhiyong: > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > The functional autotest is not integrated in the basic test suite. > > I can run command line "memset_autotest", It seems that I leave something out. Please check app/test/autotest_data.py ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test 2017-01-18 7:42 ` Thomas Monjalon @ 2017-01-19 1:36 ` Yang, Zhiyong 0 siblings, 0 replies; 44+ messages in thread From: Yang, Zhiyong @ 2017-01-19 1:36 UTC (permalink / raw) To: Thomas Monjalon Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu, De Lara Guarch, Pablo, dev > -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Wednesday, January 18, 2017 3:43 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com> > Cc: Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; yuanhan.liu@linux.intel.com; De Lara > Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset > and related test > > 2017-01-18 02:42, Yang, Zhiyong: > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > > > The functional autotest is not integrated in the basic test suite. > > > > I can run command line "memset_autotest", It seems that I leave > something out. > > Please check app/test/autotest_data.py Thanks, Thomas. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH 2/4] app/test: add functional autotest for rte_memset 2016-12-05 8:26 [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang 2016-12-02 10:00 ` Maxime Coquelin 2016-12-05 8:26 ` [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang @ 2016-12-05 8:26 ` Zhiyong Yang 2016-12-05 8:26 ` [dpdk-dev] [PATCH 3/4] app/test: add performance " Zhiyong Yang 2016-12-05 8:26 ` [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang 4 siblings, 0 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-05 8:26 UTC (permalink / raw) To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev, Zhiyong Yang The file implements the functional autotest for rte_memset, which validates the new function rte_memset whether to work in a right way. The implementation of test_memcpy.c is used as a reference. Usage: step 1: run ./x86_64-native-linuxapp-gcc/app/test step 2: run command memset_autotest at the run time. Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> --- app/test/Makefile | 2 + app/test/test_memset.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 160 insertions(+) create mode 100644 app/test/test_memset.c diff --git a/app/test/Makefile b/app/test/Makefile index 5be023a..82da3f3 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -123,6 +123,8 @@ SRCS-y += test_logs.c SRCS-y += test_memcpy.c SRCS-y += test_memcpy_perf.c +SRCS-y += test_memset.c + SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_perf.c diff --git a/app/test/test_memset.c b/app/test/test_memset.c new file mode 100644 index 0000000..c9020bf --- /dev/null +++ b/app/test/test_memset.c @@ -0,0 +1,158 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2016 Intel Corporation. All rights reserved. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#include <stdint.h> +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <rte_common.h> +#include <rte_random.h> +#include <rte_memset.h> +#include "test.h" + +/* + * Set this to the maximum buffer size you want to test. If it is 0, then the + * values in the buf_sizes[] array below will be used. + */ +#define TEST_VALUE_RANGE 0 +#define MAX_INT8 127 +#define MIN_INT8 -128 +/* List of buffer sizes to test */ +#if TEST_VALUE_RANGE == 0 +static size_t buf_sizes[] = { + 0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, + 255, 256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518, + 1522, 1600, 2048, 3072, 4096, 5120, 6144, 7168, 8192 +}; +/* MUST be as large as largest packet size above */ +#define BUFFER_SIZE 8192 +#else /* TEST_VALUE_RANGE != 0 */ +static size_t buf_sizes[TEST_VALUE_RANGE]; +#define BUFFER_SIZE TEST_VALUE_RANGE +#endif /* TEST_VALUE_RANGE == 0 */ + +/* Data is aligned on this many bytes (power of 2) */ +#define ALIGNMENT_UNIT 32 + +/* + * Create two buffers, and initialize the one as the reference buffer with + * random values. Another(dest_buff) is assigned by the reference buffer. + * Set some memory area of dest_buff by using ch and then compare to see + * if the rte_memset is successful. The bytes outside the setted area are + * also checked to make sure they are not changed. + */ +static int +test_single_memset(unsigned int off_dst, int ch, size_t size) +{ + unsigned int i; + uint8_t dest_buff[BUFFER_SIZE + ALIGNMENT_UNIT]; + uint8_t ref_buff[BUFFER_SIZE + ALIGNMENT_UNIT]; + void *ret; + + /* Setup buffers */ + for (i = 0; i < BUFFER_SIZE + ALIGNMENT_UNIT; i++) { + ref_buff[i] = (uint8_t) rte_rand(); + dest_buff[i] = ref_buff[i]; + } + /* Do the rte_memset */ + ret = rte_memset(dest_buff + off_dst, ch, size); + if (ret != (dest_buff + off_dst)) { + printf("rte_memset() returned %p, not %p\n", + ret, dest_buff + off_dst); + } + /* Check nothing before offset was affected */ + for (i = 0; i < off_dst; i++) { + if (dest_buff[i] != ref_buff[i]) { + printf("rte_memset() failed for %u bytes (offsets=%u): \ + [modified before start of dst].\n", + (unsigned int)size, off_dst); + return -1; + } + } + /* Check every byte was setted */ + for (i = 0; i < size; i++) { + if (dest_buff[i + off_dst] != (uint8_t)ch) { + printf("rte_memset() failed for %u bytes (offsets=%u): \ + [didn't memset byte %u].\n", + (unsigned int)size, off_dst, i); + return -1; + } + } + /* Check nothing after memset was affected */ + for (i = off_dst + size; i < BUFFER_SIZE + ALIGNMENT_UNIT; i++) { + if (dest_buff[i] != ref_buff[i]) { + printf("rte_memset() failed for %u bytes (offsets=%u): \ + [memset too many].\n", + (unsigned int)size, off_dst); + return -1; + } + } + return 0; +} + +/* + * Check functionality for various buffer sizes and data offsets/alignments. + */ +static int +func_test(void) +{ + unsigned int off_dst, i; + unsigned int num_buf_sizes = sizeof(buf_sizes) / sizeof(buf_sizes[0]); + int ret; + int j; + + for (j = MIN_INT8; j <= MAX_INT8; j++) { + for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) { + for (i = 0; i < num_buf_sizes; i++) { + ret = test_single_memset(off_dst, j, + buf_sizes[i]); + if (ret != 0) + return -1; + } + } + } + return 0; +} + +static int +test_memset(void) +{ + int ret; + + ret = func_test(); + if (ret != 0) + return -1; + return 0; +} + +REGISTER_TEST_COMMAND(memset_autotest, test_memset); -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH 3/4] app/test: add performance autotest for rte_memset 2016-12-05 8:26 [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang ` (2 preceding siblings ...) 2016-12-05 8:26 ` [dpdk-dev] [PATCH 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang @ 2016-12-05 8:26 ` Zhiyong Yang 2016-12-05 8:26 ` [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang 4 siblings, 0 replies; 44+ messages in thread From: Zhiyong Yang @ 2016-12-05 8:26 UTC (permalink / raw) To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev, Zhiyong Yang The file implements the perf autotest for rte_memset. The perf data can be gotten compared between memset and rte_memset when you run it. The first column shows the N size for memset. The second column lists a set of numbers for memset in cache, The third column lists a set of numbers for memset in memory. Usage: step 1: run ./x86_64-native-linuxapp-gcc/app/test step 2: run command memset_perf_autotest at the run time. Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> --- app/test/Makefile | 1 + app/test/test_memset_perf.c | 348 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 349 insertions(+) create mode 100644 app/test/test_memset_perf.c diff --git a/app/test/Makefile b/app/test/Makefile index 82da3f3..1c3e7f1 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -124,6 +124,7 @@ SRCS-y += test_memcpy.c SRCS-y += test_memcpy_perf.c SRCS-y += test_memset.c +SRCS-y += test_memset_perf.c SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c diff --git a/app/test/test_memset_perf.c b/app/test/test_memset_perf.c new file mode 100644 index 0000000..83b15b5 --- /dev/null +++ b/app/test/test_memset_perf.c @@ -0,0 +1,348 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2010-2016 Intel Corporation. All rights reserved. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#include <stdint.h> +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <rte_common.h> +#include <rte_cycles.h> +#include <rte_random.h> +#include <rte_malloc.h> +#include <rte_memset.h> +#include "test.h" + +/* + * Set this to the maximum buffer size you want to test. If it is 0, then the + * values in the buf_sizes[] array below will be used. + */ +#define TEST_VALUE_RANGE 0 + +/* List of buffer sizes to test */ +#if TEST_VALUE_RANGE == 0 +static size_t buf_sizes[] = { + 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, + 70, 85, 96, 105, 115, 127, 128, 129, 161, 191, 192, 193, 255, 256, + 257, 319, 320, 321, 383, 384, 385, 447, 448, 449, 511, 512, 513, + 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600, 2048, 2560, + 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192 +}; +/* MUST be as large as largest packet size above */ +#define SMALL_BUFFER_SIZE 8192 +#else /* TEST_VALUE_RANGE != 0 */ +static size_t buf_sizes[TEST_VALUE_RANGE]; +#define SMALL_BUFFER_SIZE TEST_VALUE_RANGE +#endif /* TEST_VALUE_RANGE == 0 */ + +/* + * Arrays of this size are used for measuring uncached memory accesses by + * picking a random location within the buffer. Make this smaller if there are + * memory allocation errors. + */ +#define LARGE_BUFFER_SIZE (100 * 1024 * 1024) + +/* How many times to run timing loop for performance tests */ +#define TEST_ITERATIONS 1000000 +#define TEST_BATCH_SIZE 100 + +/* Data is aligned on this many bytes (power of 2) */ +#ifdef RTE_MACHINE_CPUFLAG_AVX512F +#define ALIGNMENT_UNIT 64 +#elif defined RTE_MACHINE_CPUFLAG_AVX2 +#define ALIGNMENT_UNIT 32 +#else /* RTE_MACHINE_CPUFLAG */ +#define ALIGNMENT_UNIT 16 +#endif /* RTE_MACHINE_CPUFLAG */ + +/* + * Pointers used in performance tests. The two large buffers are for uncached + * access where random addresses within the buffer are used for each + * memset. The two small buffers are for cached access. + */ +static uint8_t *large_buf_read, *large_buf_write; +static uint8_t *small_buf_read, *small_buf_write; + +/* Initialise data buffers. */ +static int +init_buffers(void) +{ + unsigned int i; + + large_buf_read = rte_malloc("memset", LARGE_BUFFER_SIZE + + ALIGNMENT_UNIT, ALIGNMENT_UNIT); + if (large_buf_read == NULL) + goto error_large_buf_read; + + large_buf_write = rte_malloc("memset", LARGE_BUFFER_SIZE + + ALIGNMENT_UNIT, ALIGNMENT_UNIT); + if (large_buf_write == NULL) + goto error_large_buf_write; + + small_buf_read = rte_malloc("memset", SMALL_BUFFER_SIZE + + ALIGNMENT_UNIT, ALIGNMENT_UNIT); + if (small_buf_read == NULL) + goto error_small_buf_read; + + small_buf_write = rte_malloc("memset", SMALL_BUFFER_SIZE + + ALIGNMENT_UNIT, ALIGNMENT_UNIT); + if (small_buf_write == NULL) + goto error_small_buf_write; + + for (i = 0; i < LARGE_BUFFER_SIZE; i++) + large_buf_read[i] = rte_rand(); + for (i = 0; i < SMALL_BUFFER_SIZE; i++) + small_buf_read[i] = rte_rand(); + + return 0; + +error_small_buf_write: + rte_free(small_buf_read); +error_small_buf_read: + rte_free(large_buf_write); +error_large_buf_write: + rte_free(large_buf_read); +error_large_buf_read: + printf("ERROR: not enough memory\n"); + return -1; +} + +/* Cleanup data buffers */ +static void +free_buffers(void) +{ + rte_free(large_buf_read); + rte_free(large_buf_write); + rte_free(small_buf_read); + rte_free(small_buf_write); +} + +/* + * Get a random offset into large array, with enough space needed to perform + * max memset size. Offset is aligned, uoffset is used for unalignment setting. + */ +static inline size_t +get_rand_offset(size_t uoffset) +{ + return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) & + ~(ALIGNMENT_UNIT - 1)) + uoffset; +} + +/* Fill in destination addresses. */ +static inline void +fill_addr_arrays(size_t *dst_addr, int is_dst_cached, size_t dst_uoffset) +{ + unsigned int i; + + for (i = 0; i < TEST_BATCH_SIZE; i++) + dst_addr[i] = (is_dst_cached) ? dst_uoffset : + get_rand_offset(dst_uoffset); +} + +/* + * WORKAROUND: For some reason the first test doing an uncached write + * takes a very long time (~25 times longer than is expected). So we do + * it once without timing. + */ +static void +do_uncached_write(uint8_t *dst, int is_dst_cached, size_t size) +{ + unsigned int i, j; + size_t dst_addrs[TEST_BATCH_SIZE]; + int ch = rte_rand() & 0xff; + + for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) { + fill_addr_arrays(dst_addrs, is_dst_cached, 0); + for (j = 0; j < TEST_BATCH_SIZE; j++) + rte_memset(dst+dst_addrs[j], ch, size); + } +} + +/* + * Run a single memset performance test. This is a macro to ensure that if + * the "size" parameter is a constant it won't be converted to a variable. + */ +#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset, size) \ +do { \ + unsigned int iter, t; \ + size_t dst_addrs[TEST_BATCH_SIZE]; \ + uint64_t start_time, total_time = 0; \ + uint64_t total_time2 = 0; \ + int ch = rte_rand() & 0xff; \ + \ + for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {\ + fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset); \ + start_time = rte_rdtsc(); \ + for (t = 0; t < TEST_BATCH_SIZE; t++) \ + rte_memset(dst+dst_addrs[t], ch, size); \ + total_time += rte_rdtsc() - start_time; \ + } \ + for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {\ + fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset); \ + start_time = rte_rdtsc(); \ + for (t = 0; t < TEST_BATCH_SIZE; t++) \ + memset(dst+dst_addrs[t], ch, size); \ + total_time2 += rte_rdtsc() - start_time; \ + } \ + printf("%8.0f -", (double)total_time / TEST_ITERATIONS); \ + printf("%5.0f", (double)total_time2 / TEST_ITERATIONS); \ +} while (0) + +/* Run aligned memset tests. */ +#define ALL_PERF_TESTS_FOR_SIZE(n) \ +do { \ + if (__builtin_constant_p(n)) \ + printf("\nC%6u", (unsigned int)n); \ + else \ + printf("\n%7u", (unsigned int)n); \ + SINGLE_PERF_TEST(small_buf_write, 1, 0, n); \ + SINGLE_PERF_TEST(large_buf_write, 0, 0, n); \ +} while (0) + +/* Run unaligned memset tests */ +#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n) \ +do { \ + if (__builtin_constant_p(n)) \ + printf("\nC%6u", (unsigned int)n); \ + else \ + printf("\n%7u", (unsigned int)n); \ + SINGLE_PERF_TEST(small_buf_write, 1, 1, n); \ + SINGLE_PERF_TEST(large_buf_write, 0, 1, n); \ +} while (0) + +/* Run memset tests for constant length */ +#define ALL_PERF_TEST_FOR_CONSTANT \ +do { \ + TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U); \ + TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U); \ + TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U); \ +} while (0) + +/* Run all memset tests for aligned constant cases */ +static inline void +perf_test_constant_aligned(void) +{ +#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE + ALL_PERF_TEST_FOR_CONSTANT; +#undef TEST_CONSTANT +} + +/* Run all memset tests for unaligned constant cases */ +static inline void +perf_test_constant_unaligned(void) +{ +#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE_UNALIGNED + ALL_PERF_TEST_FOR_CONSTANT; +#undef TEST_CONSTANT +} + +/* Run all memset tests for aligned variable cases */ +static inline void +perf_test_variable_aligned(void) +{ + unsigned int n = sizeof(buf_sizes) / sizeof(buf_sizes[0]); + unsigned int i; + + for (i = 0; i < n; i++) + ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]); +} + +/* Run all memset tests for unaligned variable cases */ +static inline void +perf_test_variable_unaligned(void) +{ + unsigned int n = sizeof(buf_sizes) / sizeof(buf_sizes[0]); + unsigned int i; + + for (i = 0; i < n; i++) + ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]); +} + +/* Run all memset tests */ +static int +perf_test(void) +{ + int ret; + + ret = init_buffers(); + if (ret != 0) + return ret; + +#if TEST_VALUE_RANGE != 0 + /* Set up buf_sizes array, if required */ + unsigned int i; + + for (i = 0; i < TEST_VALUE_RANGE; i++) + buf_sizes[i] = i; +#endif + + /* See function comment */ + do_uncached_write(large_buf_write, 0, SMALL_BUFFER_SIZE); + + printf("\n** rte_memset() - memset perf tests \t\n \ + (C = compile-time constant) **\n" + "======== ======= ======== ======= ========\n" + " Size memset in cache memset in mem\n" + "(bytes) (ticks) (ticks)\n" + "------- -------------- ---------------"); + + printf("\n============= %2dB aligned ================", ALIGNMENT_UNIT); + /* Do aligned tests where size is a variable */ + perf_test_variable_aligned(); + printf("\n------ -------------- -------------- ------"); + /* Do aligned tests where size is a compile-time constant */ + perf_test_constant_aligned(); + printf("\n============= Unaligned ==================="); + /* Do unaligned tests where size is a variable */ + perf_test_variable_unaligned(); + printf("\n------ -------------- -------------- ------"); + /* Do unaligned tests where size is a compile-time constant */ + perf_test_constant_unaligned(); + printf("\n====== ============== ============== =======\n\n"); + + free_buffers(); + + return 0; +} + +static int +test_memset_perf(void) +{ + int ret; + + ret = perf_test(); + if (ret != 0) + return -1; + return 0; +} + +REGISTER_TEST_COMMAND(memset_perf_autotest, test_memset_perf); -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset 2016-12-05 8:26 [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang ` (3 preceding siblings ...) 2016-12-05 8:26 ` [dpdk-dev] [PATCH 3/4] app/test: add performance " Zhiyong Yang @ 2016-12-05 8:26 ` Zhiyong Yang 2016-12-02 9:46 ` Thomas Monjalon 4 siblings, 1 reply; 44+ messages in thread From: Zhiyong Yang @ 2016-12-05 8:26 UTC (permalink / raw) To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev, Zhiyong Yang Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4% performance improvements on IA platform from virtio/vhost non-mergeable loopback testing. Two key points have been considered: 1. One variable initialization could be saved, which involves memory store. 2. copy_virtio_net_hdr involves both load (from stack, the virtio_hdr var) and store (to virtio driver memory), while rte_memset just involves store. Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> --- doc/guides/rel_notes/release_17_02.rst | 11 +++++++++++ lib/librte_vhost/virtio_net.c | 18 +++++++++++------- 2 files changed, 22 insertions(+), 7 deletions(-) diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst index 3b65038..eecf857 100644 --- a/doc/guides/rel_notes/release_17_02.rst +++ b/doc/guides/rel_notes/release_17_02.rst @@ -38,6 +38,17 @@ New Features Also, make sure to start the actual text at the margin. ========================================================= +* **Introduced rte_memset and related test on IA platform.** + + Performance drop had been caused in some cases on Ivybridge when DPDK code calls glibc + function memset. It was necessary to introduce more high efficient function to fix it. + The function rte_memset supported three types of instruction sets including sse & avx(128 bits), + avx2(256 bits) and avx512(512bits). + + * Added rte_memset support on IA platform. + * Added functional autotest support for rte_memset. + * Added performance autotest support for rte_memset. + * Improved performance to use rte_memset instead of copy_virtio_net_hdr in lib/librte_vhost. Resolved Issues --------------- diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index 595f67c..392b31b 100644 --- a/lib/librte_vhost/virtio_net.c +++ b/lib/librte_vhost/virtio_net.c @@ -37,6 +37,7 @@ #include <rte_mbuf.h> #include <rte_memcpy.h> +#include <rte_memset.h> #include <rte_ether.h> #include <rte_ip.h> #include <rte_virtio_net.h> @@ -194,7 +195,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs, uint32_t cpy_len; struct vring_desc *desc; uint64_t desc_addr; - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0}; + struct virtio_net_hdr *virtio_hdr; desc = &descs[desc_idx]; desc_addr = gpa_to_vva(dev, desc->addr); @@ -208,8 +209,9 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs, rte_prefetch0((void *)(uintptr_t)desc_addr); - virtio_enqueue_offload(m, &virtio_hdr.hdr); - copy_virtio_net_hdr(dev, desc_addr, virtio_hdr); + virtio_hdr = (struct virtio_net_hdr *)(uintptr_t)desc_addr; + rte_memset(virtio_hdr, 0, sizeof(*virtio_hdr)); + virtio_enqueue_offload(m, virtio_hdr); vhost_log_write(dev, desc->addr, dev->vhost_hlen); PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0); @@ -459,7 +461,6 @@ static inline int __attribute__((always_inline)) copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m, struct buf_vector *buf_vec, uint16_t num_buffers) { - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0}; uint32_t vec_idx = 0; uint64_t desc_addr; uint32_t mbuf_offset, mbuf_avail; @@ -480,7 +481,6 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m, hdr_phys_addr = buf_vec[vec_idx].buf_addr; rte_prefetch0((void *)(uintptr_t)hdr_addr); - virtio_hdr.num_buffers = num_buffers; LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n", dev->vid, num_buffers); @@ -512,8 +512,12 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m, } if (hdr_addr) { - virtio_enqueue_offload(hdr_mbuf, &virtio_hdr.hdr); - copy_virtio_net_hdr(dev, hdr_addr, virtio_hdr); + struct virtio_net_hdr_mrg_rxbuf *hdr = + (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr; + + rte_memset(&(hdr->hdr), 0, sizeof(hdr->hdr)); + hdr->num_buffers = num_buffers; + virtio_enqueue_offload(hdr_mbuf, &(hdr->hdr)); vhost_log_write(dev, hdr_phys_addr, dev->vhost_hlen); PRINT_PACKET(dev, (uintptr_t)hdr_addr, dev->vhost_hlen, 0); -- 2.7.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset 2016-12-05 8:26 ` [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang @ 2016-12-02 9:46 ` Thomas Monjalon 2016-12-06 8:04 ` Yang, Zhiyong 0 siblings, 1 reply; 44+ messages in thread From: Thomas Monjalon @ 2016-12-02 9:46 UTC (permalink / raw) To: Zhiyong Yang; +Cc: dev, yuanhan.liu, bruce.richardson, konstantin.ananyev 2016-12-05 16:26, Zhiyong Yang: > +* **Introduced rte_memset and related test on IA platform.** > + > + Performance drop had been caused in some cases on Ivybridge when DPDK code calls glibc > + function memset. It was necessary to introduce more high efficient function to fix it. > + The function rte_memset supported three types of instruction sets including sse & avx(128 bits), > + avx2(256 bits) and avx512(512bits). > + > + * Added rte_memset support on IA platform. > + * Added functional autotest support for rte_memset. > + * Added performance autotest support for rte_memset. No need to reference autotests in the release notes. > + * Improved performance to use rte_memset instead of copy_virtio_net_hdr in lib/librte_vhost. Please describe this change at a higher level. Which case it is improving? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset 2016-12-02 9:46 ` Thomas Monjalon @ 2016-12-06 8:04 ` Yang, Zhiyong 0 siblings, 0 replies; 44+ messages in thread From: Yang, Zhiyong @ 2016-12-06 8:04 UTC (permalink / raw) To: Thomas Monjalon; +Cc: dev, yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin Hi, Thomas: > -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Friday, December 2, 2016 5:46 PM > To: Yang, Zhiyong <zhiyong.yang@intel.com> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com> > Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf > using rte_memset > > 2016-12-05 16:26, Zhiyong Yang: > > +* **Introduced rte_memset and related test on IA platform.** > > + > > + Performance drop had been caused in some cases on Ivybridge when > > + DPDK code calls glibc function memset. It was necessary to introduce > more high efficient function to fix it. > > + The function rte_memset supported three types of instruction sets > > + including sse & avx(128 bits), > > + avx2(256 bits) and avx512(512bits). > > + > > + * Added rte_memset support on IA platform. > > + * Added functional autotest support for rte_memset. > > + * Added performance autotest support for rte_memset. > > No need to reference autotests in the release notes. Ok. I will remove the two lines. > > > + * Improved performance to use rte_memset instead of > copy_virtio_net_hdr in lib/librte_vhost. > > Please describe this change at a higher level. Which case it is improving? Ok, good comments. * Improved performance to get 3% or so perf improvement on IA platform by using rte_memset when running virtio/vhost non-mergeable loopback test without NIC. Thanks Zhiyong ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2017-01-19 1:36 UTC | newest] Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-05 8:26 [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang 2016-12-02 10:00 ` Maxime Coquelin 2016-12-06 6:33 ` Yang, Zhiyong 2016-12-06 8:29 ` Maxime Coquelin 2016-12-07 9:28 ` Yang, Zhiyong 2016-12-07 9:37 ` Yuanhan Liu 2016-12-07 9:43 ` Yang, Zhiyong 2016-12-07 9:48 ` Yuanhan Liu 2016-12-05 8:26 ` [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang 2016-12-02 10:25 ` Thomas Monjalon 2016-12-08 7:41 ` Yang, Zhiyong 2016-12-08 9:26 ` Ananyev, Konstantin 2016-12-08 9:53 ` Yang, Zhiyong 2016-12-08 10:27 ` Bruce Richardson 2016-12-08 10:30 ` Ananyev, Konstantin 2016-12-11 12:32 ` Yang, Zhiyong 2016-12-15 6:51 ` Yang, Zhiyong 2016-12-15 10:12 ` Bruce Richardson 2016-12-16 10:19 ` Yang, Zhiyong 2016-12-19 6:27 ` Yuanhan Liu 2016-12-20 2:41 ` Yao, Lei A 2016-12-15 10:53 ` Ananyev, Konstantin 2016-12-16 2:15 ` Yang, Zhiyong 2016-12-16 11:47 ` Ananyev, Konstantin 2016-12-20 9:31 ` Yang, Zhiyong 2016-12-08 15:09 ` Thomas Monjalon 2016-12-11 12:04 ` Yang, Zhiyong 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 3/4] app/test: add performance " Zhiyong Yang 2016-12-27 10:04 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang 2017-01-09 9:48 ` [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and related test Yang, Zhiyong 2017-01-17 6:24 ` Yang, Zhiyong 2017-01-17 20:14 ` Thomas Monjalon 2017-01-18 0:15 ` Vincent JARDIN 2017-01-18 2:42 ` Yang, Zhiyong 2017-01-18 7:42 ` Thomas Monjalon 2017-01-19 1:36 ` Yang, Zhiyong 2016-12-05 8:26 ` [dpdk-dev] [PATCH 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang 2016-12-05 8:26 ` [dpdk-dev] [PATCH 3/4] app/test: add performance " Zhiyong Yang 2016-12-05 8:26 ` [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang 2016-12-02 9:46 ` Thomas Monjalon 2016-12-06 8:04 ` Yang, Zhiyong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).