From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id ABB21376D
 for <dev@dpdk.org>; Fri, 16 Dec 2016 12:47:40 +0100 (CET)
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by orsmga103.jf.intel.com with ESMTP; 16 Dec 2016 03:47:39 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.33,357,1477983600"; d="scan'208";a="18930914"
Received: from irsmsx108.ger.corp.intel.com ([163.33.3.3])
 by orsmga002.jf.intel.com with ESMTP; 16 Dec 2016 03:47:38 -0800
Received: from irsmsx105.ger.corp.intel.com ([169.254.7.212]) by
 IRSMSX108.ger.corp.intel.com ([169.254.11.173]) with mapi id 14.03.0248.002;
 Fri, 16 Dec 2016 11:47:37 +0000
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: "Yang, Zhiyong" <zhiyong.yang@intel.com>, Thomas Monjalon
 <thomas.monjalon@6wind.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, "yuanhan.liu@linux.intel.com"
 <yuanhan.liu@linux.intel.com>, "Richardson, Bruce"
 <bruce.richardson@intel.com>, "De Lara Guarch, Pablo"
 <pablo.de.lara.guarch@intel.com>
Thread-Topic: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA
 platform
Thread-Index: AQHSTHcpRTnyVY6QF0Gsm+MMelrW8KD0c9uAgAlAQICAABxAEIAACH0AgAAFB4CABN6NgIAF6d8AgABCBnCAAQNXgIAAmkaw
Date: Fri, 16 Dec 2016 11:47:37 +0000
Message-ID: <2601191342CEEE43887BDE71AB9772583F0F069A@irsmsx105.ger.corp.intel.com>
References: <1480926387-63838-1-git-send-email-zhiyong.yang@intel.com>
 <1480926387-63838-2-git-send-email-zhiyong.yang@intel.com>
 <7223515.9TZuZb6buy@xps13>
 <E182254E98A5DA4EB1E657AC7CB9BD2A3EB565EC@BGSMSX101.gar.corp.intel.com>
 <2601191342CEEE43887BDE71AB9772583F0E55B0@irsmsx105.ger.corp.intel.com>
 <E182254E98A5DA4EB1E657AC7CB9BD2A3EB586ED@BGSMSX101.gar.corp.intel.com>
 <2601191342CEEE43887BDE71AB9772583F0E568B@irsmsx105.ger.corp.intel.com>
 <E182254E98A5DA4EB1E657AC7CB9BD2A3EB58E90@BGSMSX101.gar.corp.intel.com>
 <E182254E98A5DA4EB1E657AC7CB9BD2A3EB599D4@BGSMSX101.gar.corp.intel.com>
 <2601191342CEEE43887BDE71AB9772583F0EFF66@irsmsx105.ger.corp.intel.com>
 <E182254E98A5DA4EB1E657AC7CB9BD2A3EB59CAF@BGSMSX101.gar.corp.intel.com>
In-Reply-To: <E182254E98A5DA4EB1E657AC7CB9BD2A3EB59CAF@BGSMSX101.gar.corp.intel.com>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.182]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA
 platform
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Dec 2016 11:47:41 -0000

Hi Zhiyong,

> > > > > >
> > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > > > >
> > > > > static inline void*
> > > > > rte_memset_huge(void *s, int c, size_t n) {
> > > > >    return __rte_memset_vector(s, c, n); }
> > > > >
> > > > > static inline void *
> > > > > rte_memset(void *s, int c, size_t n) {
> > > > > 	If (n < XXX)
> > > > > 		return rte_memset_scalar(s, c, n);
> > > > > 	else
> > > > > 		return rte_memset_huge(s, c, n); }
> > > > >
> > > > > XXX could be either a define, or could also be a variable, so it
> > > > > can be setuped at startup, depending on the architecture.
> > > > >
> > > > > Would that work?
> > > > > Konstantin
> > > > >
> > > I have implemented the code for  choosing the functions at run time.
> > > rte_memcpy is used more frequently, So I test it at run time.
> > >
> > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inlin=
e
> > > void * rte_memcpy(void *dst, const void *src, size_t n) {
> > >         return rte_memcpy_vector(dst, src, n); } In order to reduce
> > > the overhead at run time, I assign the function address to var
> > > rte_memcpy_vector before main() starts to init the var.
> > >
> > > static void __attribute__((constructor))
> > > rte_memcpy_init(void)
> > > {
> > > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > 	{
> > > 		rte_memcpy_vector =3D rte_memcpy_avx2;
> > > 	}
> > > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > 	{
> > > 		rte_memcpy_vector =3D rte_memcpy_sse;
> > > 	}
> > > 	else
> > > 	{
> > > 		rte_memcpy_vector =3D memcpy;
> > > 	}
> > >
> > > }
> >
> > I thought we discussed a bit different approach.
> > In which rte_memcpy_vector() (rte_memeset_vector) would be called  only
> > after some cutoff point, i.e:
> >
> > void
> > rte_memcpy(void *dst, const void *src, size_t len) {
> > 	if (len < N) memcpy(dst, src, len);
> > 	else rte_memcpy_vector(dst, src, len);
> > }
> >
> > If you just always call rte_memcpy_vector() for every len, then it mean=
s that
> > compiler most likely has always to generate a proper call (not inlining
> > happening).
>=20
> > For small length(s) price of extra function would probably overweight a=
ny
> > potential gain with SSE/AVX2 implementation.
> >
> > Konstantin
>=20
> Yes, in fact,  from my tests, For small length(s)  rte_memset is far bett=
er than glibc memset,
> For large lengths, rte_memset is only a bit better than memset.
> because memset use the AVX2/SSE, too. Of course, it will use AVX512 on fu=
ture machine.

Ok, thanks for clarification.
>>From previous mails I got a wrong  impression that on big lengths
rte_memset_vector() is significantly faster than memset().

>=20
> >For small length(s) price of extra function would probably overweight an=
y
>  >potential gain.
> This is the key point. I think it should include the scalar optimization,=
 not only vector optimization.
>=20
> The value of rte_memset is always inlined and for small lengths it will b=
e better.
> when in some case We are not sure that memset is always inlined by compil=
er.

Ok, so do you know in what cases memset() is not get inlined?
Is it when len parameter can't be precomputed by the compiler
(is not a constant)?

So to me it sounds like:
- We don't need to have an optimized verision of rte_memset() for big sizes=
.
- Which probably means we don't need an arch specific versions of rte_memse=
t_vector() at all -
   for small sizes (<=3D 32B) scalar version would be good enough.=20
- For big sizes we can just rely on memset().
Is that so?

> It seems that choosing function at run time will lose the gains.
> The following is tested on haswell by patch code.

Not sure what columns 2 and 3 in the table below mean?=20
Konstantin

> ** rte_memset() - memset perf tests
>         (C =3D compile-time constant) **
> =3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D =
=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D
>    Size memset in cache  memset in mem
> (bytes)        (ticks)        (ticks)
> ------- -------------- ---------------
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D 32B aligned =3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
>       3            3 -    8       19 -  128
>       4            4 -    8       13 -  128
>       8            2 -    7       19 -  128
>       9            2 -    7       19 -  127
>      12           2 -    7       19 -  127
>      17          3 -    8        19 -  132
>      64          3 -    8        28 -  168
>     128        7 -   13       54 -  200
>     255        8 -   20       100 -  223
>     511        14 -   20     187 -  314
>    1024      24 -   29     328 -  379
>    8192     198 -  225   1829 - 2193
>=20
> Thanks
> Zhiyong