From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 0D5B11B1C7 for ; Wed, 20 Dec 2017 19:23:50 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 20 Dec 2017 10:23:48 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,432,1508828400"; d="scan'208";a="160477277" Received: from irsmsx153.ger.corp.intel.com ([163.33.192.75]) by orsmga004.jf.intel.com with ESMTP; 20 Dec 2017 10:23:47 -0800 Received: from irsmsx103.ger.corp.intel.com ([169.254.3.49]) by IRSMSX153.ger.corp.intel.com ([169.254.9.34]) with mapi id 14.03.0319.002; Wed, 20 Dec 2017 18:23:46 +0000 From: "Ananyev, Konstantin" To: Jerin Jacob CC: "dev@dpdk.org" , "shahafs@mellanox.com" Thread-Topic: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer Thread-Index: AQHTarRqpnKgXxM3E062tX4SPFnQpqNCWb6AgAB88LCABpE+gIABUw6ggAE32gCAAIr/QA== Date: Wed, 20 Dec 2017 18:23:46 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772588081EC25@IRSMSX103.ger.corp.intel.com> References: <1512140112-13067-1-git-send-email-konstantin.ananyev@intel.com> <20171214045408.GA843@jerin> <2601191342CEEE43887BDE71AB9772585FAC98BC@irsmsx105.ger.corp.intel.com> <20171218163842.GA19052@jerin> <2601191342CEEE43887BDE71AB9772585FACB5CA@irsmsx105.ger.corp.intel.com> <20171220072823.GA13961@jerin> In-Reply-To: <20171220072823.GA13961@jerin> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZGNmZGI2NGMtNjZmZS00MzdlLWE4MzEtNDNkOTc2YTJjZDdjIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6InNXNlhvNjF1cWNnOXVhMzdcL04xMEVxSUtId1hWNE1nOUNpWkxIZEFWT2JzPSJ9 x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Dec 2017 18:23:51 -0000 Hi Jerin, >=20 > Hi Konstantin, >=20 > > > > > > > > > > > > 2. Make it safe to remove rx/tx callback at runtime. > > > > > > Right now it is not possible for the application to figure out > > > > > > when it is safe to free removed callback handle and > > > > > > associated with it resources(unless the queue is stopped). > > > > > > That's probably not a big problem if all callbacks are static > > > > > > hange through whole application lifetime) > > > > > > and/or application doesn't allocate any resources for the callb= ack handler. > > > > > > Though if callbacks have to be added/removed dynamically and > > > > > > callback handler would require more resources to operate proper= ly - > > > > > > then it might become an issue. > > > > > > So patch #2 fixes that problem - now as soon as > > > > > > rte_eth_remove_(rx|tx)_callback() completes successfully, appli= cation > > > > > > can safely free all associated with the removed callback resour= ces. > > > > > > > > > > > > Performance impact: > > > > > > If application doesn't use RX/TX callbacks, then the tests I ru= n didn't > > > > > > reveal any performance degradation. > > > > > > Though if application do use RX/TX callbacks - patch #2 does in= troduce > > > > > > some slowdown. > > > > > > > > > > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X= 520-4) > > > > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed= I got: > > > > > > 1) testpmd ... --latencystats=3D1 - slowdown < 1% > > > > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1% > > > > > > 3) examples/rxtx_callbacks - slowdown ~8% > > > > > > All that in terms of packet throughput (Mpps). > > > > > > > > > > We tried on an arm64 machine; We got following result on host and= guest > > > > > using l3fwd. > > > > > > > > Thanks for testing it. > > > > So these numbers below - are for l3fwd running with or without rx c= allback installed? > > > > i.e. with or without '--parse-ptype' option (and on a NIC with/with= out ptype HW support)? > > > > > > without '--parse-ptype' option(i.e without rx callback installed). > > > That makes it sad, Even though application is not using the rx callba= ck, > > > We need to pay on a average 3% drop in performance(based on the > > > use case/arch/SoC/topology etc) > > > > Yes it is sad... > > Do you have any idea why that happens? > > Is that the same situation you described before: for arm compiler > > often replaces small branches with conditional execution instructions > > and miss-prediction for such instructions is not free? >=20 > Yes. Probably a dumb question - is it possible to disable such optimization for these particular functions (rx_burst/tx_burst)? Or would it make things even worse? >=20 > Another thing I noticed on these inline functions(rx/tx burst)even if we = add > a lot of code which may note be executed, based on the parent > function layout, the compete register access layout gets change(like > push/pop from stack etc, as more code gets added). > I.e This change does not have much impact with testpmd(iofwd) where the > parent function is not doing much thing.It is showing up only on l3fwd. Ok sort of register spillage... But if compiler does it unconditionally (for the code path when it never ne= eded) seems like a problem with the compile, no? Wonder if we'll move at least RX/TX CBs invocation into a separate function= - would it help somehow? I.E. if (q->cb !=3D NULL) execut_cbs(...); ? Another thought - as I remember a while ago Bruce and other lads suggested = to move rx_burst/tx)burst (and whole struct rte_eth_dev) definitions into rte_ethde= v.c... That would help to avoid such problems, again we'll be able to modify rte_e= th_dev without breaking ABI... Though it would mean extra call overhead, so don't know of-hand how expensi= ve it could be. >=20 > > Or is it something else? > > > > Another thing - the series contains 2 main patches: > > 1) http://dpdk.org/dev/patchwork/patch/31866/ > > Introduces eth_queue_local and moves callback data into it. > > That adds one extra memory reference for rx/tx function to read cb func= and data. > > 2) http://dpdk.org/dev/patchwork/patch/31867/ > > Introduces some synchronization for cb data. > > > > Is it possible to rerun your tests with just patch #1 installed? > > Hopefully it would help us to figure out which of the patches (#1 or #2= ) > > is the main source of slowdown on arm machines. >=20 > I re ran the test as requested: > With first patch - 1.17% regression > With first patch + Second patch - 2.84% regression Thanks >=20 > I am fine with slowdown, but it should be effected only with those > application which request the callback feature. Ok, but you do realize that if we'll go with rx/tx function per queue, patch #1 slowdown (1.17%) will probably reappear anyway?=20 >=20 > > > > > > > > > > > > > > > > > > > > > > > > Note: > > > > > +ve number means "Performance regression with patches" > > > > > -ve number means "Performance improvement with patches" > > > > > > > > > > Relative performance difference in percentage: > > > > > > > > > > % difference on host (2 x 40G) > > > > > pkt_sz 1c 2c 4c 8c > > > > > 64 1.41 0.18 0.51 0.29 > > > > > 72 0.50 0.53 0.09 0.19 > > > > > 128 0.31 0.31 0.42 0.00 > > > > > 256 -0.44 -0.44 0.00 0.00 > > > > > 512 1.94 1.94 0.01 0.01 > > > > > 1024 0.00 0.01 0.00 0.01 > > > > > 1518 0.02 0.01 0.02 0.02 > > > > > > > > > > % difference on guest (2 x 40G) > > > > > pkt_sz 1c 2c 4c 8c > > > > > 64 5.78 2.30 -2.45 -0.01 > > > > > 72 1.13 1.13 1.31 -4.29 > > > > > 128 1.36 1.36 1.54 -0.82 > > > > > 256 2.02 2.03 -0.01 -0.35 > > > > > 512 4.05 4.04 0.00 -0.46 > > > > > 1024 0.39 0.38 0.00 -0.74 > > > > > 1518 0.00 0.00 -0.05 -4.20 > > > > > > > > > > I think, it is because we added more code under > > > > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by > > > > > default. > > > > > > > > > > I think, the impact will vary based on > > > > > architecture and underneath i-cache, d-cache and IPC. > > > > > > > > > > I am just thinking, How we can avoid such impact? > > > > > How about, > > > > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option > > > > > # Hook Rx/TX callbacks under TX/RX offload flags > > > > > # ethdev layer gives helper function to invoke > > > > > callbacks to driver. > > > > > > > > I suppose you suggest that user would have to decide at queue_setup= () > > > > stage does he wants a callback(s) for RX/TX or no? > > > > > > Yes. queue_setup or configure() time. > > > > > > > As I can see, the main drawback with that approach - it would intro= duce new limitation. > > > > Right now it is possible to add/remove RX/TX callback at runtime > > > > (without stopping and reconfiguring the queue), and I suppose we do= want to keep that ability. > > > > > > For those _application_ where it can not decide or it need to add wit= hout > > > stopping and re configuring the queue still can configure/enable the = ethdev > > > device with RX/TX callback option(and pay price for it) > > > > > > > Again in some cases user can't decide does he wants a callback inst= alled or not, > > > > before he actually calls dev_start(). > > > > l3fwd ptype-parse callback is an example of such situation: > > > > right now user has to start device first to figure out which ptypes= are supported by this PMD > > > > with desired configuration, then based on this information he will = (or will not) setup the callback. > > > > > > Good point. Can we remove that limitation from PMD perspective? IMO, = ptype-parse is more of HW feature, > > > All we need to do a lookup(single load) to convert to DPDK packet typ= e. > > > http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n2= 58 > > > > > > I see almost all the drivers has dummy check like this, I think, That= makes it > > > dependency with start(). > > > > > > if (dev->rx_pkt_burst =3D=3D i40e_recv_pkts || > > > dev->rx_pkt_burst =3D=3D i40e_recv_pkts_bulk_alloc || > > > dev->rx_pkt_burst =3D=3D i40e_recv_scattered_pkts || > > > dev->rx_pkt_burst =3D=3D i40e_recv_scattered_pkts_vec || > > > dev->rx_pkt_burst =3D=3D i40e_recv_pkts_vec) > > > return ptypes; > > > > > > > We probably can, but it would require extra work and changes inside the= PMDs. >=20 > Yes, If we want do it in clean way. Another intermediate solution could > be return some valid error code on those PMD. On such error code, applica= tion can > start the device and get the supported ptype and then stop/reconfigure it= again >=20 > > i40e is a good one - all rx functions support the same ptypes. > > But let say for ixgbe ARM version of vRX doesn't support all ptypes tha= t x86 vRX > > and generic scalar RX do. > > For fm10k - vector RX and scalar RX also support different ptype sets. >=20 > > > > > > > > > What I thought in same direction - introduce a notion of 'static' a= nd 'dynamic' callbacks. > > > > Static ones can't be removed while queue is not stopped while dynam= ic can. > > > > But that also requires user to select at queue_setup() stage does h= e wants dynamic > > > > callbacks for that queue or not (new offload flag or extra field in= rxconf/txconf structure). > > > > So again - same limitation for the user. > > > > Another possibility here - have a flag 'dynamic' or 'static' not pe= r queue, but per individual callback. > > > > Yes, that would mean API change for add/remove callback functions. > > > > > > At least on my test, the regression was caused because adding the cod= e in > > > tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabl= ed > > > by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don= 't > > > see any regression. If so, Will such regression fixed by introducing > > > 'static' or 'dynamic' callbacks? I am happy to test and provide feedb= ack. > > > > Probably no then, but let's try the experiment I mentioned above first. > > > > > > > > For me, I think, The better option is request from application for th= e > > > need for Rx/TX callbacks and set the handler appropriate in the drive= r to > > > have zero impact for the application which does not need callback at = all. > > > > It is possible, but: > > - that would be quite significant change and would affect each RX and T= X function in each PMD. >=20 > May be not, assume if we have helper function which invoke the callbacks= . > PMD can just call that before returning it. Even with helpers it means we'll have to double number of RX/TX functions - one with a CB support, another without. =20 >=20 > > - from other side - even now there are several libs inside DPDK that re= ly on ability to add/remove > > callbacks on the fly - pdump, latencystats. > I think, it is application who decides to use this API,then I think, > application can request callback feature support. I think, we have > followed the same method in latest RX/TX offload flag scheme as well. >=20 > > ethdev rx profiling also relies on ability to install RX callback. > > > > With all these things I presume most apps will configure their devices = with rx/tx callback support enabled anyway. > > Which makes me think - is it worth to introduce such changes at all. > > Anyway, my first suggestion would be let's try to figure out what exact= ly causes such big slowdown on ARM. >=20 > It may not be ARM in general, Under our roof, We have different class of = SoCs based > on the use case like 4 cores, 16 core, 96 cores part etc with different > micro architecture etc based on the use case(embedded/server etc) like > Intel. So it may not very generic with all ARM machines, it is really > based on micro architecture. ie. The same patch may behave > differently(less than < 1%) on IA-Atom vs IA-XEON. So IMO, Common code > change should use helper functions, it up to the driver to use it. Konstantin