From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id AEB6F3250 for ; Mon, 15 Jan 2018 17:29:33 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 15 Jan 2018 08:29:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,364,1511856000"; d="scan'208";a="19894036" Received: from irsmsx152.ger.corp.intel.com ([163.33.192.66]) by FMSMGA003.fm.intel.com with ESMTP; 15 Jan 2018 08:29:31 -0800 Received: from irsmsx156.ger.corp.intel.com (10.108.20.68) by IRSMSX152.ger.corp.intel.com (163.33.192.66) with Microsoft SMTP Server (TLS) id 14.3.319.2; Mon, 15 Jan 2018 16:29:30 +0000 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.236]) by IRSMSX156.ger.corp.intel.com ([169.254.3.33]) with mapi id 14.03.0319.002; Mon, 15 Jan 2018 16:29:30 +0000 From: "Ananyev, Konstantin" To: Jerin Jacob CC: "dev@dpdk.org" , "shahafs@mellanox.com" Thread-Topic: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer Thread-Index: AQHTarRqpnKgXxM3E062tX4SPFnQpqNCWb6AgAB88LCABpE+gIABUw6ggAE32gCAAIr/QIAXJ9KAgBG9qmA= Date: Mon, 15 Jan 2018 16:29:29 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772588627DDB3@irsmsx105.ger.corp.intel.com> References: <1512140112-13067-1-git-send-email-konstantin.ananyev@intel.com> <20171214045408.GA843@jerin> <2601191342CEEE43887BDE71AB9772585FAC98BC@irsmsx105.ger.corp.intel.com> <20171218163842.GA19052@jerin> <2601191342CEEE43887BDE71AB9772585FACB5CA@irsmsx105.ger.corp.intel.com> <20171220072823.GA13961@jerin> <2601191342CEEE43887BDE71AB9772588081EC25@IRSMSX103.ger.corp.intel.com> <20180104092221.GA24593@jerin> In-Reply-To: <20180104092221.GA24593@jerin> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNGZlMGE1NzEtNzkyYS00MDI5LWE4ZjAtYzMzNTQ3ZDE0NjAxIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IjZoTTNiVzFQOWRrQ2pZZjhaMWxsVlkrditMQ2ZpMnVic0lWcnE0Z0NBdTQ9In0= x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Jan 2018 16:29:34 -0000 Hi Jerin, >=20 > Hi Konstantin, >=20 > > > > > > > > 2. Make it safe to remove rx/tx callback at runtime. > > > > > > > > Right now it is not possible for the application to figure = out > > > > > > > > when it is safe to free removed callback handle and > > > > > > > > associated with it resources(unless the queue is stopped). > > > > > > > > That's probably not a big problem if all callbacks are stat= ic > > > > > > > > hange through whole application lifetime) > > > > > > > > and/or application doesn't allocate any resources for the c= allback handler. > > > > > > > > Though if callbacks have to be added/removed dynamically an= d > > > > > > > > callback handler would require more resources to operate pr= operly - > > > > > > > > then it might become an issue. > > > > > > > > So patch #2 fixes that problem - now as soon as > > > > > > > > rte_eth_remove_(rx|tx)_callback() completes successfully, a= pplication > > > > > > > > can safely free all associated with the removed callback re= sources. > > > > > > > > > > > > > > > > Performance impact: > > > > > > > > If application doesn't use RX/TX callbacks, then the tests = I run didn't > > > > > > > > reveal any performance degradation. > > > > > > > > Though if application do use RX/TX callbacks - patch #2 doe= s introduce > > > > > > > > some slowdown. > > > > > > > > > > > > > > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10G= b (X520-4) > > > > > > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch insta= lled I got: > > > > > > > > 1) testpmd ... --latencystats=3D1 - slowdown < 1% > > > > > > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1% > > > > > > > > 3) examples/rxtx_callbacks - slowdown ~8% > > > > > > > > All that in terms of packet throughput (Mpps). > > > > > > > > > > > > > > We tried on an arm64 machine; We got following result on host= and guest > > > > > > > using l3fwd. > > > > > > > > > > > > Thanks for testing it. > > > > > > So these numbers below - are for l3fwd running with or without = rx callback installed? > > > > > > i.e. with or without '--parse-ptype' option (and on a NIC with/= without ptype HW support)? > > > > > > > > > > without '--parse-ptype' option(i.e without rx callback installed)= . > > > > > That makes it sad, Even though application is not using the rx ca= llback, > > > > > We need to pay on a average 3% drop in performance(based on the > > > > > use case/arch/SoC/topology etc) > > > > > > > > Yes it is sad... > > > > Do you have any idea why that happens? > > > > Is that the same situation you described before: for arm compiler > > > > often replaces small branches with conditional execution instructio= ns > > > > and miss-prediction for such instructions is not free? > > > > > > Yes. > > > > Probably a dumb question - is it possible to disable such optimization > > for these particular functions (rx_burst/tx_burst)? > > Or would it make things even worse? > > > > > > > > Another thing I noticed on these inline functions(rx/tx burst)even if= we add > > > a lot of code which may note be executed, based on the parent > > > function layout, the compete register access layout gets change(like > > > push/pop from stack etc, as more code gets added). > > > I.e This change does not have much impact with testpmd(iofwd) where t= he > > > parent function is not doing much thing.It is showing up only on l3fw= d. > > > > Ok sort of register spillage... > > But if compiler does it unconditionally (for the code path when it neve= r needed) > > seems like a problem with the compile, no? >=20 > I think, it will be depended on parent function usage. >=20 > > Wonder if we'll move at least RX/TX CBs invocation into a separate func= tion - > > would it help somehow? > > I.E. if (q->cb !=3D NULL) execut_cbs(...); > > ? >=20 > Yes.I tried to move both (Rx, Tx) execut_cbs as separate in .c func ton. > That is helping the regression in some extended(Reduced from 2.84% to 1.2= 7%) >=20 > example: > +uint16_t > +__rte_eth_execut_tx_cbs(uint16_t port_id, uint16_t queue_id, > + struct rte_mbuf **pkts, uint16_t nb_pkts, > + struct rte_eth_queue_local *ql); > + > static inline uint16_t > rte_eth_tx_burst(uint16_t port_id, uint16_t queue_id, > struct rte_mbuf **tx_pkts, uint16_t nb_pkts) > @@ -3256,19 +3259,11 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t > queue_id, > #ifdef RTE_ETHDEV_RXTX_CALLBACKS > { > struct rte_eth_queue_local *ql; > - struct rte_eth_rxtx_callback *cb; >=20 > ql =3D dev->tx_ql + queue_id; > - if (unlikely(ql->cbs.head !=3D NULL)) { > - > - __rte_eth_rxtx_cbs_inuse(&ql->cbs); > - > - for (cb =3D ql->cbs.head; cb !=3D NULL; cb =3D cb= ->next) > - nb_pkts =3D cb->fn.tx(port_id, queue_id, = tx_pkts, > - nb_pkts, cb->param); > - > - __rte_eth_rxtx_cbs_unuse(&ql->cbs); > - } > + if (unlikely(ql->cbs.head !=3D NULL)) > + nb_pkts =3D __rte_eth_execut_tx_cbs(port_id,queue= _id, > + tx_pkts, nb_pkts, ql); > } > #endif >=20 >=20 > > Another thought - as I remember a while ago Bruce and other lads sugges= ted to move > > rx_burst/tx)burst (and whole struct rte_eth_dev) definitions into rte_e= thdev.c... > > That would help to avoid such problems, again we'll be able to modify r= te_eth_dev > > without breaking ABI... > > Though it would mean extra call overhead, so don't know of-hand how exp= ensive > > it could be. >=20 > From my experiments it shows more expensive than above scheme. >=20 > > > > > > > > > Or is it something else? > > > > > > > > Another thing - the series contains 2 main patches: > > > > 1) http://dpdk.org/dev/patchwork/patch/31866/ > > > > Introduces eth_queue_local and moves callback data into it. > > > > That adds one extra memory reference for rx/tx function to read cb = func and data. > > > > 2) http://dpdk.org/dev/patchwork/patch/31867/ > > > > Introduces some synchronization for cb data. > > > > > > > > Is it possible to rerun your tests with just patch #1 installed? > > > > Hopefully it would help us to figure out which of the patches (#1 o= r #2) > > > > is the main source of slowdown on arm machines. > > > > > > I re ran the test as requested: > > > With first patch - 1.17% regression > > > With first patch + Second patch - 2.84% regression > > > > Thanks > > > > > > > > I am fine with slowdown, but it should be effected only with those > > > application which request the callback feature. > > > > Ok, but you do realize that if we'll go with rx/tx function per queue, > > patch #1 slowdown (1.17%) will probably reappear anyway? >=20 > May be not, it will be one more redirection to get the rx/tx function > pointer(may be a additionally load instruction) >=20 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Note: > > > > > > > +ve number means "Performance regression with patches" > > > > > > > -ve number means "Performance improvement with patches" > > > > > > > > > > > > > > Relative performance difference in percentage: > > > > > > > > > > > > > > % difference on host (2 x 40G) > > > > > > > pkt_sz 1c 2c 4c 8c > > > > > > > 64 1.41 0.18 0.51 0.29 > > > > > > > 72 0.50 0.53 0.09 0.19 > > > > > > > 128 0.31 0.31 0.42 0.00 > > > > > > > 256 -0.44 -0.44 0.00 0.00 > > > > > > > 512 1.94 1.94 0.01 0.01 > > > > > > > 1024 0.00 0.01 0.00 0.01 > > > > > > > 1518 0.02 0.01 0.02 0.02 > > > > > > > > > > > > > > % difference on guest (2 x 40G) > > > > > > > pkt_sz 1c 2c 4c 8c > > > > > > > 64 5.78 2.30 -2.45 -0.01 > > > > > > > 72 1.13 1.13 1.31 -4.29 > > > > > > > 128 1.36 1.36 1.54 -0.82 > > > > > > > 256 2.02 2.03 -0.01 -0.35 > > > > > > > 512 4.05 4.04 0.00 -0.46 > > > > > > > 1024 0.39 0.38 0.00 -0.74 > > > > > > > 1518 0.00 0.00 -0.05 -4.20 > > > > > > > > > > > > > > I think, it is because we added more code under > > > > > > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by > > > > > > > default. > > > > > > > > > > > > > > I think, the impact will vary based on > > > > > > > architecture and underneath i-cache, d-cache and IPC. > > > > > > > > > > > > > > I am just thinking, How we can avoid such impact? > > > > > > > How about, > > > > > > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option > > > > > > > # Hook Rx/TX callbacks under TX/RX offload flags > > > > > > > # ethdev layer gives helper function to invoke > > > > > > > callbacks to driver. > > > > > > > > > > > > I suppose you suggest that user would have to decide at queue_s= etup() > > > > > > stage does he wants a callback(s) for RX/TX or no? > > > > > > > > > > Yes. queue_setup or configure() time. > > > > > > > > > > > As I can see, the main drawback with that approach - it would i= ntroduce new limitation. > > > > > > Right now it is possible to add/remove RX/TX callback at runtim= e > > > > > > (without stopping and reconfiguring the queue), and I suppose w= e do want to keep that ability. > > > > > > > > > > For those _application_ where it can not decide or it need to add= without > > > > > stopping and re configuring the queue still can configure/enable = the ethdev > > > > > device with RX/TX callback option(and pay price for it) > > > > > > > > > > > Again in some cases user can't decide does he wants a callback = installed or not, > > > > > > before he actually calls dev_start(). > > > > > > l3fwd ptype-parse callback is an example of such situation: > > > > > > right now user has to start device first to figure out which pt= ypes are supported by this PMD > > > > > > with desired configuration, then based on this information he w= ill (or will not) setup the callback. > > > > > > > > > > Good point. Can we remove that limitation from PMD perspective? I= MO, ptype-parse is more of HW feature, > > > > > All we need to do a lookup(single load) to convert to DPDK packet= type. > > > > > http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.= c#n258 > > > > > > > > > > I see almost all the drivers has dummy check like this, I think, = That makes it > > > > > dependency with start(). > > > > > > > > > > if (dev->rx_pkt_burst =3D=3D i40e_recv_pkts || > > > > > dev->rx_pkt_burst =3D=3D i40e_recv_pkts_bulk_alloc || > > > > > dev->rx_pkt_burst =3D=3D i40e_recv_scattered_pkts || > > > > > dev->rx_pkt_burst =3D=3D i40e_recv_scattered_pkts_vec= || > > > > > dev->rx_pkt_burst =3D=3D i40e_recv_pkts_vec) > > > > > return ptypes; > > > > > > > > > > > > > We probably can, but it would require extra work and changes inside= the PMDs. > > > > > > Yes, If we want do it in clean way. Another intermediate solution cou= ld > > > be return some valid error code on those PMD. On such error code, app= lication can > > > start the device and get the supported ptype and then stop/reconfigur= e it again > > > > > > > i40e is a good one - all rx functions support the same ptypes. > > > > But let say for ixgbe ARM version of vRX doesn't support all ptypes= that x86 vRX > > > > and generic scalar RX do. > > > > For fm10k - vector RX and scalar RX also support different ptype se= ts. > > > > > > > > > > > > > > > > > > What I thought in same direction - introduce a notion of 'stati= c' and 'dynamic' callbacks. > > > > > > Static ones can't be removed while queue is not stopped while d= ynamic can. > > > > > > But that also requires user to select at queue_setup() stage do= es he wants dynamic > > > > > > callbacks for that queue or not (new offload flag or extra fiel= d in rxconf/txconf structure). > > > > > > So again - same limitation for the user. > > > > > > Another possibility here - have a flag 'dynamic' or 'static' no= t per queue, but per individual callback. > > > > > > Yes, that would mean API change for add/remove callback functio= ns. > > > > > > > > > > At least on my test, the regression was caused because adding the= code in > > > > > tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which e= nabled > > > > > by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I= don't > > > > > see any regression. If so, Will such regression fixed by introduc= ing > > > > > 'static' or 'dynamic' callbacks? I am happy to test and provide f= eedback. > > > > > > > > Probably no then, but let's try the experiment I mentioned above fi= rst. > > > > > > > > > > > > > > For me, I think, The better option is request from application fo= r the > > > > > need for Rx/TX callbacks and set the handler appropriate in the d= river to > > > > > have zero impact for the application which does not need callback= at all. > > > > > > > > It is possible, but: > > > > - that would be quite significant change and would affect each RX a= nd TX function in each PMD. > > > > > > May be not, assume if we have helper function which invoke the callb= acks. > > > PMD can just call that before returning it. > > > > Even with helpers it means we'll have to double number of RX/TX functio= ns - > > one with a CB support, another without. >=20 > IMO, It is up to the PMD driver. It can still choose to enable callbacks > all the time irrespective of input from application(i.e no need for doubl= e > number of RX/TX functions). Good point. >=20 > But it will enable a PMD to choose do optimization if the PMD wants > it. As a NW PMD maintainer, I will be OK to make minimal change(if the > helper function is available) in my PMD, if I get 1-2% > improvement. On the upside, We can remove RTE_ETHDEV_RXTX_CALLBACKS > conditional complication flag. Yes, removing RTE_ETHDEV_RXTX_CALLBACKS would definitely be a good thing. I tried the approach you suggested for ixgbe PMD - on IA there seems no reg= ression. The only thing I don't really happy with it - it would require changes for= every RX/TX function in each PMD. Yes, changes are quite mechanical, but still every PMD will need to be rete= sted.=20 Before going forward would be really interesting to hear other people opini= ons here. Might be there are some other ways to overcome that problem. Thanks Konstantin >=20 >=20 > > > > > > > > > - from other side - even now there are several libs inside DPDK tha= t rely on ability to add/remove > > > > callbacks on the fly - pdump, latencystats. > > > I think, it is application who decides to use this API,then I think, > > > application can request callback feature support. I think, we have > > > followed the same method in latest RX/TX offload flag scheme as well. > > > > > > > ethdev rx profiling also relies on ability to install RX callbac= k. > > > > > > > > With all these things I presume most apps will configure their devi= ces with rx/tx callback support enabled anyway. > > > > Which makes me think - is it worth to introduce such changes at all= . > > > > Anyway, my first suggestion would be let's try to figure out what e= xactly causes such big slowdown on ARM. > > > > > > It may not be ARM in general, Under our roof, We have different class= of SoCs based > > > on the use case like 4 cores, 16 core, 96 cores part etc with differe= nt > > > micro architecture etc based on the use case(embedded/server etc) lik= e > > > Intel. So it may not very generic with all ARM machines, it is really > > > based on micro architecture. ie. The same patch may behave > > > differently(less than < 1%) on IA-Atom vs IA-XEON. So IMO, Common cod= e > > > change should use helper functions, it up to the driver to use it. > > > > Konstantin > >