From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 894352C18 for ; Tue, 19 Dec 2017 16:27:30 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 07:27:29 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,427,1508828400"; d="scan'208";a="3959674" Received: from irsmsx102.ger.corp.intel.com ([163.33.3.155]) by orsmga007.jf.intel.com with ESMTP; 19 Dec 2017 07:27:27 -0800 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.67]) by IRSMSX102.ger.corp.intel.com ([169.254.2.180]) with mapi id 14.03.0319.002; Tue, 19 Dec 2017 15:27:27 +0000 From: "Ananyev, Konstantin" To: Jerin Jacob CC: "dev@dpdk.org" , "shahafs@mellanox.com" Thread-Topic: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer Thread-Index: AQHTarRqpnKgXxM3E062tX4SPFnQpqNCWb6AgAB88LCABpE+gIABUw6g Date: Tue, 19 Dec 2017 15:27:26 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772585FACB5CA@irsmsx105.ger.corp.intel.com> References: <1512140112-13067-1-git-send-email-konstantin.ananyev@intel.com> <20171214045408.GA843@jerin> <2601191342CEEE43887BDE71AB9772585FAC98BC@irsmsx105.ger.corp.intel.com> <20171218163842.GA19052@jerin> In-Reply-To: <20171218163842.GA19052@jerin> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiM2VhNGY2NzItNjEyNy00OGIwLTk3OTAtMWFiYTA1YWFjY2NkIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IjBrcUVLOGd3Q2lGbUgrOVQ2d3hyRFI2UDlvK2R6cjNcL1IxMDRnbnI1dFIwPSJ9 x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Dec 2017 15:27:31 -0000 Hi Jerin, > > > > > > > > 2. Make it safe to remove rx/tx callback at runtime. > > > > Right now it is not possible for the application to figure out > > > > when it is safe to free removed callback handle and > > > > associated with it resources(unless the queue is stopped). > > > > That's probably not a big problem if all callbacks are static > > > > hange through whole application lifetime) > > > > and/or application doesn't allocate any resources for the callback = handler. > > > > Though if callbacks have to be added/removed dynamically and > > > > callback handler would require more resources to operate properly - > > > > then it might become an issue. > > > > So patch #2 fixes that problem - now as soon as > > > > rte_eth_remove_(rx|tx)_callback() completes successfully, applicati= on > > > > can safely free all associated with the removed callback resources. > > > > > > > > Performance impact: > > > > If application doesn't use RX/TX callbacks, then the tests I run di= dn't > > > > reveal any performance degradation. > > > > Though if application do use RX/TX callbacks - patch #2 does introd= uce > > > > some slowdown. > > > > > > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-= 4) > > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I g= ot: > > > > 1) testpmd ... --latencystats=3D1 - slowdown < 1% > > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1% > > > > 3) examples/rxtx_callbacks - slowdown ~8% > > > > All that in terms of packet throughput (Mpps). > > > > > > We tried on an arm64 machine; We got following result on host and gue= st > > > using l3fwd. > > > > Thanks for testing it. > > So these numbers below - are for l3fwd running with or without rx callb= ack installed? > > i.e. with or without '--parse-ptype' option (and on a NIC with/without = ptype HW support)? >=20 > without '--parse-ptype' option(i.e without rx callback installed). > That makes it sad, Even though application is not using the rx callback, > We need to pay on a average 3% drop in performance(based on the > use case/arch/SoC/topology etc) Yes it is sad... Do you have any idea why that happens? Is that the same situation you described before: for arm compiler often replaces small branches with conditional execution instructions and miss-prediction for such instructions is not free? Or is it something else? Another thing - the series contains 2 main patches: 1) http://dpdk.org/dev/patchwork/patch/31866/ Introduces eth_queue_local and moves callback data into it. That adds one extra memory reference for rx/tx function to read cb func and= data. 2) http://dpdk.org/dev/patchwork/patch/31867/ Introduces some synchronization for cb data. Is it possible to rerun your tests with just patch #1 installed? Hopefully it would help us to figure out which of the patches (#1 or #2) =20 is the main source of slowdown on arm machines. >=20 > > > > > > > > > > > Note: > > > +ve number means "Performance regression with patches" > > > -ve number means "Performance improvement with patches" > > > > > > Relative performance difference in percentage: > > > > > > % difference on host (2 x 40G) > > > pkt_sz 1c 2c 4c 8c > > > 64 1.41 0.18 0.51 0.29 > > > 72 0.50 0.53 0.09 0.19 > > > 128 0.31 0.31 0.42 0.00 > > > 256 -0.44 -0.44 0.00 0.00 > > > 512 1.94 1.94 0.01 0.01 > > > 1024 0.00 0.01 0.00 0.01 > > > 1518 0.02 0.01 0.02 0.02 > > > > > > % difference on guest (2 x 40G) > > > pkt_sz 1c 2c 4c 8c > > > 64 5.78 2.30 -2.45 -0.01 > > > 72 1.13 1.13 1.31 -4.29 > > > 128 1.36 1.36 1.54 -0.82 > > > 256 2.02 2.03 -0.01 -0.35 > > > 512 4.05 4.04 0.00 -0.46 > > > 1024 0.39 0.38 0.00 -0.74 > > > 1518 0.00 0.00 -0.05 -4.20 > > > > > > I think, it is because we added more code under > > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by > > > default. > > > > > > I think, the impact will vary based on > > > architecture and underneath i-cache, d-cache and IPC. > > > > > > I am just thinking, How we can avoid such impact? > > > How about, > > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option > > > # Hook Rx/TX callbacks under TX/RX offload flags > > > # ethdev layer gives helper function to invoke > > > callbacks to driver. > > > > I suppose you suggest that user would have to decide at queue_setup() > > stage does he wants a callback(s) for RX/TX or no? >=20 > Yes. queue_setup or configure() time. >=20 > > As I can see, the main drawback with that approach - it would introduce= new limitation. > > Right now it is possible to add/remove RX/TX callback at runtime > > (without stopping and reconfiguring the queue), and I suppose we do wan= t to keep that ability. >=20 > For those _application_ where it can not decide or it need to add without > stopping and re configuring the queue still can configure/enable the ethd= ev > device with RX/TX callback option(and pay price for it) >=20 > > Again in some cases user can't decide does he wants a callback installe= d or not, > > before he actually calls dev_start(). > > l3fwd ptype-parse callback is an example of such situation: > > right now user has to start device first to figure out which ptypes are= supported by this PMD > > with desired configuration, then based on this information he will (or = will not) setup the callback. >=20 > Good point. Can we remove that limitation from PMD perspective? IMO, ptyp= e-parse is more of HW feature, > All we need to do a lookup(single load) to convert to DPDK packet type. > http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258 >=20 > I see almost all the drivers has dummy check like this, I think, That mak= es it > dependency with start(). >=20 > if (dev->rx_pkt_burst =3D=3D i40e_recv_pkts || > dev->rx_pkt_burst =3D=3D i40e_recv_pkts_bulk_alloc || > dev->rx_pkt_burst =3D=3D i40e_recv_scattered_pkts || > dev->rx_pkt_burst =3D=3D i40e_recv_scattered_pkts_vec || > dev->rx_pkt_burst =3D=3D i40e_recv_pkts_vec) > return ptypes; >=20 We probably can, but it would require extra work and changes inside the PMD= s. i40e is a good one - all rx functions support the same ptypes. But let say for ixgbe ARM version of vRX doesn't support all ptypes that x8= 6 vRX and generic scalar RX do. For fm10k - vector RX and scalar RX also support different ptype sets. =20 >=20 > > What I thought in same direction - introduce a notion of 'static' and '= dynamic' callbacks. > > Static ones can't be removed while queue is not stopped while dynamic c= an. > > But that also requires user to select at queue_setup() stage does he wa= nts dynamic > > callbacks for that queue or not (new offload flag or extra field in rxc= onf/txconf structure). > > So again - same limitation for the user. > > Another possibility here - have a flag 'dynamic' or 'static' not per qu= eue, but per individual callback. > > Yes, that would mean API change for add/remove callback functions. >=20 > At least on my test, the regression was caused because adding the code in > tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled > by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't > see any regression. If so, Will such regression fixed by introducing > 'static' or 'dynamic' callbacks? I am happy to test and provide feedback. Probably no then, but let's try the experiment I mentioned above first. >=20 > For me, I think, The better option is request from application for the > need for Rx/TX callbacks and set the handler appropriate in the driver to > have zero impact for the application which does not need callback at all. It is possible, but: - that would be quite significant change and would affect each RX and TX fu= nction in each PMD. - from other side - even now there are several libs inside DPDK that rely o= n ability to add/remove callbacks on the fly - pdump, latencystats. ethdev rx profiling also relies on ability to install RX callback. With all these things I presume most apps will configure their devices with= rx/tx callback support enabled anyway. Which makes me think - is it worth to introduce such changes at all. Anyway, my first suggestion would be let's try to figure out what exactly c= auses such big slowdown on ARM. Konstantin >=20 > at minimum, on those PMDs that support detecting the supported ptypes wit= hout > device start() >=20 >=20 >=20 >=20 > > > > Konstantin > > > > > But, don't invoke from common code > > > # When application needs the callback support, it > > > configured the given RX/TX queue offloads > > > # If the Rx/TX callback configure by the application > > > then driver calls the helper functions exposed by > > > the common code to invoke RX callbacks. > > > > > > > > > > Jerin > > > > > > > > > > > Ability to safely remove callbacks at runtime implies > > > > some sort of synchronization. > > > > Even I tried to make it as light as possible, > > > > probably some slowdown is unavoidable. > > > > Of course instead of introducing these changes at rte_ethdev layer > > > > similar technique could be applied on individual callback basis. > > > > In that case it would be up to callback writer/installer to decide > > > > does he/she need a removable at runtime callback or not. > > > > Though in that case, each installed callback might introduce extra > > > > synchronization overhead and slowdown. > > > > > > > > Konstantin Ananyev (3): > > > > ethdev: introduce eth_queue_local > > > > ethdev: make it safe to remove rx/tx callback at runtime > > > > doc: ethdev ABI change deprecation notice > > > > > > > > doc/guides/rel_notes/deprecation.rst | 5 + > > > > lib/librte_ether/rte_ethdev.c | 390 ++++++++++++++++++++++-= ------------ > > > > lib/librte_ether/rte_ethdev.h | 174 ++++++++++++---- > > > > 3 files changed, 387 insertions(+), 182 deletions(-) > > > > > > > > -- > > > > 2.13.5 > > > >