From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR01-VE1-obe.outbound.protection.outlook.com (mail-ve1eur01on0083.outbound.protection.outlook.com [104.47.1.83]) by dpdk.org (Postfix) with ESMTP id 98DA53DC for ; Wed, 6 Sep 2017 08:54:37 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=ALCYwZUOi7tSg7NadXq4mAmHpffMKkdnNQN7Rm6V2M4=; b=IB5SZUCP/7ExvRfntCSwyA3sB0yiTI7t2LWJ0QSOnHev2ceVTmdMZLvx1zZO+IjTd/6eszAHxXhHOCCFyaFKT/SmSNGyH0Yt6BRi8zQz1Xk/5f2sDHI0tf+VygIFoo0Z1+6HOoFTBW14eGS4e1hB4FgeoJbCcNWoS3Uh9JOGpMY= Received: from DB6PR0502MB3048.eurprd05.prod.outlook.com (10.172.250.136) by DB6PR0502MB3016.eurprd05.prod.outlook.com (10.172.247.139) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.13.10; Wed, 6 Sep 2017 06:54:35 +0000 Received: from DB6PR0502MB3048.eurprd05.prod.outlook.com ([fe80::e938:efe9:effc:7ef8]) by DB6PR0502MB3048.eurprd05.prod.outlook.com ([fe80::e938:efe9:effc:7ef8%14]) with mapi id 15.20.0013.018; Wed, 6 Sep 2017 06:54:35 +0000 From: Matan Azrad To: =?iso-8859-1?Q?Ga=EBtan_Rivet?= CC: "dev@dpdk.org" , Thomas Monjalon , Adrien Mazarguil Thread-Topic: [PATCH] net/failsafe: stat support enhancement Thread-Index: AQHTJkoXg9AtWO/g8EmNE6CKdkBdbqKmYMiggAAcSoCAAOBsYA== Date: Wed, 6 Sep 2017 06:54:35 +0000 Message-ID: References: <1504605394-6450-1-git-send-email-matan@mellanox.com> <20170905132244.GF21444@bidouze.vm.6wind.com> <20170905163421.GJ21444@bidouze.vm.6wind.com> In-Reply-To: <20170905163421.GJ21444@bidouze.vm.6wind.com> Accept-Language: en-US, he-IL Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=matan@mellanox.com; x-originating-ip: [193.47.165.251] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; DB6PR0502MB3016; 6:7UaCe/Vw3Ch11rLHrnFyqOrhvfd+FI60gxql5nG0XSW8OXI8Vizda6y3KaUCSb82Cf0rKTdBdvmAH3YUlNZ2iu47mMQyehhif6YrClYuIrmhc2VCvMHOi0J/ryzZZcehULVOizat8BhWFQwQyagcLoxxdr8ju7DL+kke/8gozt85EeoFeEW/1jqoMVsydsiYbJOACq1fNtGBkYLaEyw2CMyLrlzjU8sBRe7SrkaMMcgcYkZz+9PyhwDzVIrLZARGr8a0F7Ua45VlMQl11a/qbp9jRMSuRXYk1rBi5oWgTtX9aINJOg4eH61Y1AAf23xceaeF4hoQINGHcVZP9R8V5w==; 5:NCfPO6fI+vvBhk6QNQc03upSoXAgnjV61TlyhGj92Y79sifLQIH8/7wdG3HvpFGzMitPl7iz0gk8VERfweXOxmyVufB1GRC759VxUeADhecQhMa9Jx/gRaXn8IRQPWXk9YnZqT5wBXvbGB7/m8rLQmoAb1c6IdNzMf93rGIeMU0=; 24:mS0pbz7HPxar0JB4dGEIQwTUopZSIXRgnT9s47MkivIEL5g+/YBKSWEQJZeQjWJCxivul547Xj34QAeTWCwSk8M5RMKuq35/dfmUi8bAcTM=; 7:6jvZhn8DTATKC5QCU/i+gBipBdtXC/Nw0KlWQTdGHey8JdMwIGwYh6ZO/VKOc5YefhpJZesNHeK3OQsfVuroV4oCfu+8DtctaIF9i0qPeduLAE8xhW3gjMPgRasLkp5lDjptT8cwOQtSdq4Bw52vfKbWKnfKr4V5TDPe8iunlGHOZ+mqGy6zKkYZmC+VkErYsnYtsTb2Pp5L+Kq616zXIUm9H0FHuq2WM4AHCnFv/7g= x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ld-processed: a652971c-7d2e-4d9b-a6a4-d149256f461b,ExtAddr x-ms-office365-filtering-correlation-id: de60fffc-3090-45e2-809e-08d4f4f42277 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254152)(48565401081)(300000503095)(300135400095)(2017052603199)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095); SRVR:DB6PR0502MB3016; x-ms-traffictypediagnostic: DB6PR0502MB3016: x-exchange-antispam-report-test: UriScan:(278428928389397); x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(2401047)(5005006)(8121501046)(100000703101)(100105400095)(93006095)(93001095)(3002001)(10201501046)(6055026)(6041248)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123560025)(20161123555025)(20161123562025)(20161123564025)(20161123558100)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:DB6PR0502MB3016; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:DB6PR0502MB3016; x-forefront-prvs: 0422860ED4 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(39860400002)(13464003)(199003)(377454003)(24454002)(189002)(81156014)(106356001)(2906002)(189998001)(66066001)(54356999)(105586002)(561944003)(76176999)(68736007)(3280700002)(101416001)(3660700001)(551984002)(5250100002)(4326008)(99286003)(55016002)(54906002)(50986999)(33656002)(81166006)(6246003)(9686003)(53936002)(2900100001)(2950100002)(6916009)(478600001)(229853002)(110136004)(6506006)(86362001)(8936002)(7696004)(7736002)(8676002)(74316002)(53546010)(305945005)(14454004)(25786009)(6116002)(5660300001)(3846002)(102836003)(93886005)(6436002)(97736004); DIR:OUT; SFP:1101; SCL:1; SRVR:DB6PR0502MB3016; H:DB6PR0502MB3048.eurprd05.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; A:1; MX:1; LANG:en; received-spf: None (protection.outlook.com: mellanox.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-originalarrivaltime: 06 Sep 2017 06:54:35.2130 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0502MB3016 Subject: Re: [dpdk-dev] [PATCH] net/failsafe: stat support enhancement X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Sep 2017 06:54:38 -0000 Hi Gaetan, > -----Original Message----- > From: Ga=EBtan Rivet [mailto:gaetan.rivet@6wind.com] > Sent: Tuesday, September 5, 2017 7:34 PM > To: Matan Azrad > Cc: dev@dpdk.org > Subject: Re: [PATCH] net/failsafe: stat support enhancement >=20 > On Tue, Sep 05, 2017 at 03:12:55PM +0000, Matan Azrad wrote: > > Hi Gaetan, > > > > > -----Original Message----- > > > From: Ga=EBtan Rivet [mailto:gaetan.rivet@6wind.com] > > > Sent: Tuesday, September 5, 2017 4:23 PM > > > To: Matan Azrad > > > Cc: dev@dpdk.org > > > Subject: Re: [PATCH] net/failsafe: stat support enhancement > > > > > > Hi Matan, > > > > > > On Tue, Sep 05, 2017 at 12:56:34PM +0300, Matan Azrad wrote: > > > > The previous stats code returned only the current TX sub device sta= ts. > > > > > > > > This enhancement extends it to return the sum of all sub devices > > > > stats with history of removed sub-devices. > > > > > > > > Dedicated stats accumulator saves the stat history of all sub > > > > device remove events. > > > > > > > > Each failsafe sub device contains the last stats asked by the user > > > > and updates the accumulator in removal time. > > > > > > > > I would like to implement ultimate snapshot on removal time. > > > > The stats_get API needs to be changed to return error in the case > > > > it is too late to retrieve statistics. > > > > > > You need to have this API evolution first. It is not assured to be > > > accepted by the community. > > > > > > > I already sent suggestion to mailing list, find "stats_get API: return = value > suggestion": > > No objections for now. > > > > > > > As it is, this version is incorrect, because the only available > > > stats for a removed slave is the last snapshot. > > > > > > > It is for sure better from the previous version, the accuracy can be > > better if my suggestion will be accepted. > > If it is, I will improve it. > > > > > Thus it complicates the rules while still being incorrect. Even if > > > you were able to push for this API evolution, PMDs with hard > > > counters would be the ones to return an error code on stat read while > removed. > > > You may be lucky at the moment because MLX drivers do not support > > > hard counters, but this won't always be true (and it will be false so= on > enough). > > > You will thus be back at square one, with a new useless API and > > > still incorrect stats in the fail-safe. On the other hand, the > > > fail-safe should strive to be as easy to use as possible with most > > > drivers, and not cater specifically to soft- counters ones. > > > > > > So, my take on this: I understand that aggregated stats would be > interesting. > > > Keep it simple & stupid: simple aggregation on stats_get. > > > You will have a rollback at some point on device removal, but this > > > is still easier to detect / deal with than partially / sometimes inco= rrect > stats. > > > > > > > 100% accuracy is impossible when removal event is in the game, but we > > can do it better as we can. > > > > As I wrote in commit log(if stat gets will return value): > > When some drivers can give the stats after RMV interrupt and before > > dev_close, We just can to improve accuracy and be closer to 100%. > > The stats of the sub devices which return error will be taken from the = last > snapshot. > > > > It is just enhancement which is important for customers. > > >=20 > The trade-off is between accuracy and complexity, both at the fail-safe l= evel > and at the application level. >=20 > Your solution | KISS solution > -------------------------+-------------------------------- > Accuracy Closer to reality. | Farther from reality. > Upon device removal, | Upon device removal, > only packets received | all previous accounted received > since the last stats_get | packets are missing, inducing > are missing. | a stats rollback from the app > | PoV. > -------------------------+-------------------------------- > Complexity Medium implementation | Simple implementation. > (fast-path) complexity. | Two hotplug sequences will be > The accumulator induces | identical (from a stats PoV). > a trailing effect | > between device remove / | > plugin, thus incurring | > edge-effects and | > reducing idempotency of | > hotplug sequences. | > -------------------------+-------------------------------- > Complexity Statistical errors | Glaring statistical errors > (application) induced by device | induced by device removal. > removal is hard. | Detecting those (stats rollback) > Device removal is | is trivial and can thus be > pretty much invisible. | mitigated. > | > Plays nice with the | While easier to deal with, > "seamless" feature of | it actually pushes application > fail-safe, but might be | to have custom code-path for > a curse in disguise. | the fail-safe PMD, which goes > | against its design principles. >=20 >=20 Thanks a lot for the analysis. I agree very much with your complexity analysis, I have version to deliver = the accumulator update=20 function from the remove interrupt to the fs_stats_get function with more c= omplexity in fs_stats_get. Is it interesting you? > I consider that your proposed API will only somewhat fix errors with PMDs > having software counters. Thus it being accepted or rejected is irrelevan= t for > this assessment. And improve the applications (and failsafe) report about optional errors in= the current stats_get. >=20 > While I strongly support the simplest solution for any implementation, I > dislike going against the fail-safe core principles. >=20 > This is a very flimsy sheet of ice we are walking upon. Keep in mind that= at > the first complication due to your proposal, it will be reverted and the = KISS > solution used instead. >=20 But sometimes if the accuracy is just beside us and we can take it easily,= =20 why not? =20 > But as it is, I thus accept it. I will do a proper review (there are a fe= w things to > fix still, now that the `why` is dealt with) soon. >=20 > > > > By this way, failsafe can get stats snapshot in removal interrupt > > > > callback for each PMD which can give stats after removal event. > > > > > > > > Signed-off-by: Matan Azrad > > > > --- > > > > drivers/net/failsafe/failsafe_ether.c | 33 > > > +++++++++++++++++++++++++++++++++ > > > > drivers/net/failsafe/failsafe_ops.c | 16 ++++++++++++---- > > > > drivers/net/failsafe/failsafe_private.h | 5 +++++ > > > > 3 files changed, 50 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/drivers/net/failsafe/failsafe_ether.c > > > > b/drivers/net/failsafe/failsafe_ether.c > > > > index a3a8cce..133080d 100644 > > > > --- a/drivers/net/failsafe/failsafe_ether.c > > > > +++ b/drivers/net/failsafe/failsafe_ether.c > > > > @@ -399,6 +399,37 @@ failsafe_eth_dev_state_sync(struct > > > > rte_eth_dev > > > *dev) > > > > return ret; > > > > } > > > > > > > > +void > > > > +fs_increment_stats(struct rte_eth_stats *from, struct > > > > +rte_eth_stats > > > > +*to) { > > > > + uint8_t i; > > > > + > > > > + RTE_ASSERT(from !=3D NULL && to !=3D NULL); > > > > + to->ipackets +=3D from->ipackets; > > > > + to->opackets +=3D from->opackets; > > > > + to->ibytes +=3D from->ibytes; > > > > + to->obytes +=3D from->obytes; > > > > + to->imissed +=3D from->imissed; > > > > + to->ierrors +=3D from->ierrors; > > > > + to->oerrors +=3D from->oerrors; > > > > + to->rx_nombuf +=3D from->rx_nombuf; > > > > + for (i =3D 0; i < RTE_ETHDEV_QUEUE_STAT_CNTRS; i++) { > > > > + to->q_ipackets[i] +=3D from->q_ipackets[i]; > > > > + to->q_opackets[i] +=3D from->q_opackets[i]; > > > > + to->q_ibytes[i] +=3D from->q_ibytes[i]; > > > > + to->q_obytes[i] +=3D from->q_obytes[i]; > > > > + to->q_errors[i] +=3D from->q_errors[i]; > > > > + } > > > > +} > > > > + > > > > +void > > > > +fs_increment_stats_accumulator(struct sub_device *sdev) { > > > > + fs_increment_stats(&sdev->stats_snapshot, > > > > + &PRIV(sdev->fs_dev)->stats_accumulator); > > > > + memset(&sdev->stats_snapshot, 0, sizeof(struct rte_eth_stats)); > > > > +} > > > > + > > > > int > > > > failsafe_eth_rmv_event_callback(uint8_t port_id __rte_unused, > > > > enum rte_eth_event_type event > > > __rte_unused, @@ -410,6 +441,8 @@ > > > > failsafe_eth_rmv_event_callback(uint8_t port_id __rte_unused, > > > > fs_switch_dev(sdev->fs_dev, sdev); > > > > /* Use safe bursts in any case. */ > > > > set_burst_fn(sdev->fs_dev, 1); > > > > + /* Increment the stats accumulator by the last stats snapshot. */ > > > > + fs_increment_stats_accumulator(sdev); > > > > /* > > > > * Async removal, the sub-PMD will try to unregister > > > > * the callback at the source of the current thread context. > > > > diff --git a/drivers/net/failsafe/failsafe_ops.c > > > > b/drivers/net/failsafe/failsafe_ops.c > > > > index ff9ad15..e47cc85 100644 > > > > --- a/drivers/net/failsafe/failsafe_ops.c > > > > +++ b/drivers/net/failsafe/failsafe_ops.c > > > > @@ -586,9 +586,14 @@ static void > > > > fs_stats_get(struct rte_eth_dev *dev, > > > > struct rte_eth_stats *stats) { > > > > - if (TX_SUBDEV(dev) =3D=3D NULL) > > > > - return; > > > > - rte_eth_stats_get(PORT_ID(TX_SUBDEV(dev)), stats); > > > > + struct sub_device *sdev; > > > > + uint8_t i; > > > > + > > > > + memcpy(stats, &PRIV(dev)->stats_accumulator, sizeof(*stats)); > > > > + FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > > > > + rte_eth_stats_get(PORT_ID(sdev), &sdev->stats_snapshot); > > > > + fs_increment_stats(&sdev->stats_snapshot, stats); > > > > + } > > > > } > > > > > > > > static void > > > > @@ -597,8 +602,11 @@ fs_stats_reset(struct rte_eth_dev *dev) > > > > struct sub_device *sdev; > > > > uint8_t i; > > > > > > > > - FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) > > > > + FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > > > > rte_eth_stats_reset(PORT_ID(sdev)); > > > > + memset(&sdev->stats_snapshot, 0, sizeof(struct > > > rte_eth_stats)); > > > > + } > > > > + memset(&PRIV(dev)->stats_accumulator, 0, sizeof(struct > > > > +rte_eth_stats)); > > > > } > > > > > > > > /** > > > > diff --git a/drivers/net/failsafe/failsafe_private.h > > > > b/drivers/net/failsafe/failsafe_private.h > > > > index 0361cf4..267c749 100644 > > > > --- a/drivers/net/failsafe/failsafe_private.h > > > > +++ b/drivers/net/failsafe/failsafe_private.h > > > > @@ -102,6 +102,8 @@ struct sub_device { > > > > uint8_t sid; > > > > /* Device state machine */ > > > > enum dev_state state; > > > > + /* Last stats snapshot passed to user */ > > > > + struct rte_eth_stats stats_snapshot; > > > > /* Some device are defined as a command line */ > > > > char *cmdline; > > > > /* fail-safe device backreference */ @@ -140,6 +142,7 @@ struct > > > > fs_priv { > > > > * synchronized state. > > > > */ > > > > enum dev_state state; > > > > + struct rte_eth_stats stats_accumulator; > > > > unsigned int pending_alarm:1; /* An alarm is pending */ > > > > /* flow isolation state */ > > > > int flow_isolated:1; > > > > @@ -180,6 +183,8 @@ int failsafe_eal_uninit(struct rte_eth_dev > > > > *dev); > > > > > > > > int failsafe_eth_dev_state_sync(struct rte_eth_dev *dev); void > > > > failsafe_dev_remove(struct rte_eth_dev *dev); > > > > +void fs_increment_stats(struct rte_eth_stats *from, struct > > > > +rte_eth_stats *to); void fs_increment_stats_accumulator(struct > > > > +sub_device *sdev); > > > > int failsafe_eth_rmv_event_callback(uint8_t port_id, > > > > enum rte_eth_event_type type, > > > > void *arg, void *out); > > > > -- > > > > 2.7.4 > > > > > > > > > > -- > > > Ga=EBtan Rivet > > > 6WIND >=20 > -- > Ga=EBtan Rivet > 6WIND Thanks, Matan azrad