From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 2D48A7E10 for ; Sun, 28 Sep 2014 18:05:36 +0200 (CEST) Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga103.fm.intel.com with ESMTP; 28 Sep 2014 09:02:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,862,1389772800"; d="scan'208";a="392727225" Received: from irsmsx103.ger.corp.intel.com ([163.33.3.157]) by FMSMGA003.fm.intel.com with ESMTP; 28 Sep 2014 09:05:55 -0700 Received: from irsmsx151.ger.corp.intel.com (163.33.192.59) by IRSMSX103.ger.corp.intel.com (163.33.3.157) with Microsoft SMTP Server (TLS) id 14.3.195.1; Sun, 28 Sep 2014 17:12:05 +0100 Received: from irsmsx104.ger.corp.intel.com ([169.254.5.248]) by IRSMSX151.ger.corp.intel.com ([169.254.4.146]) with mapi id 14.03.0195.001; Sun, 28 Sep 2014 17:12:05 +0100 From: "Ananyev, Konstantin" To: Neil Horman Thread-Topic: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe: Thread-Index: AQHP2MAqmY9s5RkexE+yRGxj2kXnOpwR4sKAgAAemtCAAAdaAIAAZ8zwgADMQACAAA5cAIAAEWsAgAAnzM6AAAITkIAAAzQAgAApX+CAAA3RgIAC865Q Date: Sun, 28 Sep 2014 16:12:04 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772582138410B@IRSMSX104.ger.corp.intel.com> References: <20140925172358.GG32725@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB97725821378B50@IRSMSX104.ger.corp.intel.com> <20140926114630.GA3930@hmsreliant.think-freely.org> <20140926134014.GB3930@hmsreliant.think-freely.org> <20140926150156.GB5619@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB9772582137D88E@IRSMSX104.ger.corp.intel.com> <20140926162134.GE5619@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB9772582137D95F@IRSMSX104.ger.corp.intel.com> <20140926193905.GH5619@hmsreliant.think-freely.org> In-Reply-To: <20140926193905.GH5619@hmsreliant.think-freely.org> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe: X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Sep 2014 16:05:38 -0000 > -----Original Message----- > From: Neil Horman [mailto:nhorman@tuxdriver.com] > Sent: Friday, September 26, 2014 8:39 PM > To: Ananyev, Konstantin > Cc: Wodkowski, PawelX; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread= -safe: >=20 > On Fri, Sep 26, 2014 at 06:07:14PM +0000, Ananyev, Konstantin wrote: > > > > > > > > As I remember the purpose of the patch was to fix the race conditio= n inside rte_alarm library. > > > > I believe that the patch provided by Michal & Pawel fixes the issue= s you discovered. > > > > If you think, that is not the case, could you please provide a list= of remaining issues? > > > > Excluding ones that you just don't like it, and you are not happy w= ith rte_alarm API in total? > > > > > > > Gladly. As Pawel explained the race, its possible that, after callin= g > > > rte_eal_alarm_cancel, an in-flight execution of an alarm callback may= still be > > > running. The problem with that ostensibly is that data which is bein= g accessed > > > by the callback might be then accessed in parallel with another proce= ss leading > > > to data corruption or some other problem. The issue I have with his p= atch is > > > that it doesn't completely close the race. While it does close the r= ace for the > > > condition in whcih thread B is running the alarm callback while threa= d A is > > > executing the cancel operation, it does not close the case for when a= single > > > thread B is running the cancel operation, as the in-flight execution = itself is > > > still active. > > > > A bit puzzled here: > > Are you saying that calling alarm_cancel() for itself inside eal_alarm_= callback() might cause a problem? > > I still don't see how. > > > Potentially yes, by the same race condition that exists when using a seco= ndary > thread to do the cancel call. As I understand it the race that Pawel des= cribed > is as follows: >=20 > Thread A Thread B > alarm_cancel() eal_alarm_callback > block on alarm spinlock drop spinlock > run cancel operation execute callback function > return from cancel > rte_eal_alarm_set >=20 > As Pawel described the problem, there is a desire to not set the new alar= m while > the old alarm is still executing. And his patch accomplishes that for th= e two > thread case above just fine >=20 > The problem with Pawels patch is that its non functional in the case wher= e the > cancel happens within Thread B. Lets change the scenario just a little b= it: >=20 > Thread B Thread C > eal_alarm_callback > callback_function > some_other_common_func > rte_eal_alarm_cancel(this) > pthread_signal(Thread C) wake up > operate on alarm data rte_eal_alarm_set > As I can see, there is an incorrect behaviour in your callback_function exa= mple. It should first finish with " eal_alarm_callback" and only then send a sign= al to other thread. Otherwise we can't help it in any way. But I think, I understand your concern: after rte_eal_aralm_cancel() finishes, the caller can't clearly distinguish= what exactly happen: 1) alarm was cancelled succesfully. 2) alarm was not found (already cancelled or executed). 3) alarm is executing by the same thread and can't be cancelled. Basically right now the caller can distinguish that either #1 or #2,3 happe= ned, but can't distinguish between 2 & 3. Correct? If that's so, then I suppose we can do: make alarm_cancel() to return a ne= gative value for the case #3 (-EINPROGRESS or something). Something like:=20 ... if (ap->executing =3D=3D 0) { LIST_REMOVE(ap,next); rte_free(ap); count++; ap =3D ap_prev;=20 } else if (pthread_equal(ap->executing_id, pthread_self()) =3D=3D 0) { executing++; } else {=20 ret =3D -EINPROGRESS; } ... return ((ret !=3D 0) ? ret : count); So the return value will be > 0 for #1, 0 for #2, <0 for #3. As I remember, you already suggested something similar in one of the previo= us mails. Konstantin =09 >=20 > In this scenario the problem is not fixed because when called from within= the > alarm thread, the executing alarm is skipped (as it must be), but that fa= ct is > invisible to the caller, and because of that its still possible for the s= ame > origional problem to occur. >=20 > Neil >=20