From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 26AD441DD0;
	Fri, 10 Mar 2023 09:23:58 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 4C6E642FA8;
	Fri, 10 Mar 2023 09:21:08 +0100 (CET)
Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com
 [67.231.156.173])
 by mails.dpdk.org (Postfix) with ESMTP id 0E6DB42D1D
 for <dev@dpdk.org>; Fri, 10 Mar 2023 09:20:34 +0100 (CET)
Received: from pps.filterd (m0045851.ppops.net [127.0.0.1])
 by mx0b-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id
 32A7uIdK021875 for <dev@dpdk.org>; Fri, 10 Mar 2023 00:20:34 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com;
 h=from : to : cc :
 subject : date : message-id : in-reply-to : references : mime-version :
 content-type; s=pfpt0220; bh=f2xAjJ4TgaG2zcQDqpA3ovW7IPwgrxy2g7o6oBsWNOM=;
 b=flRD8irR7JbCccHQkgIM7KLrgCxZoB3lN7X5AJR+ZNIpSZrVaRuabHCrGhke+rY/Uffd
 PCt9kR7jsJlnOfrRLheFqbQ50QCNZtR1MDHCPfG+CpNV2WPtUq+lkZqNs8ccE0yRXYOv
 JAXsPogFHzPignpnsSLxBhKFaC1GBWuotemMEZqFOZamid6pnl4/IZZg0pM7HzivaGOp
 d9fI1UqpGC32zNCiSjqvdb9kvHSjFj5KWvtv8eefxNFa5y/8PVPMZmuVw5Pi2fXBY1zq
 4/DcyWXqrr1kNucHw+y79uzq0Fy070es6hwr0XxKZq7ptDxfEHXmtofknJb7jCLV1UrH PA== 
Received: from dc5-exch01.marvell.com ([199.233.59.181])
 by mx0b-0016f401.pphosted.com (PPS) with ESMTPS id 3p80k782tf-19
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
 for <dev@dpdk.org>; Fri, 10 Mar 2023 00:20:34 -0800
Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server (TLS) id 15.0.1497.42;
 Fri, 10 Mar 2023 00:20:24 -0800
Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend
 Transport; Fri, 10 Mar 2023 00:20:24 -0800
Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233])
 by maili.marvell.com (Postfix) with ESMTP id 282735C68EF;
 Fri, 10 Mar 2023 00:20:23 -0800 (PST)
From: Srikanth Yalavarthi <syalavarthi@marvell.com>
To: Srikanth Yalavarthi <syalavarthi@marvell.com>
CC: <dev@dpdk.org>, <sshankarnara@marvell.com>, <jerinj@marvell.com>,
 <aprabhu@marvell.com>, <ptakkar@marvell.com>, <pshukla@marvell.com>
Subject: [PATCH v6 29/39] ml/cnxk: enable support for firmware error codes
Date: Fri, 10 Mar 2023 00:20:05 -0800
Message-ID: <20230310082015.20200-30-syalavarthi@marvell.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20230310082015.20200-1-syalavarthi@marvell.com>
References: <20221208200220.20267-1-syalavarthi@marvell.com>
 <20230310082015.20200-1-syalavarthi@marvell.com>
MIME-Version: 1.0
Content-Type: text/plain
X-Proofpoint-GUID: A7BqKijoVYDKtMy1uB0JYPk4GXTsl5Vx
X-Proofpoint-ORIG-GUID: A7BqKijoVYDKtMy1uB0JYPk4GXTsl5Vx
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.254,Aquarius:18.0.942,Hydra:6.0.573,FMLib:17.11.170.22
 definitions=2023-03-10_02,2023-03-09_01,2023-02-09_01
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Enabled support for error handling. Added error types and subtypes
supported by ML firmware. Enabled support to get device specific
error code and message for a completed ML request.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 drivers/ml/cnxk/cn10k_ml_dev.c |   4 +-
 drivers/ml/cnxk/cn10k_ml_dev.h |  50 +++++++++++++-
 drivers/ml/cnxk/cn10k_ml_ops.c | 117 ++++++++++++++++++++++++++++++---
 drivers/ml/cnxk/cn10k_ml_ops.h |   2 +
 4 files changed, 160 insertions(+), 13 deletions(-)

diff --git a/drivers/ml/cnxk/cn10k_ml_dev.c b/drivers/ml/cnxk/cn10k_ml_dev.c
index 837f006bf0..76ed853a3c 100644
--- a/drivers/ml/cnxk/cn10k_ml_dev.c
+++ b/drivers/ml/cnxk/cn10k_ml_dev.c
@@ -261,7 +261,7 @@ cn10k_ml_fw_load_asim(struct cn10k_ml_fw *fw)
 	} while (plt_tsc_cycles() < timeout_cycle);
 
 	/* Check firmware load status, clean-up and exit on failure. */
-	if ((!timeout) && (fw->req->result.error_code == 0)) {
+	if ((!timeout) && (fw->req->result.error_code.u64 == 0)) {
 		cn10k_ml_fw_print_info(fw);
 	} else {
 		/* Set ML to disable new jobs */
@@ -452,7 +452,7 @@ cn10k_ml_fw_load_cn10ka(struct cn10k_ml_fw *fw, void *buffer, uint64_t size)
 	} while (plt_tsc_cycles() < timeout_cycle);
 
 	/* Check firmware load status, clean-up and exit on failure. */
-	if ((!timeout) && (fw->req->result.error_code == 0)) {
+	if ((!timeout) && (fw->req->result.error_code.u64 == 0)) {
 		cn10k_ml_fw_print_info(fw);
 	} else {
 		/* Set ML to disable new jobs */
diff --git a/drivers/ml/cnxk/cn10k_ml_dev.h b/drivers/ml/cnxk/cn10k_ml_dev.h
index 8f6bc24370..604a200e26 100644
--- a/drivers/ml/cnxk/cn10k_ml_dev.h
+++ b/drivers/ml/cnxk/cn10k_ml_dev.h
@@ -64,6 +64,54 @@ enum cn10k_ml_dev_state {
 	ML_CN10K_DEV_STATE_CLOSED
 };
 
+/* Error types enumeration */
+enum cn10k_ml_error_etype {
+	/* 0x0 */ ML_ETYPE_NO_ERROR = 0, /* No error */
+	/* 0x1 */ ML_ETYPE_FW_NONFATAL,	 /* Firmware non-fatal error */
+	/* 0x2 */ ML_ETYPE_HW_NONFATAL,	 /* Hardware non-fatal error */
+	/* 0x3 */ ML_ETYPE_HW_FATAL,	 /* Hardware fatal error */
+	/* 0x4 */ ML_ETYPE_HW_WARNING,	 /* Hardware warning */
+	/* 0x5 */ ML_ETYPE_DRIVER,	 /* Driver specific error */
+	/* 0x6 */ ML_ETYPE_UNKNOWN,	 /* Unknown error */
+};
+
+/* Firmware non-fatal error sub-type */
+enum cn10k_ml_error_stype_fw_nf {
+	/* 0x0 */ ML_FW_ERR_NOERR = 0,		 /* No error */
+	/* 0x1 */ ML_FW_ERR_UNLOAD_ID_NOT_FOUND, /* Model ID not found during load */
+	/* 0x2 */ ML_FW_ERR_LOAD_LUT_OVERFLOW,	 /* Lookup table overflow at load */
+	/* 0x3 */ ML_FW_ERR_ID_IN_USE,		 /* Model ID already in use */
+	/* 0x4 */ ML_FW_ERR_INVALID_TILEMASK,	 /* Invalid OCM tilemask */
+	/* 0x5 */ ML_FW_ERR_RUN_LUT_OVERFLOW,	 /* Lookup table overflow at run */
+	/* 0x6 */ ML_FW_ERR_RUN_ID_NOT_FOUND,	 /* Model ID not found during run */
+	/* 0x7 */ ML_FW_ERR_COMMAND_NOTSUP,	 /* Unsupported command */
+	/* 0x8 */ ML_FW_ERR_DDR_ADDR_RANGE,	 /* DDR address out of range */
+	/* 0x9 */ ML_FW_ERR_NUM_BATCHES_INVALID, /* Invalid number of batches */
+	/* 0xA */ ML_FW_ERR_INSSYNC_TIMEOUT,	 /* INS sync timeout */
+};
+
+/* Driver error sub-type */
+enum cn10k_ml_error_stype_driver {
+	/* 0x0 */ ML_DRIVER_ERR_NOERR = 0, /* No error */
+	/* 0x1 */ ML_DRIVER_ERR_UNKNOWN,   /* Unable to determine error sub-type */
+	/* 0x2 */ ML_DRIVER_ERR_EXCEPTION, /* Firmware exception */
+	/* 0x3 */ ML_DRIVER_ERR_FW_ERROR,  /* Unknown firmware error */
+};
+
+/* Error structure */
+union cn10k_ml_error_code {
+	struct {
+		/* Error type */
+		uint64_t etype : 4;
+
+		/* Error sub-type */
+		uint64_t stype : 60;
+	} s;
+
+	/* WORD 0 */
+	uint64_t u64;
+};
+
 /* Firmware stats */
 struct cn10k_ml_fw_stats {
 	/* Firmware start cycle */
@@ -82,7 +130,7 @@ struct cn10k_ml_fw_stats {
 /* Result structure */
 struct cn10k_ml_result {
 	/* Job error code */
-	uint64_t error_code;
+	union cn10k_ml_error_code error_code;
 
 	/* Firmware stats */
 	struct cn10k_ml_fw_stats stats;
diff --git a/drivers/ml/cnxk/cn10k_ml_ops.c b/drivers/ml/cnxk/cn10k_ml_ops.c
index 87778c37bb..23a9ca4ff2 100644
--- a/drivers/ml/cnxk/cn10k_ml_ops.c
+++ b/drivers/ml/cnxk/cn10k_ml_ops.c
@@ -23,6 +23,49 @@
 #define ML_FLAGS_POLL_COMPL BIT(0)
 #define ML_FLAGS_SSO_COMPL  BIT(1)
 
+/* Error message length */
+#define ERRMSG_LEN 32
+
+/* Error type database */
+static const struct cn10k_ml_etype_db {
+	enum cn10k_ml_error_etype etype;
+	char name[ERRMSG_LEN];
+} ml_etype_db[] = {
+	{ML_ETYPE_NO_ERROR, "NO_ERROR"},	{ML_ETYPE_FW_NONFATAL, "FW_NON_FATAL"},
+	{ML_ETYPE_HW_NONFATAL, "HW_NON_FATAL"}, {ML_ETYPE_HW_FATAL, "HW_FATAL"},
+	{ML_ETYPE_HW_WARNING, "HW_WARNING"},	{ML_ETYPE_DRIVER, "DRIVER_ERROR"},
+	{ML_ETYPE_UNKNOWN, "UNKNOWN_ERROR"},
+};
+
+/* Hardware non-fatal error subtype database */
+static const struct cn10k_ml_stype_db_hw_nf {
+	enum cn10k_ml_error_stype_fw_nf stype;
+	char msg[ERRMSG_LEN];
+} ml_stype_db_hw_nf[] = {
+	{ML_FW_ERR_NOERR, "NO ERROR"},
+	{ML_FW_ERR_UNLOAD_ID_NOT_FOUND, "UNLOAD MODEL ID NOT FOUND"},
+	{ML_FW_ERR_LOAD_LUT_OVERFLOW, "LOAD LUT OVERFLOW"},
+	{ML_FW_ERR_ID_IN_USE, "MODEL ID IN USE"},
+	{ML_FW_ERR_INVALID_TILEMASK, "INVALID TILEMASK"},
+	{ML_FW_ERR_RUN_LUT_OVERFLOW, "RUN LUT OVERFLOW"},
+	{ML_FW_ERR_RUN_ID_NOT_FOUND, "RUN MODEL ID NOT FOUND"},
+	{ML_FW_ERR_COMMAND_NOTSUP, "COMMAND NOT SUPPORTED"},
+	{ML_FW_ERR_DDR_ADDR_RANGE, "DDR ADDRESS OUT OF RANGE"},
+	{ML_FW_ERR_NUM_BATCHES_INVALID, "INVALID BATCHES"},
+	{ML_FW_ERR_INSSYNC_TIMEOUT, "INSSYNC TIMEOUT"},
+};
+
+/* Driver error subtype database */
+static const struct cn10k_ml_stype_db_driver {
+	enum cn10k_ml_error_stype_driver stype;
+	char msg[ERRMSG_LEN];
+} ml_stype_db_driver[] = {
+	{ML_DRIVER_ERR_NOERR, "NO ERROR"},
+	{ML_DRIVER_ERR_UNKNOWN, "UNKNOWN ERROR"},
+	{ML_DRIVER_ERR_EXCEPTION, "FW EXCEPTION"},
+	{ML_DRIVER_ERR_FW_ERROR, "UNKNOWN FIRMWARE ERROR"},
+};
+
 static void
 print_line(FILE *fp, int len)
 {
@@ -474,6 +517,7 @@ cn10k_ml_dev_configure(struct rte_ml_dev *dev, const struct rte_ml_dev_config *c
 
 	dev->enqueue_burst = cn10k_ml_enqueue_burst;
 	dev->dequeue_burst = cn10k_ml_dequeue_burst;
+	dev->op_error_get = cn10k_ml_op_error_get;
 
 	mldev->nb_models_loaded = 0;
 	mldev->state = ML_CN10K_DEV_STATE_CONFIGURED;
@@ -758,7 +802,7 @@ cn10k_ml_dev_selftest(struct rte_ml_dev *dev)
 	if (timeout) {
 		ret = -ETIME;
 	} else {
-		if (req->result.error_code != 0)
+		if (req->result.error_code.u64 != 0)
 			ret = -1;
 	}
 
@@ -936,7 +980,7 @@ cn10k_ml_model_start(struct rte_ml_dev *dev, uint16_t model_id)
 	/* Prepare JD */
 	req = model->req;
 	cn10k_ml_prep_sp_job_descriptor(mldev, model, req, ML_CN10K_JOB_TYPE_MODEL_START);
-	req->result.error_code = 0x0;
+	req->result.error_code.u64 = 0x0;
 	req->result.user_ptr = NULL;
 
 	plt_write64(ML_CN10K_POLL_JOB_START, &req->status);
@@ -1017,7 +1061,7 @@ cn10k_ml_model_start(struct rte_ml_dev *dev, uint16_t model_id)
 
 	if (job_dequeued) {
 		if (plt_read64(&req->status) == ML_CN10K_POLL_JOB_FINISH) {
-			if (req->result.error_code == 0)
+			if (req->result.error_code.u64 == 0)
 				ret = 0;
 			else
 				ret = -1;
@@ -1079,7 +1123,7 @@ cn10k_ml_model_stop(struct rte_ml_dev *dev, uint16_t model_id)
 	/* Prepare JD */
 	req = model->req;
 	cn10k_ml_prep_sp_job_descriptor(mldev, model, req, ML_CN10K_JOB_TYPE_MODEL_STOP);
-	req->result.error_code = 0x0;
+	req->result.error_code.u64 = 0x0;
 	req->result.user_ptr = NULL;
 
 	plt_write64(ML_CN10K_POLL_JOB_START, &req->status);
@@ -1134,7 +1178,7 @@ cn10k_ml_model_stop(struct rte_ml_dev *dev, uint16_t model_id)
 
 	if (job_dequeued) {
 		if (plt_read64(&req->status) == ML_CN10K_POLL_JOB_FINISH) {
-			if (req->result.error_code == 0x0)
+			if (req->result.error_code.u64 == 0x0)
 				ret = 0;
 			else
 				ret = -1;
@@ -1426,12 +1470,30 @@ cn10k_ml_result_update(struct rte_ml_dev *dev, int qp_id, struct cn10k_ml_result
 	PLT_SET_USED(dev);
 	PLT_SET_USED(qp_id);
 
-	op->impl_opaque = result->error_code;
+	struct cn10k_ml_dev *mldev;
 
-	if (likely(result->error_code == 0))
+	if (likely(result->error_code.u64 == 0)) {
+		op->impl_opaque = result->error_code.u64;
 		op->status = RTE_ML_OP_STATUS_SUCCESS;
-	else
+	} else {
+		/* Handle driver error */
+		if (result->error_code.s.etype == ML_ETYPE_DRIVER) {
+			mldev = dev->data->dev_private;
+
+			/* Check for exception */
+			if ((roc_ml_reg_read64(&mldev->roc, ML_SCRATCH_EXCEPTION_SP_C0) != 0) ||
+			    (roc_ml_reg_read64(&mldev->roc, ML_SCRATCH_EXCEPTION_SP_C1) != 0))
+				result->error_code.s.stype = ML_DRIVER_ERR_EXCEPTION;
+			else if ((roc_ml_reg_read64(&mldev->roc, ML_CORE_INT_LO) != 0) ||
+				 (roc_ml_reg_read64(&mldev->roc, ML_CORE_INT_HI) != 0))
+				result->error_code.s.stype = ML_DRIVER_ERR_FW_ERROR;
+			else
+				result->error_code.s.stype = ML_DRIVER_ERR_UNKNOWN;
+		}
+
+		op->impl_opaque = result->error_code.u64;
 		op->status = RTE_ML_OP_STATUS_ERROR;
+	}
 
 	op->user_ptr = result->user_ptr;
 }
@@ -1468,6 +1530,7 @@ cn10k_ml_enqueue_burst(struct rte_ml_dev *dev, uint16_t qp_id, struct rte_ml_op
 	cn10k_ml_prep_fp_job_descriptor(dev, req, op);
 
 	memset(&req->result, 0, sizeof(struct cn10k_ml_result));
+	req->result.error_code.s.etype = ML_ETYPE_UNKNOWN;
 	req->result.user_ptr = op->user_ptr;
 
 	plt_write64(ML_CN10K_POLL_JOB_START, &req->status);
@@ -1515,8 +1578,12 @@ cn10k_ml_dequeue_burst(struct rte_ml_dev *dev, uint16_t qp_id, struct rte_ml_op
 dequeue_req:
 	req = &queue->reqs[tail];
 	status = plt_read64(&req->status);
-	if (unlikely(status != ML_CN10K_POLL_JOB_FINISH))
-		goto empty_or_active;
+	if (unlikely(status != ML_CN10K_POLL_JOB_FINISH)) {
+		if (plt_tsc_cycles() < req->timeout)
+			goto empty_or_active;
+		else /* Timeout, set indication of driver error */
+			req->result.error_code.s.etype = ML_ETYPE_DRIVER;
+	}
 
 	cn10k_ml_result_update(dev, qp_id, &req->result, req->op);
 	ops[count] = req->op;
@@ -1533,6 +1600,35 @@ cn10k_ml_dequeue_burst(struct rte_ml_dev *dev, uint16_t qp_id, struct rte_ml_op
 	return count;
 }
 
+__rte_hot int
+cn10k_ml_op_error_get(struct rte_ml_dev *dev, struct rte_ml_op *op, struct rte_ml_op_error *error)
+{
+	union cn10k_ml_error_code *error_code;
+	char msg[RTE_ML_STR_MAX];
+
+	PLT_SET_USED(dev);
+
+	error_code = (union cn10k_ml_error_code *)&op->impl_opaque;
+
+	/* Copy error message */
+	plt_strlcpy(msg, ml_etype_db[error_code->s.etype].name, sizeof(msg));
+
+	/* Copy sub error message */
+	if (error_code->s.etype == ML_ETYPE_HW_NONFATAL) {
+		strcat(msg, " : ");
+		strcat(msg, ml_stype_db_hw_nf[error_code->s.stype].msg);
+	}
+
+	if (error_code->s.etype == ML_ETYPE_DRIVER) {
+		strcat(msg, " : ");
+		strcat(msg, ml_stype_db_driver[error_code->s.stype].msg);
+	}
+
+	plt_strlcpy(error->message, msg, sizeof(error->message));
+
+	return 0;
+}
+
 __rte_hot int
 cn10k_ml_inference_sync(struct rte_ml_dev *dev, struct rte_ml_op *op)
 {
@@ -1549,6 +1645,7 @@ cn10k_ml_inference_sync(struct rte_ml_dev *dev, struct rte_ml_op *op)
 	cn10k_ml_prep_fp_job_descriptor(dev, req, op);
 
 	memset(&req->result, 0, sizeof(struct cn10k_ml_result));
+	req->result.error_code.s.etype = ML_ETYPE_UNKNOWN;
 	req->result.user_ptr = op->user_ptr;
 
 	plt_write64(ML_CN10K_POLL_JOB_START, &req->status);
diff --git a/drivers/ml/cnxk/cn10k_ml_ops.h b/drivers/ml/cnxk/cn10k_ml_ops.h
index 7c35bf7539..1784900cff 100644
--- a/drivers/ml/cnxk/cn10k_ml_ops.h
+++ b/drivers/ml/cnxk/cn10k_ml_ops.h
@@ -75,6 +75,8 @@ __rte_hot uint16_t cn10k_ml_enqueue_burst(struct rte_ml_dev *dev, uint16_t qp_id
 					  struct rte_ml_op **ops, uint16_t nb_ops);
 __rte_hot uint16_t cn10k_ml_dequeue_burst(struct rte_ml_dev *dev, uint16_t qp_id,
 					  struct rte_ml_op **ops, uint16_t nb_ops);
+__rte_hot int cn10k_ml_op_error_get(struct rte_ml_dev *dev, struct rte_ml_op *op,
+				    struct rte_ml_op_error *error);
 __rte_hot int cn10k_ml_inference_sync(struct rte_ml_dev *dev, struct rte_ml_op *op);
 
 #endif /* _CN10K_ML_OPS_H_ */
-- 
2.17.1