From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ci-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 72B4D42536;
	Thu,  7 Sep 2023 14:56:06 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 5CC40402AF;
	Thu,  7 Sep 2023 14:56:06 +0200 (CEST)
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
 by mails.dpdk.org (Postfix) with ESMTP id 4761E4026C
 for <ci@dpdk.org>; Thu,  7 Sep 2023 14:56:05 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
 s=mimecast20190719; t=1694091364;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=M7SuutFSTAYX3fVB49PtFByLwRpscwUOkBD9YLzh2wY=;
 b=bAlvwkDf5dVvsuC70O0zo7/yqh2mEuoj7F08TGX6cDvNtkC3I0fr+Q9XBJ7XKc5mYRAcEK
 nLNY1E4kabJG2jLG8W/zNq7VJKNahFr7daAaCO1BR9JLHQmwVblF8mxOqKT74mf3bJEN1n
 oYxt+5YaaW807KYdR5YacsmmY0F5ZBM=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-251-KJhcXzvXPWq0Z5sLM7rzxA-1; Thu, 07 Sep 2023 08:56:01 -0400
X-MC-Unique: KJhcXzvXPWq0Z5sLM7rzxA-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 00515181B544;
 Thu,  7 Sep 2023 12:56:01 +0000 (UTC)
Received: from RHTPC1VM0NT (unknown [10.22.32.194])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id AD4D920BAE35;
 Thu,  7 Sep 2023 12:56:00 +0000 (UTC)
From: Aaron Conole <aconole@redhat.com>
To: jspewock@iol.unh.edu
Cc: ci@dpdk.org,  alialnu@nvidia.com,  probb@iol.unh.edu,  Adam Hassick
 <ahassick@iol.unh.edu>
Subject: Re: [PATCH 1/1] tools: add get_reruns script
References: <20230905222317.25821-2-jspewock@iol.unh.edu>
 <20230905222317.25821-4-jspewock@iol.unh.edu>
Date: Thu, 07 Sep 2023 08:56:00 -0400
In-Reply-To: <20230905222317.25821-4-jspewock@iol.unh.edu>
 (jspewock@iol.unh.edu's message of "Tue, 5 Sep 2023 18:13:03 -0400")
Message-ID: <f7tfs3qm9b3.fsf@redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain
X-BeenThere: ci@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK CI discussions <ci.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/ci>,
 <mailto:ci-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/ci/>
List-Post: <mailto:ci@dpdk.org>
List-Help: <mailto:ci-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/ci>,
 <mailto:ci-request@dpdk.org?subject=subscribe>
Errors-To: ci-bounces@dpdk.org

Hi Jeremy,

jspewock@iol.unh.edu writes:

> From: Jeremy Spewock <jspewock@iol.unh.edu>
>
> This script is used to interact with the DPDK Patchwork API to collect a
> list of retests from comments on patches based on a desired list of
> contexts to retest. The script uses regex to scan all of the comments
> since a timestamp that is passed into the script through the CLI for
> any comment that is requesting a retest. These requests are then filtered
> based on the desired contexts that you pass into the script through the
> CLI and then aggregated based on the patch series ID of the series that
> the comment came from. This aggregated list is then outputted to a JSON
> file with a timestamp of the most recent comment on patchworks.
>
> Signed-off-by: Jeremy Spewock <jspewock@iol.unh.edu>
> Signed-off-by: Adam Hassick <ahassick@iol.unh.edu>
> ---

Thanks for the tool.

>  tools/get_reruns.py | 219 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 219 insertions(+)
>  create mode 100755 tools/get_reruns.py
>
> diff --git a/tools/get_reruns.py b/tools/get_reruns.py
> new file mode 100755
> index 0000000..159ff6e
> --- /dev/null
> +++ b/tools/get_reruns.py
> @@ -0,0 +1,219 @@
> +#!/usr/bin/env python3
> +# -*- coding: utf-8 -*-
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2023 University of New Hampshire
> +
> +import argparse
> +import datetime
> +import json
> +import re
> +from json import JSONEncoder
> +from typing import Dict, List, Set, Optional
> +
> +import requests

I think this block should be cleaned up a bit.

The imports should be in alphabetical order.  The block shouldn't have
extra spaces.

> +
> +
> +class JSONSetEncoder(JSONEncoder):
> +    """Custom JSON encoder to handle sets.
> +
> +    Pythons json module cannot serialize sets so this custom encoder converts
> +    them into lists.
> +
> +    Args:
> +        JSONEncoder: JSON encoder from the json python module.
> +    """
> +
> +    def default(self, input_object):
> +        if isinstance(input_object, set):
> +            return list(input_object)
> +        return input_object
> +
> +
> +class RerunProcessor:
> +    """Class for finding reruns inside an email using the patchworks events
> +    API.
> +
> +    The idea of this class is to use regex to find certain patterns that
> +    represent desired contexts to rerun.
> +
> +    Arguments:
> +        desired_contexts: List of all contexts to search for in the bodies of
> +            the comments
> +        time_since: Get all comments since this timestamp
> +
> +    Attributes:
> +        collection_of_retests: A dictionary that maps patch series IDs to the
> +            set of contexts to be retested for that patch series.
> +        regex: regex used for collecting the contexts from the comment body.
> +        last_comment_timestamp: timestamp of the most recent comment that was
> +            processed
> +    """
> +
> +    _desired_contexts: List[str]
> +    _time_since: str
> +    collection_of_retests: Dict[str, Dict[str, Set]] = {}
> +    last_comment_timestamp: Optional[str] = None
> +    # ^ is start of line
> +    # ((?:[a-zA-Z-]+(?:, ?\n?)?)+) is a capture group that gets all test
> +    #   labels after "Recheck-request: "
> +    #   (?:[a-zA-Z-]+(?:, ?\n?)?)+ means 1 or more of the first match group
> +    #       [a-zA-Z0-9-_]+ means 1 more more of any character in the ranges a-z,
> +    #           A-Z, 0-9, or the characters '-' or '_'
> +    #       (?:, ?\n?)? means 1 or none of this match group which expects
> +    #           exactly 1 comma followed by 1 or no spaces followed by
> +    #           1 or no newlines.

This comment might not be needed.  Afterall, we can see the regex group
and you are just documenting python regex tool.  Instead, maybe we
should just re-iterate the understanding around recheck-request.  For
example, the comment we look for must appear at the start of a line, it
is case sensitive tag, and 

> +    # VALID MATCHES:
> +    #   Recheck-request: iol-unit-testing, iol-something-else, iol-one-more,
> +    #   Recheck-request: iol-unit-testing,iol-something-else, iol-one-more
> +    #   Recheck-request: iol-unit-testing, iol-example, iol-another-example,
> +    #   more-intel-testing
> +    # INVALID MATCHES:
> +    #   Recheck-request: iol-unit-testing,  intel-example-testing
> +    #   Recheck-request: iol-unit-testing iol-something-else,iol-one-more,
> +    #   Recheck-request: iol-unit-testing,iol-something-else,iol-one-more,
> +    #
> +    #   more-intel-testing
> +    regex: str = "^Recheck-request: ((?:[a-zA-Z0-9-_]+(?:, ?\n?)?)+)"
> +
> +    def __init__(self, desired_contexts: List[str], time_since: str) -> None:
> +        self._desired_contexts = desired_contexts
> +        self._time_since = time_since
> +
> +    def process_reruns(self) -> None:
> +        patchwork_url = f"http://patches.dpdk.org/api/events/?since={self._time_since}"

On the off-chance this API URL ever changes, we should make this
configurable.

> +        comment_request_info = []
> +        for item in [
> +            "&category=cover-comment-created",
> +            "&category=patch-comment-created",
> +        ]:
> +            response = requests.get(patchwork_url + item)
> +            response.raise_for_status()
> +            comment_request_info.extend(response.json())
> +        rerun_processor.process_comment_info(comment_request_info)
> +
> +    def process_comment_info(self, list_of_comment_blobs: List[Dict]) -> None:
> +        """Takes the list of json blobs of comment information and associates
> +        them with their patches.
> +
> +        Collects retest labels from a list of comments on patches represented
> +        inlist_of_comment_blobs and creates a dictionary that associates them
> +        with their corresponding patch series ID. The labels that need to be
> +        retested are collected by passing the comments body into
> +        get_test_names() method. This method also updates the current UTC
> +        timestamp for the processor to the current time.
> +
> +        Args:
> +            list_of_comment_blobs: a list of JSON blobs that represent comment
> +            information
> +        """
> +
> +        list_of_comment_blobs = sorted(
> +            list_of_comment_blobs,
> +            key=lambda x: datetime.datetime.fromisoformat(x["date"]),
> +            reverse=True,
> +        )
> +
> +        if list_of_comment_blobs:
> +            most_recent_timestamp = datetime.datetime.fromisoformat(
> +                list_of_comment_blobs[0]["date"]
> +            )
> +            # exclude the most recent
> +            most_recent_timestamp = most_recent_timestamp + datetime.timedelta(
> +                microseconds=1
> +            )
> +            self.last_comment_timestamp = most_recent_timestamp.isoformat()
> +
> +        for comment in list_of_comment_blobs:
> +            # before we do any parsing we want to make sure that we are dealing
> +            # with a comment that is associated with a patch series
> +            payload_key = "cover"
> +            if comment["category"] == "patch-comment-created":
> +                payload_key = "patch"
> +            patch_series_arr = requests.get(
> +                comment["payload"][payload_key]["url"]
> +            ).json()["series"]
> +            if not patch_series_arr:
> +                continue
> +            patch_id = patch_series_arr[0]["id"]
> +
> +            comment_info = requests.get(comment["payload"]["comment"]["url"])
> +            comment_info.raise_for_status()
> +            content = comment_info.json()["content"]
> +
> +            labels_to_rerun = self.get_test_names(content)
> +
> +            # appending to the list if it already exists, or creating it if it
> +            # doesn't
> +            if labels_to_rerun:
> +                self.collection_of_retests[patch_id] = self.collection_of_retests.get(
> +                    patch_id, {"contexts": set()}
> +                )
> +                self.collection_of_retests[patch_id]["contexts"].update(labels_to_rerun)
> +
> +    def get_test_names(self, email_body: str) -> Set[str]:
> +        """Uses the regex in the class to get the information from the email.
> +
> +        When it gets the test names from the email, it will all be in one
> +        capture group. We expect a comma separated list of patchwork labels
> +        to be retested.
> +
> +        Returns:
> +            A set of contexts found in the email that match your list of
> +            desired contexts to capture. We use a set here to avoid duplicate
> +            contexts.
> +        """
> +        rerun_section = re.findall(self.regex, email_body, re.MULTILINE)
> +        if not rerun_section:
> +            return set()
> +        rerun_list = list(map(str.strip, rerun_section[0].split(",")))
> +        return set(filter(lambda x: x and x in self._desired_contexts, rerun_list))
> +
> +    def write_to_output_file(self, file_name: str) -> None:
> +        """Write class information to a JSON file.
> +
> +        Takes the collection_of_retests and last_comment_timestamp and outputs
> +        them into a json file.
> +
> +        Args:
> +            file_name: Name of the file to write the output to.
> +        """

Maybe it is also friendly to output to stdout with a filename like "-"
so that we can use it in a script pipeline.

> +        output_dict = {
> +            "retests": self.collection_of_retests,
> +            "last_comment_timestamp": self.last_comment_timestamp,
> +        }
> +        with open(file_name, "w") as file:
> +            file.write(json.dumps(output_dict, indent=4, cls=JSONSetEncoder))
> +
> +
> +if __name__ == "__main__":
> +    parser = argparse.ArgumentParser(description="Help text for getting reruns")
> +    parser.add_argument(
> +        "-ts",
> +        "--time-since",
> +        dest="time_since",
> +        required=True,
> +        help="Get all patches since this many days ago (default: 5)",
> +    )
> +    parser.add_argument(
> +        "--contexts",
> +        dest="contexts_to_capture",
> +        nargs="*",
> +        required=True,
> +        help="List of patchwork contexts you would like to capture",
> +    )
> +    parser.add_argument(
> +        "-o",
> +        "--out-file",
> +        dest="out_file",
> +        help=(
> +            "Output file where the list of reruns and the timestamp of the"
> +            "last comment in the list of comments"
> +            "(default: rerun_requests.json)."
> +        ),
> +        default="rerun_requests.json",
> +    )
> +    args = parser.parse_args()
> +    rerun_processor = RerunProcessor(args.contexts_to_capture, args.time_since)
> +    rerun_processor.process_reruns()
> +    rerun_processor.write_to_output_file(args.out_file)