From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jplsek@iol.unh.edu>
Received: from mail-ot1-f67.google.com (mail-ot1-f67.google.com
 [209.85.210.67]) by dpdk.org (Postfix) with ESMTP id B6DFB1B4CF
 for <ci@dpdk.org>; Fri, 15 Feb 2019 22:29:08 +0100 (CET)
Received: by mail-ot1-f67.google.com with SMTP id m1so18977839otf.5
 for <ci@dpdk.org>; Fri, 15 Feb 2019 13:29:08 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iol.unh.edu; s=unh-iol;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=Mc5oEl7rlPyAhFDfnHJOz4bDRK6LdsiJhQqxQsxzgdo=;
 b=FxXQNIqBYx6zdTiQ6I4MiXdvc/ep+XOuCH93HZRpKQiVAInTEVJvRrb8jaTlVGUhh7
 wd1+8fkQVH9KFCVbv0Qb1h1tbc89hb7HVWiDFttHK9Yh/Mp2KgcapF9cRC2igu4gBzkJ
 TudnXMxRjmyh0Go8/0eoS3sFviQvuHOi+Ce60=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=Mc5oEl7rlPyAhFDfnHJOz4bDRK6LdsiJhQqxQsxzgdo=;
 b=Ni+CfkM1TgBthaWbPPljxTHjLqqY0QshztEyY4lKAj1+aEUAeAYr+RGQ4HUVY8BCf8
 mualmyrFD+smdPJR6IesGK1nT06LESz/q8wbmw7yZ70O2WjJeIK+pG1lN82VvJiKOc8K
 RLir5demPjTPZaNQfZzD4aSLmlm0SVKsuinUhL9UJZgD3ITkuhNLAWjMQSWlN/j0U9QI
 hSAbOziEiiq3aJ1KEOASUIm3JgmCMRWVVWdx2X8Th3vAUx5fbHpNCbVUO5imMbUAOZQF
 Bx1JZN3gq1hTEu+BrqXpMkCVxi+b2dhcVRh5OKuzcT6pX+icQYDqHzuyGLTjLFN1DT4O
 Cm+g==
X-Gm-Message-State: AHQUAuZQRQNT3VeNnmwUhTSmLJ6XARRTMU+okQndKYU63z0eDWfO8nco
 0NuNpsD2T1/pkaaQQwUNmQC97ut5dcnMk1FfIPlsbA==
X-Google-Smtp-Source: AHgI3IbWDwdHk3yCke/2jmfW1xlxwDLiHaw4Snbvdsf1ldHF1mgGwB3h/nGtH56Q6gkYzX4AXaGWkRaw7NSJ9UCY+8w=
X-Received: by 2002:a9d:730b:: with SMTP id e11mr22924otk.277.1550266147853;
 Fri, 15 Feb 2019 13:29:07 -0800 (PST)
MIME-Version: 1.0
References: <20190212144828.18122-1-alialnu@mellanox.com>
In-Reply-To: <20190212144828.18122-1-alialnu@mellanox.com>
From: Jeremy Plsek <jplsek@iol.unh.edu>
Date: Fri, 15 Feb 2019 16:28:31 -0500
Message-ID: <CA+xUZB5yTd+b0jEBu1Xj329ogMw-iCd5bVvaseybeRsW0Hsntw@mail.gmail.com>
To: Ali Alnubani <alialnu@mellanox.com>
Cc: "ci@dpdk.org" <ci@dpdk.org>, Thomas Monjalon <thomas@monjalon.net>, 
 "ferruh.yigit@intel.com" <ferruh.yigit@intel.com>, Ori Kam <orika@mellanox.com>
Content-Type: text/plain; charset="UTF-8"
Subject: Re: [dpdk-ci] [PATCH v2] add script to decide best tree match for
	patches
X-BeenThere: ci@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK CI discussions <ci.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/ci>,
 <mailto:ci-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/ci/>
List-Post: <mailto:ci@dpdk.org>
List-Help: <mailto:ci-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/ci>,
 <mailto:ci-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Feb 2019 21:29:09 -0000

Thanks for the script! I tested it locally and seems to work.
Review inline. Extra comments at the end.

On Tue, Feb 12, 2019 at 9:48 AM Ali Alnubani <alialnu@mellanox.com> wrote:
>
> The script can be used to get the trees that best match
> a patch or a series.
>
> Signed-off-by: Ali Alnubani <alialnu@mellanox.com>
> Signed-off-by: Ori Kam <orika@mellanox.com>
> ---
> Changes in v2:
>         - Renamed script.
>         - Updated license.
>
>  tools/guess-git-tree.py | 244 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 244 insertions(+)
>  create mode 100755 tools/guess-git-tree.py
>
> diff --git a/tools/guess-git-tree.py b/tools/guess-git-tree.py
> new file mode 100755
> index 0000000..f4caef5
> --- /dev/null
> +++ b/tools/guess-git-tree.py
> @@ -0,0 +1,244 @@
> +#!/usr/bin/env python
> +
> +# SPDX-License-Identifier: (BSD-3-Clause AND GPL-2.0-or-later AND MIT)
> +# Copyright 2019 Mellanox Technologies, Ltd
> +
> +import os
> +import sys
> +import re
> +import argparse
> +import copy
> +import fnmatch
> +
> +from requests.exceptions import HTTPError
> +
> +from git_pw import config
> +from git_pw import api
> +from git_pw import utils
> +
> +"""
> +This script uses the git-pw API to retrieve Patchwork's series/patches,
> +and find a list of trees/repos that best match the series/patch.
> +
> +The rules on which matches are based, are taken from the MAINTAINERS file,
> +and currently only based on the paths of the changed files. Results can be
> +improved by adding more information to the MAINTAINERS file.
> +
> +TODO:
> +    - Match using the subject of the patch/series.
> +    - Add a configuration file to specify the priority of each tree.
> +
> +Configurations:
> +The script uses tokens for authentication.
> +If the arguments pw_{server,project,token} aren't passed, the environment
> +variables PW_{SERVER,PROJECT,TOKEN} should be set. If not, the script will try
> +to load the git configurations pw.{server,project,token}.
> +
> +Example usage:
> +    ./guess-git-tree.py --command list_trees_for_series 2054
> +    ./guess-git-tree.py --command list_trees_for_patch 2054
> +
> +The output will be a list of trees sorted based on number of matches,
> +with the first line having the highest count.
> +"""
> +
> +CONF = config.CONF
> +CONF.debug = False
> +
> +MAINTAINERS_FILE_PATH = os.environ.get('MAINTAINERS_FILE_PATH')
> +if not MAINTAINERS_FILE_PATH:
> +    print('MAINTAINERS_FILE_PATH is not set.')
> +    sys.exit(1)
> +RULES = {}
> +
> +ignored_files_re = re.compile(r'^doc/|\.sh$|\.py$')
> +
> +def configure_git_pw(args=None):
> +    """Configure git-pw."""
> +    conf = {}
> +    conf_keys = ['server', 'project', 'token']

See comment in the argument parser section.

> +    for key in conf_keys:
> +        value =  getattr(args, 'pw_{}'.format(key))

There are two spaces between = and getattr,

> +        if not value:
> +            print('--{} is a required git-pw configuration'.format(arg))

I think this was meant to be '--pw_{}' instead of '--{}'.

> +            sys.exit(1)
> +        else:
> +            setattr(CONF, key, value)
> +
> +def find_filenames(diff):
> +    """Find file changes in a given diff.
> +
> +    Source: https://github.com/getpatchwork/patchwork/blob/master/patchwork/parser.py
> +    Changes from source:
> +        - Moved _filename_re into the method.
> +        - Reduced newlines.
> +    """
> +    _filename_re = re.compile(r'^(---|\+\+\+) (\S+)')
> +    # normalise spaces
> +    diff = diff.replace('\r', '')
> +    diff = diff.strip() + '\n'
> +    filenames = {}
> +    for line in diff.split('\n'):
> +        if len(line) <= 0:
> +            continue
> +        filename_match = _filename_re.match(line)
> +        if not filename_match:
> +            continue
> +        filename = filename_match.group(2)
> +        if filename.startswith('/dev/null'):
> +            continue
> +        filename = '/'.join(filename.split('/')[1:])
> +        filenames[filename] = True
> +    filenames = sorted(filenames.keys())
> +    return filenames
> +
> +def construct_rules():
> +    """Build a dictionary of rules from the MAINTAINERS file."""
> +    with open(MAINTAINERS_FILE_PATH) as fd:
> +        maintainers = fd.read()
> +    # Split into blocks of text for easier search.
> +    maintainers = maintainers.split('\n\n')
> +
> +    # Extract blocks that have a tree and files.
> +    tree_file_blocks = [_item for _item in maintainers \
> +            if 'T: git://dpdk.org' in _item and 'F: ' in _item]
> +    _dict = {}
> +    for _item in tree_file_blocks:
> +        # Get the tree url.
> +        tree_match = re.search(r'T: (git://dpdk\.org[^\n]+)', _item)
> +        if tree_match:
> +            tree = tree_match.group(1)
> +        else:
> +            continue
> +        if tree not in _dict:
> +            _dict[tree] = {}
> +            _dict[tree]['paths'] = []
> +        paths = re.findall(r'F: ([^\n]+)',  _item)
> +        _paths = copy.deepcopy(paths)
> +        for path in paths:
> +            # Remove don't-care paths
> +            if ignored_files_re.search(path):
> +                _paths.remove(path)
> +        _dict[tree]['paths'] += _paths
> +    return _dict
> +
> +def get_subject(resource):
> +    """Get subject from patch/series object,
> +    remove its prefix and strip it.
> +    """
> +    name = resource['name']
> +    return re.sub('^\[.*\]', '', name).strip()
> +
> +def find_matches(files):
> +    """Find trees that the changed files in a patch match,
> +    and stop at first match for each file."""
> +    matches = []
> +    for _file in files:
> +        if ignored_files_re.search(_file):
> +            continue
> +        match_found = False
> +        for tree in RULES.keys():
> +            for rule in RULES[tree]['paths']:
> +                if rule.endswith('/'):
> +                    rule = '{}*'.format(rule)
> +                if fnmatch.fnmatch(_file, rule):
> +                    matches.append(tree)
> +                    match_found = True
> +                    break
> +            if match_found:
> +                break
> +    return matches
> +
> +def get_ordered_matches(matches):
> +    """Order matches by occurrences."""
> +    match_counts = {item:matches.count(item) for item in matches}
> +    return sorted(match_counts, key=match_counts.get, reverse=True)
> +
> +def list_trees_for_patch(patch):
> +    """Find matching trees for a specific patch.
> +    For a patch to match a tree, both its subject and
> +    at least one changed path has to match the tree.
> +    """
> +    subject = get_subject(patch)

subject is not used and can be removed.

> +    files = find_filenames(patch['diff'])
> +
> +    matches = find_matches(files)
> +    return matches
> +
> +def list_trees_for_series(series):
> +    """Find matching trees for a series."""
> +    patch_list = series['patches']
> +
> +    matches = []
> +
> +    for patch in patch_list:
> +        matches = matches + \
> +                list_trees_for_patch(api_get('patches', patch['id']))
> +
> +    return matches
> +
> +def parse_args():
> +    """Parse command-line arguments."""
> +    parser = argparse.ArgumentParser()
> +    git_pw_conf_parser = parser.add_argument_group('git-pw configurations')
> +    options_parser = parser.add_argument_group('optional arguments')
> +
> +    options_parser.add_argument('--command',
> +            choices=('list_trees_for_patch',
> +                     'list_trees_for_series'),
> +            required=True, help='command to perform on patch/series')
> +
> +    git_pw_conf_parser.add_argument('--pw_server', type=str,
> +            default=os.environ.get('PW_SERVER', utils.git_config('pw.server')),
> +            help='PW.SERVER')
> +    git_pw_conf_parser.add_argument('--pw_project', type=str,
> +            default=os.environ.get('PW_PROJECT', utils.git_config('pw.project')),
> +            help='PW.PROJECT')
> +    git_pw_conf_parser.add_argument('--pw_token', type=str,
> +            default=os.environ.get('PW_TOKEN', utils.git_config('pw.token')),
> +            help='PW.TOKEN')

Depending on the action taken from
https://github.com/getpatchwork/git-pw/pull/37 this argument may be
removed, since the API calls don't require authentication. Then
"token" can be removed from conf_keys. You'll also probably need to
add and set CONF.token/username/password to None, similar to the
CONF.debug = False line. But since authentication in the python
library is required now, this part can be done later.

> +
> +    parser.add_argument('id', type=int,
> +            help='patch/series id')
> +
> +    args = parser.parse_args()
> +
> +    return args
> +
> +def main():
> +    """Main procedure."""
> +    args = parse_args()
> +    configure_git_pw(args)
> +
> +    command = args.command
> +    _id = args.id
> +
> +    global RULES
> +    RULES = construct_rules()
> +
> +    tree_list = []
> +
> +    if command == 'list_trees_for_patch':
> +        patch = api_get('patches', _id)
> +        tree_list = list_trees_for_patch(patch)
> +
> +    elif command == 'list_trees_for_series':
> +        series = api_get('series', _id)
> +        tree_list = list_trees_for_series(series)
> +
> +    tree_list = get_ordered_matches(tree_list)
> +
> +    print('{}'.format('\n'.join(tree_list)))
> +
> +def api_get(resource_type, resource_id):
> +    """Retrieve an API resource."""
> +    try:
> +        return api.detail(resource_type, resource_id)
> +    except HTTPError as err:
> +        if '404' in str(err):
> +            sys.exit(1)
> +        else:
> +            raise
> +
> +if __name__ == '__main__':
> +    main()
> --
> 2.11.0
>

There should also be two blank lines above/under method definitions
outside of classes to follow pep8 more closely.

We have a python script that we use to apply patches. With this
current script, it's a little bit awkward to use inside our apply
script.
Right now it would look something like this inside of our apply script:
(ggt = guess-git-tree)
ggt.configure_git_pw(args)
ggt.RULES = construct_rules()
series = ggt.api_get('series', id)
tree_list = list_trees_for_series(series)
tree_list = get_ordered_matches(tree_list)
# now attempt to apply patch to each tree until successful

I'd rather see this implemented in a class, so that things like the
construct_rules() and configure_git_pw() happens in the constructor,
then I'd like to call something like get_ordered_tree_for_series(id)
to get the list. The constructor arguments would probably contain the
server address and project. (Or you could make the server address also
an environment variable like PATCHWORK_API_URL =
os.getenv("PATCHWORK_API_URL", "http://patches.dpdk.org/api/1.0/"),
which is what we do in our scripts.)

As a note, this my first in-email review. If I was supposed to do
something different, please let me know!

Thanks
-- 
Jeremy Plsek
UNH InterOperability Laboratory