python - python - parse maven dependency tree

Question

I want to be able to take in a maven dependency tree in as an input and parse through it to determine the groupId, artifactId, and version of each dependency with its child(ren) if any, and the child(ren)'s groupId, artifactId, and version (and any additional child(ren) and so on). I'm not sure if it makes the most sense to parse through the mvn dependency tree and store the info as a nested dictionary before preparing the data for neo4j.

I'm also unsure of the best way to parse through the entire mvn dependency tree. The code below is the most progress I've made at attempting to parse, remove unnecessary info in the front and label something a child or parent.

tree= 
[INFO] +- org.antlr:antlr4:jar:4.7.1:compile
[INFO] |  +- org.antlr:antlr4-runtime:jar:4.7.1:compile
[INFO] |  +- org.antlr:antlr-runtime:jar:3.5.2:compile
[INFO] |  \- com.ibm.icu:icu4j:jar:58.2:compile
[INFO] +- commons-io:commons-io:jar:1.3.2:compile
[INFO] +- brs:dxprog-lang:jar:3.3-SNAPSHOT:compile
[INFO] |  +- brs:libutil:jar:2.51:compile
[INFO] |  |  +- commons-collections:commons-collections:jar:3.2.2:compile
[INFO] |  |  +- org.apache.commons:commons-collections4:jar:4.1:compile
[INFO] |  |  |  +- com.fasterxml.jackson.core:jackson-annotations:jar:2.9.0:compile
    [INFO] |  |  |  \- com.fasterxml.jackson.core:jackson-core:jar:2.9.5:compile
.
.
.


fileObj = open("tree", "r")

for line in fileObj.readlines():
    for word in line.split():
        if "[INFO]" in line.split():
            line = line.replace(line.split().__getitem__(0), "")
            print(line)

            if "|" in line.split():
                line = line.replace(line.split().__getitem__(0), "child")
                print(line)

                if "+-" in line.split() and "|" not in line.split():
                    line = line.replace(line.split().__getitem__(0), "")
                    line = line.replace(line.split().__getitem__(0), "parent")
                    print(line, '\n\n')

Output:

 |  |  \- com.google.protobuf:protobuf-java:jar:3.5.1:compile

 child  child  \- com.google.protobuf:protobuf-java:jar:3.5.1:compile

 |  +- com.h2database:h2:jar:1.4.195:compile

 child  +- com.h2database:h2:jar:1.4.195:compile

   parent com.h2database:h2:jar:1.4.195:compile

I would appreciate any insight on the best way to parse & return data in an organized way given that I'm relatively unfamiliar with the capabilities of python. Thank you in advance!

score 7 · Accepted Answer

我不知道你的编程经验是什么，但这不是一项简单的任务。

首先，您可以看到依赖项的重叠级别由符号具体化|。您可以做的最简单的事情是构建一个堆栈，用于存储从根到子、孙、...的依赖路径：

def build_stack(text):
    stack = []
    for line in text.split("\n"):
        if not line:
            continue

        line = line[7:] # remove [INFO]
        level = line.count("|")
        name = line.split("-", 1)[1].strip() # the part after the -
        stack = stack[:level] + [name] # update the stack: everything up to level-1 and name
        yield stack[:level], name # this is a generator

for bottom_stack, name in build_stack(DATA):
    print (bottom_stack + [name])

输出：

['org.antlr:antlr4:jar:4.7.1:compile']
['org.antlr:antlr4:jar:4.7.1:compile', 'org.antlr:antlr4-runtime:jar:4.7.1:compile']
['org.antlr:antlr4:jar:4.7.1:compile', 'org.antlr:antlr-runtime:jar:3.5.2:compile']
['org.antlr:antlr4:jar:4.7.1:compile', 'com.ibm.icu:icu4j:jar:58.2:compile']
['commons-io:commons-io:jar:1.3.2:compile']
['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile']
['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile']
['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile', 'commons-collections:commons-collections:jar:3.2.2:compile']
['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile', 'org.apache.commons:commons-collections4:jar:4.1:compile']
['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile', 'org.apache.commons:commons-collections4:jar:4.1:compile', 'com.fasterxml.jackson.core:jackson-annotations:jar:2.9.0:compile']
['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile', 'org.apache.commons:commons-collections4:jar:4.1:compile', 'com.fasterxml.jackson.core:jackson-core:jar:2.9.5:compile']

其次，您可以使用此堆栈来构建基于复叠字典的树：

def create_tree(text):
    tree = {}
    for stack, name in build_stack(text):
        temp = tree
        for n in stack: # find or create...
            temp = temp.setdefault(n, {}) # ...the most inner dict
        temp[name] = {}
    return tree

from pprint import pprint
pprint(create_tree(DATA))

输出：

{'brs:dxprog-lang:jar:3.3-SNAPSHOT:compile': {'brs:libutil:jar:2.51:compile': {'commons-collections:commons-collections:jar:3.2.2:compile': {},
                                                                               'org.apache.commons:commons-collections4:jar:4.1:compile': {'com.fasterxml.jackson.core:jackson-annotations:jar:2.9.0:compile': {},
                                                                                                                                           'com.fasterxml.jackson.core:jackson-core:jar:2.9.5:compile': {}}}},
 'commons-io:commons-io:jar:1.3.2:compile': {},
 'org.antlr:antlr4:jar:4.7.1:compile': {'com.ibm.icu:icu4j:jar:58.2:compile': {},
                                        'org.antlr:antlr-runtime:jar:3.5.2:compile': {},
                                        'org.antlr:antlr4-runtime:jar:4.7.1:compile': {}}}
{'brs:dxprog-lang:jar:3.3-SNAPSHOT:compile': {'brs:libutil:jar:2.51:compile': {'commons-collections:commons-collections:jar:3.2.2:compile': {},
                                                                               'org.apache.commons:commons-collections4:jar:4.1:compile': {'com.fasterxml.jackson.core:jackson-annotations:jar:2.9.0:compile': {},
                                                                                                                                           'com.fasterxml.jackson.core:jackson-core:jar:2.9.5:compile': {}}}},
 'commons-io:commons-io:jar:1.3.2:compile': {},
 'org.antlr:antlr4:jar:4.7.1:compile': {'com.ibm.icu:icu4j:jar:58.2:compile': {},
                                        'org.antlr:antlr-runtime:jar:3.5.2:compile': {},
                                        'org.antlr:antlr4-runtime:jar:4.7.1:compile': {}}}

一个空的字典实现了树上的一片叶子。

第三，您需要格式化树，即 1. 提取数据和 2. 将子项分组到列表中。这是一个简单的树遍历（这里是 DFS）：

def format(tree):
    L = []
    for name, subtree in tree.items():
        group, artifact, packaging, version, scope = name.split(":")
        d = {"artifact":artifact} # you can add group, ...
        if subtree: # children are present
            d["children"] = format(subtree)
        L.append(d)
    return L

pprint(format(create_tree(DATA)))

输出：

[{'artifact': 'antlr4',
  'children': [{'artifact': 'antlr4-runtime'},
               {'artifact': 'antlr-runtime'},
               {'artifact': 'icu4j'}]},
 {'artifact': 'commons-io'},
 {'artifact': 'dxprog-lang',
  'children': [{'artifact': 'libutil',
                'children': [{'artifact': 'commons-collections'},
                             {'artifact': 'commons-collections4',
                              'children': [{'artifact': 'jackson-annotations'},
                                           {'artifact': 'jackson-core'}]}]}]}]

您也许可以将步骤分组。

python - python - parse maven dependency tree

1 回答 1

Related

Reference