python - 将单词列表解析成树

Question

我有一个单词列表。例如：

reel
road
root
curd

我想以反映以下结构的方式存储这些数据：

Start -> r -> e -> reel
           -> o -> a -> road
                   o -> root
         c -> curd

对我来说很明显我需要实现一棵树。从这棵树中，我必须能够很容易地得到一个节点的高度、一个节点的后代数量、搜索一个节点等统计信息。添加节点应该“自动”将其添加到树中的正确位置，因为该位置是唯一的。

它还希望能够以实际图形树的形式可视化数据。由于树会很大，我需要在可视化上进行缩放/平移控件。当然，漂亮的可视化总是比丑陋的要好。

有谁知道可以让我简单地实现这一切的 Python 包？自己编写代码需要相当长的时间。你认为http://packages.python.org/ete2/适合这个任务吗？

我在 Python 2.x 上，顺便说一句。

我发现 NLTK 有一个 trie 类 - nltk.containers.trie。这对我来说很方便，因为我已经使用了 NLTK。有谁知道如何使用这个类？我在任何地方都找不到任何例子！例如，如何向树中添加单词？

score 4 · Accepted Answer

ETE2 is an environment for tree exploration, in principle made for browsing, building and exploring phylogenetic trees, and i've used it long time ago for these purposes. But its possible that if you set your data properly, you could get it done.

You just have to place paretheses wherever you need to split your tree and create a branch. See the following example, taken from ETE doc. If you change these "(A,B,(C,D));" for your words/letters it should be done.

from ete2 import Tree
unrooted_tree = Tree( "(A,B,(C,D));" )
print unrooted_tree

output:

     /-A
    |
----|--B
    |
    |     /-C
     \---|
          \-D

...and this package will let u do most of the operations you want, giving u the chance to select every branch individually, and operating with it in an easy way. I recommend u to give a look to the tutorial anyway, not pretty difficult :)

score 3 · Accepted Answer

我认为以下示例使用ETE工具包几乎可以满足您的需求。

from ete2 import Tree

words = [ "reel", "road", "root", "curd", "curl", "whatever","whenever", "wherever"]

#Creates a empty tree
tree = Tree()
tree.name = ""
# Lets keep tree structure indexed
name2node = {}
# Make sure there are no duplicates
words = set(words)
# Populate tree
for wd in words:
    # If no similar words exist, add it to the base of tree
    target = tree

    # Find relatives in the tree
    for pos in xrange(len(wd), -1, -1):
        root = wd[:pos]
        if root in name2node:
            target = name2node[root]
            break

    # Add new nodes as necessary
    fullname = root 
    for letter in wd[pos:]:
        fullname += letter 
        new_node = target.add_child(name=letter, dist=1.0)
        name2node[fullname] = new_node

        target = new_node

# Print structure
print tree.get_ascii()
# You can also use all the visualization machinery from ETE
# (http://packages.python.org/ete2/tutorial/tutorial_drawing.html)
# tree.show()

# You can find, isolate and operate with a specific node using the index
wh_node = name2node["whe"]
print wh_node.get_ascii()

# You can rebuild words under a given node
def recontruct_fullname(node):
    name = []
    while node.up:
        name.append(node.name)
        node = node.up
    name = ''.join(reversed(name))
    return name

for leaf in wh_node.iter_leaves():
    print recontruct_fullname(leaf)


                    /n-- /e-- /v-- /e-- /-r
               /e--|
     /w-- /h--|     \r-- /e-- /v-- /e-- /-r
    |         |
    |          \a-- /t-- /e-- /v-- /e-- /-r
    |
    |     /e-- /e-- /-l
----|-r--|
    |    |     /o-- /-t
    |     \o--|
    |          \a-- /-d
    |
    |               /-d
     \c-- /u-- /r--|
                    \-l

python - 将单词列表解析成树

2 回答 2

Related

Reference