python - 在python中生成有向词图

Question

从句子列表中，我想根据以下属性生成有向图以生成子句：

假设有三个句子：

我爱糖果
我爱你
我非常喜欢糖果

每次对出现时，该图将在每个连续单词之间具有权重 1 的边。

然后我找出图中权重最大的路径。这里它返回“我非常喜欢糖果”，重量为 3+2+1+1

str1 = """Man who run in front of car, get tired.
Man who run behind car, get exhausted."""
# create a list of words separated at whitespaces
wordList1 = str1.split(None)
# strip any punctuation marks and build modified word list
# start with an empty list
wordList2 = []
for word1 in wordList1:
# last character of each word
lastchar = word1[-1:]
# use a list of punctuation marks
if lastchar in [",", ".", "!", "?", ";"]:
    word2 = word1.rstrip(lastchar)
else:
    word2 = word1
# build a wordList of lower case modified words
wordList2.append(word2.lower())

现在wordList2有一个连续单词的列表。如何将其转换为图表。我正在使用 networkx 库，但对它不太熟悉。

如何继续制作边缘加权图？

score 8 · Accepted Answer

这是使用networkX解决您的问题的方法。

此代码将生成一个有向图 dG。dG 每个单词都有一个节点，带有一个“count”属性，表示该单词被看到了多少次。每个二元组将有一个有向边，并带有一个“权重”属性，指示二元组出现了多少次。

import networkx as nx
import string
from sys import maxint

str1 = """Man who run in front of car, get tired.
Man who run behind car, get exhausted."""

wordList1 = str1.split(None)

wordList2 = [string.rstrip(x.lower(), ',.!?;') for x in wordList1]

dG = nx.DiGraph()

for i, word in enumerate(wordList2):
    try:
        next_word = wordList2[i + 1]
        if not dG.has_node(word):
            dG.add_node(word)
            dG.node[word]['count'] = 1
        else:
            dG.node[word]['count'] += 1
        if not dG.has_node(next_word):
            dG.add_node(next_word)
            dG.node[next_word]['count'] = 0

        if not dG.has_edge(word, next_word):
            dG.add_edge(word, next_word, weight=maxint - 1)
        else:
            dG.edge[word][next_word]['weight'] -= 1
    except IndexError:
        if not dG.has_node(word):
            dG.add_node(word)
            dG.node[word]['count'] = 1
        else:
            dG.node[word]['count'] += 1
    except:
        raise

要查看图表的内容，您可以打印节点和边。

for node in dG.nodes():
    print '%s:%d\n' % (node, dG.node[node]['count'])

for edge in dG.edges():
    print '%s:%d\n' % (edge, maxint - dG.edge[edge[0]][edge[1]]['weight'])

请注意，二元边权重不是从零开始计数，而是从 maxint 开始向下计数。这是因为 networkX 的最短路径实用程序将使用权重值作为每条边的成本。通过使我们的最高计数成为我们最小的权重，我们可以使用这些实用程序来找到具有最高边缘计数的路径。

例如，我们可以得到单词“man”和“exhausted”之间计数最多的路径：

>>>shortest_path = nx.shortest_path(dG, source='man', target='exhausted', weight='weight')
>>>print shortest_path
['man', 'who', 'run', 'behind', 'car', 'get', 'exhausted']

或者，我们可以得到单词“man”和所有其他单词之间计数最高的路径：

>>>shortest_paths = nx.shortest_path(dG, source='man', weight='weight')
>>>print shortest_paths
{'run': ['man', 'who', 'run'], 
'get': ['man', 'who', 'run', 'behind', 'car', 'get'], 
'exhausted': ['man', 'who', 'run', 'behind', 'car', 'get', 'exhausted'], 
'car': ['man', 'who', 'run', 'behind', 'car'], 
'who': ['man', 'who'], 
'behind': ['man', 'who', 'run', 'behind'], 
'of': ['man', 'who', 'run', 'in', 'front', 'of'], 
'in': ['man', 'who', 'run', 'in'], 
'front': ['man', 'who', 'run', 'in', 'front'], 
'tired': ['man', 'who', 'run', 'behind', 'car', 'get', 'tired'], 
'man': ['man']}

如上所述，有可能在这样的图中得到循环，我不确定 networkx 最短路径算法将如何处理这种情况。

祝你好运！

score 1 · Accepted Answer

使用字典存储二元组，每次遇到二元组时将值加 1。获取字典值的最大值以生成您的起点，然后递归调用一个函数，该函数获取从前一个二元组中的最后一个单词开始的具有最高值的二元组，直到不再存在以最后一个单词开头的二元组。警惕圆形图的可能性，例如I love that I love that I love无限（设置递归限制）。

我从未专门与networkx合作过，但浏览主页后，上述内容仍然适用。

python - 在python中生成有向词图

2 回答 2

Related

Reference