python - 如何使用 NLTK 从诱导语法生成句子？

Question

我有一个（大）已解析句子列表（使用斯坦福解析器解析），例如，句子“现在你可以被娱乐”具有以下树：

(ROOT
  (S
    (ADVP (RB Now))
    (, ,)
    (NP (PRP you))
    (VP (MD can)
      (VP (VB be)
        (VP (VBN entertained))))
    (. .)))

我正在使用一组句子树来诱导使用 nltk 的语法：

import nltk

# ... for each sentence tree t, add its production to allProductions
allProductions += t.productions()

# Induce the grammar
S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, allProductions)

现在我想用它grammar来生成新的随机句子。我的希望是，由于语法是从一组特定的输入示例中学习的，因此生成的句子在语义上将是相似的。我可以在 nltk 中做到这一点吗？

如果我不能使用 nltk 来执行此操作，是否存在任何其他可以获取（可能重新格式化）grammar并生成句子的工具？

score 14 · Accepted Answer

在 NLTK 2.0 中，您可以使用为给定语法nltk.parse.generate生成所有可能的句子。

此代码定义了一个函数，该函数应根据 (P)CFG 中的产生式规则生成单个句子。

# This example uses choice to choose from possible expansions
from random import choice
# This function is based on _generate_all() in nltk.parse.generate
# It therefore assumes the same import environment otherwise.
def generate_sample(grammar, items=["S"]):
    frags = []
    if len(items) == 1:
        if isinstance(items[0], Nonterminal):
            for prod in grammar.productions(lhs=items[0]):
                frags.append(generate_sample(grammar, prod.rhs()))
        else:
            frags.append(items[0])
    else:
        # This is where we need to make our changes
        chosen_expansion = choice(items)
        frags.append(generate_sample,chosen_expansion)
    return frags

为了利用 PCFG 中的权重，您显然希望使用比更好的采样方法choice()，这隐含地假设当前节点的所有扩展都是等概率的。

score 4 · Accepted Answer

首先，如果你生成随机句子，它们可能在语义上是正确的，但它们可能会失去意义。

（在我看来，这听起来有点像麻省理工学院的学生在他们的SCIgen 程序中所做的，该程序是自动生成科学论文。顺便说一句，非常有趣。）

无论如何，我自己从来没有这样做过，但是使用 nltk.bigrams 似乎是可能的，您可以在使用 Bigrams 生成随机文本下查看那里。

您还可以生成当前树的所有子树，我也不确定它是否是您想要的。

score 3 · Accepted Answer

我从现有 nltk.CFG 语法生成随机句子的解决方案：

def generate_sample(grammar, prod, frags):        
    if prod in grammar._lhs_index: # Derivation
        derivations = grammar._lhs_index[prod]            
        derivation = random.choice(derivations)            
        for d in derivation._rhs:            
            generate_sample(grammar, d, frags)
    elif prod in grammar._rhs_index:
        # terminal
        frags.append(str(prod))

现在可以使用它：

frags = []  
generate_sample(grammar, grammar.start(), frags)
print( ' '.join(frags) )

score 2 · Accepted Answer

使用 nltk文本对象，您可以在其上调用“生成（）”，这将“打印随机文本，使用三元语言模型生成”。http://nltk.org/_modules/nltk/text.html

score 1 · Accepted Answer

受上述启发，这是一个使用迭代而不是递归的方法。

import random

def rewrite_at(index, replacements, the_list):
    del the_list[index]
    the_list[index:index] = replacements

def generate_sentence(grammar):
    sentence_list = [grammar.start()]
    all_terminals = False
    while not all_terminals:
        all_terminals = True
        for position, symbol in enumerate(sentence_list):
            if symbol in grammar._lhs_index:
                all_terminals = False
                derivations = grammar._lhs_index[symbol]
                derivation = random.choice(derivations) # or weighted_choice(derivations) if you have a function for that
                rewrite_at(position, derivation.rhs(), sentence_list)
    return sentence_list

或者，如果您想要派生树，这棵。

from nltk.tree import Tree

def tree_from_production(production):
    return Tree(production.lhs(), production.rhs())

def leaf_positions(the_tree):
    return [the_tree.leaf_treeposition(i) for i in range(len(the_tree.leaves()))]

def generate_tree(grammar):
    initial_derivations = grammar._lhs_index[grammar.start()]
    initial_derivation = random.choice(initial_derivations) # or weighed_choice if you have that function
    running_tree = tree_from_production(initial_derivation)
    all_terminals = False
    while not all_terminals:
        all_terminals = True
        for position in leaf_positions(running_tree):
            node_label = running_tree[position]
            if node_label in grammar._lhs_index:
                all_terminals = False
                derivations = grammar._lhs_index[node_label]
                derivation = random.choice(derivations) # or weighed_choice if you have that function
                running_tree[position] = tree_from_production(derivation)
    return running_tree

这是用于 NLTK PCFG 生产规则的 weighted_choice 函数，可与上述内容一起使用，改编自 Ned Batchelder 的回答，用于一般的加权选择函数：

def weighted_choice(productions):
    prods_with_probs = [(prod, prod.prob()) for prod in productions]
    total = sum(prob for prod, prob in prods_with_probs)
    r = random.uniform(0, total)
    upto = 0
    for prod, prob in prods_with_probs:
        if upto + prob >= r:
            return prod
        upto += prob
    assert False, "Shouldn't get here"

python - 如何使用 NLTK 从诱导语法生成句子？

5 回答 5

Related

Reference