0

我正在尝试通过 python 的正则表达式和 NLTK 处理各种文本-位于http://www.nltk.org/book-。我正在尝试创建一个随机文本生成器,但遇到了一个小问题。首先,这是我的代码流程:

  1. 输入一个句子作为输入-这称为触发字符串,分配给一个变量-

  2. 获取触发字符串中最长的单词

  3. 在所有 Project Gutenberg 数据库中搜索包含此单词的句子 - 不管大写小写 -

  4. 返回包含我在步骤 3 中谈到的单词的最长句子

  5. 将步骤 1 和步骤 4 中的句子附加在一起

  6. 将步骤 4 中的句子指定为新的“触发”句子并重复该过程。请注意,我必须在第二句中找到最长的单词并继续这样,依此类推-

到目前为止,我只能这样做一次。当我试图让这个继续下去时,程序只继续打印我的搜索产生的第一句话。它实际上应该在这个新句子中寻找最长的单词并继续应用我上面描述的代码流。

下面是我的代码以及示例输入/输出:

样本输入

“密码之王”

样本输出

“代码的领主挪威本人,数字可怕,在最不忠的叛徒考多领主的协助下,开始了一场小冲突,直到贝罗纳的新郎,在校对中,用自我比较来对抗他,点对点,叛逆的Arme'gainst Arme,抑制了他的狂妄自大的精神:最后,胜利落在了vs上

现在这实际上应该采用以“挪威自己......”开头的句子并寻找其中最长的单词并执行上述步骤等等,但事实并非如此。有什么建议么?谢谢。

import nltk

from nltk.corpus import gutenberg

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str

split_str = triggerSentence.split()#split the sentence into words

longestLength = 0

longestString = ""

montyPython = 1

while montyPython:

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)


    listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-

    listOfWords = gutenberg.words()# all words in gutenberg books -list format-
    # I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
    lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way. 

    longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
    #get longest sentence -list format with every word of sentence being an actual element-

    longestSent=[longestSentence]

    for word in longestSent:#convert the list longestSentence to an actual string
        sstr = " ".join(word)
    print triggerSentence + " "+ sstr
    triggerSentence = sstr
4

4 回答 4

1

这个怎么样?

  1. 您在触发器中找到最长的单词
  2. 您在包含 1 中找到的单词的最长句子中找到最长的单词。
  3. 1. 的单词是 2 句子中最长的单词。

发生什么了?提示:答案以“无限”开头。要纠正这个问题,您可以找到一组有用的小写单词。

顺便说一句,当您认为 MontyPython 变为 False 并且程序完成时?

于 2010-08-26T04:31:22.767 回答
1

与其每次都搜索整个语料库,不如构建从单词到包含该单词的最长句子的单个映射可能更快。这是我的(未经测试的)尝试这样做。

import collections
from nltk.corpus import gutenberg

def words_in(sentence):
    """Generate all words in the sentence (lower-cased)"""
    for word in sentence.split():
        word = word.strip('.,"\'-:;')
        if word:
            yield word.lower()

def make_sentence_map(books):
    """Construct a map from words to the longest sentence containing the word."""
    result = collections.defaultdict(str)
    for book in books:
        for sentence in book:
            for word in words_in(sentence):
                if len(sentence) > len(result[word]):
                    result[word] = sent
    return result

def generate_random_text(sentence, sentence_map):
    while True:
        yield sentence
        longest_word = max(words_in(sentence), key=len)
        sentence = sentence_map[longest_word]

sentence_map = make_sentence_map(gutenberg.sents())
for sentence in generate_random_text('Thane of code.', sentence_map): 
    print sentence
于 2010-08-26T05:33:03.633 回答
0

您在循环之外分配“split_str”,因此它获取原始值然后保留它。您需要在 while 循环的开头分配它,因此它每次都会更改。

import nltk

from nltk.corpus import gutenberg

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str

longestLength = 0

longestString = ""

montyPython = 1

while montyPython:
    #so this is run every time through the loop
    split_str = triggerSentence.split()#split the sentence into words

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)


    listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-

    listOfWords = gutenberg.words()# all words in gutenberg books -list format-
    # I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
    lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way. 

    longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
    #get longest sentence -list format with every word of sentence being an actual element-

    longestSent=[longestSentence]

    for word in longestSent:#convert the list longestSentence to an actual string
        sstr = " ".join(word)
    print triggerSentence + " "+ sstr
    triggerSentence = sstr
于 2010-08-26T04:07:11.470 回答
0

汉金先生的回答更优雅,但以下更符合您开始的方法:

import sys
import string
import nltk
from nltk.corpus import gutenberg

def longest_element(p):
    """return the first element of p which has the greatest len()"""
    max_len = 0
    elem = None
    for e in p:
        if len(e) > max_len:
            elem = e
            max_len = len(e)
    return elem

def downcase(p):
    """returns a list of words in p shifted to lower case"""
    return map(string.lower, p)


def unique_words():
    """it turns out unique_words was never referenced so this is here
       for pedagogy"""
    # there are 2.6 million words in the gutenburg corpus but only ~42k unique
    # ignoring case, let's pare that down a bit
    for word in gutenberg.words():
        words.add(word.lower())
    print 'gutenberg.words() has', len(words), 'unique caseless words'
    return words

print 'loading gutenburg corpus...'
sentences = []
for sentence in gutenberg.sents():
    sentences.append(downcase(sentence))

trigger = sys.argv[1:]
target = longest_element(trigger).lower()
last_target = None

while target != last_target:
    matched_sentences = []
    for sentence in sentences:
        if target in sentence:
            matched_sentences.append(sentence)

    print '===', target, 'matched', len(matched_sentences), 'sentences'
    longestSentence = longest_element(matched_sentences)
    print ' '.join(longestSentence)

    trigger = longestSentence
    last_target = target
    target = longest_element(trigger).lower()

但是,鉴于您的例句,它会在两个周期内达到固定:

$ python nltkgut.py 代码
加载古腾堡语料库的领主...
=== 目标领主匹配了 24 句
挪威自己,数字可怕,在最不忠的叛徒 cawdor 领主的协助下,开始了一场小冲突,直到那个 bellona '的新郎,在校对,与他对峙自我比较,点对点,叛逆的武装,抑制他的狂妄自大:最后,胜利落在了vs
===目标新郎匹配1句
挪威自己,人数惨不忍睹,在最不忠的叛徒考多公爵的协助下,开始了一场小规模的冲突,直到那位贝罗纳的新郎,一圈圈的证明,以自我比较,点对点,叛逆的武装来对付他。 arme , 抑制他的狂妄自大: 总而言之, 胜利落在了 vs 上。

对最后一个问题的响应的部分问题在于它按照您的要求做了,但是您提出的问题比您想要的答案更具体。因此,响应陷入了一些我不确定您是否理解的相当复杂的列表表达式中。我建议您更自由地使用 print 语句,如果您不知道它的作用,请不要导入代码。在展开列表表达式时,我发现(如前所述)您从未使用过语料库词表。函数也是一种帮助。

于 2010-08-26T05:49:44.297 回答