我正在尝试通过 python 的正则表达式和 NLTK 处理各种文本-位于http://www.nltk.org/book-。我正在尝试创建一个随机文本生成器,但遇到了一个小问题。首先,这是我的代码流程:

  1. 输入一个句子作为输入-这称为触发字符串,分配给一个变量-

  2. 获取触发字符串中最长的单词

  3. 在所有 Project Gutenberg 数据库中搜索包含此单词的句子 - 不管大写小写 -

  4. 返回包含我在步骤 3 中谈到的单词的最长句子

  5. 将步骤 1 和步骤 4 中的句子附加在一起

  6. 将步骤 4 中的句子指定为新的“触发”句子并重复该过程。请注意,我必须在第二句中找到最长的单词并继续这样,依此类推-






“代码的领主挪威本人,数字可怕,在最不忠的叛徒考多领主的协助下,开始了一场小冲突,直到贝罗纳的新郎,在校对中,用自我比较来对抗他,点对点,叛逆的Arme'gainst Arme,抑制了他的狂妄自大的精神:最后,胜利落在了vs上


import nltk

from nltk.corpus import gutenberg

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str

split_str = triggerSentence.split()#split the sentence into words

longestLength = 0

longestString = ""

montyPython = 1

while montyPython:

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)

    listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-

    listOfWords = gutenberg.words()# all words in gutenberg books -list format-
    # I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
    lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way. 

    longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
    #get longest sentence -list format with every word of sentence being an actual element-


    for word in longestSent:#convert the list longestSentence to an actual string
        sstr = " ".join(word)
    print triggerSentence + " "+ sstr
    triggerSentence = sstr

  1. 您在触发器中找到最长的单词
  2. 您在包含 1 中找到的单词的最长句子中找到最长的单词。
  3. 1. 的单词是 2 句子中最长的单词。


顺便说一句,当您认为 MontyPython 变为 False 并且程序完成时?

import collections
from nltk.corpus import gutenberg

def words_in(sentence):
    """Generate all words in the sentence (lower-cased)"""
    for word in sentence.split():
        word = word.strip('.,"\'-:;')
        if word:
            yield word.lower()

def make_sentence_map(books):
    """Construct a map from words to the longest sentence containing the word."""
    result = collections.defaultdict(str)
    for book in books:
        for sentence in book:
            for word in words_in(sentence):
                if len(sentence) > len(result[word]):
                    result[word] = sent
    return result

def generate_random_text(sentence, sentence_map):
    while True:
        yield sentence
        longest_word = max(words_in(sentence), key=len)
        sentence = sentence_map[longest_word]

sentence_map = make_sentence_map(gutenberg.sents())
for sentence in generate_random_text('Thane of code.', sentence_map): 
    print sentence
您在循环之外分配“split_str”,因此它获取原始值然后保留它。您需要在 while 循环的开头分配它,因此它每次都会更改。

import nltk

from nltk.corpus import gutenberg

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str

longestLength = 0

longestString = ""

montyPython = 1

while montyPython:
    #so this is run every time through the loop
    split_str = triggerSentence.split()#split the sentence into words

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)

    listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-

    listOfWords = gutenberg.words()# all words in gutenberg books -list format-
    # I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
    lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way. 

    longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
    #get longest sentence -list format with every word of sentence being an actual element-


    for word in longestSent:#convert the list longestSentence to an actual string
        sstr = " ".join(word)
    print triggerSentence + " "+ sstr
    triggerSentence = sstr
import sys
import string
import nltk
from nltk.corpus import gutenberg

def longest_element(p):
    """return the first element of p which has the greatest len()"""
    max_len = 0
    elem = None
    for e in p:
        if len(e) > max_len:
            elem = e
            max_len = len(e)
    return elem

def downcase(p):
    """returns a list of words in p shifted to lower case"""
    return map(string.lower, p)

def unique_words():
    """it turns out unique_words was never referenced so this is here
       for pedagogy"""
    # there are 2.6 million words in the gutenburg corpus but only ~42k unique
    # ignoring case, let's pare that down a bit
    for word in gutenberg.words():
    print 'gutenberg.words() has', len(words), 'unique caseless words'
    return words

print 'loading gutenburg corpus...'
sentences = []
for sentence in gutenberg.sents():

trigger = sys.argv[1:]
target = longest_element(trigger).lower()
last_target = None

while target != last_target:
    matched_sentences = []
    for sentence in sentences:
        if target in sentence:

    print '===', target, 'matched', len(matched_sentences), 'sentences'
    longestSentence = longest_element(matched_sentences)
    print ' '.join(longestSentence)

    trigger = longestSentence
    last_target = target
    target = longest_element(trigger).lower()


$ python nltkgut.py 代码
=== 目标领主匹配了 24 句
挪威自己,数字可怕,在最不忠的叛徒 cawdor 领主的协助下,开始了一场小冲突,直到那个 bellona '的新郎,在校对,与他对峙自我比较,点对点,叛逆的武装,抑制他的狂妄自大:最后,胜利落在了vs
挪威自己,人数惨不忍睹,在最不忠的叛徒考多公爵的协助下,开始了一场小规模的冲突,直到那位贝罗纳的新郎,一圈圈的证明,以自我比较,点对点,叛逆的武装来对付他。 arme , 抑制他的狂妄自大: 总而言之, 胜利落在了 vs 上。

对最后一个问题的响应的部分问题在于它按照您的要求做了,但是您提出的问题比您想要的答案更具体。因此,响应陷入了一些我不确定您是否理解的相当复杂的列表表达式中。我建议您更自由地使用 print 语句,如果您不知道它的作用,请不要导入代码。在展开列表表达式时,我发现(如前所述)您从未使用过语料库词表。函数也是一种帮助。

