python - 用 NLTK 和 CMU Dict 发现诗意形式

Question

编辑：此代码已作为基本模块处理并发布：https ://github.com/hyperreality/Poetry-Tools

我是一名语言学家，最近学习了 python，我正在从事一个希望自动分析诗歌的项目，包括检测诗歌的形式。即，如果它找到一个 10 音节线，重音模式为 0101010101，它会声明它是抑扬格五音步。具有 5-7-5 音节模式的诗歌将是俳句。

我正在使用以下代码，它是较大脚本的一部分，但我有一些问题列在程序下方：

脚本中的语料库只是诗歌的原始文本输入。

import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit

...

def cmuform():
    tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
    d = cmudict.dict()
    text = nltk.Text(tokens)
    words = [w.lower() for w in text]
    regexp = "[A-Za-z]+"
    exp = re.compile(regexp)

    def nsyl(word):
        lowercase = word.lower()
        if lowercase not in d:
                return 0
        else:
            first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
            second = ''.join(first)
            third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
            return third 
                #return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])      

    sum1 = 0
    for a in words:
            if exp.match(a):
            print a,nsyl(a),
                sum1 = sum1 + len(str(nsyl(a)))

    print "\nTotal syllables:",sum1

我想我想要的输出是这样的：

1101111101

0101111001

1101010111

第一个问题是我在标记化过程中丢失了换行符，我真的需要换行符才能识别表单。不过，这不应该太难处理。更大的问题是：

我无法处理非字典单词。目前我为他们返回 0，但这会混淆任何识别这首诗的尝试，因为该行的音节数可能会减少。
此外，CMU 词典经常说一个单词有重音 - '1' - 没有 - '0 - 。这就是为什么输出看起来像这样的原因：1101111101，当它应该是五音抑扬格的重音时：0101010101
那么我该如何添加一些捏造因素，以便这首诗在仅接近模式时仍被识别为抑扬格五音步？当 CMU 字典不会输出如此干净的结果时，编写一个识别 01 行的函数是不好的。我想我在问如何编写“部分匹配”算法。

score 10 · Accepted Answer

欢迎来到堆栈溢出。我对 Python 不是很熟悉，但我看到您还没有收到很多答案，所以我会尽力帮助您解决问题。

首先是一些建议：你会发现，如果你专注于你的问题，你得到答案的机会就会大大提高。您的帖子太长并且包含几个不同的问题，因此超出了大多数人在这里回答问题的“注意力范围”。

回到主题：

在你修改你的问题之前，你问过如何让它不那么混乱。这是一个大问题，但您可能希望使用自上而下的过程方法并将代码分解为功能单元：

将语料库分成几行
对于每一行：找到音节长度和重音模式。
对压力模式进行分类。

你会发现第一步是在 python 中调用单个函数：

corpus.split("\n");

并且可以保留在 main 函数中，但第二步最好放在它自己的函数中，第三步需要自己拆分，并且可能会更好地使用面向对象的方法来处理。如果你在学院，你也许可以说服 CS 教员借给你几个月的研究生并帮助你，而不是一些研讨会的要求。

现在回答您的其他问题：

不丢失换行符：正如@ykaganovich 所提到的，您可能希望将语料库分成几行并将它们提供给标记器。

字典中没有的词/错误：CMU 字典主页说：

发现错误？请联系开发商。我们将查看问题并改进字典。（联系方式见底部。）

可能有一种方法可以将自定义单词添加到字典中/更改现有单词，查看他们的站点，或直接联系字典维护人员。如果您无法弄清楚，也可以在此处单独提问。stackoverflow 中肯定有人知道答案或可以将您指向正确的资源。无论您做出什么决定，您都需要联系维护人员并为他们提供任何额外的单词和更正以改进字典。

在输入语料库与模式不完全匹配时对其进行分类：您可能希望查看为模糊字符串比较提供的链接 ykaganovich。要寻找的一些算法：

Levenshtein 距离：让您衡量两个字符串的不同程度，例如将一个字符串转换为另一个字符串所需的更改次数。优点：易于实现，缺点：未标准化，得分 2 表示与长度为 20 的模式匹配良好，但与长度为 3 的模式匹配不佳。
Jaro-Winkler 字符串相似性度量：类似于 Levenshtein，但基于两个字符串中以相同顺序出现的字符序列的数量。实现起来有点困难，但会为您提供标准化值（0.0 - 完全不同，1.0 - 相同）并且适用于对压力模式进行分类。CS 研究生或去年的本科生应该不会有太多麻烦（提示提示）。

我想这些都是你的问题。希望这个对你有帮助。

score 4 · Accepted Answer

要保留换行符，请在将每一行发送到 cmu 解析器之前逐行解析。

对于处理单音节单词，当 nltk 返回 1 时，您可能希望同时尝试 0 和 1（看起来 nltk 已经为某些永远不会重读的单词返回 0，例如“the”）。所以，你最终会得到多个排列：1101111101 0101010101 1101010101

等等。然后你必须选择那些看起来像已知形式的。

对于非字典单词，我也会以同样的方式对其进行修改：计算音节的数量（最愚蠢的方法是计算元音），并排列所有可能的重音。也许添加更多规则，例如“ea 是单个音节，尾随 e 是无声的”......

我从未使用过其他类型的模糊处理，但您可以查看https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison了解一些想法。

score 2 · Accepted Answer

这是我在 stackoverflow 上的第一篇文章。我是一个 python 新手，所以请原谅代码风格的任何缺陷。但我也试图从诗歌中提取准确的韵律。这个问题中包含的代码对我有帮助，所以我发布了我在此基础上提出的内容。这是一种将重音提取为单个字符串的方法，使用“伪造因素”来纠正 cmudict 偏差，并且不会丢失不在 cmudict 中的单词。

import nltk
from nltk.corpus import cmudict

prondict = cmudict.dict()

#
# parseStressOfLine(line) 
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings 
#
# 'stress' in form '0101*,*110110'
#   -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'


def parseStressOfLine(line):

    stress=""
    stress_no_punct=""
    print line

    tokens = [words.lower() for words in nltk.word_tokenize(line)] 
    for word in tokens:        

        word_punct =  strip_punctuation_stressed(word.lower())
        word = word_punct['word']
        punct = word_punct['punct']

        #print word

        if word not in prondict:
            # if word is not in dictionary
            # add it to the string that includes punctuation
            stress= stress+"*"+word+"*"
        else:
            zero_bool=True
            for s in prondict[word]:
                # oppose the cmudict bias toward 1
                # search for a zero in array returned from prondict
                # if it exists use it
                # print strip_letters(s),word
                if strip_letters(s)=="0":
                    stress = stress + "0"
                    stress_no_punct = stress_no_punct + "0"
                    zero_bool=False
                    break

            if zero_bool:
                stress = stress + strip_letters(prondict[word][0])
                stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])

        if len(punct)>0:
            stress= stress+"*"+punct+"*"

    return {'stress':stress,'stress_no_punct':stress_no_punct}



# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
    # define punctuations
    punctuations = '!()-[]{};:"\,<>./?@#$%^&*_~'
    my_str = word

    # remove punctuations from the string
    no_punct = ""
    punct=""
    for char in my_str:
        if char not in punctuations:
            no_punct = no_punct + char
        else:
            punct = punct+char

    return {'word':no_punct,'punct':punct}


# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
    #print "strip_letters"
    nm = ''
    for ws in ls:
        #print "ws",ws
        for ch in list(ws):
            #print "ch",ch
            if ch.isdigit():
                nm=nm+ch
                #print "ad to nm",nm, type(nm)
    return nm


# TESTING  results 
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)


""" 

OUTPUT 

This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

python - 用 NLTK 和 CMU Dict 发现诗意形式

3 回答 3

Related

Reference