8

这是我的功能的代码:

def calcVowelProportion(wordList):
    """
    Calculates the proportion of vowels in each word in wordList.
    """

    VOWELS = 'aeiou'
    ratios = []

    for word in wordList:
        numVowels = 0
        for char in word:
            if char in VOWELS:
                numVowels += 1
        ratios.append(numVowels/float(len(word)))

现在,我正在处理一个超过 87,000 个单词的列表,这个算法显然非常慢。

有一个更好的方法吗?

编辑:

我测试了以下类提供的算法@ExP:

    import time

    class vowelProportions(object):
        """
        A series of methods that all calculate the vowel/word length ratio
        in a list of words.
        """

        WORDLIST_FILENAME = "words_short.txt"

        def __init__(self):
            self.wordList = self.buildWordList()
            print "Original: " + str(self.calcMeanTime(10000, self.cvpOriginal, self.wordList))
            print "Generator: " + str(self.calcMeanTime(10000, self.cvpGenerator, self.wordList))
            print "Count: " + str(self.calcMeanTime(10000, self.cvpCount, self.wordList))
            print "Translate: " + str(self.calcMeanTime(10000, self.cvpTranslate, self.wordList))

        def buildWordList(self):
            inFile = open(self.WORDLIST_FILENAME, 'r', 0)
            wordList = []
            for line in inFile:
                wordList.append(line.strip().lower())
            return wordList

        def cvpOriginal(self, wordList):
            """ My original, slow algorithm"""
            VOWELS = 'aeiou'
            ratios = []

            for word in wordList:
                numVowels = 0
                for char in word:
                    if char in VOWELS:
                        numVowels += 1
                ratios.append(numVowels/float(len(word)))

            return ratios

        def cvpGenerator(self, wordList):
            """ Using a generator expression """
            return [sum(char in 'aeiou' for char in word)/float(len(word)) for word in wordList]

        def cvpCount(self, wordList):
            """ Using str.count() """
            return [sum(word.count(char) for char in 'aeiou')/float(len(word)) for word in wordList]

        def cvpTranslate(self, wordList):
            """ Using str.translate() """
            return [len(word.translate(None, 'bcdfghjklmnpqrstxyz'))/float(len(word)) for word in wordList]

        def timeFunc(self, func, *args):
            start = time.clock()
            func(*args)
            return time.clock() - start

        def calcMeanTime(self, numTrials, func, *args):
            times = [self.timeFunc(func, *args) for x in range(numTrials)]
            return sum(times)/len(times)

输出是(对于 200 个单词的列表):

Original: 0.0005613667
Generator: 0.0008402738
Count: 0.0012531976
Translate: 0.0003343548

令人惊讶的是,Generator 和 Count 比原来的还要慢(如果我的实现不正确,请告诉我)。

我想测试@John 的解决方案,但对树木一无所知。

4

6 回答 6

4

您应该优化最里面的循环。

我很确定有几种替代方法。这是我现在能想到的。我不确定他们将如何比较速度(相对于彼此和您的解决方案)。

  • 使用生成器表达式:

    numVowels = sum(x in 'aeiou' for x in word)
    
  • 使用str.count()

    numVowels = sum(word.count(x) for x in 'aeiou')
    
  • 使用str.translate()(假设没有大写字母或特殊符号):

    numVowels = len(word.translate(None, 'bcdfghjklmnpqrstxyz'))
    

有了所有这些,您甚至可以在一行中编写整个函数,而无需list.append().

我很想知道哪个结果是最快的。

于 2013-04-22T20:48:14.000 回答
4

由于您只关心每个单词中元音与字母的比例,您可以首先将所有元音替换为a. 现在您可以尝试一些可能更快的方法:

  • 您在每一步测试一个字母而不是五个字母。那肯定会更快。
  • 您也许可以对整个列表进行排序并搜索从元音(现在分类为a)到非元音的点。这是一个树形结构。单词中的字母数是树的级别。元音的数量是左分支的数量。
于 2013-04-22T20:48:17.080 回答
1

使用正则表达式匹配元音列表并计算匹配数。

>>> import re
>>> s = 'supercalifragilisticexpialidocious'
>>> len(re.findall('[aeiou]', s))
16
于 2013-04-22T20:55:14.927 回答
0
import timeit

words = 'This is a test string'

def vowelProportions(words):
    counts, vowels = {}, 'aeiou'
    wordLst = words.lower().split()
    for word in wordLst:
        counts[word] = float(sum(word.count(v) for v in vowels)) / len(word)
    return counts

def f():
    return vowelProportions(words)

print timeit.timeit(stmt = f, number = 17400) # 5 (len of words) * 17400 = 87,000
# 0.838676
于 2013-04-22T22:31:29.700 回答
0
for word in wordlist:
    numVowels = 0
    for letter in VOWELS:
        numVowels += word.count(letter)
    ratios.append(numVowels/float(len(word)))

更少的决策,应该意味着更少的时间,也使用内置的东西,我相信它工作得更快。

于 2013-04-22T20:49:00.913 回答
0

以下是如何在 Linux 上使用一个命令行来计算它:-

cat wordlist.txt | tr -d aeiouAEIOU | paste - wordlist.txt | gawk '{ FS="\t"; RATIO = length($1)/ length($2); print $2, RATIO }'

输出:

aa 0
ab 0.5
abs 0.666667

注意:中的每一行都wordlist.txt包含一个单词。空行将产生除以零错误

于 2013-04-22T23:21:35.880 回答