python - 使用循环从列表中查找所有唯一单词

Question

我正在尝试根据从文本文件中获取的所有单词列表制作唯一单词列表。我唯一的问题是用于迭代两个列表的算法。

def getUniqueWords(allWords):
    uniqueWords = []
    uniqueWords.append(allWords[0])
    for i in range(len(allWords)):
        for j in range(len(uniqueWords)):
            if allWords[i] == uniqueWords[j]:
                pass
            else:
                uniqueWords.append(allWords[i])
                print uniqueWords[j]
    print uniqueWords
    return uniqueWords

如您所见，我创建了一个空列表并开始遍历两个列表。我还附加了列表中的第一项，因为由于某种原因它不会尝试匹配我假设的单词，因为在空列表中， list[0] 不存在。如果有人可以帮助我弄清楚如何正确地迭代这个，那么我就可以生成一个很棒的单词列表。

print uniqueWords[j] 只是为了调试，所以我可以看到在处理列表期间出现了什么

score 18 · Accepted Answer

我不是 python 专家，但认为这应该可行：

uniqueWords = [] 
for i in allWords:
      if not i in uniqueWords:
          uniqueWords.append(i);

return uniqueWords

编辑：

我测试并且它有效，它只返回列表中的唯一单词：

def getUniqueWords(allWords) :
    uniqueWords = [] 
    for i in allWords:
        if not i in uniqueWords:
            uniqueWords.append(i)
    return uniqueWords

print getUniqueWords(['a','b','c','a','b']);

['a', 'b', 'c']

score 2 · Accepted Answer

我不喜欢（试图）要求你选择糟糕的算法的家庭作业问题。例如，更好的选择是使用 aset或 a trie。

您可以通过 2 个小改动来修复您的程序

def getUniqueWords(allWords):
    uniqueWords = []
    uniqueWords.append(allWords[0])
    for i in range(len(allWords)):
        for j in range(len(uniqueWords)):
            if allWords[i] == uniqueWords[j]:
                break
        else:
            uniqueWords.append(allWords[i])
            print uniqueWords[j]
    print uniqueWords
    return uniqueWords

首先，当您看到单词已经存在时，您需要停止循环

        for j in range(len(uniqueWords)):
            if allWords[i] == uniqueWords[j]:
                break  # break out of the loop since you found a match

第二个是使用for/else构造而不是if/else

        for j in range(len(uniqueWords)):
            if allWords[i] == uniqueWords[j]:
                break
        else:
            uniqueWords.append(allWords[i])
            print uniqueWords[j]

score 1 · Accepted Answer

也许你可以使用 collections.Counter 类？（特别是如果您还想计算每个单词在源文档中出现的次数）。

http://docs.python.org/2/library/collections.html?highlight=counter#collections.Counter

import collections.Counter
def getUniqueWords(allWords):
    uniqueWords = Counter()

    for word in allWords:
        uniqueWords[word]+=1
    return uniqueWords.keys()

另一方面，如果您只想计算单词，只需使用一个集合：

def getUniqueWords(allWords):
    uniqueWords =set()

    for word in allWords:
        uniqueWords.add(word)
    return uniquewords #if you want to return them as a set
    OR
    return list(uniquewords) #if you want to return a list

而且，如果您仅限于循环，并且输入相对较大，则循环 + 二分搜索比仅循环是更好的选择 - 类似这样：

def getUniqueWords(allWords):
   uw = []
   for word in allWords:
       (lo,hi) = (0,len(uw)-1)
       m = -1
       while hi>=lo and m==-1:
           mid = lo + (hi-lo)/2
           if uw[mid]==word:
              m = mid
           elif uw[mid]<word:
              lo = mid+1
           else:
              hi = mid-1
       if m==-1:
           m = lo
           uw = uw[:m]+[word]+uw[m:]
   return uw

如果您的输入大约有 100000 个单词，则使用此循环和简单循环之间的区别在于您的 PC 在执行程序时不会发出噪音：)

score 0 · Accepted Answer

您可以使用 set 来获取唯一的单词：

def getUniqueWords(allWords) :
    uniqueWords = list({i for i in allWords})
    return uniqueWords

print getUniqueWords(['a','b','c','a','b']);

结果： ['c', 'a', 'b']

python - 使用循环从列表中查找所有唯一单词

4 回答 4

Related

Reference