python-2.7 - Python pairs have multiple copies of a word in list

Question

So I have the following code:

def stripNonAlphaNum(text):
    import re
    return re.compile(r'\W+', re.UNICODE).split(text)

def readText(fileStub):
  words = open(fileStub, 'r').read()
  words = words.lower() # Make it lowercase
  wordlist = sorted(stripNonAlphaNum(words))
  wordfreq = []
  for w in wordlist: # Increase count of one upon every iteration of the word.
    wordfreq.append(wordlist.count(w))
  return list(zip(wordlist, wordfreq))

It reads a file in, and then makes pairs of the word and frequency in which they occur. The issue I'm facing is that when I print the result, I don't get the proper pair counts.

If I have some input given, I might get output like this:

('and', 27), ('and', 27), ('and', 27), ('and', 27), ('and', 27), ('and', 27), ('and', 27),.. (27 times)

Which is NOT what I want it to do.

Rather I would like it to give 1 output of the word and just one number like so:

('and', 27), ('able', 5), ('bat', 6).. etc

So how do I fix this?

score 1 · Accepted Answer

您应该考虑使用字典。字典像哈希映射一样工作，因此允许关联索引；这样重复不是问题。

...
  wordfreq = {}
  for w in wordlist: 
    wordfreq[w] = wordlist.count(w)
  return wordfreq

如果您确实需要返回列表，请执行return wordfreq.items()

这种方法的唯一问题是您将不必要地为每个单词计算 wordlist.count() 方法一次以上。为了避免这个问题，写for w in set(wordlist):

编辑附加问题：如果您可以返回列表，只需执行return sorted(wordfreq.items(), key=lambda t: t[1]). 如果省略key部分，结果会先按word排序，再按value排序

python-2.7 - Python pairs have multiple copies of a word in list

1 回答 1

Related

Reference