python - 计算词频并从中制作字典

Question

我想从文本文件中取出每个单词，并计算字典中的单词频率。

例子：'this is the textfile, and it is used to take words and count'

d = {'this': 1, 'is': 2, 'the': 1, ...}

我不是那么远，但我只是看不到如何完成它。到目前为止我的代码：

import sys

argv = sys.argv[1]
data = open(argv)
words = data.read()
data.close()
wordfreq = {}
for i in words:
    #there should be a counter and somehow it must fill the dict.

score 15 · Accepted Answer

如果不想使用 collections.Counter，可以编写自己的函数：

import sys

filename = sys.argv[1]
fp = open(filename)
data = fp.read()
words = data.split()
fp.close()

unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for raw_word in words:
    word = raw_word.strip(unwanted_chars)
    if word not in wordfreq:
        wordfreq[word] = 0 
    wordfreq[word] += 1

对于更精细的事情，请查看正则表达式。

score 13 · Accepted Answer

尽管按照@Michael 的建议Counter从collections库中使用是一种更好的方法，但我添加此答案只是为了改进您的代码。（我相信这对于新的 Python 学习者来说是一个很好的答案。）

从代码中的注释看来，您似乎想改进代码。而且我认为您可以用文字阅读文件内容（虽然通常我避免使用read()函数并使用for line in file_descriptor:某种代码）。

与words 字符串一样，在 for 循环中， for i in words:循环变量i不是单词而是 char。您正在迭代字符串中的字符，而不是迭代字符串中的单词words。要理解这一点，请注意以下代码片段：

>>> for i in "Hi, h r u?":
...  print i
... 
H
i
,
 
h
 
r
 
u
?
>>>

因为逐个字符而不是逐个单词迭代给定的字符串不是您想要实现的，所以要逐个单词迭代，您应该使用splitPython 中字符串类的方法/函数。方法返回字符串中所有单词的列表，使用 str 作为分隔符（如果未指定，则在所有空格上拆分），可选择将拆分数限制为 num。
str.split(str="", num=string.count(str))

注意下面的代码示例：

分裂：

>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']

带拆分的循环：

>>> for i in "Hi, how are you?".split():
...  print i
... 
Hi,
how
are
you?

它看起来像你需要的东西。除了 word Hi,because split()，默认情况下，由空格分割，所以Hi,被保存为单个字符串（显然）你不想要那个。

要计算文件中单词的频率，一个好的解决方案是使用正则表达式。但首先，为了简单起见，我将使用replace()方法。该方法str.replace(old, new[, max])返回字符串的副本，其中旧的出现已被替换为新的，可选地将替换次数限制为最大值。

现在检查下面的代码示例以查看我的建议：

>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split

环形：

>>> for word in "Hi, how are you?".replace(',', ' ').split():
...  print word
... 
Hi
how
are
you?

现在，如何计算频率：

一种方法是Counter按照@Michael 的建议使用，但要使用您想要从空字典开始的方法。执行以下代码示例：

words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
    wordfreq[word] = wordfreq.setdefault(word, 0) + 1
    #                ^^ add 1 to 0 or old value from dict

我在做什么？因为最初wordfreq是空的，所以你不能wordfreq[word]第一次将它分配给它（它会引发关键异常错误）。所以我使用setdefault了 dict 方法。

dict.setdefault(key, default=None)类似于get()，但dict[key]=default如果 key 不在 dict 中，则会设置。因此，当一个新词第一次出现时，我0在 dict 中使用setdefault然后添加1并分配给同一个 dict 来设置它。

我使用with open而不是 single编写了等效代码open。

with open('~/Desktop/file') as f:
    words = f.read()
    wordfreq = {}
    for word in words.replace(',', ' ').split():
        wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq

像这样运行：

$ cat file  # file is 
this is the textfile, and it is used to take words and count
$ python work.py  # indented manually 
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 
 'it': 1, 'to': 1, 'take': 1, 'words': 1, 
 'the': 1, 'textfile': 1}

使用re.split(pattern, string, maxsplit=0, flags=0)

只需更改 for 循环：for i in re.split(r"[,\s]+", words):，它应该会产生正确的输出。

编辑：最好找到所有字母数字字符，因为您可能有多个标点符号。

>>> re.findall(r'[\w]+', words) # manually indent output  
['this', 'is', 'the', 'textfile', 'and', 
  'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']

使用 for 循环：for word in re.findall(r'[\w]+', words):

我将如何编写代码而不使用read()：

文件是：

$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.

代码是：

$ cat work.py
import re
wordfreq = {}
with open('file') as f:
    for line in f:
        for word in re.findall(r'[\w]+', line.lower()):
            wordfreq[word] = wordfreq.setdefault(word, 0) + 1
  
print wordfreq

用于lower()将大写字母转换为小写字母。

输出：

$python work.py  # manually strip output  
{'and': 3, 'letters': 1, 'text': 1, 'is': 3, 
 'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1, 
 'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1, 
 'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1, 
 'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2, 
 'lines': 1, 'can': 1, 'the': 1}

score 10 · Accepted Answer

from collections import Counter
t = 'this is the textfile, and it is used to take words and count'

dict(Counter(t.split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}

或者在计数之前删除标点符号更好：

dict(Counter(t.replace(',', '').replace('.', '').split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}

score 2 · Accepted Answer

下面获取字符串，用 split() 将其拆分为一个列表，for 循环该列表并使用 Python 的计数函数 count() 计算句子中每个项目的频率。单词 i 及其频率作为元组放置在一个空列表 ls 中，然后使用 dict() 转换为键值对。

sentence = 'this is the textfile, and it is used to take words and count'.split()
ls = []  
for i in sentence:

    word_count = sentence.count(i)  # Pythons count function, count()
    ls.append((i,word_count))       


dict_ = dict(ls)

print dict_

输出; {'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 'it': 1, 'to': 1, 'take': 1, '单词'：1，'the'：1，'textfile，'：1}

score 1 · Accepted Answer

sentence = "this is the textfile, and it is used to take words and count"

# split the sentence into words.
# iterate thorugh every word

counter_dict = {}
for word in sentence.lower().split():
# add the word into the counter_dict initalize with 0
  if word not in counter_dict:
    counter_dict[word] = 0
# increase its count by 1   
  counter_dict[word] =+ 1

score 1 · Accepted Answer

#open your text book,Counting word frequency
File_obj=open("Counter.txt",'r')
w_list=File_obj.read()
print(w_list.split())
di=dict()
for word in w_list.split():


    if word in di:
        di[word]=di[word] + 1

    else:
        di[word]=1



max_count=max(di.values())
largest=-1
maxusedword=''
for k,v in di.items():
    print(k,v)
    if v>largest:
        largest=v
        maxusedword=k

print(maxusedword,largest)

score 1 · Accepted Answer

您还可以使用 int 类型的默认字典。

 from collections import defaultdict
 wordDict = defaultdict(int)
 text = 'this is the textfile, and it is used to take words and count'.split(" ")
 for word in text:
    wordDict[word]+=1

解释：我们初始化一个默认字典，它的值是 int 类型的。这样，任何键的默认值都是 0，我们不需要检查字典中是否存在键。然后，我们将带有空格的文本拆分为单词列表。然后我们遍历列表并增加单词的计数。

score 1 · Accepted Answer

wordList = 'this is the textfile, and it is used to take words and count'.split()
wordFreq = {}

# Logic: word not in the dict, give it a value of 1. if key already present, +1.
for word in wordList:
    if word not in wordFreq:
        wordFreq[word] = 1
    else:
        wordFreq[word] += 1

print(wordFreq)

score 0 · Accepted Answer

我的方法是从地面做一些事情：

从文本输入中删除标点符号。
制作单词列表。
删除空字符串。
遍历列表。
使每个新单词成为字典中值为 1 的键。
如果一个单词已经作为键存在，则将其值加一。

text = '''this is the textfile, and it is used to take words and count'''
word = '' #This will hold each word

wordList = [] #This will be collection of words
for ch in text: #traversing through the text character by character
#if character is between a-z or A-Z or 0-9 then it's valid character and add to word string..
    if (ch >= 'a' and ch <= 'z') or (ch >= 'A' and ch <= 'Z') or (ch >= '0' and ch <= '9'): 
        word += ch
    elif ch == ' ': #if character is equal to single space means it's a separator
        wordList.append(word) # append the word in list
        word = '' #empty the word to collect the next word
wordList.append(word)  #the last word to append in list as loop ended before adding it to list
print(wordList)

wordCountDict = {} #empty dictionary which will hold the word count
for word in wordList: #traverse through the word list
    if wordCountDict.get(word.lower(), 0) == 0: #if word doesn't exist then make an entry into dic with value 1
        wordCountDict[word.lower()] = 1
    else: #if word exist then increament the value by one
        wordCountDict[word.lower()] = wordCountDict[word.lower()] + 1
print(wordCountDict)

另一种方法：

text = '''this is the textfile, and it is used to take words and count'''
for ch in '.\'!")(,;:?-\n':
    text = text.replace(ch, ' ')
wordsArray = text.split(' ')
wordDict = {}
for word in wordsArray:
    if len(word) == 0:
        continue
    else:
        wordDict[word.lower()] = wordDict.get(word.lower(), 0) + 1
print(wordDict)

score 0 · Accepted Answer

还有一个功能：

def wcount(filename):
    counts = dict()
    with open(filename) as file:
        a = file.read().split()
        # words = [b.rstrip() for b in a]
    for word in a:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
    return counts

score 0 · Accepted Answer

def play_with_words（输入）：

input_split = input.split(",")
input_split.sort()
count = {}
for i in input_split:
    if i in count:
        count[i] += 1
    else:
        count[i] = 1

return count

输入=“我，我，这里，哪里，你，是”

打印（play_with_words（输入））

score 0 · Accepted Answer

Write a Python program to create a list of strings by taking input from the user and then create  a dictionary containing each string along with their frequencies. (e.g. if the list is [‘apple’,  ‘banana’, ‘fig’, ‘apple’, ‘fig’, ‘banana’, ‘grapes’, ‘fig’, ‘grapes’, ‘apple’] then output should be  {'apple': 3, 'banana': 2, 'fig': 3, 'grapes': 2}.  

lst = []
d = dict()
print("ENTER ZERO NUMBER FOR EXIT !!!!!!!!!!!!")
while True:
    user = input('enter string element :: -- ')
    if user == "0":
        break
    else:
        lst.append(user)
print("LIST ELEMENR ARE :: ",lst)
l = len(lst)
for i in range(l) :
    c = 0
    for j in range(l) :
        if lst[i] == lst[j ]:
            c += 1
    d[lst[i]] = c
print("dictionary is  :: ",d)

score 0 · Accepted Answer

您也可以采用这种方法。但是您需要在读取文件后首先将文本文件的内容作为字符串存储在变量中。这样，您不需要使用或导入任何外部库。

s = "this is the textfile, and it is used to take words and count"

s = s.split(" ")
d = dict()
for i in s:
  c = ""
  if i.isalpha() == True: 
    if i not in d:
      d[i] = 1
    else:
      d[i] += 1
  else:
    for j in i:
      l = len(j)
      if j.isalpha() == True:
        c+=j    
    if c not in d:
      d[c] = 1
    else:
      d[c] += 1


print(d)

结果：

python - 计算词频并从中制作字典

13 回答 13

Related

Reference