python - 按类别分组计数

Question

我编写了一个脚本，它会检查数据，使用正则表达式检查表情符号，当找到表情符号时，计数器会更新。然后，应将每个类别的计数数写入一个列表，例如 cat ne 有 25 个表情符号，类别 fr 有 45.... 这是出错的地方。我得到的结果是：

[1, 'ag', 2, 'dg', 3, 'dg', 4, 'fr', 5, 'fr', 6, 'fr', 7, 'fr', 8, 'hp', 9 , 'hp', 10, 'hp', 11, 'hp', 12, 'hp', 13, 'hp', 14, 'hp', 15, 'hp', 16, 'hp', 17, ' hp', 18, 'hp', 19, 'hp', 20, 'hp', 21, 'hp', 22, 'hp', 23, 'hp', 24, 'hp', 25, 'ne' , 26, 'ne', 27, 'ne', 28, 'ne', 29, 'ne', 30, 'ne', 31, 'ne', 32, 'ne', 33, 'ne', 34 , 'ne', 35, 'ne', 36, 'ne', 37, 'ne', 38]

fileid就是这种形式，一个大文件包含7个小文件（每个文件是一个类别）。在类别文件中，每个类别大约有 100 个文件：

数据/ne/567.txt

每个.txt文件中的数据就是一句话，长这样

我今天很开心：）

这是我的脚本：

counter = 0
lijst = []  
for fileid in corpus.fileids():
    for sentence in corpus.sents(fileid):
        cat = str(fileid.split('/')[0])
        s = " ".join(sentence)    
        m = re.search('(:\)|:\(|:\s|:\D|:\o|:\@)+', s)
        if m is not None:
            counter +=1
            lijst += [counter] + [cat]

score 1 · Accepted Answer

你应该做：

import collections

counts = collections.defaultdict(lambda: 0)
for fileid in corpus.fileids():
    for sentence in corpus.sents(fileid):
        cat = str(fileid.split('/')[0])
        s = " ".join(sentence)
        counts[cat] += len(re.findall('(:\)|:\(|:\s|:\D|:\o|:\@)+', s))

python - 按类别分组计数

1 回答 1

Related

Reference