目前我正在尝试lingspam dataset
通过计算 600 个文件(400 封电子邮件和 200 封垃圾邮件)中单词的出现来进行处理。我已经用Porter Stemmer
Aglorithm使每个单词都通用,我还希望我的结果在每个文件中都标准化,以便进一步处理。但我不确定我如何能做到这一点..
迄今为止的资源
为了获得下面的输出,我需要能够按升序添加文件中可能不存在的项目。
printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0, 'univers', 0, 'sales', 1)]
然后我计划将其转换为vectors
使用numpy
.
[0,0,0]
[2,0,0]
[0,0,0]
代替..
printing from ./../lingspam_results/spmsgb165.txt.out
[]
printing from ./../lingspam_results/spmsgb166.txt.out
[('univers', 2)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('sale', 1)]
如何将Counter
模块中的结果标准化为Ascending Order
(同时还将项目添加到我的可能不存在的计数器结果中search_list
)?我已经在下面尝试了一些东西,它只是从每个文本文件中读取并基于search_list
.
import numpy as np, os
from collections import Counter
def parse_bag(directory, search_list):
words = []
for (dirpath, dirnames, filenames) in os.walk(directory):
for f in filenames:
path = directory + "/" + f
count_words(path, search_list)
return;
def count_words(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
wordfreq = Counter(filteredwords).most_common(5)
print "printing from " + filename
print wordfreq
search_list = ['sale', 'univers', 'money']
parse_bag("./../lingspam_results", search_list)
谢谢