1

目前我正在尝试lingspam dataset通过计算 600 个文件(400 封电子邮件和 200 封垃圾邮件)中单词的出现来进行处理。我已经用Porter StemmerAglorithm使每个单词都通用,我还希望我的结果在每个文件中都标准化,以便进一步处理。但我不确定我如何能做到这一点..

迄今为止的资源

为了获得下面的输出,我需要能够按升序添加文件中可能不存在的项目。

printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0, 'univers', 0,  'sales', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2, 'univers', 0,  'sales', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0, 'univers', 0,  'sales', 1)]

然后我计划将其转换为vectors使用numpy.

[0,0,0]
[2,0,0]
[0,0,0]

代替..

printing from ./../lingspam_results/spmsgb165.txt.out
[]
printing from ./../lingspam_results/spmsgb166.txt.out
[('univers', 2)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('sale', 1)]

如何将Counter模块中的结果标准化为Ascending Order(同时还将项目添加到我的可能不存在的计数器结果中search_list)?我已经在下面尝试了一些东西,它只是从每个文本文件中读取并基于search_list.

import numpy as np, os
from collections import Counter

def parse_bag(directory, search_list):
    words = []
    for (dirpath, dirnames, filenames) in os.walk(directory):
        for f in filenames:
            path = directory + "/" + f
            count_words(path, search_list)
    return;

def count_words(filename, search_list):
    textwords = open(filename, 'r').read().split()
    filteredwords = [t for t in textwords if t in search_list]
    wordfreq = Counter(filteredwords).most_common(5)
    print "printing from " + filename
    print wordfreq

search_list = ['sale', 'univers', 'money']
parse_bag("./../lingspam_results", search_list)

谢谢

4

3 回答 3

3

从您的问题来看,听起来您的要求是您希望在所有文件中以一致的顺序排列相同的单词,并进行计数。这应该为你做:

def count_words(filename, search_list):
    textwords = open(filename, 'r').read().split()
    filteredwords = [t for t in textwords if t in search_list]
    counter = Counter(filteredwords)
    for w in search_list:
        counter[w] += 0        # ensure exists
    wordfreq = sorted(counter.items())
    print "printing from " + filename
    print wordfreq

search_list = ['sale', 'univers', 'money']

样本输出:

printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0), ('sale', 1), ('univers', 0)]

我认为您根本不想使用most_common,因为您特别不希望每个文件的内容影响排序或列表长度。

于 2012-10-05T05:39:46.040 回答
1

jsbueno 和 Mu Mind 的结合

def count_words_SO(filename, search_list):
    textwords = open(filename, 'r').read().split()
    filteredwords = [t for t in textwords if t in search_list]
    counter = Counter(filteredwords)
    for w in search_list:
        counter[w] += 0        # ensure exists
    wordfreq = number_parse(counter)
    print "printing from " + filename
    print wordfreq

def number_parse(counter, n=5):
     freq = sorted (((value ,item) for item, value in counter.viewitems() ),    reverse=True)
     return [item[0] for item in freq[:n]]

出来了,只需多做一点工作,我就准备好了,Neurel Network谢谢大家:)

printing from ./../lingspam_results/spmsgb19.txt.out
[0, 0, 0]
printing from ./../lingspam_results/spmsgb2.txt.out
[4, 0, 0]
printing from ./../lingspam_results/spmsgb20.txt.out
[10, 0, 0]
于 2012-10-05T05:57:35.973 回答
1

您在示例中使用的调用Counter(filteredwords)可以计算所有单词,就像您打算的那样 - 它不会给您最常用的单词 - 即没有“most_common”方法 - 为此您必须重新处理所有计数器中的项目,以便具有包含(频率,单词)的元组序列,并对其进行排序:

def most_common(counter, n=5):
     freq = sorted (((value ,item) for item, value in counter.viewitems() ), reverse=True)
     return [item[1] for item in freq[:n]]
于 2012-10-05T04:54:50.960 回答