0

csv 看起来像这样。'|' 表示不同的列。

2014-09-01 | I love chicken

2014-09-01 | I eat chicken

2014-09-02 | She loves chicken

2014-09-02 | Ha ha ha I love chicken

2014-09-03 | Blah Blah Blah

我想处理数据,使其看起来像这样。

2014-09-01 | 'i', 2 | 'love', 1 | 'chicken', 2 | 'eat', 1 |

2014-09-02 | 'she', 1 | 'love', 2 | 'chicken', 2 | 'ha', 3 | 'I', 1 |

2014-09-03 | 'blah', 3 |

DATE | WORD, WORDCOUNTS | WORD2, WORDCOUNTS2 | ...

我应该在这里使用什么方法?我最终想绘制一个图表,在 x 轴上显示日期,在 y 轴上显示字数(频率)。

以下是我最好的方法。

TestStartDate = "2013-11-11"
TestEndDate = "2014-06-10"

with open('Simplified.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        if str(row[0:1])[2:12] == TestStartDate:
            #str(row[1:2])[2:str(row[1:2]).find('"')-1] is the second column
            tagger = MeCab.Tagger()
            rose = tagger.parse(str(row[1:2])[2:str(row[1:2]).find('"')-1])
            #print rose
            wordCount = {}
            wordList = rose.split()[:-1:2]
            for word in wordList:
                wordCount.setdefault(word, 0)
                wordCount[word] += 1
            for word, count in wordCount.items():
                print '"%s, %i"' % (word, count)

我计划将单词和计数添加到数据中。

4

4 回答 4

0

我建议使用Counter计数。

import re
from collections import Counter

stats = {}

with open('in.txt' ,'r') as fin:
    for line in fin:
        tokens = re.split('[\| ]', line)
        key = tokens.pop(0)
        counter = Counter()
        for token in tokens:
            counter[token] = counter[token] + 1
        if key in stats:
            stats[key] = stats[key] + counter
        else:
            stats[key] = counter

for key, counter in stats.items():
    print key, '|', '|'.join([ '"%s", %s' % (k,v) for k,v in counter.items() ]), '|'
于 2014-09-26T02:25:57.797 回答
0

这是使用defaultdictCounter集合的解决方案。

import csv
from collections import defaultdict
from collections import Counter


date_words = defaultdict(lambda: Counter())


with open('test.csv') as psvfile:
    reader = csv.reader(psvfile, delimiter="|")

    for line in reader:
        date = line[0]
        words = line[1].split()

        date_words[date].update(Counter(words))

您可能还想考虑使用擅长处理日期和绘制内容的 pandas 库。

于 2014-09-26T02:27:42.103 回答
0

这对我有用〜你真的需要最后一个'|' ? 因为当你用'|'分割它时 再次,当您将其放入 matplotlib 或其他内容时,您将在结果中得到一个 ''。

下面的代码不会附加“|” 在每行结果中,如果您认为有必要,只需附加一个“|” 到函数 d,如下所示:

return '%s| %s|'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))

===========

def d(s):
    tokens = s.split('|')
    words = tokens[-1].strip().lower().split(' ')
    return '%s| %s'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))

def wordcount():
    lines=[
        '2014-09-01 | I love chicken',
        '2014-09-01 | I eat chicken',
        '2014-09-02 | She loves chicken',
        '2014-09-02 | Ha ha ha I love chicken',
        '2014-09-03 | Blah Blah Blah'
    ]
    rows={}
    for line in lines:
        t_line = line.split(' | ')
        if t_line[0] not in rows:
            rows[t_line[0]]=''
        rows[t_line[0]]+=(' '+t_line[-1])
    newrows=[]
    for k,v in rows.items():
        newrows.append(d('%s | %s'%(k,v)))
    print '\n'.join(newrows)


>>2014-09-02 | 'love',1|'i',1|'she',1|'loves',1|'chicken',2|'ha',3
>>2014-09-03 | 'blah',3
>>2014-09-01 | 'i',2|'chicken',2|'love',1|'eat',1
于 2014-09-26T02:01:31.510 回答
0

读取输入 CSV,创建一个将日期映射到Counters 的字典。使用该行中的单词更新每行给定数据的计数器。然后写 [date, (word1, count1), (word2, count2), ...] 形式的行。此示例对日期和单词进行排序,但您可以省略它以获得更好的性能。

from collections import Counter
import csv

data = {}

with open('my_data.csv') as f:
    for date, words in csv.reader(f, delimiter='|'):
        data.setdefault(date, Counter()).update(word for word in words.split())

with open('my_counts.csv', 'w') as f:
    writer = csv.writer(f, delimiter='|')

    for date in sorted(data.keys()):
        writer.writerow([date] + ["'{0}', {1}".format(date, data[date]) for date in sorted(data.keys())])
于 2014-09-26T02:16:36.230 回答