csv 看起来像这样。'|' 表示不同的列。
2014-09-01 | I love chicken
2014-09-01 | I eat chicken
2014-09-02 | She loves chicken
2014-09-02 | Ha ha ha I love chicken
2014-09-03 | Blah Blah Blah
我想处理数据,使其看起来像这样。
2014-09-01 | 'i', 2 | 'love', 1 | 'chicken', 2 | 'eat', 1 |
2014-09-02 | 'she', 1 | 'love', 2 | 'chicken', 2 | 'ha', 3 | 'I', 1 |
2014-09-03 | 'blah', 3 |
DATE | WORD, WORDCOUNTS | WORD2, WORDCOUNTS2 | ...
我应该在这里使用什么方法?我最终想绘制一个图表,在 x 轴上显示日期,在 y 轴上显示字数(频率)。
以下是我最好的方法。
TestStartDate = "2013-11-11"
TestEndDate = "2014-06-10"
with open('Simplified.csv') as f:
reader = csv.reader(f)
for row in reader:
if str(row[0:1])[2:12] == TestStartDate:
#str(row[1:2])[2:str(row[1:2]).find('"')-1] is the second column
tagger = MeCab.Tagger()
rose = tagger.parse(str(row[1:2])[2:str(row[1:2]).find('"')-1])
#print rose
wordCount = {}
wordList = rose.split()[:-1:2]
for word in wordList:
wordCount.setdefault(word, 0)
wordCount[word] += 1
for word, count in wordCount.items():
print '"%s, %i"' % (word, count)
我计划将单词和计数添加到数据中。