此时我需要做两件事,但我需要你的帮助:
- 清理数据的最佳实践 - 以编程方式删除多余的标签和“>>>>>>>”,以及其他无意义的通信 flotsam 和 jetsum
- 一旦它被清理 - 我如何打包它以在 django 和 sqlite 中正常工作。
- 我是否根据日期、人物、主题、单词将其转换为 csv,然后将它们输入到我的数据库中的数据类中?
好吧,在我进入数据库之前,我希望能够对数据进行排序并干净地显示数据——我几乎没有将东西放入数据库的经验,我最接近的是使用 XML、csv 和 JSON 工作。
我需要通过排名来获得 ngram,例如某个单词在一个人的一系列电子邮件中出现了多少次。我正试图更接近于了解人们如何与我谈论主题等。这是Jon Kleinberg 分析他自己的电子邮件的一个非常基本的版本。
要温柔,要粗暴,但请提供帮助:)!
> 我的输出当前如下所示: : 1, 'each': 1, 'Me': 1, 'IN!\r\n\r\n2012/1/31': 1, 'calculator.\r\n> >>>>>\r\n>>>>>>': 1, '人': 1, '=97MB\r\n>\r\n>': 1, '我们': 2, '写:\r\n>>>>>>\r\n>>>>>>': 1, '=\r\n写道:\r\n>>>>>\r\n>>>>> >': 1, '2012/1/31': 2, '是': 1, '31,': 5, '=97MB\r\n>>>>\r\n>>>>': 1 , '1:45': 1, 'be\r\n>>>>>': 1, '已发送':
import getpass, imaplib, email
# NGramCounter builds a dictionary relating ngrams (as tuples) to the number
# of times that ngram occurs in a text (as integers)
class NGramCounter(object):
# parameter n is the 'order' (length) of the desired n-gram
def __init__(self, text):
self.text = text
self.ngrams = dict()
# feed method calls tokenize to break the given string up into units
def tokenize(self):
return self.text.split(" ")
# feed method takes text, tokenizes it, and visits every group of n tokens
# in turn, adding the group to self.ngrams or incrementing count in same
def parse(self):
tokens = self.tokenize()
#Moves through every individual word in the text, increments counter if already found
#else sets count to 1
for word in tokens:
if word in self.ngrams:
self.ngrams[word] += 1
else:
self.ngrams[word] = 1
def get_ngrams(self):
return self.ngrams
#loading profile for login
M = imaplib.IMAP4_SSL('imap.gmail.com')
M.login("EMAIL", "PASS")
M.select()
new = open('liamartinez.txt', 'w')
typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages
def get_first_text_part(msg): #where should this be nested?
maintype = msg.get_content_maintype()
if maintype == 'multipart':
for part in msg.get_payload():
if part.get_content_maintype() == 'text':
return part.get_payload()
elif maintype == 'text':
return msg.get_payload()
for num in data[0].split(): #Loops through all messages
typ, data = M.fetch(num, '(RFC822)') #Pulls Message
msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects
_from = msg['from'] #pull from
_to = msg['to'] #pull to
_subject = msg['subject'] #pull subject
_body = get_first_text_part(msg) #pull body
if _body:
ngrams = NGramCounter(_body)
ngrams.parse()
_feed = ngrams.get_ngrams()
# print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
print _feed
# print 'Content-Type:',msg.get_content_type()
# print _from
# print _to
# print _subject
# print _body
#
new.write(_from)
print '---------------------------------'
M.close()
M.logout()