我有这段代码,它将 mbox 转换为 JSON。目标是将生成的 JSON 传输到 Mongodb 数据库。但是,该代码在虚拟 mbox“example.mbox”上进行了测试,并且运行良好。然而,当在实际 mbox 上测试它时,我有一个意外的输出,它确实生成了 JSON 文件,但是,“在 JSONification 中跳过 MIME 内容(多部分)”.. 我不想跳过任何东西!
import sys
import mailbox
import email
import quopri
import json
import time
from BeautifulSoup import BeautifulSoup
from dateutil.parser import parse
MBOX = 'antonita.mbox'
OUT_FILE = MBOX + '.json'
def cleanContent(msg):
# Decode message from "quoted printable" format, but first
# re-encode, since decodestring will try to do a decode of its own
msg = quopri.decodestring(msg.encode('utf-8'))
# Strip out HTML tags, if any are present.
# Bail on unknown encodings if errors happen in BeautifulSoup.
try:
soup = BeautifulSoup(msg)
except:
return ''
return ''.join(soup.findAll(text=True))
# There's a lot of data to process, and the Pythonic way to do it is with a
# generator. See http://wiki.python.org/moin/Generators.
# Using a generator requires a trivial encoder to be passed to json for object
# serialization.
class Encoder(json.JSONEncoder):
def default(self, o): return list(o)
# The generator itself...
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
break
yield jsonifyMessage(msg)
def jsonifyMessage(msg):
json_msg = {'parts': []}
for (k, v) in msg.items():
json_msg[k] = v.decode('utf-8', 'ignore')
# The To, Cc, and Bcc fields, if present, could have multiple items.
# Note that not all of these fields are necessarily defined.
for k in ['To', 'Cc', 'Bcc']:
if not json_msg.get(k):
continue
json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r', '')\
.replace(' ', '').decode('utf-8', 'ignore').split(',')
for part in msg.walk():
json_part = {}
if part.get_content_maintype() != 'text':
print >> sys.stderr, "Skipping MIME content in JSONification ({0})".format(part.get_content_maintype())
continue
json_part['contentType'] = part.get_content_type()
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['content'] = cleanContent(content)
json_msg['parts'].append(json_part)
# Finally, convert date from asctime to milliseconds since epoch using the
# $date descriptor so it imports "natively" as an ISODate object in MongoDB
then = parse(json_msg['Date'])
millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000)
json_msg['Date'] = {'$date' : millis}
return json_msg
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
# Write each message out as a JSON object on a separate line
# for easy import into MongoDB via mongoimport
f = open(OUT_FILE, 'w')
for msg in gen_json_msgs(mbox):
if msg != None:
f.write(json.dumps(msg, cls=Encoder) + '\n')
f.close()
print "All done"
输出结果:
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
All done
注意:正如一些人所指出的,“我不想跳过任何内容”这句话是指我可以对大多数 mbox 进行 JSON 化,但不能对多部分或图像进行 JSON 化。因此,代码中的部分 {for part in msg.walk(): ...} 被标记为 skipping 以证明此代码确实跳过了 multipart 和图像,因为没有它,我得到的 JSON 文件没有图像等的二进制文件..它不会出现在最终代码中,但当我想出如何将图像和多部分转换为 JSON 时。