So I'm trying to import about 80 million page views from a log file. I'm trying to put them in the database as sessions, i.e. groups of page views separated with 20 minutes in between.
So eventually, in my user database I would like each user to have a list of dictionary objects like so:
{
'id': 'user1'
'sessions':[
{
"start" : ISODate("2011-04-03T23:21:59.639Z"),
"end" : ISODate("2011-04-03T23:50:05.518Z"),
"page_loads" : 136
},
{
"start" : ISODate("another date"),
"end" : ISODate("later date"),
"page_loads" : 20
},
]
}
Should be fairly simple. So I wrote this script:
howManyLinesTotal = 9999999 #i've done: wc -l in bash before to find the file size
blank_dict = {'page_loads':0, 'start':0, 'end':0}
latest_sessions = defaultdict(lambda: blank_dict)
for line in f: #opens a gigantic gzip file called "f"
line = line.split('\t') #each entry is tab-delimited with: user \t datetime \t page_on_my_site
user = line[1] #grab the data from this line in the file
timestamp = datetime.utcfromtimestamp(float(line[2]))
latest_sessions[user]['page_loads'] += 1 #add one to this user's current session
if latest_sessions[user]['start'] == 0: #put in the start time if there isn't one
latest_sessions[user]['start'] = timestamp
if latest_sessions[user]['end'] == 0: #put in the end time if there isn't one
latest_sessions[user]['end'] = timestamp
else: #otherwise calculate if the new end time is 20 mins later
diff = (timestamp - latest_sessions[user]['end']).seconds
if diff > 1200: #if so, save the session to the database
db.update({'id':user}, {'$push':{'sessions':latest_sessions[user]}})
latest_sessions[user] = blank_dict
else:
latest_sessions[user]['end'] = timestamp #otherwise just replace this endtime
count += 1
if count % 100000 == 0: #output some nice stats every 100,000 lines
print str(count) + '/' + str(howManyLinesTotal)
#now put the remaining last sessions in
for user in latest_sessions:
db.update({'id':user}, {'$push':{'sessions':latest_sessions[user]}})
I'm getting about 0.002 seconds per line = 44 hours for an 80 million page view file.
This is with a 2TB 7200rpm seagate HDD, 32 GB of RAM and 3.4Ghz dualcore i3 processor.
Does this time sound reasonable or am I making some horrendous mistakes?
EDIT: We're looking at about 90,000+ users, i.e. keys in the defaultdict
EDIT2: Here's the cProfile output on a much smaller 106mb file. I commented out the actual mongoDB saves for testing purposes: http://pastebin.com/4XGtvYWD
EDIT3: Here's a barchart analysis of the cProfile: http://i.imgur.com/K6pu6xx.png