I need to integrate just over half a terabyte of data into MongoDB (about 525GB).
They are visits to my website, each line of the file is a tab delimited string
This is my main loop:
f = codecs.open(directoryLocation+fileWithData, encoding='utf-8')
count = 0 #start the line counter
for line in f:
print line
line = line.split('\t')
document = {
'date': line[0], # a date as a string
'user_id': line[1], # a string
'datetime': line[2], # a unix timestamp
'url': line[3], # a fairly long string
'ref_url': line[4], # another fairly long string
'date_obj': datetime.utcfromtimestamp(float(line[2])) #a date object
}
Visits.insert(document)
#line integration time/stats
count = count + 1 #increment the counter
now = datetime.now()
diff = now - startTime
taken = diff.seconds
avgPerLine = float(taken) / float(count)
totalTimeLeft = (howManyLinesTotal-count) * avgPerLine
print "Time left (mins): " + str(totalTimeLeft/60) #output the stats
print "Avg per line: " + str(avgPerLine)
I'm currently getting about 0.00095 seconds per line, which is REALLY slow considering the amount of data I need to integrate.
I̶'̶m̶ ̶g̶o̶i̶n̶g̶ ̶t̶o̶ ̶e̶n̶a̶b̶l̶e̶ ̶t̶h̶e̶ ̶p̶y̶m̶o̶n̶g̶o̶ ̶C̶ ̶v̶e̶r̶s̶i̶o̶n̶ ̶s̶i̶n̶c̶e̶ ̶I̶ ̶c̶h̶e̶c̶k̶e̶d̶ ̶t̶h̶a̶t̶ ̶p̶y̶m̶o̶n̶g̶o̶.̶h̶a̶s̶_̶c̶(̶)̶ ̶i̶s̶ ̶F̶a̶l̶s̶e̶.̶ PyMongo C version now enabled, its hovering about 0.0007 or 0.0008 seconds per line. Still pretty slow. 3.3Ghz intel i3 with 16GB ram.
What else will be bottlenecking this loop? I pretty much have to have the 3 different dates, but could ask to get rid of one if its totally slowing things down
The stats are really useful because it shows me how much longer is left during these huge integrations. However, I guess their calculations might be slowing things down on a large scale? Could it be all the printing to terminal?
EDIT:
I've put the loop minus the actual insert into cProfile and this is the result of 99999 sample lines:
300001 function calls in 32.061 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 32.061 32.061 <string>:1(<module>)
1 31.326 31.326 32.061 32.061 cprofiler.py:14(myfunction)
100000 0.396 0.000 0.396 0.000 {built-in method now}
99999 0.199 0.000 0.199 0.000 {built-in method utcfromtimestamp}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
99999 0.140 0.000 0.140 0.000 {method 'split' of 'str' objects}
EDIT 2: Solved by Matt Tenenbaum - the solution- only output to the terminal once every 10,000 or so lines.