1

I need to integrate just over half a terabyte of data into MongoDB (about 525GB).

They are visits to my website, each line of the file is a tab delimited string

This is my main loop:

f = codecs.open(directoryLocation+fileWithData, encoding='utf-8')
count = 0 #start the line counter

for line in f:
    print line
    line = line.split('\t')

    document = {
        'date': line[0],     # a date as a string
        'user_id': line[1],  # a string 
        'datetime': line[2], # a unix timestamp
        'url': line[3],      # a fairly long string
        'ref_url': line[4],  # another fairly long string
        'date_obj': datetime.utcfromtimestamp(float(line[2])) #a date object
    }

    Visits.insert(document)

    #line integration time/stats
    count = count + 1 #increment the counter
    now = datetime.now()
    diff = now - startTime
    taken = diff.seconds
    avgPerLine = float(taken) / float(count)
    totalTimeLeft = (howManyLinesTotal-count) * avgPerLine
    print "Time left (mins): " + str(totalTimeLeft/60) #output the stats
    print "Avg per line: " + str(avgPerLine)

I'm currently getting about 0.00095 seconds per line, which is REALLY slow considering the amount of data I need to integrate.

I̶'̶m̶ ̶g̶o̶i̶n̶g̶ ̶t̶o̶ ̶e̶n̶a̶b̶l̶e̶ ̶t̶h̶e̶ ̶p̶y̶m̶o̶n̶g̶o̶ ̶C̶ ̶v̶e̶r̶s̶i̶o̶n̶ ̶s̶i̶n̶c̶e̶ ̶I̶ ̶c̶h̶e̶c̶k̶e̶d̶ ̶t̶h̶a̶t̶ ̶p̶y̶m̶o̶n̶g̶o̶.̶h̶a̶s̶_̶c̶(̶)̶ ̶i̶s̶ ̶F̶a̶l̶s̶e̶.̶ PyMongo C version now enabled, its hovering about 0.0007 or 0.0008 seconds per line. Still pretty slow. 3.3Ghz intel i3 with 16GB ram.

What else will be bottlenecking this loop? I pretty much have to have the 3 different dates, but could ask to get rid of one if its totally slowing things down

The stats are really useful because it shows me how much longer is left during these huge integrations. However, I guess their calculations might be slowing things down on a large scale? Could it be all the printing to terminal?

EDIT:

I've put the loop minus the actual insert into cProfile and this is the result of 99999 sample lines:

         300001 function calls in 32.061 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   32.061   32.061 <string>:1(<module>)
        1   31.326   31.326   32.061   32.061 cprofiler.py:14(myfunction)
   100000    0.396    0.000    0.396    0.000 {built-in method now}
    99999    0.199    0.000    0.199    0.000 {built-in method utcfromtimestamp}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    99999    0.140    0.000    0.140    0.000 {method 'split' of 'str' objects}

EDIT 2: Solved by Matt Tenenbaum - the solution- only output to the terminal once every 10,000 or so lines.

4

2 回答 2

3
  1. Delete all indexes which exists in collection
  2. If you have a replication than set w=0, j=0
  3. Use bulk inserts

Some performance tests: http://www.arangodb.org/2012/09/04/bulk-inserts-mongodb-couchdb-arangodb

于 2012-12-20T20:26:16.140 回答
2

As Zagorulkin says, it's important not to be doing a bunch of indexing while inserting, so make sure there are no operative indexed.

Beyond that, you probably want to limit the feedback to every 1000 lines (or some useful number based on how many lines you have to input, or how much feedback you want). Instead of doing all the calculations to produce the feedback on each iteration, change the last block of your code to qualify it against count so that it only passes the test once every 1000 iterations:

    #line integration time/stats
    count = count + 1 #increment the counter
    if count % 1000 == 0:
        now = datetime.now()
        diff = now - startTime
        taken = diff.seconds
        avgPerLine = float(taken) / float(count)
        totalTimeLeft = (howManyLinesTotal-count) * avgPerLine
        print "Time left (mins): " + str(totalTimeLeft/60) #output the stats
        print "Avg per line: " + str(avgPerLine)

Again, 1000 might not be the right number for you, but something like that will prevent a lot of this work from happening, while still giving you the feedback you're looking for.

于 2012-12-20T22:29:25.613 回答