1

I have a large CSV file that I am importing into Django. At the moment, if my math is correct, it will be done in 32 hours! Is it possible to speed this up?

I have a CSV file with ~157,000 rows and 15 columns. I am reading this into my Django model and saving it off to a MySQL database. Here is where the magic happens:

reader = csv.reader(csvFile, delimiter=',', quotechar='"')
for row in reader:
    tmpRecord = Employee(
        emp_id = row[0], # Primary Key
        name = row[1],
        # snipped for brevity; other columns assigned
        group_abbr = row[14]
    )

    pieces = string.split(tmpRecord.name.title(), " ")
    newName = pieces[1]
    try:
        newName += " " + pieces[2]
    except IndexError:
        pass
    newName += " " + pieces[0]
    tmpRecord.name = newName

    tmpRecord.save()

The "pieces" chunk is taking the name field from "LASTNAME FIRSTNAME MIDDLE" and turning it into "Firstname Middle Lastname".

This will be run about once a month to update the database with new employees and any changes to existing employee records. Much of the time, an existing record isn't going to change, but any one (or more) of fields might change. Is there a check I could add that take less time then just calling save() on each record?

At the moment, this is taking about 15 seconds per 20 records to complete! Is there a way I can speed this up (substantially)?

UPDATE:

If it matters, the emp_id is the table's primary key. No employee ever has the same id as a previous employee (retired employees included).

4

2 回答 2

2

I think bulk_create will help you https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

If you have problems with data that already persist in database. Insert into other table, then using SQL queries fix your issue.

于 2013-04-10T19:41:26.750 回答
1

Perhaps you could use your python script to prepare an intermediate load CSV, then try doing a load operation instead?

http://dev.mysql.com/doc/refman/5.6/en/load-data.html

于 2013-04-10T19:46:00.737 回答