I have a large CSV file that I am importing into Django. At the moment, if my math is correct, it will be done in 32 hours! Is it possible to speed this up?
I have a CSV file with ~157,000 rows and 15 columns. I am reading this into my Django model and saving it off to a MySQL database. Here is where the magic happens:
reader = csv.reader(csvFile, delimiter=',', quotechar='"')
for row in reader:
tmpRecord = Employee(
emp_id = row[0], # Primary Key
name = row[1],
# snipped for brevity; other columns assigned
group_abbr = row[14]
)
pieces = string.split(tmpRecord.name.title(), " ")
newName = pieces[1]
try:
newName += " " + pieces[2]
except IndexError:
pass
newName += " " + pieces[0]
tmpRecord.name = newName
tmpRecord.save()
The "pieces" chunk is taking the name field from "LASTNAME FIRSTNAME MIDDLE" and turning it into "Firstname Middle Lastname".
This will be run about once a month to update the database with new employees and any changes to existing employee records. Much of the time, an existing record isn't going to change, but any one (or more) of fields might change. Is there a check I could add that take less time then just calling save()
on each record?
At the moment, this is taking about 15 seconds per 20 records to complete! Is there a way I can speed this up (substantially)?
UPDATE:
If it matters, the emp_id
is the table's primary key. No employee ever has the same id as a previous employee (retired employees included).