python - Python : What's wrong with my code of multi processes inserting to MySQL?

Question

I've write a Python script to insert some data(300 millions) to a MySQL table:

#!/usr/bin/python

import os
import MySQLdb

from multiprocessing import Pool

class DB(object):

  def __init__(self):
    self.conn = MySQLdb.connect(host='localhost',user='root',passwd='xxx',db='xdd',port=3306)
    self.cur = self.conn.cursor()

  def insert(self, arr):
    self.cur.execute('insert into RAW_DATA values(null,%s,%s,%s,%s,%s,%s,%s)', arr)

  def close(self):
    self.conn.commit()
    self.cur.close()
    self.conn.close()


def Import(fname):
  db = DB()

  print 'importing ', fname
  with open('data/'+fname, 'r') as f:

    for line in f:
      arr = line.split()
      db.insert(arr)

  db.close()


if  __name__ == '__main__':
  # 800+ files
  files = [d for d in os.listdir('data') if d[-3:]=='txt']

  pool = Pool(processes = 10)
  pool.map(Import, files)

The problem is, the script runs very very slow, is there any obvious wrong using of multiprocessing ?

score 4 · Accepted Answer

是的，如果您将 3 亿行批量插入到同一个表中，那么您不应该尝试并行化此插入。所有插入都必须经过相同的瓶颈：更新索引，写入硬盘上的物理文件。这些操作需要独占访问底层资源（索引或磁盘头）。

您实际上在数据库上添加了一些无用的开销，现在必须处理多个并发事务。这会消耗内存，强制上下文切换，使磁盘读取头一直跳来跳去，等等。

将所有内容插入同一个线程中。

看起来您实际上是从一种 CSV 文件中导入数据。LOAD DATA INFILE您可能想要使用专为此目的而设计的内置MySQL 命令。如果您在调整此命令时需要一些帮助，请描述您的源文件。

python - Python : What's wrong with my code of multi processes inserting to MySQL?

1 回答 1

Related

Reference