1

客户有一个金融服务网站,他每天早上都会在该网站上获得充满潜在客户的 CSV 电子表格。通常每张纸大约有 10k - 15k 条记录。他想做的是上传该电子表格,让服务器对其进行解析,检查他的数据库中是否存在重复的潜在客户,将它们插入到他的数据库中,对照外部 API 检查记录,然后通过电子邮件发送合格的潜在客户。

现在我已经为他构建了一个临时实用程序来完成所有这些工作,但看起来服务器和数据库因为做了这么多而压力过大。他必须将它们切成 1000 个记录包,但这让他很恼火。它上传文件,遍历电子表格并执行上述所有操作,但一次处理不了那么多。

所以我的问题是,是否有人对你如何处理这样的事情有一些一般性的建议,你可能会为这样的事情考虑什么?特别是能够上传单个文件,然后在剩下的时间里不必担心,他的脸上会露出灿烂的笑容。

以下是我目前处理这些记录的方式(不要笑):

<?php
//standard php file upload handler
include("upload.inc.php");

$conn = mysql_connect("localhost","username","password");

mysql_select_db("database",$conn);



if($_FILES['csvFile']['name']) {

$upload_dir = $_SERVER['DOCUMENT_ROOT'] . "/upload/files/";

list($file,$errMsg) = upload('csvFile',$upload_dir,'');



// clear the db table

$sql = "DELETE FROM tempTable";

$result = mysql_query($sql) or die("Error: " . mysql_error() . "<br>");



// process the file

$row = 1;

$fileName = $upload_dir . $file;

if (($handle = fopen($fileName, "r")) !== FALSE) {

while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {

$num = count($data);

// write the data to mysql duplicate checker table

$sql = "INSERT INTO tempTable (process_date,firstname,middlename,lastname,ssn,dob,dl_number,dl_state,gender,military_active,amount_requested,residence_type,residence_length,address1,address2,city, state,zip,phone_home,phone_cell,contact_time,email,ip_addr,pay_frequency,net_income,first_payday,second_payday,employment_status,employer_name,job_title,hire_date,phone_work,phone_work2, bank_name,account_type,direct_deposit,reference1_firstname,reference1_lastname,reference1_relationship,phone_reference1,reference2_firstname,reference2_lastname,reference2_relationship, phone_reference2,routing_no,account_no) VALUES

('".mysql_escape_string($data[0])."',

'".mysql_escape_string($data[1])."',

'".mysql_escape_string($data[2])."',

'".mysql_escape_string($data[3])."',

'".mysql_escape_string($data[4])."',

'".mysql_escape_string($data[5])."',

'".mysql_escape_string($data[6])."',

'".mysql_escape_string($data[7])."',

'".mysql_escape_string($data[8])."',

'".mysql_escape_string($data[9])."',

'".mysql_escape_string($data[10])."',

'".mysql_escape_string($data[11])."',

'".mysql_escape_string($data[12])."',

'".mysql_escape_string($data[13])."',

'".mysql_escape_string($data[14])."',

'".mysql_escape_string($data[15])."',

'".mysql_escape_string($data[16])."',

'".mysql_escape_string($data[17])."',

'".mysql_escape_string($data[18])."',

'".mysql_escape_string($data[19])."',

'".mysql_escape_string($data[20])."',

'".mysql_escape_string($data[21])."',

'".mysql_escape_string($data[22])."',

'".mysql_escape_string($data[23])."',

'".mysql_escape_string($data[24])."',

'".mysql_escape_string($data[25])."',

'".mysql_escape_string($data[26])."',

'".mysql_escape_string($data[27])."',

'".mysql_escape_string($data[28])."',

'".mysql_escape_string($data[29])."',

'".mysql_escape_string($data[30])."',

'".mysql_escape_string($data[31])."',

'".mysql_escape_string($data[32])."',

'".mysql_escape_string($data[33])."',

'".mysql_escape_string($data[34])."',

'".mysql_escape_string($data[35])."',

'".mysql_escape_string($data[36])."',

'".mysql_escape_string($data[37])."',

'".mysql_escape_string($data[38])."',

'".mysql_escape_string($data[39])."',

'".mysql_escape_string($data[40])."',

'".mysql_escape_string($data[41])."',

'".mysql_escape_string($data[42])."',

'".mysql_escape_string($data[43])."',

'".mysql_escape_string($data[44])."',

'".mysql_escape_string($data[45])."')";

$result = mysql_query($sql) or die("Error: " . mysql_error() . "<br>");

$numRows++;

//echo "<br>";

}

fclose($handle);



// now look for duplicates

$sql_1 = "SELECT account_no,count(*) FROM tempTable GROUP BY account_no";

$result_1 = mysql_query($sql_1) or die("Error: " . mysql_error() . "<br>");

while(list($acct,$numcount) = mysql_fetch_row($result_1)) {

// if there is more than one delete all of them

if($numcount>1) {

//echo "acct: $acct, num: $numcount<br>";

$toBeRemoved+=$numcount;

$sql_delete = "DELETE FROM tempTable WHERE(account_no = '$acct')";

$result_delete = mysql_query($sql_delete) or die("Error: " . mysql_error() . "<br>");

}

else {

//echo "acct: $acct, num: $numcount<br>";

}

}



// now remove the duplicates who are already in the customer table

$sql_2 = "SELECT account_no FROM customerTable";

$result_2 = mysql_query($sql_2) or die("Error: " . mysql_error() . "<br>");

while(list($acct) = mysql_fetch_row($result_2)) {

//echo "acct: $acct, num: $numcount<br>";

$sql_delete = "DELETE FROM tempTable WHERE(account_no = '$acct')";

$result_delete = mysql_query($sql_delete) or die("Error: " . mysql_error() . "<br>");

}

// now send the user to the new page with more options
Header("Location: finish.php");

}





}

?>



<html>
<body>
<?php 

if(!empty($_GET['msg'])) {

$msg = $_GET['msg'];

echo "<strong>$msg</strong>";

}

?>
<form method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>" enctype="multipart/form-data">
Upload CSV File: <input type="file" name="csvFile" size="30">
<input type="submit">
</form>
</body>
</html>
4

3 回答 3

1

您可以使用后台作业运行器来执行此类操作。我最近经常这样做

https://github.com/seatgeek/djjob

http://seatgeek.com/blog/dev/djjob-a-php-port-of-delayed_job

希望这可以帮助

于 2012-06-17T22:11:07.487 回答
0

1-确保两者tempTable.account_nocustomerTable.account_no被索引(它们可能已经是主键)。您可能还需要ALTER TABLE temptable ENGINE=MEMORY;,这使 MySQL仅在内存中工作(不写入磁盘,但表内容在服务器关闭时丢失)。

2- 将 CSV 文件导入表格:

ALTER TABLE tempTable DISABLE KEYS; -- otherwise, MySQL recomputes indexes after each line is inserted
LOAD DATA LOCAL INFILE 'yourfile.csv' INTO TABLE tempTable
FIELDS TERMINATED BY ',' ;
ALTER TABLE tempTable ENABLE KEYS; -- recompute all indexes in one go

您当然需要自定义LOAD DATA命令以匹配输入文件的确切格式。

3-删除重复项:

DELETE tempTable.*
FROM tempTable

-- match with the list of duplicates
LEFT JOIN (
    SELECT account_no
    FROM tempTable
    GROUP BY account_no
    COUNT(account_no) > 1
) AS duplicates
    ON duplicates.account_no = tempTable.account_no

-- match with records in customerTable
LEFT JOIN customerTable
    ON customerTable.account_no = tempTable.account_no

-- records with either duplicates, or with a match in customerTable
WHERE duplicates.account_no IS NOT NULL OR customerTable.account_no IS NOT NULL;
于 2012-06-18T23:29:24.220 回答
0

尝试对您的一些数据库查询进行计时并将它们批处理在一起。可能是其中之一正在减慢一切。

于 2012-06-17T23:03:08.680 回答