For a research project I am obtaining data from a local bus company's GPS system (through their API). I created a php cron job that runs every minute to obtain data like the vehicle, route ID, location, destination, etc. The data did not contain a unique "run number" for each bus route (a unique number so that I can track the progression of a single bus along its route), so I created my own that checks if the vehicle ID, destination, and relative time are similar, and assigns the unique "run ID" to it so that I can track the bus along its route. If no run ID exists, a random one is generated. (Any vehicle with the same "vid" and "pid" within 2 minutes of the last inserted row "timeadded" is on the same run, and this is important for my research)

Each time the cron runs (1 minute), approximately 80 rows are added into the database.

Initially the job would run quickly. However, with over 500,000 rows now, I've noticed the job can take upwards of 40 seconds. I believe it's because for each of the ~80 rows, it has to check the entire table ("vehicles") to see if the same run ID exists, essentially querying a large table and inserting a row 80 times. I want to get at least a week's worth of data (on day 4 now), at which point I can export the data, erase all rows, and start over. My question is: Is there any way I can refactor my PHP/SQL code to make the process run faster? It's been years since I've worked with SQL, so I'm sure there's a more ingenious way to insert all this data.

// Obtain data from XML
$xml = simplexml_load_file("url.xml");
foreach ($xml->vehicle as $vehicle) {
    $vid = $vehicle->vid;
    $tm = $vehicle->tmstmp;
    $dat = substr($vehicle->tmstmp, 0, 8);
    $tme = substr($vehicle->tmstmp, 9);
    $lat = $vehicle->lat;
    $lon = $vehicle->lon;
    $hdg = $vehicle->hdg;
    $pid = $vehicle->pid;
    $rt = $vehicle->rt;
    $des = $vehicle->des;
    $pdist = $vehicle->pdist;

     // Database connection and insert
    mysql_connect("redacted", "redacted", "redacted") or die(mysql_error()); mysql_select_db("redacted") or die(mysql_error());
    $sql_findsim = "SELECT vid, pid, timeadded, run, rt FROM vehicles WHERE vid=" . mysql_real_escape_string($vid). " AND pid=" . mysql_real_escape_string($pid). " AND rt=" . mysql_real_escape_string($rt). " AND timeadded > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 2 MINUTE);";
    $handle = mysql_query($sql_findsim);
    $row = mysql_fetch_row($handle);
    $runid = $row[3];
    if($runid !== null) {
        $run = $runid;
    } else {
        $run = substr(md5(rand()), 0, 30);
    $sql = "INSERT INTO vehicles (vid, tmstmp, dat, tme, lat, lon, hdg, pid, rt, des, pdist, run) VALUES ($vid,'$tm','$dat','$tme','$lat','$lon',$hdg,$pid,'$rt','$des',$pdist,'$run')";
    $result = mysql_query($sql);




Thanks for any help with refactoring this code to get it to run more quickly and efficiently.


2 回答 2




vid,  pid, run, rt 

如果没有插入(将 vid 设为自动增量),您可以检查上表的 id,而不是检查车辆表中的 vid。


于 2013-11-11T05:12:31.663 回答

表上有索引吗?(vid,pid,rt,timeadded) 上的复合索引将使查询更快,避免全表扫描。

create index fastmagic on vehicles (vid,pid,rt,timeadded)

或者,您可以一起跳过选择,直接插入而不分配“运行”随机值。这将使您的 cron 工作保持在“恒定时间”,因为您所做的只是添加新数据。

在您获得一周的数据后,返回并编写“第二遍”代码以逐步浏览每一行(按时间添加从车辆订单中选择 *)。对于每一行,执行类似于您已经完成的“选择” - 然后“更新”您现在正在处理的行。


于 2013-11-11T05:11:24.160 回答