For a research project I am obtaining data from a local bus company's GPS system (through their API). I created a php cron job that runs every minute to obtain data like the vehicle, route ID, location, destination, etc. The data did not contain a unique "run number" for each bus route (a unique number so that I can track the progression of a single bus along its route), so I created my own that checks if the vehicle ID, destination, and relative time are similar, and assigns the unique "run ID" to it so that I can track the bus along its route. If no run ID exists, a random one is generated. (Any vehicle with the same "vid" and "pid" within 2 minutes of the last inserted row "timeadded" is on the same run, and this is important for my research)
Each time the cron runs (1 minute), approximately 80 rows are added into the database.
Initially the job would run quickly. However, with over 500,000 rows now, I've noticed the job can take upwards of 40 seconds. I believe it's because for each of the ~80 rows, it has to check the entire table ("vehicles") to see if the same run ID exists, essentially querying a large table and inserting a row 80 times. I want to get at least a week's worth of data (on day 4 now), at which point I can export the data, erase all rows, and start over. My question is: Is there any way I can refactor my PHP/SQL code to make the process run faster? It's been years since I've worked with SQL, so I'm sure there's a more ingenious way to insert all this data.
<?php
// Obtain data from XML
$xml = simplexml_load_file("url.xml");
foreach ($xml->vehicle as $vehicle) {
$vid = $vehicle->vid;
$tm = $vehicle->tmstmp;
$dat = substr($vehicle->tmstmp, 0, 8);
$tme = substr($vehicle->tmstmp, 9);
$lat = $vehicle->lat;
$lon = $vehicle->lon;
$hdg = $vehicle->hdg;
$pid = $vehicle->pid;
$rt = $vehicle->rt;
$des = $vehicle->des;
$pdist = $vehicle->pdist;
// Database connection and insert
mysql_connect("redacted", "redacted", "redacted") or die(mysql_error()); mysql_select_db("redacted") or die(mysql_error());
$sql_findsim = "SELECT vid, pid, timeadded, run, rt FROM vehicles WHERE vid=" . mysql_real_escape_string($vid). " AND pid=" . mysql_real_escape_string($pid). " AND rt=" . mysql_real_escape_string($rt). " AND timeadded > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 2 MINUTE);";
$handle = mysql_query($sql_findsim);
$row = mysql_fetch_row($handle);
$runid = $row[3];
if($runid !== null) {
$run = $runid;
} else {
$run = substr(md5(rand()), 0, 30);
}
$sql = "INSERT INTO vehicles (vid, tmstmp, dat, tme, lat, lon, hdg, pid, rt, des, pdist, run) VALUES ($vid,'$tm','$dat','$tme','$lat','$lon',$hdg,$pid,'$rt','$des',$pdist,'$run')";
$result = mysql_query($sql);
mysql_close();
}
?>
Thanks for any help with refactoring this code to get it to run more quickly and efficiently.