I'm trying to play with a GTFS database, namely the one provided by the RATP for Paris and its suburbs.
The set of data is huge. The stop_times
table has 14 million rows.
Here's the tables schemas: https://github.com/mauryquijada/gtfs-mysql/blob/master/gtfs-sql.sql
I'm trying to get the most efficient way to get the available routes at a specific location. As far as I understand the GTFS spec, here are the tables and their links from my data (lat/lon) to the routes:
stops | stop_times | trips | routes
-----------+----------------+------------+--------------
lat | stop_id | trip_id | route_id
lon | trip_id | route_id |
stop_id | | |
I have compiled what I want in three steps (actually the three links we have between the four tables above), published under this gist for clarity: https://gist.github.com/BenoitDuffez/4eba85e3598ebe6ece5f
Here's how I created this script.
I have been able to quickly find all the stops within a walking distance (say, 200m) in less than a second. I use:
$ . mysql.ini && time mysql -h $host -N -B -u $user -p${pass} $name -e "SELECT stop_id, (6371000*acos(cos(radians(48.824699))*cos(radians(s.stop_lat))*cos(radians(2.3243)-radians(s.stop_lon))+sin(radians(48.824699))*sin(radians(s.stop_lat)))) AS distance
FROM stops s
GROUP BY s.stop_id
HAVING distance < 200
ORDER BY distance ASC" | awk '{print $1}'
3705271
4472979
4036891
4036566
3908953
3908755
3900765
3900693
3900607
4473141
3705272
4472978
4036892
4036472
4035057
3908952
3705288
3908814
3900832
3900672
3900752
3781623
3781622
real 0m0.797s
user 0m0.000s
sys 0m0.000s
Then, getting all the stop_times later today (with stop_times.departure_time > '``date +%T``'
) takes a lot of time:
"SELECT trip_id
FROM stop_times
WHERE
stop_id IN ($stops) AND departure_time >= '$now'
GROUP BY trip_id"
With $stops
containing the list of stops obtained from the first step. Here's an example:
$ . mysql.ini && time mysql -h $host -N -B -u $user -p${pass} $name -e "SELECT stop_id, (6371000*acos(cos(radians(
FROM stops s
GROUP BY s.stop_id
HAVING distance < 200
ORDER BY distance ASC" | awk '{print $1}'
3705271
4472979
4036891
4036566
3908953
...
9916360850964321
9916360920964320
9916360920964321
real 1m21.399s
user 0m0.000s
sys 0m0.000s
There are more than 2000 lines in this result.
My last step was to select all routes that match these trip_id
s. It's quite easy, and rather fast:
$ . mysql.ini && time mysql -h $host -u $user -p${pass} $name -e "SELECT r.id, r.route_long_name FROM trips t, routes r WHERE t.trip_id IN (`cat trip_ids | tr '\n' '#' | sed -e 's/##$//' -e 's/#/,/g'`) AND r.route_id = t.route_id GROUP BY t.route_id"
+------+-------------------------------------------------------------------------+
| id | route_long_name |
+------+-------------------------------------------------------------------------+
| 290 | (PLACE DE CLICHY <-> CHATILLON METRO) - Aller |
| 291 | (PLACE DE CLICHY <-> CHATILLON METRO) - Retour |
| 404 | (PORTE D'ORLEANS-METRO <-> ECOLE VETERINAIRE DE MAISON-ALFORT) - Aller |
| 405 | (PORTE D'ORLEANS-METRO <-> ECOLE VETERINAIRE DE MAISON-ALFORT) - Retour |
| 453 | (PORTE D'ORLEANS-METRO <-> LYCEE POLYVALENT) - Retour |
| 457 | (PORTE D'ORLEANS-METRO <-> LYCEE POLYVALENT) - Retour |
| 479 | (PORTE D'ORLEANS-METRO <-> VELIZY 2) - Retour |
| 810 | (PLACE DE LA LIBERATION <-> GARE MONTPARNASSE) - Aller |
| 989 | (PORTE D'ORLEANS-METRO) - Retour |
| 1034 | (PLACE DE LA LIBERATION <-> HOTEL DE VILLE DE PARIS_4E__AR) - Aller |
+------+-------------------------------------------------------------------------+
real 0m1.070s
user 0m0.000s
sys 0m0.000s
With here the file trip_ids
containing the 2k trip IDs.
How can I get this result faster? Is there a better way to crawl through the data rather than the stops>stop_times>trips>routes
path I have taken?
The total time here is around 30s for actually ONE 'query': "What are the routes available 200m from this location?". That's too much...