In addition to being trivial to parse, these files can also be imported into SQLite with just a few commands. I often used SQL to debug missing vehicles or find the details of a Route when I was working on these renderings.
sqlite> SELECT route_id, route_short_name, route_color
FROM routes WHERE route_type = '1'
ORDER BY route_id ASC LIMIT 5;
route_id route_short_name route_color
----------- ---------------- -----------
100110001:1 1 FFCD00
100110002:2 2 003CA6
100110003:3 3 837902
100110004:4 4 CF009E
100110005:5 5 FF7E2E
Given a route_id we can list its schedule Trips, here two random trips on line 1:
sqlite> SELECT trip_id, trip_headsign FROM trips
WHERE route_id = '100110001:1'
ORDER BY RANDOM() LIMIT 2;
trip_id trip_headsign
------------------ -------------------------
125111807-1_212922 La Défense (Grande Arche)
125025182-1_213183 Château de Vincennes
(trip_headsign is what you’d see on the schedule board as the vehicle’s destination)
Then for these two trips, let’s look up the first 5 planned stops, their time, and location:
sqlite> SELECT substr(trips.trip_id, -6, 6) AS `trip`,
stop_times.departure_time AS `departs`,
stop_times.arrival_time AS `arrives`, stops.stop_name, stops.stop_lat,
stops.stop_lon, stop_times.stop_sequence AS `seq` FROM trips LEFT JOIN
stop_times ON (trips.trip_id = stop_times.trip_id) LEFT JOIN stops ON
(stop_times.stop_id = stops.stop_id) WHERE trips.route_id = '100110001:1'
AND trips.trip_id IN ('125111807-1_212922', '125025182-1_213183') AND
CAST(stop_times.stop_sequence AS INTEGER) < 5
ORDER BY trips.trip_id ASC, CAST(stop_times.stop_sequence AS INTEGER) ASC;
trip departs arrives stop_name stop_lat stop_lon seq
------- --------- --------- ------------------------- --------- --------- ---
213183 08:14:00 08:14:00 La Défense (Grande Arche) 48.89182 2.23799 0
213183 08:15:00 08:15:00 Esplanade de la Défense 48.88835 2.24993 1
213183 08:17:00 08:17:00 Pont de Neuilly 48.88550 2.25852 2
213183 08:18:00 08:18:00 Les Sablons (Jardin d'acc 48.88129 2.27191 3
213183 08:20:00 08:20:00 Porte Maillot 48.87800 2.28246 4
212922 19:26:00 19:26:00 Château de Vincennes 48.84432 2.44055 0
212922 19:28:00 19:28:00 Bérault 48.84536 2.42824 1
212922 19:29:00 19:29:00 Saint-Mandé 48.84623 2.419 2
212922 19:30:00 19:30:00 Porte de Vincennes 48.84701 2.41081 3
212922 19:32:00 19:32:00 Nation 48.84811 2.39800 4
After adding a few indexes on the <type>_id fields, even complex queries over a large data set are pretty snappy.
For reference, the GTFS data set for Paris lists 1,882 routes (lines) over which 443,026 vehicles will journey between 63,676 stops for a total of 9,921,422 scheduled stop times.
OpenMobilityData is a free repository of GTFS data providing access to over 1,200 feeds from public transit agencies in more than 50 countries. It is maintained by a Canadian non-profit organization and hosts high-quality data sets that are frequently updated.
Feeds that aren’t fully valid according to the GTFS specification are clearly labeled, with mistakes listed on the download page:
I also generated a longer version where the vehicles can be followed more easily. At 4 seconds per frame and a 6-minute run time, they tend to zip across the map very quickly; at 2 seconds per frame,
the slower video is 12 minutes long.
For this area, I started by cropping a map of the Bay Area to focus on San Francisco and Oakland. Here, we see mostly SFMTA buses and Muni metro light rail in San Francisco, with also a few BART trains crossing the Bay, Caltrain commuters coming from the south, and AC Transit buses in Oakland.
Finding data for the Bay Area is more challenging than for Paris and London, due to the number of operators involved. Thankfully the Metropolitan Transportation Commission publishes a common API to download GTFS schedules for 33 local operators on the website 511.org. The rendering below is based on data from all of these feeds, even though not all companies operate in the cropped area.
The difference between this region and Paris or London is striking, and reflects the vast difference in public transport funding and use between these cities. Whereas Paris and its close suburbs has over 1,200 trains and 5,600 buses running at rush hour, we only see a maximum of 30 trains and 800 buses here.
Similar to SF & Oakland, this time extended to show the entire San Francisco Bay with San Jose in the south and up to parts of Marin County in the north and Livermore in the east. We can spot the automated trains at San Francisco Airport, as well as the Blue, Green, and Orange lines of VTA light rail in the South Bay. There are even a few buses roaming around Half Moon Bay by the ocean.
At rush hour the map shows as many as 60 trains, 1,600 buses, 70 trams, and 10 boats.
I’d like to add more cities in the future if I can find high-quality data; I was thinking to try New York City, Beijing, or Tokyo next. Depending on the amount of metadata included in the GTFS feeds, the process can take some time if I have to add color or route type information manually for a few routes.
Comments and suggestions are welcome! You can contact me on Twitter.
I’ve been playing with maps for the past few weeks in an effort to find the best location for my next flat, cross-referencing several data sources to assign scores to offers from agencies and property owners.
I started with a small rendering engine for OpenStreetMap data, incorporating more data sources as I found them. It turns out that creating a small rendering engine is pretty simple as long as you don’t bother drawing the roads:
The rendering on its own is clearly inferior to Google Maps, but the data itself is very valuable: being able to understand how streets are connected together makes it possible to implement search algorithms on top of a graph representation of the city. So instead of focusing my efforts on building a better renderer, I’ve decided to keep using Google Maps as a background image only and to implement my application using simple projections on top of that background.
In my search for relevant data sources, I downloaded a copy of the Transport for London (TfL) schedule list, describing each train, bus, or boat and its planned stops with their geographical location. The files are in a TfL-specific format, which can be converted to the more accessible GTFS.
An application that builds on several input feeds to print out a result based on their interaction needs cleaned-up data sources with minimal bias or error rates. In order to make sense of the TfL schedules, I plotted the journeys of every vehicle I knew of, looking for obvious gaps in the data. Tube trains and the DLR are plotted with colored circles, while buses are represented as red squares.
After almost 9 years leaving this project untouched, I recently re-discovered the code I used to generate these videos and decided to try re-rendering them in higher quality, this time in 4K at 60 frames per second.
Each video was generated with ffmpeg by processing 21,600 4K frames in PNG format (3,840 × 2,160), which amounts to a runtime of 6 minutes at 60 frames per second. The PNG files are over 200 GB per video, but the resulting video files are only 2-3 GB (H.264 is pretty impressive!)
Here is what a 4K frame looks like (click to view the full 11 MB file):
And the same frame at the wider zoom level (full file is 9 MB):
This visualization made a few things clear for me:
I am confident that I have enough data to use in my application, even though there seem to be some minor issues with buses.
London has a lot of buses! The TfL data includes over 14 million stop times.
Tube lines are subject to frequent outages for maintenance, especially on the weekend. The videos don’t reflect these as they only show planned journeys.