Tracking bikeshare use using GBFS feeds
The General Bikeshare Feed Specification (GBFS) is a popular standard for publishing live data about bikeshare systems used by systems around the globe. Its stated purpose is to allow 3rd party applications to interface with the live data and allow municipalities to monitor compliance with local regulations. For my purpose, it also allows me to monitor system usage in real time. Here I'll go through the steps I use to track bikeshare usage in real time using GBFS feeds. I currently publish live tracking of Vancouver bikeshare systems at @VanBikeShareBot and will soon be adding tracking of Toronto, Hamilton and Montreal systems.
I've bundled the tools I use for GBFS monitoring into a python package that can be found on Github. The Bikedata package is not a full-service GBFS client -- for that, you might prefer GBFS-client. Bikedata provides minimal functionality for querying GBFS feeds and returning Pandas dataframes, and implements some helper functions for persistent monitoring of bikeshare systems.
The GBFS spec¶
I won't go into detail about the GBFS spec, more information can be found on the project's Github page. Suffice to say that a GBFS complient system offers several distinct JSON endpoints that provide information about the system. For example, Mobi Bikes in Vancouver provides:
- https://vancouver-gbfs.smoove.pro/gbfs/gbfs.json: A list of available feeds
- https://vancouver-gbfs.smoove.pro/gbfs/en/system_information.json: General system information
- https://vancouver-gbfs.smoove.pro/gbfs/en/station_information.json: Details about stations (short and long names, location, coordinates)
- https://vancouver-gbfs.smoove.pro/gbfs/en/station_status.json: Live status of stations (bikes available, free docks)
Mobi doesn't have free floating bikes, but systems that do also have a free_bikes.json
feed providing the location of available free floating bikes.
Tracking station-based systems¶
Many bikeshare systems, especially those in dense city centres, only allow trips to begin and end at physical stations. To monitor these systems, I periodically query the station_status.json
and record the number of bikes at each station. If the number of available bikes decreases by N, I count that as N departures from the station. If it increases by M, I count that as M bikes returned.
import pandas as pd
import json
import urllib
import datetime as dt
import time
import matplotlib.pyplot as plt
%matplotlib notebook
def query_station_status(url):
with urllib.request.urlopen(url) as data_url:
data = json.loads(data_url.read().decode())
df = pd.DataFrame(data['data']['stations'])
# drop inactive stations
df = df[df.is_renting==1]
df = df[df.is_returning==1]
df = df.drop_duplicates(['station_id','last_reported'])
df.last_reported = df.last_reported.map(lambda x: dt.datetime.utcfromtimestamp(x))
df['time'] = data['last_updated']
df.time = df.time.map(lambda x: dt.datetime.utcfromtimestamp(x))
df = df.set_index('time')
df.index = df.index.tz_localize('UTC')
return df
station_url = 'https://vancouver-gbfs.smoove.pro/gbfs/en/station_status.json'
df = query_station_status(station_url)
df.head()
df = pd.pivot_table(df,columns='station_id',index='time',values='num_bikes_available')
df
Running the above code periodically allows me to build up a dataframe with the number of bikes at each station at subsequent times. Below is an example that runs for 10 minutes.
stations_df = pd.DataFrame()
for i in range(10):
df = query_station_status(station_url)
df = pd.pivot_table(df,columns='station_id',index='time',values='num_bikes_available')
stations_df = pd.concat([stations_df,df])
time.sleep(60)
stations_df
To compute the number of trips started at each station at each time interval, I compute the difference in number of bikes between each successive query.
taken_df = stations_df - stations_df.shift(-1)
taken_df = taken_df.fillna(0.0).astype(int)
taken_df[taken_df>0] = 0
taken_df = taken_df*-1
taken_df
Summing accross rows gives us the number of bikes taken during each interval. Summing accross columns gives us trips per station. Summing both gives us the total number of trips.
taken_df.sum(1)
taken_df.sum().head()
taken_df.sum(1).sum()
There are some limitations to this method. The Mobi GBFS feed only updates every 60 seconds or so (this may vary across systems). If one person takes a bike from a station and another person returns a bike to that station within the same update period, no trip will be registered. These "collision" events can be estimated and accounted for (see Chardon et al, "Estimating bike-share trips using station level data"). Experimental physicists might be reminded of the concept of dead time of a detector. I haven't yet implemented a correction for collision events in my work.
This model also doesn't account for bike rebalancing done by system staff. Staff periodically move bikes from crowded stations to empty stations to keep the system balanced. One might guess at rebalancing events by noting when a large amount of bikes appear or diseappear near-instantaneously at the same station. This also is not currently implemented in my work.
Tracking free bike systems¶
Some bikeshare systems allow trips to be started and ended anywhere within the system boundary. This seems to be more popular with scooter share systems (which also use GBFS) and in smaller/less dense areas. As an example, I'll use HOPR bikeshare at the University of British Columbia. The procedure for tracking free bikes is very similar to tracking stations.
free_bikes_url = 'https://gbfs.hopr.city/api/gbfs/13/free_bike_status'
def query_free_bikes(url):
with urllib.request.urlopen(url) as data_url:
data = json.loads(data_url.read().decode())
df = pd.DataFrame(data['data']['bikes'])
df['bike_id'] = df['bike_id'].astype(str)
df['time'] = data['last_updated']
df.time = df.time.map(lambda x: dt.datetime.utcfromtimestamp(x))
df = df.set_index('time')
df.index = df.index.tz_localize('UTC')
return df
df = query_free_bikes(free_bikes_url)
df.head()
By querying the free_bikes.json
feed, we retrieve the latitude and longitude of each available bike in the system, excluding bikes in use. (Sidenote: in working out this procedure, I discovered a bug in some bikeshare systems that allowed the tracking of bikes while use).
Initially I used the bike_id
field to track free bikes; when a given bike_id
disappears from the feed I would infer that that bike was in use, and when it returns to the feed I infer the trip is over. The GBFS maintainers have decided that this is a security vulnerability, presumably since if you know the ID of the bike an individual is using you can track where that person went with the bike. In the future, bike_id
will either be rotated periodically or removed entirely from the feed so I no longer rely on it at all.
Instead, I truncat the coordinates of each bike to 4 decimal places and treat each coordinate as if it were a station. Most coordinates will only have one bike, but designated bike parking spots might have multiple bikes at the same coordinate.
df = df.round(4)
df['coords'] = list(zip(df.lat,df.lon))
df = pd.pivot_table(df,values='lat',index='time',columns='coords', aggfunc='count').fillna(0)
df
Similarly to above, let's query the free_bikes.json
feed for 10 minutes and then infer trips by computing the change in number of bikes at each set of coordinates.
bikes_df = pd.DataFrame()
for i in range(10):
df = query_free_bikes(free_bikes_url)
df = df.round(4)
df['coords'] = list(zip(df.lat,df.lon))
df['coords'] = df['coords'].map(lambda x: str(x))
df = pd.pivot_table(df,values='lat',index='time',columns='coords', aggfunc='count').fillna(0)
bikes_df = pd.concat([bikes_df,df],sort=True)
bikes_df = bikes_df.fillna(0)
time.sleep(60)
bikes_df
taken_df = (bikes_df - bikes_df.shift(1)).fillna(0).astype(int)
taken_df[taken_df>0] = 0
taken_df = taken_df*-1
taken_df = taken_df.stack()
taken_df = taken_df.reset_index()
taken_df.columns = ['time','coords','trips']
taken_df.coords = taken_df.coords.map(lambda y: tuple([float(x) for x in y.strip("'()").split(',')]))
taken_df = taken_df.set_index('time')
taken_df = taken_df[taken_df.trips > 0]
taken_df
In this example we only capture one trip during the 10 minutes of tracking. We have the coordinates of where the trip started but we don't its duration or where it ended. We can also track returns, but we can't match up departures with returns without more information. As discussed above, this is a deliberate decision by the GBFS maintainers to ensure user privacy.
Tracking free bikes runs into similar problems as tracking station-based systems. Bikes taken offline for any reason -- rebalancing, maintenance, disabled -- will be counted as a trip. There is also still the possibility of collision events if a bike is left at a bike rack just as a different bike is taken from the same rack, or if a one bike is used for two consecutive trips by different people without enough time between the trips for the bike to show back up in the feed.
Hybrid systems¶
Some systems, such as Hamilton's SoBi bikeshare, combine the station-based and free bike models. Trips typically start and end at designated stations, but for a small fee users can leave bikes locked up anywhere in the system zone. To track these systems, I run both the station-based and free-bikes algorithms above and combine the counts to determine the total trips in the system.
Other systems, like HOPR at UBC, provide both the station and free bike feeds, but bikes are double counted. For example, a bike left at a designated "station" (in reality just marked area on the sidewalk) would show up in both feeds. In this case, I only watch the free bikes feed. Some familiarity with how each system actually operates is required to properly interpret the GBFS feed.
Validation¶
Mobi generously provides official system data of all trips taken since 2017. Here I'll use that the check how the official data compares to my tracking data. First, I'll load the system data. For this exercise, I'll limit the date range to January-September 2019 inclusive. I've done some minor cleaning of Mobi's official data to standardize field names and drop nonsensical records.
system_df = pd.read_csv('/home/msj/mobi-data/Mobi_System_Data_Prepped_full.csv',index_col=0,parse_dates=True)
system_df = system_df['2019-01-01':'2019-09-30']
system_daily_df = system_df.groupby(pd.Grouper(freq='d')).size()
system_daily_df.index = system_daily_df.index.tz_localize('America/Vancouver')
system_daily_df.head()
len(system_daily_df)
I'll similarly load my tracking data for the same period
tracking_df = pd.read_csv('/home/msj/bikeshares/mobi_bikes/data/taken_hourly.csv',index_col=0,parse_dates=True)
tracking_df.index = tracking_df.index.tz_convert('America/Vancouver')
tracking_df = tracking_df['2019-01-01':'2019-09-30']
tracking_daily_df = tracking_df.sum(1).groupby(pd.Grouper(freq='d')).sum()
tracking_daily_df.head()
len(tracking_hourly_df)
First let's look at a correlation plot. If my tracking estimates are exactly the same as the official data, they will plot at straight x=y line.
f,ax = plt.subplots()
ax.scatter(system_daily_df,tracking_daily_df,s=1)
ax.plot([0,6000],[0,6000],color='k',linestyle='--')
ax.set_xlabel('System count (daily)')
ax.set_ylabel('Tracking count (daily)')
Not bad! I do see a bit of a trend: overcounting trips on slow days and undercounting trips on busy days. This tracks with the limitations I've mentioned above. On slow days there's few collision events, so the error is dominated by re-balancings. As the days get busier, we get more and more collisions until they become the dominant error and I start under counting trips. The half-dozen or so large outliers are due to my tracker going down for hours at a time and missing hundreds of trips (note that all the large outliers are under-estimates).
delta = system_daily_df - tracking_daily_df
The Bland-Altman plot is often used to visualize systematic difference between two measurement types. Here we plot the average of the two measurements on the x axis and the difference between the two measurments on the y axis. A slope in the observed measurements implies a bias in one of the measurements.
average = (system_daily_df + tracking_daily_df)/2
f,ax = plt.subplots()
ax.scatter(x=average,y=delta)
ax.hlines(delta.mean(),average.min(),average.values.max(),linestyles='solid')
ax.hlines([delta.mean()+delta.std()*1.96,delta.mean()-delta.std()*1.96],
average.min(),average.values.max(),linestyles='dashed')
ax.set_ylabel('Difference in measurements')
ax.set_xlabel('Average of measurments')
ax.set_title(f'Bland-Altman Plot (mean difference={delta.mean():.02f}, std={delta.std():.02f})')
We observe a slight bias between low and high count days as discussed above. But the vast majority of observations are within 2 standard deviations of the mean, which implies that in general we have good agreement between the two methods.
Summary¶
In general, the above algorithm is a good technique for estimating trip counts from a bikeshare system's GBFS feed. Comparing the results of this technique with Mobi's official trip data shows strong correlation. Some bias is likely caused by rebalancing of bikes by the operator (noticeable on low-trip days) as well as missed trips when a bike is parked at the same time as a bike is being taken out at the same station (more noticeable on high-trip days).
Further improvements could be made by filtering out events that are likely rebalancings (6+ bikes coming or going from a single station at the same time) and correcting for missed trips due to simultaneous events. I hope to revisit this blog when I've implemented these features and see if they improve the performance.
Comments
Comments powered by Disqus