I've started wondering if there are any network effects in Mobi trip data. I don't know much about network analysis, but luckily there are python packages to do the work for me: NetworkX provides general network analysis tools and basic plotting, and Community provides tools for determining clusters in a network. In this post, I'll use these tools to investigate how stations cluster together based on the frequency of trips between pairs of stations
The General Bikeshare Feed Specification (GBFS) is a popular standard for publishing live data about bikeshare systems used by systems around the globe. Its stated purpose is to allow 3rd party applications to interface with the live data and allow municipalities to monitor compliance with local regulations. For my purpose, it also allows me to monitor system usage in real time. Here I'll go through the steps I use to track bikeshare usage in real time using GBFS feeds. I currently publish live tracking of Vancouver bikeshare systems at @VanBikeShareBot and will soon be adding tracking of Toronto, Hamilton and Montreal systems.
I've bundled the tools I use for GBFS monitoring into a python package that can be found on Github. The Bikedata package is not a full-service GBFS client -- for that, you might prefer GBFS-client. Bikedata provides minimal functionality for querying GBFS feeds and returning Pandas dataframes, and implements some helper functions for persistent monitoring of bikeshare systems.
The GBFS spec¶
I won't go into detail about the GBFS spec, more information can be found on the project's Github page. Suffice to say that a GBFS complient system offers several distinct JSON endpoints that provide information about the system. For example, Mobi Bikes in Vancouver provides:
- https://vancouver-gbfs.smoove.pro/gbfs/gbfs.json: A list of available feeds
- https://vancouver-gbfs.smoove.pro/gbfs/en/system_information.json: General system information
- https://vancouver-gbfs.smoove.pro/gbfs/en/station_information.json: Details about stations (short and long names, location, coordinates)
- https://vancouver-gbfs.smoove.pro/gbfs/en/station_status.json: Live status of stations (bikes available, free docks)
Mobi doesn't have free floating bikes, but systems that do also have a
free_bikes.json feed providing the location of available free floating bikes.
Tracking station-based systems¶
Many bikeshare systems, especially those in dense city centres, only allow trips to begin and end at physical stations. To monitor these systems, I periodically query the
station_status.json and record the number of bikes at each station. If the number of available bikes decreases by N, I count that as N departures from the station. If it increases by M, I count that as M bikes returned.
I came across @realtimebus the other day and immediately wanted to recreate it for Vancouver. Luckily Translink has a realtime bus API that I could pull from and making an animated gif is pretty straightforward in Python. And since I already have some experience making twitter bots with @VanBikeShareBot I felt like I didn't really have a choice.
I've been working on how best to look at connections between the various Mobi bikeshare stations throughout Vancouver. One thing the quickly became obvious was that some visualizations are much too noisy with all stations, but work nicely when I group stations into neighbourhoods. Luckily the city geometry files of the various neighourhoods available in their open data collection, so I was able to use those to group the stations according to the city's official definitions. See the distributions of stations below.
Vancouver's Mobi bikeshare system has been up and running for over 2 years now, and with two full summers of activity it's time to take a look at how exactly Vancouverites are using their bikeshare system.
For over a year, I've been collecting real-time data about Mobi bike trips by monitoring public information about the number of bikes at each station and inferring trip activity based on changes to the number of bikes at each station. This has led to some fun uses: I have live figures updating constantly on my website, a twitter bot tweets out daily stats at @VanBikeShareBot, and a few blog posts.
As handy as those live trip estimates are, they're very much estimates and only give us information about how often certain stations are used. Luckily, Mobi has started publishing open system data. This data set gives us a registry of every Mobi bikeshare trip since the beginning of 2017, current to the end of 2018 as of this writing. With this we have access to trip start and endpoints, trip duration and distance, membership type and more. In this post, I'll summarize some of the things I've learned after spending some time looking into this data.
I came across a problem recently where I wanted to store metadata about each column in a pandas DataFrame. This could be done easily enough with a simple dictionary where each column name is a key, but I was hoping for something that would propogate with a DataFrame as it's sliced, copied, expanded, etc, without having to explicitely keep track of the metadata in my scripts.
My solution is to use the pandas MultiIndex. In this post I'll discuss the pros and cons of this approach, show some examples, a show how I've subclassed the pandas DataFrame to make this approach easier.
Mountain View Cemetary Open Data¶
One of my goals this year is to turn everything interesting I learn into a blog post. I've been using Vancouver's open data to find datasets to play around with, and the site includes some data about people interred at Mountain View Cemetary. A little morbid for a data vis exercise, but I'm doing it anyways.
If you have an Android phone, Google is tracking your location. If you have "Location History" turned on in your settings, Google is really tracking your location. If you have this setting turned off, Google will still record your location when an app that has proper permission (like Google Maps or Strava) requests your location. This is creepy and probably bad from a privacy standpoint, but it's cool and fun from a data and visualisation standpoint. In the interest of blog content, I've let google track my location since I got my new phone ~1.5 years ago and it's now time to look at the data.
Google offers ways to look at your location history through their apps, but it's more fun to do it myself. Through myaccount.google.com, you can create an archive of your personal information being stored by Google. You can pick which services you're interested in, but for now I'm just looking for my location history data. It takes some time for Google to create the archive and make it available for download.
City staff recently reported to Council about the status of the Mobi bikeshare system. You can see the slideshow for yourself here. Results generally look positive for the future of bikeshare in Vancouver, but a comment one slide stuck out for me: "Temperature influences ridership more than precipitation".
Is this really true? In my experience, biking on a cool, dry day is much more enjoyable than biking on a warm rainy day. There was no clue in the council report on how they came to this conclusion, but I suspect they used the eyeball test. But the cooler months in Vancouver are also the rainy months, so a more careful analysis is needed. I'd recently seen an example in Jake Vanderplas' Python Data Science Handbook that looked at the factors influencing bike ridership in Seattle, so I decided to do a similar thing for Vancouver.
I've described at bit more about how I collected the Mobi trip data in a previous post. While the data is unofficial and has some clear sources of error, it should be reliable to look at usage trends. For weather data, I wrote a small scraper to grab historical Vancouver weather from weather.gc.ca. All the code I used for this post is available on my github page.
First, let's take a zoomed out look at bike usage from late April to early November 2017. Rainfall and daily highs are plotted on the same scale, in degrees Celcius and millimeters of rain. Weekends are highlighted by grey bars. The first thing we see is the obvious broad trend across the seasons, matching up with the temperature trend. This is probably what city staff noticed. Next, we see that on days with sharp drop-offs from the broader trend, there's almost always some rainfall. So far so good!
(Days with missing data are days I had computer downtime before I moved my scraper to a cloud server.)
A few things to note. Weather data is per day, and for just one Vancouver weather station. There are certainly days where it pours overnight and the day is clear, or it rains more in one part of the city than another. That said, let's take a closer look at a few weeks.
Here's two consecutive weekends where there was substantial rain on Sunday but none on Saturday. The drop-off is clear on both weekends. But later in the second week, there is a day with much less rainfall that has almost the same drop-off in number of trips.
Here's another three week stretch. Again, days with rain clearly show reduced Mobi usage. But usage also follows the temperature line! How much of the variation in bike usage is due to temperature, and how much is due to rain?
First let's look at temperature and rainfall separately.
The relationship between temperature and bike share trips is strong and exactly what you'd expect. More people ride on warm days! I've coloured the data by rainfall to see if there's any interesting outliers, but the rainy days are all well within the trend. We had pretty great weather all summer this year, so no examples of really rainy days with warm temperatures.
Rainfall also shows a clear relationship with daily trips. But it's not linear like temperature. There's a band of zero rain days that correlate with temperature, then a linear segment as ridership falls off with increased rainfall, then ridership hits a baseline below which it doesn't decrease. Apparently regardless of the amount of rain there's something like 600-800 users who will take out a bike no matter what. Cool!
So, we need to make a model that incorporates both temperature and rainfall to try to separate their effects. But to be as accurate as possible, we should include any other prominent factors. I showed in a earlier post that over the course of a day, weekdays and weekends show different ridership patters. But it turns out there's no obvious difference in the total number of trips.
If we're going to think about weekday vs weekend, let's just include each day of the week as a separate factor. Stat holiday vs not holiday should also be included. Since there's such a dramatic difference between days with any rain and days with no rain, let's include "dry" days as a factor. The last factor I'll include is hours of daylight -- when it gets dark before 5pm, it's hard to say whether it's the temperature or the darkness that has more on an effect someone's decision to ride a Mobi.
Trips ~ Temperature + Rainfall + Dry + Holiday + Daylight + Monday + Tuesday + Wednesday + Thursday + Friday + Saturday + Sunday
To fit the model, I'll use the OLS (ordinary least squares) class from the statsmodels Python package.
OLS Regression ResultsWe can take a few things away from this table. The adjusted R2 value is 0.801 which means our model explains about 80% of the variance in the data. Pretty good! We can also say that, all other things being equal, there will be
============================================================================== Dep. Variable: Trips R-squared: 0.812 Model: OLS Adj. R-squared: 0.801 Method: Least Squares F-statistic: 69.65 Date: Tue, 21 Nov 2017 Prob (F-statistic): 2.49e-58 Time: 13:57:32 Log-Likelihood: -1376.9 No. Observations: 189 AIC: 2778. Df Residuals: 177 BIC: 2817. Df Model: 11
Covariance Type: nonrobust
================================================================================ coef std err t P>|t| [0.025 0.975]
Max Temp 117.3302 8.079 14.522 0.000 101.386 133.274 Total Rainmm -29.8730 6.715 -4.449 0.000 -43.124 -16.622 Mon -523.1535 209.4 -2.498 0.013 -936.401 -109.906 Tue -523.9437 206.4 -2.537 0.012 -931.439 -116.448 Wed -497.6093 209.4 -2.376 0.019 -910.967 -84.252 Thu -429.2738 213.2 -2.013 0.046 -850.082 -8.465 Fri -368.8536 207.5 -1.777 0.077 -778.440 40.733 Sat -487.4666 206.5 -2.360 0.019 -895.072 -79.861 Sun -755.6349 207.6 -3.640 0.000 -1165.364 -345.906 Dry 362.8911 75.665 4.796 0.000 213.569 512.214 Holiday -277.1649 181.6 -1.526 0.129 -635.701 81.371 daylight_hrs 37.2471 15.672 2.377 0.019 6.319 68.175 ============================================================================== Omnibus: 9.879 Durbin-Watson: 1.087 Prob(Omnibus): 0.007 Jarque-Bera (JB): 20.264 Skew: 0.116 Prob(JB): 3.98e-05 Kurtosis: 4.587 Cond. No. 472. ==============================================================================
- 117 more trips for each increase of one degree Celcius
- 30 fewer trips for each additional of 1 mm of rain
- 277 fewer trips on holidays versus equivalent non-holidays
- 37 more trips for each additional hour of daylight
Given our model, we can plot the expected number of trips for each day on top of the measured number of trip.
The model nails the macroscopic structure, driven by temperature and daylight, and does pretty well with the day-to-day variations which are probably more driven by rainfall and the day of the week.
There's a few big misses, though. Let's look at the days where our model misses the mark by over 1000 trips:
On July 11th, a Tuesday, our model expects a normal summer day but the counted number of trips drops of precipitously. The weather data for that says 20 degrees and no rain, so I'm not sure what's going on. I've looked at my source data and don't see anything out of the ordinary, but I suspect there was a data acquisition issue that day.
On August 12th, a Saturday, the model understates the number of riders by a large margin. There was 0.2 mm of rain measured on this day, so the model is treating it as a rainy day, but perhaps it only rained a bit overnight and was clear the rest of the day.
Finally, on October 12th, a Thursday, the model again undercounts the number of trips, only expecting ~200 trips when in reality there over a thousand! There was 35 mm of rain measured that day, so even though the model expects the number of trips to keep decreasing linearly with rain, in practice if someone is going to bike in 20 mm of rain they're probably also going to bike in 30 mm of rain. Our model doesn't account for this.
There's some other potentially nonlinear effects that we're not including here. During the winter in Vancouver, the coldest days are often clear and sunny. I wonder if as we get into December and January, the temperature trend might reverse and we'll see more trips on the coldest days. If that's the case, we may need a more complex model to really describe how weather affects bike share users.
All the source code used for data acquisition and analysis in this post is available on my github page.
To see more posts like this, follow me on twitter @MikeJarrett_.